### Journal of Data Science, v.6, no.3, p.269-271

#### In Memory of Professor Jack Chao-sheng Lee --- A Scholar that Never Stops Learning

##### by Henry Horng-Shing Lu

- Full Text (PDF): [26.17kB]

This is the note from the guest editor, Professor Henry Horng-Shing Lu, of this special issue.

### Journal of Data Science, v.6, no.3, p.273-301

#### A Copula-based Approach to Option Pricing and Risk Assessment

##### by Shang C. Chiou and Ruey S. Tsay

- Full Text (PDF): [228.01kB]

Copulas are useful tools to study the relationship between random variables. In financial applications, they can separate the marginal distributions from the dynamic dependence of asset prices. The marginal distributions may assume some univariate volatility models whereas the dynamic dependence can be time-varying and depends on some explanatory variables. In this paper, we consider applications of copulas in finance. First, we combine the risk-neutral representation and copula-based models to price multivariate exotic derivatives. Second, we show that copula-based models can be used to assess value at risk of multiple assets. We demonstrate the applications using daily log returns of two market indices and compare the proposed method with others available in the literature.

### Journal of Data Science, v.6, no.3, p.303-312

#### Calibration Design of Implied Volatility Surfaces

##### by K. Detlefsen and W. K. Hardle

- Full Text (PDF): [143.04kB]

The calibration of option pricing models leads to the minimization of an error functional. We show that its usual specification as a root mean squared error implies prices of exotic options that can change abruptly when plain vanilla options expire. We propose a simple and natural method to overcome these problems, illustrate drawbacks of the usual approach and show advantages of our method. To this end, we calibrate the Heston model to a time series of DAX implied volatility surfaces and then price cliquet options.

### Journal of Data Science, v.6, no.3, p.313-331

#### Multivariate Regression Modeling for Functional Data

##### by Hidetoshi Matsui, Yuko Araki and Sadanori Konishi

- Full Text (PDF): [241.74kB]

We propose functional multivariate regression modeling, using Gaussian basis functions along with the technique of regularization. In order to evaluate the model estimated by the regularization method, we derive model selection criteria from information-theoretic and Bayesian viewpoints. Monte Carlo simulations are conducted to investigate the efficiency of the proposed model. We also apply our modeling strategy to the analysis of spectrometric data.

### Journal of Data Science, v.6, no.3, p.333-355

#### Longitudinal Data Analysis Using t Linear Mixed Models with Autoregressive Dependence Structures

##### by Tsung-I Lin

- Full Text (PDF): [186.75kB]

The t linear mixed model with AR(p) dependence structure is proposed for the analysis of longitudinal data in which the underlying repeated measures contain thick tails and serial correlations simultaneously. For parameter estimation, I develop a hybrid maximization scheme that combines the stability of the Expectation Conditional Maximization Either (ECME) algorithm with the rapid convergence property of the scoring method. Empirical Bayes estimation of random effects and prediction of future values for the proposed model are also considered. The proposed methodologies are applied to a real example from a tumor growth study on twenty-two mice. Numerical comparisons indicate that the proposed model outperforms the normal model from both inferential and predictive perspectives.

### Journal of Data Science, v.6, no.3, p.357-368

#### Data Technology --- As a New Concept for Application of Statistics

##### by Sung H. Park and Moon W. Suh

- Full Text (PDF): [804.17kB]

Data technology (DT) is newly defined apart from Information Technology (IT). Major territory for DT is outlined along with its roles for the information society. DT is concerned primarily with data collection, analysis of data, generation of information and creation of knowledge. On the other hand, IT is mainly concerned with transmission and communication of data, and development of engineering devices for information handling.

The roles of DT for knowledge creation from raw data are explained step-by-step. In order to exploit the DT concept for solving practical problems in a process or a product, a 6 step working flow is suggested. Loss due to poor DT is mentioned with two examples. In addition, e-Statistics is proposed as one of major vehicles for promoting the roles of DT. Overall, DT is explained as a new concept for broadened application of statistical science as a key technology for impacting global competition in the 21st century socio-economic environment.

### Journal of Data Science, v.6, no.3, p.369-388

#### Radio Frequency Identification: A new Opportunity for Data Science

##### by Vijay Wadhwa and Dennis K. J. Lin

- Full Text (PDF): [248.68kB]

Radio Frequency Identification (RFID) has taken center stage at retail and consumer products forum. RFID is not a new technology; it has been in use for many years. In this paper, we first review RFID technology and the components that form the backbone of the RFID system. Next, we demonstrate the usefulness of RFID in supply chain and present some data mining challenges in RFID. Finally a real-life case study is used to illustrate how organizations are using RFID data.

### Journal of Data Science, v.6, no.3, p.389-414

#### Data Mining and Hotspot Detection in an Urban Development Project

##### by Chamont Wang and Pin-Shuo Liu

- Full Text (PDF): [1.17MB]

Modern statistical analysis often involves large amount of data from many application areas with diverse data types and complicated data structures. This paper gives a brief survey of certain large-scale applications. In addition, this paper compares a number of data mining tools in the study of a specific data set which has 1.4 million cases, 14 predictors and a binary response variable. The study focuses on predictive models that include Classification Tree, Neural Network, Stochastic Gradient Boosting, and Multivariate Adaptive Regression Splines. The study found that the variable importance scores generated by different data mining tools exhibit wide variability and that the users need to be cautious in the applications of these scores. On the other hand, the response surfaces and the classification accuracies of most models are relatively similar, yet the financial implications can be very profound when the models select the top 10% of cases and when the cost and profit are incorporated in the calculation. Finally, the Decision Tree, Predictor Importance, and Geographic Information Systems (GIS) are used for Hotspot Detection to further enhance the profit to 95.5% of its full potential.

### Journal of Data Science, v.6, no.3, p.415-427

#### A New Method for Gene Identification in Comparative Genomic Analysis

##### by Ching-Wei Chang, Wen Zou and James J. Chen

- Full Text (PDF): [185.38kB]

Microarray technology has been used to characterize intraspecies genetic diversity in bacteria at the genome level and to rapidly determine genetic profiles of pathogenic microorganisms for high-throughput screening in food safety research. In the use of microarray technology for bacterial identification and characterization in comparative genomic analysis, the primary objective is to determine the present (conserved) and absent (divergent) genes for each bacterial sample tested. The goal of the analysis is to estimate an optimal cutoff in which genes with intensity above the cutoff are classified as conserved and below the cutoff are classified as divergent. Standard statistical procedures developed for identifying differentially expressed genes are not appropriate. These procedures use the significance testing approach and often require a sufficient number of biological replicates. This paper proposes an analytic method to determine a cutoff based on the change-point estimation using the multivariate adaptive regression splines (MARS). The proposed estimation method is applied to two public datasets to compare with an existing classification Genotyping Analysis by Charlie Kim (GACK) algorithm. The proposed method performs consistently better than the GACK algorithm with respect to the specificity and accuracy.

### Journal of Data Science, v.6, no.3, p.429-448

#### Computer-Aided Diagnosis of Liver Cirrhosis by Simultaneous Comparisons of the Ultrasound Images of Liver and Spleen

##### by Henry Horng-Shing Lu, Chung-Ming Chen, Yi-Ming Huang and Jen-Shian Wu

- Full Text (PDF): [494.07kB]

Ultrasound imaging is an important tool for early detection and regular check-ups of liver cirrhosis. The diagnosis can be performed by analysis of echo textures of the liver and of the accompanying spleen. The simultaneous comparison of liver and spleen images for the same person at the same system setup can be used to reduce subject, machine, and system variations. This study aims to investigate the computer-aided diagnosis of features derived from the ultrasound images of livers and the accompanying spleens. We will incorporate the techniques of an early vision model, dimension reduction, fractal dimension, nonparametric discriminant rules by kernel density estimation and classification trees to improve the statistical analysis methods. These methods are tested by the clinical images collected at National Taiwan University Hospital with 64 normal livers and 30 cirrhosis ones. The smallest overall bootstrap prediction error is found to be 5.29% by these new methods.