#### Announcement of Our New Editor

Effective January 1, 2011, Journal of Data Science will have a new editor. Please send contributions to:

Professor Wen-Jang Huang

Department of Applied Mathematics

National University of Kaohsiung

Kaohsiung, Taiwan 811

huangwj@nuk.edu.tw

### Journal of Data Science, v.8, no.4, p.505-519

#### A Multivariate Method for Normalization in Affymetrix Oligonucleotide Microarray Experiments

##### by Zhide Fang, Xiaohu Li and Lizhe Xu

- Full Text (PDF): [388.85units_k]

Affymetrix high-density oligonucleotide microarray makes it possible to simultaneously measure, and thus compare the expression profiles of hundreds of thousands of genes in living cells. Genes differentially expressed in different conditions are very important to both basic and medical research. However, before detecting these differentially expressed genes from a vast number of candidates, it is necessary to normalize the microarray data due to the significant variation caused by non-biological factors. During the last few years, normalization methods based on probe level or probeset level intensities were proposed in the literature. These methods were motivated by different purposes. In this paper, we propose a multivariate normalization method, based on partial least squares regression, aiming to equalize the central tendency, reduce and equalize the variation of the probe level intensities in any probeset across the replicated arrays. By so doing, we hope that one can precisely estimate the gene expression indexes.

### Journal of Data Science, v.8, no.4, p.521-539

#### True-Value Regression Theory

##### by Gordon G. Bechtel

- Full Text (PDF): [113.54units_k]

Design-based regression regards the survey response as a constant waiting to be observed. Bechtel (2007) replaced this constant with the sum of a fixed true value and a random measurement error. The present paper relaxes the assumption that the expected error is zero within a survey respondent. It also allows measurement errors in predictor variables as well as in the response variable. Reasonable assumptions about these errors over respondents, along with coefficient alpha in psychological test theory, enable the regression of true responses on true predictors. This resolves two major issues in survey regression, i.e. errors in variables and item non-response. The usefulness of this resolution is demonstrated with three large datasets collected by the European Social Survey in 2002, 2004 and 2006. The paper concludes with implications of true-value regression for survey theory and practice and for surveying large world populations.

### Journal of Data Science, v.8, no.4, p.541-553

#### A Bayesian Approach to Successive Comparisons

##### by A. Aghamohammadi, M. R. Meshkani and M. Mohammadzadeh

- Full Text (PDF): [97.94units_k]

The present article discusses and compares multiple testing procedures (MTPs) for controlling the family wise error rate. Machekano and Hubbard (2006) have proposed empirical Bayes approach that is a resampling based multiple testing procedure asymptotically controlling the familywise error rate. In this paper we provide some additional work on their procedure, and we develop resampling based step-down procedure asymptotically controlling the familywise error rate for testing the families of one-sided hypotheses. We apply these procedures for making successive comparisons between the treatment effects under a simple-order assumption. For example, the treatment means may be a sequences of increasing dose levels of a drug. Using simulations, we demonstrate that the proposed step-down procedure is less conservative than the Machekano and Hubbard's procedure. The application of the procedure is illustrated with an example.

### Journal of Data Science, v.8, no.4, p.555-577

#### Edition and Imputation of Multiple Time Series Data Generated by Repetitive Surveys

##### by Victor M. Guerrero and Blanca I. Gaspar

- Full Text (PDF): [213.31units_k]

This paper considers the statistical problems of editing and imputing data of multiple time series generated by repetitive surveys. The case under study is that of the Survey of Cattle Slaughter in Mexico's Municipal Abattoirs. The proposed procedure consists of two phases; firstly the data of each abattoir are edited to correct them for gross inconsistencies. Secondly, the missing data are imputed by means of restricted forecasting. This method uses all the historical and current information available for the abattoir, as well as multiple time series models from which efficient estimates of the missing data are obtained. Some empirical examples are shown to illustrate the usefulness of the method in practice.

### Journal of Data Science, v.8, no.4, p.579-595

#### The Effect of Sample Composition on Inference for Random Effects Using Normal and Dirichlet Process Models

##### by Guofen Yan and J. Sedransk

- Full Text (PDF): [223.30units_k]

Good inference for the random effects in a linear mixed-effects model is important because of their role in decision making. For example, estimates of the random effects may be used to make decisions about the quality of medical providers such as hospitals, surgeons, etc. Standard methods assume that the random effects are normally distributed, but this may be problematic because inferences are sensitive to this assumption and to the composition of the study sample. We investigate whether using a Dirichlet process prior instead of a normal prior for the random effects is effective in reducing the dependence of inferences on the study sample. Specifically, we compare the two models, normal and Dirichlet process, emphasizing inferences for extrema. Our main finding is that using the Dirichlet process prior provides inferences that are substantially more robust to the composition of the study sample.

### Journal of Data Science, v.8, no.4, p.597-606

#### Application of Skew-normal in Classification of Satellite Image

##### by Mohammad Reza Zadkarami and Mahdi Rowhani

- Full Text (PDF): [1.43units_m]

The aim of this paper is to investigate the flexibility of the skew-normal distribution to classify the pixels of a remotely sensed satellite image. In the most of remote sensing packages, for example ENVI and ERDAS, it is assumed that populations are distributed as a multivariate normal. Then linear discriminant function (LDF) or quadratic discriminant function (QDF) is used to classify the pixels, when the covariance matrix of populations are assumed equal or unequal, respectively. However, the data was obtained from the satellite or airplane images suffer from non-normality. In this case, skew-normal discriminant function (SDF) is one of techniques to obtain more accurate image. In this study, we compare the SDF with LDF and QDF using simulation for different scenarios. The results show that ignoring the skewness of the data increases the misclassification probability and consequently we get wrong image. An application is provided to identify the effect of wrong assumptions on the image accuracy.

### Journal of Data Science, v.8, no.4, p.607-617

#### Generalized Poisson-Poisson Mixture Model for Misreported Counts with an Application to Smoking Data

##### by Mavis Pararai, Felix Famoye and Carl Lee

- Full Text (PDF): [84.54units_k]

The assumption that is usually made when modeling count data is that the response variable, which is the count, is correctly reported. Some counts might be over- or under-reported. We derive the Generalized Poisson-Poisson mixture regression (GPPMR) model that can handle accurate, underreported and overreported counts. The parameters in the model will be estimated via the maximum likelihood method. We apply the GPPMR model to a real-life data set.

### Journal of Data Science, v.8, no.4, p.619-630

#### The Analysis of Health Care Coverage through Transition Matrices Using a One Factor Model

##### by Eric D. Olson, Billie S. Anderson and J. Michael Hardin

- Full Text (PDF): [74.41units_k]

This paper studies the affect the tax environment has on health care coverage of individuals. This study adds to the current literature of health care policy by examining how individuals switch types of health care coverage given a change in the tax environment. The distribution of health care coverage will be investigated using transition matrices. Then, a model is used to determine how the individuals might be expected to switch insurance types given a change in the tax environment. Based on the results of this study, the authors give some recommendations on what the implications of the results may mean to health care policy makers.

### Journal of Data Science, v.8, no.4, p.631-644

#### A Weighted-Least-Squares Estimation Approach to Comparing Trends in Age-Adjusted Cancer Rates Across Overlapping Regions

##### by Kimberly A. Walters, Yi Li, Ram C. Tiwari and Zhaohui Zou

- Full Text (PDF): [127.08units_k]

Li and Tiwari (2008) recently developed a corrected Z-test statistic for comparing the trends in cancer age-adjusted mortality and incidence rates across overlapping geographic regions, by properly adjusting for the correlation between the slopes of the fitted simple linear regression equations. One of their key assumptions is
that the error variances have unknown but common variance. However, since the age-adjusted rates are linear combinations of mortality or incidence counts, arising naturally from an underlying Poisson process, this constant variance assumption

may be violated. This paper develops a weighted-least-squares based test that incorporates heteroscedastic error variances, and thus significantly extends the work of Li and Tiwari. The proposed test generally outperforms the aforementioned test through simulations and through application to the age-adjusted mortality data from the Surveillance, Epidemiology, and End Results (SEER) Program of the National Cancer Institute.

### Journal of Data Science, v.8, no.4, p.645-664

#### On the Estimation and Comparison of Lifetime Morbid Risks

##### by Camil Fuchs, David M. Steinberg and Michael Poyurovsky
- Full Text (PDF):
[112.91units_k]

Lifetime morbid risks are usually determined either by the Kaplan-Meier product limit estimator or by simpler estimators such as the lifetime prevalence, the Weinberg method or the Schulz method, which can be considered an elaboration of the Weinberg method.

We show that the Kaplan-Meier product limit estimator of lifetime morbid risk may yield unreliable estimates. Although the simplicity of the Schulz method and the Weinberg method is appealing, we suggest that under a proper model, those methods can be replaced by the original Str\"{o}mgren estimator which is almost equally simple, and more accurate. Increased accuracy is achieved when the investigators have prior indication regarding the distribution of the ages at onset for those affected by the disorder, and even when that indication is vague and only limited knowledge of the distribution is available.