### Journal of Data Science, v.2, no.3, p.213-230

#### Factor Effects Testing for Mixture Distributions with Application to the Study of Emergence of Pontomyia Oceana

##### by Mong-Na Lo Huang, Chun-Sui Lin and Keryea Soong

- Full Text (PDF): [186.00kB]

In this work, testing of factor effects to the observed data from finite mixture distributions are discussed. Likelihood ratio tests are used to test whether factors of interest have significant effects to the mixture distribution model. To carry out the likelihood ratio tests, different methods about the computation algorithm for the maximum likelihood estimation (MLE) of the parameters in the mixture models are studied. These methods are applied to the data obtained from a laboratory study on emergence of Pontomyia oceana, where the effects of factors, such as sex and temperature, to the distribution of the dates that Pontomyia oceana emerged are investigated. From the results obtained, in some cases, three-component logistic distributions are fitted to the data with two peaks very close to each other. This is somewhat surprising as merely from the histogram, it is not easy to see and usually not expected to say there are two very close peaks. From the practical point of view, as the laboratory conditions excluded the possible effects related to semi-lunar tidal fluctuations that may have a dominating influence in nature. Thus the laboratory results helps to identify all the possible factors that have minor effects. Based on the results of this study, the difference between males and females, nevertheless, suggests that sex hormone may be involved in affecting the emergence dates. The suggestion of a third peak is unexpected from our point of view and it implies that there are factors we never suspected. It is worth noting that through rigorous statistical analysis presented here, it helps to provide an objective estimation on the distribution of the emergence dates as well as the corresponding proportions and the peak synchronous emergence dates in each period under different factor effects. We only start to speculate its possible adaptive meaning after the differences have been established as a true phenomenon. From this study, it reveals some additional biological phenomena worthy of more investigations.

### Journal of Data Science, v.2, no.3, p.231-244

#### Estimating Vehicle Speed from Traffic Count and Occupancy Data

##### by Martin L. Hazelton

- Full Text (PDF): [198.61kB]

Automatic vehicle detectors are now common on road systems across the world. Many of these detectors are based on single inductive loops, from which data on traffic volumes (i.e., vehicle counts) and occupancy (i.e., proportion of time during which the loop is occupied) are available for 20 or 30 second observational periods. However, for the purposes of traffic management it is frequently useful to have data on (mean) vehicle speeds, but this is not directly available from single loop detectors. While detector occupancy is related in a simple fashion to vehicle speed and length, the latter variable is not measured on the vehicles that pass. In this paper a new method for speed estimation from traffic count and occupancy data is proposed. By assuming a simple random walk model for successive vehicle speeds an MCMC approach to speed estimation can be applied, in which missing vehicle lengths are sampled from an exogenous data set. Unlike earlier estimation methods, measurement error in occupancy data is explicitly modelled. The proposed methodology is applied to traffic flow data from Interstate 5 near Seattle, during a weekday morning. The efficacy of the estimation scheme is examined by comparing the estimates with independently collected vehicle speed data. The results are encouraging.

### Journal of Data Science, v.2, no.3, p.245-257

#### A GEE Approach for Estimating Correlation Coefficients Involving Left-censored Variables

##### by Jingli Song, Huiman X. Barnhart and Robert H. Lyles

- Full Text (PDF): [142.15kB]

HIV (Human Immunodeficiency Virus) researchers are often concerned with the correlation between HIV viral load measurements and CD4+ lymphocyte counts. Due to the lower limits of detection (LOD) of the available assays, HIV viral load measurements are subject to left-censoring. Motivated by these considerations, the maximum likelihood (ML) method under normality assumptions was recently proposed for estimating the correlation between two continuous variables that are subject to left-censoring. In this paper, we propose a generalized estimating equations (GEE) approach as an alternative to estimate such a correlation coefficient. We investigate the robustness to the normality assumption of the ML and the GEE approaches via simulations. An actual HIV data example is used for illustration.

### Journal of Data Science, v.2, no.3, p.259-272

#### Using the Box-Cox Power Transformation to Predict Temporally Correlated Longitudinal Data

##### by R. C. Hwang

- Full Text (PDF): [217.15kB]

In this paper, the repeated measurement linear model proposed by Diggle (1988) is applied to two real data examples to predict future values for temporally correlated longitudinal data. This model incorporates the population mean, variability among individuals, serial correlation within an individual, and measurement error. In practice, however, the original data may not fit well with the linearity assumption imposed on the mean function by Diggle's model, thereby deteriorating the overall prediction ability of the model. To overcome this potential drawback, the Box-Cox power transformation (Box and Cox 1964) is considered, and two different ways of conducting power transformations are suggested. One of these two approaches performs transformation inside of Diggle's model, and the other performs transformation outside of Diggle's model. Given Diggle's model using the power transformed data, two prediction methods (the maximum likelihood method and the approximate Bayesian approach) are used to predict future values. Using our real data examples, it is shown that both values of mean absolute difference and mean absolute relative difference for each of these two prediction methods without power transformation can be reduced by more than 10% by simply performing power transformation. Results indicate that the prediction ability of Diggle's model can be significantly improved by employing power transformation, because lower levels of both mean absolute difference and mean absolute relative difference can be obtained.

### Journal of Data Science, v.2, no.3, p.273-285

#### An Analysis of Quasi-complete Binary Data with Logistic Models: Applications to Alcohol Abuse Data

##### by Mandy C. Webb, Jeffrey R. Wilson and Jenny Chong

- Full Text (PDF): [151.67kB]

This paper examines the issues surrounding the analysis of quasi-complete binary data using logistic regression models with the aid of some popular statistical software programs. Results from three procedures in SAS (LOGISTIC, CATMOD and GENMOD) and the pull-down menu in SPSS were examined. The review was conducted in response to an observation that some users of these procedures do not always independently account for data irregularities encountered when interpreting the computer results. This may be due partly to the fact that the information provided by some statistical software packages may not be sufficient for the user to make informed decisions regarding the results. The dataset that motivated this review came from a substance abuse treatment outcome study. Thirty subjects were followed up to determine the proportion that relapsed and to determine the factors that may predict the relapse. Binary logistic regression models were used to determine the predictors of a relapse. Results showed that there was quasi-complete separation of the data and as such the interpretation is limited. SAS and its procedures in the analysis of quasi-complete data gave very large standard errors, computed more iterations, and provided a useful warning for researchers regarding the configuration of data. In contrast, SPSS provided estimates with smaller standard errors, and did not necessarily provide warning for researchers of the data configuration. Thus researchers who make use of statistical softwares without the knowledge of the iterative procedures used by the statistical package should be aware of the possibility of erroneous conclusions as a consequence when analyzing quasi-complete or complete data.

### Journal of Data Science, v.2, no.3, p.287-295

#### On the Generalized Poisson Regression Model with an Application to Accident Data

##### by Felix Famoye, John T. Wulu, Jr. and Karan P. Singh

- Full Text (PDF): [98.89kB]

In this paper a random sample of drivers aged sixty-five years or older was selected from the Alabama Department of Public Safety Records. The data in the sample has information on many variables including the number of accidents, demographic information, driving habits, and medication. The purpose of the sample was to assess the effects of demographic factors, driving habits, and medication use on elderly drivers. The generalized Poisson regression (GPR) model is considered for identifying the relationship between the number of accidents and some covariates. About 59% of drivers who rate their quality of driving as average or below are involved in automobile accidents. Drivers who take calcium channel blockers show a significantly reduced risk of about 34.5%. Based on the test for the dispersion parameter and the goodness-of-fit measure for the accident data, the GPR model performs as good as or better than the other regression models.

### Journal of Data Science, v.2, no.3, p.297-309

#### Identifying the Patterns of Hematopoietic Stem Cells Gene Expressions Using Clustering Methods: Comparison and Summary

##### by Jie Chen, Xi He, and Linheng Li

- Full Text (PDF): [192.61kB]

Clustering algorithms have been used to analyze microarray gene expression data in many recent applications. In this paper, we make a comparison among popularly used clustering methods, including hierarchical clustering with average, complete, and single linkages, k-means clustering, k-means clustering with hierarchical initialization, and self organization map (SOM), by making use of our hemotopietic stem cell (HSC) microarray data. To understand the biological pathways from HSC to proliferative multipotent progenitor (MPP), and from MPP to either common lymphoid progenitor (CLP) or common myeloid progenitor (CMP), statistical clustering is an important tool. Our results demonstrated that the HSC microarray data set casts some challenge on clustering algorithms as different clustering algorithms resulted in clusters that were not all consistent. We compared the results by using the total within-cluster sum of squares of dispersions and the biological functions of the genes, and reached the conclusion that k-means clustering with hierarchical average or complete linkage initialization performed the best among all the methods we compared. Our investigation of the clustering methods with HSC microarray data provide a useful approach and guide to medical researchers who use clustering algorithms in analyzing their microarray or related data sets.