Announcement of Our New Editor
Effective January 1, 2011, Journal of Data Science will have a new editor. Please send contributions to:
Professor Wen-Jang Huang
Department of Applied Mathematics
National University of Kaohsiung
Kaohsiung, Taiwan 811
Journal of Data Science, v.8, no.3, p.361-378
Imputation Methods for Missing Categorical Questionnaire Data: A Comparison of Approaches
by W. Holmes Finch
- Full Text (PDF): [89.53kB]
Missing data are a common problem for researchers working with surveys and other types of questionnaires. Often, respondents do not respond to one or more items, making the conduct of statistical analyses, as well as the calculation of scores difficult. A number of methods have been developed for dealing with missing data, though most of these have focused on continuous variables. It is not clear that these techniques for imputation are appropriate for the categorical items that make up surveys. However, methods of imputation specifically designed for categorical data are either limited in terms of the number of variables they can accommodate, or have not been fully compared with the continuous data approaches used with categorical variables. The goal of the current study was to compare the performance of these explicitly categorical imputation approaches with the more well established continuous method used with categorical item responses. Results of the simulation study based on real data demonstrate that the continuous based imputation approach and a categorical method based on stochastic regression appear to perform well in terms of creating data that match the complete datasets in terms of logistic regression results.
Journal of Data Science, v.8, no.3, p.379-396
A Mixed Effects Model for Overdispersed Zero Inflated Poisson Data with an Application in Animal Breeding
by Mariana Rodrigues-Motta, Daniel Gianola and Bjorg Heringstad
- Full Text (PDF): [169.51kB]
Response variables that are scored as counts, for example, number of mastitis cases in dairy cattle, often arise in quantitative genetic analysis. When the number of zeros exceeds the amount expected such as under the Poisson density, the zero-inflated Poisson (ZIP) model is more appropriate. In using the ZIP model in animal breeding studies, it is necessary to accommodate genetic and environmental covariances. For that, this study proposes to model the mixture and Poisson parameters hierarchically, each as a function of two random effects, representing the genetic and environmental sources of variability, respectively. The genetic random effects are allowed to be correlated, leading to a correlation within and between clusters. The environmental effects are introduced by independent residual terms, accounting for overdispersion above that caused by extra-zeros. In addition, an inter correlation structure between random genetic effects affecting mixture and Poisson parameters is used to infer pleiotropy, an expression of the extent to which these parameters are influenced by common genes. The methods described here are illustrated with data on number of mastitis cases from Norwegian Red cows. Bayesian analysis yields posterior distributions useful for studying environmental and genetic variability, as well as genetic correlation.
Journal of Data Science, v.8, no.3, p.397-412
On Bootstrap Tests of Symmetry about an Unknown Median
by Tian Zheng and Joseph L. Gastwirth
- Full Text (PDF): [198.03kB]
It is important to examine the symmetry of an underlying distribution before applying some statistical procedures to a data set. For example, in the Zuni School District case, a formula originally developed by the Department of Education trimmed 5% of the data symmetrically from each end. The validity of this procedure was questioned at the hearing by Chief Justice Roberts. Most tests of symmetry (even nonparametric ones) are not distribution free in finite sample sizes. Hence, using asymptotic distribution may not yield an accurate type I error rate or/and loss of power in small samples. Bootstrap resampling from a symmetric empirical distribution function fitted to the data is proposed to improve the accuracy of the calculated p-value of several tests of symmetry. The results show that the bootstrap method is superior to previously used approaches relying on the asymptotic distribution of the tests that assumed the data come from a normal distribution. Incorporating the bootstrap estimate in a recently proposed test due to Miao, Gel and Gastwirth (2006) preserved its level and shows it has reasonable power properties on the family of distribution evaluated.
Journal of Data Science, v.8, no.3, p.413-427
On Intraday Shanghai Stock Exchange Index
by Hua Wang, Yan Yu and Min Li
- Full Text (PDF): [157.94kB]
This paper investigates the return, volatility, and trading on the Shanghai Stock Exchange with high-frequency intraday five-minute Shanghai Stock Exchange Composite Index (SHCI) data. The random walk hypothesis is rejected, indicating there are predictable components in the index. We adopt a time-inhomogeneous diffusion model using log penalized splines (log $P$-splines) to estimate the volatility. A GARCH volatility model is also fitted for comparison. A de-volatilized series are obtained by using the de-volatilization technique of Zhou (1991) that resample the data into different de-volatilized series with more desired properties for trading. A trading program based on local trends extracted with a State Space model is then implemented on the de-volatilized five-minute SHCI return series for profit. Volatility estimates from both models are found to be competitive for the purpose of trading.
Journal of Data Science, v.8, no.3, p.429-441
Regression: Comparing Predictors and Groups of Predictors Based on a Robust Measure of Association
by Rand R. Wilcox
- Full Text (PDF): [99.71kB]
Let $\rho_j$ be Pearson's correlation between $Y$ and $X_j$ ($j=1$, 2). A problem that has received considerable attention is testing $H_0$: $\rho_1=\rho_2$. A well-known concern, however, is that Pearson's correlation is not robust (e.g., Wilcox, 2005), and the usual estimate of $\rho_j$, $r_j$ has a finite sample breakdown point of only $1/n$. The goal in this paper is to consider extensions to situations where Pearson's correlation is replaced by a particular robust measure of association. Included are results where there are $p>2$ predictors and the goal to compare any two subsets of $m<p$ predictors.
Journal of Data Science, v.8, no.3, p.443-455
An application of Multiple Imputation under the Two Generalized Parametric Families
by Hakan Demirtas
- Full Text (PDF): [99.82kB]
Multiple imputation under the multivariate normality assumption has often been regarded as a viable model-based approach in dealing with incomplete continuous data. Considering the fact that real data rarely conform with normality, there has been a growing attention to generalized classes of distributions that cover a broader range of skewness and elongation behavior compared to the normal distribution. In this regard, two recent works have shown that creating imputations under Fleishman's power polynomials and the generalized lambda distribution may be a promising tool. In this article, essential distributional characteristics of these families are illustrated along with a description of how they can be used to create multiply imputed data sets. Furthermore, an application is presented using a data example from psychiatric research. Multiple imputation under these families that span most of the feasible area in the symmetry-peakedness plane appears to have substantial potential of capturing real missing-data trends that can be encountered in clinical practice.
Journal of Data Science, v.8, no.3, p.457-469
Evaluation of Agreement between Measurement Methods from Data with Matched Repeated Measurements via the Coefficient of Individual Agreement
by Michael Haber, Jingjing Gao and Huiman X. Barnhart
- Full Text (PDF): [100.49kB]
We propose a simple method for evaluating agreement between methods of measurement when the measured variable is continuous and the data consists of matched repeated observations made with the same method under different conditions. The conditions may represent different time points, raters, laboratories, treatments, etc. Our approach allows the values of the measured variable and the magnitude of disagreement to vary across the conditions. The coefficient of individual agreement (CIA), which is based on the comparison of the between and within-methods mean squared deviation (MSD) is used to quantify the magnitude of agreement between measurement methods. The new approach is illustrated via two examples from studies designed to compare (a) methods of evaluating carotid stenosis and (b) methods of measuring percent body fat.
Journal of Data Science, v.8, no.3, p.471-482
Interval Estimation for Ratios of Correlated Age-Adjusted Rates
by Ram C. Tiwari, Yi Li and Zhaohui Zou
- Full Text (PDF): [101.30kB]
Providing reliable estimates of the ratios of cancer incidence and mortality rates across geographic regions has been important for the National cancer Institute (NCI) Surveillance, Epidemiology, and End Results (SEER) Program as it profiles cancer risk factors as well decides cancer control planning. A fundamental difficulty, however, arises when such ratios have to be computed to compare the rate of a subregion (e.g., California) with that of a parent region (e.g., the US). Such a comparison is often made for policy-making purposes. Based on F-approximations as well as normal approximations, this paper provides new confidence intervals (CIs) for such rate ratios. Intensive simulations, which capture the real issues with the observed mortality data, reveal that these two CIs perform well. In general, for rare cancer sites, the $F$-intervals are often more conservative, and for moderate and common cancers, all intervals perform similarly.
Journal of Data Science, v.8, no.3, p.483-493
Bimodality of Plasma Glucose Distributions in Whites: A Bootstrap Approach to Testing Mixture Models
by Ying Yang, Juanjuan Fan, and Susanne May
- Full Text (PDF): [190.18kB]
The null distribution of the likelihood ratio test (LRT) of a one-component normal model versus two-component normal mixture model is unknown. In this paper, we take a bootstrap approach to the likelihood ratio test for testing bimodality of plasma glucose concentrations from Rancho Bernardo Diabetes Study. The small $p$-values from this approach support the hypothesis that a bimodal normal mixture model fits the data significantly better than a unimodal normal model. The size and power of the bootstrap based LRT are evaluated through simulations. The results suggest that a sample size of close to 500 would be necessary in order to attain a power of 90% for detecting the unbalanced mixtures with means and variances similar to those in the Rancho Bernardo data. Besides sample size, the power also depends on the two means and variances of the two components in the data.
Journal of Data Science, v.8, no.3, p.495-504
The Bayesian Multiple Logistic Random Effects Model for Analysis of Clinical Trial Data
by Karan P. Singh, Alfred A. Bartolucci and Sejong Bae
- Full Text (PDF): [84.12kB]
A prospective, multi-institutional and randomized surgical trial involving 724 early stage melanoma patients was conducted to determine whether excision margins for intermediate-thickness melanomas (1.0 to 4.0 mm) could be safely reduced from the standard 4-cm radius. Patients with 1- to 4-mm-thick melanomas on the trunk or proximal extremities were randomly assigned to receive either a 2- or 4-cm surgical margin with or without immediate node dissection (i.e. immediate vs. later -within 6 months). The median follow-up time was 6 years. Recurrence rates did not correlate with surgical margins, even among stratified thickness groups. The hospital stay was shortened from 7.0 days for patients receiving 4-cm surgical margins to 5.2 days for those receiving 2-cm margins ($p = 0.0001$). This reduction was largely due to reduced need for skin grafting in the 2cm group. The overall conclusion was that the narrower margins significantly reduced the need for skin grafting and shortened the hospital stay. Due to the adequacy of subject follow up, recently a statistical focus was on what prognostics factors usually called covariates actually determined recurrence. As was anticipated, the thickness of the lesion ($p=0.0091$) and whether or not the lesion was ulcerated ($p=0.0079$), were determined to be significantly associated with recurrence events using the logistic regression model. This type of fixed effect analysis is rather a routine.
The authors have determined that a Bayesian consideration of the results would afford a more coherent interpretation of the effect of the model assuming a random effect of the covariates of thickness and ulceration. Thus, using a Markov Chain Monte Carlo method of parameter estimation with non informative priors, one is able to obtain the posterior estimates and credible regions of estimates of these effects as well as their interaction on recurrence outcome. Graphical displays of convergence history and posterior densities affirm the stability of the results. We demonstrate how the model performs under relevant clinical conditions. The conditions are all tested using a Bayesian statistical approach allowing for the robust testing of the model parameters under various recursive partitioning conditions of the covariates and hyper parameters which we introduce into the model. The convergence of the parameters to stable values are seen in trace plots which follow the convergence patterns This allows for precise estimation for determining clinical conditions under which the response pattern will change. We give a numerical example of our results. The major platform for the theoretical
development follows the Bayesian methodology and the multiple parameter logistic model with random effects having carefully chosen hyper parameters. We have done the basic infrastructure for the analysis using the commercially available WinBugs software employing the Markov Chain Monte Carlo (MCMC) methodology. The BUGS language allows a concise expression of the parametric model to denote
stochastic (probabilistic) relationships and deterministic (logical) relationships.