Volume 9, Number 1, January 2011

  • Rand R. Wilcox
    Comparing Two Dependent Groups: Dealing with Missing Values
  • Yixin Fang
    Asymptotic Equivalence between Cross-Validations and Akaike Information Criteria in Mixed-Effects Models
  • Balgobin Nandram, Jai-Won Choi and Hongyan Xu
    Maximum Likelihood Estimation for Ascertainment Bias in Sampling Siblin
  • Haydar Demirhan
    Latent Class Analysis for Models with Error of Measurement Using Log-Linear Models and An Application to Women's Liberation Data
  • Shenghai Zhang
    Estimating Transmissibility of Seasonal Influenza Virus by Surveillance Data
  • Biswajeet Pradhan
    An Assessment of the Use of an Advanced Neural Network Model with Five Different Training Strategies for the Preparation of Landslide Susceptibility Maps
  • Te-Hsin Liang
    Association between Use of Internet Services and Quality of Life in Taiwan
  • Md. Hasinur Rahaman Khan and J. Ewart H. Shaw
    Multilevel Logistic Regression Analysis Applied to Binary Contraceptive Prevalence Data
  • S. M. Sadooghi-Alvandi, A. R. Nematollahi, Reza Habibi
    Test Procedures for Change Point in a General Class of Distributions
  • Yuanjia Wang and Yixin Fang
    Adjusting for Treatment Effect when Estimating or Testing Genetic Effect is of Main Interest

Journal of Data Science, v.9, no.1, p.1-13

Comparing Two Dependent Groups: Dealing with Missing Values

by Rand R. Wilcox

The paper considers the problem of comparing measures of location associated with two dependent groups when values are missing at random, with an emphasis on robust measures of location. It is known that simply imputing missing values can be unsatisfactory when testing hypotheses about means, so the goal here is to compare several alternative strategies that use all of the available data. Included are results on comparing means and a 20% trimmed mean. Yet another method is based on the usual median but differs from the other methods in a manner that is made obvious. (It is somewhat related to the formulation of the Wilcoxon-Mann-Whitney test for independent groups.) The strategies are compared in terms of Type I error probabilities and power.

Journal of Data Science, v.9, no.1, p.15-21

Asymptotic Equivalence between Cross-Validations and Akaike Information Criteria in Mixed-Effects Models

by Yixin Fang

For model selection in mixed effects models, Vaida and Blanchard (2005) demonstrated that the marginal Akaike information criterion is appropriate as to the questions regarding the population and the conditional Akaike information criterion is appropriate as to the questions regarding the particular clusters in the data. This article shows that the marginal Akaike information criterion is asymptotically equivalent to the leave-one-cluster-out cross-validation and the conditional Akaike information criterion is asymptotically equivalent to the leave-one-observation-out cross-validation.

Journal of Data Science, v.9, no.1, p.23-41

Maximum Likelihood Estimation for Ascertainment Bias in Sampling Siblings

by Balgobin Nandram, Jai-Won Choi and Hongyan Xu

When there is a rare disease in a population, it is inefficient to take a random sample to estimate a parameter. Instead one takes a random sample of all nuclear families with the disease by ascertaining at least one affected sibling (proband) of each family. In these studies, an estimate of the proportion of siblings with the disease will be inflated. For example, studies of the issue of whether a rare disease shows an autosomal recessive pattern of inheritance, where the Mendelian segregation ratios are of interest, have been investigated for several decades. How do we correct for this ascertainment bias? Methods, primarily based on maximum likelihood estimation, are available to correct for the ascertainment bias. We show that for ascertainment bias, although maximum likelihood estimation is optimal under asymptotic theory, it can perform badly. The problem is exasperated in the situation where the proband probabilities are allowed to vary with the number of affected siblings. We use two data sets to illustrate the difficulties of maximum likelihood estimation procedure, and we use a simulation study to assess the quality of the maximum likelihood estimators.

Journal of Data Science, v.9, no.1, p.43-54

Latent Class Analysis for Models with Error of Measurement Using Log-Linear Models and An Application to Women's Liberation Data

by Haydar Demirhan

This article deals with the latent class analysis of models with error of measurement. If the latent variable is ordinal and manifest variables are nominal, an approach to handle the restrictions is given for latent class analysis of the models with error of measurement using log linear models. By this way, we include ordinal nature of the latent variable into the analysis. Therefore, overall uncertainty is decreased, and our inferences become more precise. The new approach is applied to a women’s liberation data set.

Journal of Data Science, v.9, no.1, p.55-64

Estimating Transmissibility of Seasonal Influenza Virus by Surveillance Data

by Shenghai Zhang

It is important to estimate transmissibility of influenza during its grow phase for understanding the propagation of the virus. The estimation procedures of the transmissibility are usually based on the data generated in flu seasons. The data-generating process of the outbreak of influenza has many features. The data is generated by not only a biological process but also control measures such as flu vaccination. The estimation is discussed by considering the aspects of the data-generating process and using the model to capture the essential characteristics of flu transmission during the grow phase of a flu season.

Journal of Data Science, v.9, no.1, p.65-81

An Assessment of the Use of an Advanced Neural Network Model with Five Different Training Strategies for the Preparation of Landslide Susceptibility Maps

by Biswajeet Pradhan

Data collection for landslide susceptibility modelling is often an almost inhibitive activity. This has been the reason for quite sometimes landslide was described and modelled on the basis of spatially distributed values of landslide related attributes. This paper presents landslide susceptibility analysis at Selangor area, Malaysia, using artificial neural network model with the aid of remote sensing data and geographic information system (GIS) tools. To meet the objectives, landslide locations were identified in the study area from interpretation of aerial photographs and supported with extensive field surveys. Then, the landslide inventory was grouped into two categories: (1) training data (2) testing data. Further, topographical, geological data and satellite images were collected, processed, and constructed into a spatial database using GIS tools and image processing techniques. Nine landslide occurrence attributes were selected and analyzed using an artificial neural network model to generate the landslide susceptibility maps. Landslide location data (training data) were used for training the neural network and five training sites were selected randomly in this case. The use of five training sites ensemble to investigate the model reliability, including the role of the thematic variables used to construct the model, and the model sensitivity to changes in the selection of the training sites. By studying the variation of the neural network's susceptibility estimate, the error associated with the model is determined. The results of the neural network analysis are shown on five sets of landslide susceptibility maps. Then the susceptibility maps were validated using "receiver operating characteristics (ROC)" method as a measure for the model verification. Landslide training data which were not used during the training of the neural network was used for the verification of the maps. The results of the analysis were verified using the landslide location data and compared between five different cases. Qualitatively, the model seems to give reasonable results with accuracy observed was 87%, 83%, 85%, 86% and 82% for five different training sites respectively.

Journal of Data Science, v.9, no.1, p.83-92

Association between Use of Internet Services and Quality of Life in Taiwan

by Te-Hsin Liang

The study explored the association between the use of Internet services and quality of life in Taiwan. The use of broadband, wireless, and mobile Internet is found to be positively correlated with the people's overall quality of life. The more the Internet services of e-Government are used, the higher the satisfaction with social-economic status and social competence. People using more Internet services in their daily activities also have higher self-esteem and less psychological ressures. However, people who deeply rely on Internet services for e-Business such as
online shopping or ticket booking have lower satisfaction with community support.

Journal of Data Science, v.9, no.1, p.93-110

Multilevel Logistic Regression Analysis Applied to Binary Contraceptive Prevalence Data

by Md. Hasinur Rahaman Khan and J. Ewart H. Shaw

In public health, demography and sociology, large-scale surveys often follow a hierarchical data structure as the surveys are based on multistage stratified cluster sampling. The appropriate approach to analyzing such survey data is therefore based on nested sources of variability which come from different levels of the hierarchy. When the variance of the residual errors is correlated between individual observations as a result of these nested structures, traditional logistic regression is inappropriate. We use the 2004 Bangladesh Demographic and Health Survey (BDHS) contraceptive binary data which is a multistage stratified cluster dataset. This dataset is used to exemplify all aspects of working with multilevel logistic regression models, including model conceptualization, model description, understanding of the structure of required multilevel data, estimation of the model via the statistical package \emph{MLwiN}, comparison between different estimations, and investigation of the selected determinants of contraceptive use.

Journal of Data Science, v.9, no.1, p.111-126

Test Procedures for Change Point in a General Class of Distributions

by S. M. Sadooghi-Alvandi, A. R. Nematollahi, Reza Habibi

This paper is concerned with the change point analysis in a general class of distributions. The quasi-Bayes and likelihood ratio test procedures are considered to test the null hypothesis of no change point. Exact and asymptotic behaviors of the two test statistics are derived. To compare the performances of two test procedures, numerical significance levels and powers of tests are tabulated for certain selected values of the parameters. Estimation of change point based on these two test procedures are considered. The epidemic change point problem is studied as an alternative model for the single change point model. A real data set with epidemic change model is analyzed by two test procedures.

Journal of Data Science, v.9, no.1, p.127-138

Adjusting for Treatment Effect when Estimating or Testing Genetic Effect is of Main Interest

by Yuanjia Wang and Yixin Fang

It is known that "standard methods for estimating the causal effect of a time-varying treatment on the mean of a repeated measures outcome (for example, GEE regression) may be biased when there are time-dependent variables that are simultaneously confounders of the effect of interest and are predicted by previous treatment" (Hernan et al. 2002). Inverse-probability of treatment weighted (IPTW) methods are developed in the literature of causal inference. In genetic studies, however, the main interest is to estimate or test the genetic effect rather than the treatment effect. In this work, we describe an IPTW method that provides unbiased estimate for the genetic effect, and discuss how to develop a family-based association test using IPTW for family-based studies. We apply the developed methods to systolic blood pressure data in Framingham Heart Study, where some subjects took antihypertensive treatment during the course of study.