Volume 3, Number 4, October 2005

  • A Unified Computational Framework to Compare Direct and Sequential False Discovery Rate Algorithms for Exploratory DNA Microarray Studies
  • Application of a Mixed Effects Model for Biosurveilliance of Regional Rail Systems
  • Deinstitutionalization in California: Mortality of Persons with Developmental Disabilities after Transfer into Community Care, 1997-1999
  • Industrial Effects and the CAPM: From the Views of Robustness and Longitudinal Data Analysis
  • Predicting Confidence Intervals for the Age-Period-Cohort Model
  • Skew-normal Linear Mixed Models
  • Sampling Random Variables: A Paradigm Shift for Opinion Polling
  • A Monte Carlo Comparison of Two Linear Dimension Reduction Matrices for Statistical Discrimination

Journal of Data Science, v.3, no.4, p.331-352

A Unified Computational Framework to Compare Direct and Sequential False Discovery Rate Algorithms for Exploratory DNA Microarray Studies

by Danh V. Nguyen

The problem of detecting differential gene expression with microarray data has led to further innovative approaches to controlling false positives in multiple testing. False discovery rate (FDR) has been widely used as a measure of error in this multiple testing context. Direct estimation of FDR was recently proposed by Storey (2002, Journal of the Royal Statistical Society, Series B 64, 479-498) as a substantially more powerful alternative to the traditional sequential FDR controlling procedure, pioneered by Benjamini and Hochberg (1995, Journal of the Royal Statistical Society, Series B 57, 289-300). Direct estimation to FDR requires fixing a rejection region of interest and then conservatively estimating the associated FDR. On the other hand, sequential FDR procedure requires fixing a FDR control level and then estimating the rejection region. Thus, sequential and direct approaches to FDR control appear very different. In this paper, we introduce a unified computational framework for sequential FDR methods and propose a class of more powerful sequential FDR algorithms, that link the direct and sequential approaches. Under the proposed unified compuational framework, both approaches simply approximate the least conservative (optimal) sequential FDR procedure. We illustrate the FDR algorithms and concepts with some numerical studies (simulations) and with two real exploratory DNA microarray studies, one on the detection of molecular signatures in {\it BRCA}-mutation breast cancer patients and another on the detection of genetic signatures during colon cancer initiation and progression in the rat.

Journal of Data Science, v.3, no.4, p.353-370

Application of a Mixed Effects Model for Biosurveilliance of Regional Rail Systems

by Robert J. Gallop, Charles J. Mode, Kenneth Blank, Sherri M. Jurgens and Chad Schaben

Although United States government planners and others outside government had recognized the potential risk of attacks by terrorists, the events of September 11, 2001, vividly revealed the nation's vulnerabilities to terrorism. Similarly, the 2004 terrorist attacks in Madrid illustrated vulnerabilities to terrorism extend beyond the United States. Those attacks were obvious destructive acts with a primary purpose of massive causalities. Let us consider a bioterrorist attack which is conducted subtly through the release of a Chemical/Biological agent. If such an attack occurs through release of a specific biological agent, an awareness of the potential threat of this agent in terms of the number of infections and deaths that could occur in a community is of paramount importance in preparing the public health community to respond to this attack. An increase in biosurveillance and novel approaches to biosurveillance are needed. This paper illustrates the use of mixed effects model for biosurveillance based on commuter size for regional rail lines. With mixed effects model we can estimate for any station on a given rail system the expected daily number of commuters and establish an acceptability criterion around this expected size. If the actual commuter size is significantly smaller than the estimate, then this could be an indicator of a possible attack. We illustrate this method through an example based on the 2001 daily totals for the Port Authority Transportation Company (PATCO) rail system, which serves residents of southern New Jersey and Philadelphia region in the United States. In addition, we discuss ways to put this application in a real time setting for continuous biosurveillance.

Journal of Data Science, v.3, no.4, p.371-380

Deinstitutionalization in California: Mortality of Persons with Developmental Disabilities after Transfer into Community Care, 1997-1999

by Robert Shavelle, David Strauss and Steven Day

More than 2,000 persons with developmental disability transferred from California institutions into community care during 1993 to early 1996. Using data on 1,878 children and adults moved between April 1, 1993 and March 5, 1996, Strauss, Shavelle, Baumeister and Anderson (1998) found a corresponding increase in mortality rates by comparison with those who stayed behind. Shavelle and Strauss (1999) updated the study through 1996 and found similar results. The present study is a further update through 1999. There were 81 deaths, a 47\% increase in risk-adjusted mortality over that expected in institutions ($p< 0.01$). As in the two previous studies, we found that persons transferred later were at higher risk than those moving earlier, even after adjustment for differences in risk profiles. The difference cannot be explained by the short-term effects of the transfer, and therefore appear to reflect an increased mortality rate associated with the less intensive medical care and supervision available in the community.

Journal of Data Science, v.3, no.4, p.381-401

Industrial Effects and the CAPM: From the Views of Robustness and Longitudinal Data Analysis

by Tsung-Chi Cheng, Hung-Neng Lai, and Chien-Ju Lu

The traditional approach by Fama and Macbeth (1973) to the validity of an asset pricing model suffers from two drawbacks. Firstly, it uses the ordinary least squares (OLS) method, which is sensitive to outliers, to estimate the time-series beta. Secondly, it takes averages of the slope coefficients from cross-sectional regressions which ignore the importance of time-series properties. In this article, robust estimators and a longitudinal approach are applied to avoid the problems of these two kinds. We use data on the electronics industry in Taiwan's stock market during the period from September 1998 to December 2001 in order to examine whether betas from the Capital Asset Pricing Model (CAPM) are a valid measure of risk and whether industries to which the firms belong explain excess returns. The methods we propose lead to more explanatory power than the traditional OLS results.

Journal of Data Science, v.3, no.4, p.403-414

Predicting Confidence Intervals for the Age-Period-Cohort Model

by Naser B. Elkum

Forecasting incidence and/or mortality rates of cancer is of special interest to epidemiologists, health researchers and other planners in predicting the demand for health care. This paper proposes a methodology for developing prediction intervals using forecasts from Poisson APC models. The annual Canadian age-specific prostate cancer mortality rates among males aged 45 years or older for the period between 1950 and 1990 are calculated using 5-year intervals. The data were analyzed by fitting an APC model to the logarithm of the mortality rate. Based on the fit of the 1950 to 1979 data, the known prostate mortality in 1980 to 1990 is estimated. The period effects, for 1970-1979, are extended linearly to estimate the next ten period effects. With the aims of parsimony, scientific validity, and a reasonable fit to existing data two different possible forms are evaluated namely, the age-period and the age-period-cohort models. The asymptotic 95% prediction intervals are based on the standard errors using an assumption of normality (estimate $\pm 1.96 \times$ standard error of the estimate).

Journal of Data Science, v.3, no.4, p.415-438

Skew-normal Linear Mixed Models

by R. B. Arellano-Valle, H. Bolfarine and V. H. Lachos

Normality (symmetric) of the random effects and the within-subject errors is a routine assumptions for the linear mixed model, but it may be unrealistic, obscuring important features of among- and within-subjects variation. We relax this assumption by considering that the random effects and model errors follow a skew-normal distributions, which includes normality as a special case and provides flexibility in capturing a broad range of non-normal behavior. The marginal distribution for the observed quantity is derived which is expressed in closed form, so inference may be carried out using existing statistical software and standard optimization techniques. We also implement an EM type algorithm which seem to provide some advantages over a direct maximization of the likelihood. Results of simulation studies and applications to real data sets are reported.

Journal of Data Science, v.3, no.4, p.439-448

Sampling Random Variables: A Paradigm Shift for Opinion Polling

by Gordon G. Bechtel

Conventional sampling in biostatistics and economics posits an individual in a fixed observable state (e.g., diseased or not, poor or not, etc.).
Social, market, and opinion research, however, require a cognitive sampling theory which recognizes that a respondent has a choice between two options (e.g., yes versus no). This new theory posits the survey respondent as a personal probability. Once the sample is drawn, a series of independent non-identical Bernoulli trials are carried out. The outcome of each trial is a momentary binary choice governed by this unobserved probability. Liapunov's extended central limit theorem (Lehmann, 1999) and the Horvitz-Thompson (1952) theorem are then brought to bear on sampling unobservables, in contrast to sampling observations. This formulation reaffirms the usefulness of a weighted sample proportion, which is now seen to estimate a different target parameter than that of conventional design-based sampling theory.

Journal of Data Science, v.3, no.4, p.449-464

A Monte Carlo Comparison of Two Linear Dimension Reduction Matrices for Statistical Discrimination

by J. Wade Davis, Dean M. Young and Karin B. Ernstrom-Keim

We compare two linear dimension-reduction methods for statistical discrimination in terms of average probabilities of misclassification
in reduced dimensions. Using Monte Carlo simulation we compare the dimension-reduction methods over several different parameter
configurations of multivariate normal populations and find that the two methods yield very different results. We also apply the two
dimension-reduction methods examined here to data from a study on football helmet design and neck injuries.