Volume 7, Number 2, April 2009

  • Kasing Man and Chung Chen
    On a Stepwise Hypotheses Testing Procedure and Information Criterion in Identifying Dynamic Relations between Time Series
  • Reena Deutsch, Monica Rivera Mindt, Ronghui Xu, Mariana Cherner, Igor Grant, and the HNRC Group
    Quantifying Relative Superiority among Many Binary-valued Diagnostic Tests in the Presence of a Gold Standard
  • G. Jones, A. D. L. Noble, B. Schauer and N. Cogger
    Measuring the Attenuation in a Subject-specific Random Effect with Paired Data
  • Simon Sai Man Kwok, Wai Keung Li and Philip Leung Ho Yu
    The Autoregressive Conditional Marked Duration Model: Statistical Inference to Market Microstructure
  • Yungtai Lo
    Estimating Age-specific Prevalence of Testosterone Deficiency in Men Using Normal Mixture Models
  • Maela Kloareg and David Causeur
    Double Sampling Designs to Reduce the Non-discovery Rate: Application to Microarray Data
  • Timothy E. O'Brien and Gerald M. Funk
    Encouraging Students to Think Critically: Regression Modeling and Goodness-of-Fit
  • Gordon G. Bechtel
    Panel Regression of Arbitrarily Distributed Responses
  • Jae Keun Yoo
    Iterative Optimal Sufficient Dimension Reduction for Conditional Mean in Multivariate Regression
  • Fan C. Meng
    On Some Structural Importance of System Components

Journal of Data Science, v.7, no.2, p.139-159

On a Stepwise Hypotheses Testing Procedure and Information Criterion in Identifying Dynamic Relations between Time Series

by Kasing Man and Chung Chen

This paper studies an effective stepwise hypotheses testing procedure in identifying dynamic relations between time series, and its close connection with popular information criteria such as AIC and BIC. This procedure, labeled M2, extends Chen and Lee's (1990) procedure to cover both the strong and weak form dynamic relations; and to be used with a guided choice of significance levels which are adapting in nature. Intuitively, procedure M2 can be viewed as a backward-elimination approach that simplifies the all-possible pairwise comparisons approach implied by information criterion. New insights concerning identification of strong and weak form dynamic relations using these approaches are given. Extensive simulation experiments are conducted to illustrate the performance of the IC and M2 approach in different settings. For applications, we study the dynamic relations between price level and interest rate in US and UK, and the robustness of the model identified is also addressed.

Journal of Data Science, v.7, no.2, p.161-177

Quantifying Relative Superiority among Many Binary-valued Diagnostic Tests in the Presence of a Gold Standard

by Reena Deutsch, Monica Rivera Mindt, Ronghui Xu, Mariana Cherner, Igor Grant, and the HNRC Group

Comparison of more than two diagnostic or screening tests for prediction of presence vs. absence of a disease or condition can be complicated when attempting to simultaneously optimize a pair of competing criteria such as sensitivity and specificity. A technique for quantifying relative superiority of a diagnostic test when a gold standard exists in this setting is described. The proposed {\it superiority index} is used to quantify and rank performance of diagnostic tests and combinations of tests. Development of a validated model containing a subset of the tests may be improved by eliminating tests having a very small value for this index. To illustrate, we present an example using a large battery of neuropsychological tests for prediction of cognitive impairment. Using the proposed index, the battery is reduced with favorable results.

Journal of Data Science, v.7, no.2, p.179-188

Measuring the Attenuation in a Subject-specific Random Effect with Paired Data

by G. Jones, A. D. L. Noble, B. Schauer and N. Cogger

This paper is motivated by an investigation into the growth of pigs, which studied among other things the effect of short-term feed withdrawal on live weight. This treatment was thought to reduce the variability in the weights of the pigs. We represent this reduction as an attenuation in an animal-specific random effect. Given data on each pig before and after treatment, we consider the problems of testing for a treatment effect and measuring the strength of the effect, if significant. These problems are related to those of testing the homogeneity of correlated variances, and regression with errors in variables. We compare three different estimates of the attenuation factor using data on the live weights of pigs, and by simulation.

Journal of Data Science, v.7, no.2, p.189-201

The Autoregressive Conditional Marked Duration Model: Statistical Inference to Market Microstructure

by Simon Sai Man Kwok, Wai Keung Li and Philip Leung Ho Yu

We consider the Autoregressive Conditional Marked Duration (ACMD) model and apply it to 16 stocks traded in Hong Kong Stock Exchange (SEHK). By examining the orderings of appropriate sets of model parameters, market microstructure phenomena can be explained. To substantiate these conclusions, likelihood ratio test is used for testing the significance of the parameter orderings of the ACMD model. While some of our results resolve a few controversial market microstructure hypotheses and echo some of the existing empirical evidence, we discover some interesting market microstructure phenomena that may be characteristic to SEHK.

Journal of Data Science, v.7, no.2, p.203-217

Estimating Age-specific Prevalence of Testosterone Deficiency in Men Using Normal Mixture Models

by Yungtai Lo

Testosterone levels decline as men age. There is little consensus on what testosterone levels are normal for aging men. In this paper, we estimate age-specific prevalence of testosterone deficiency in men using normal mixture models when no generally agreed on cut-off value for defining testosterone deficiency is available. The Box-Cox power transformation is used to skewness in data and best suits normal mixture distributions. Parametric bootstrap tests are used to determine the number of components in a normal mixture.

Journal of Data Science, v.7, no.2, p.219-234

Double Sampling Designs to Reduce the Non-discovery Rate: Application to Microarray Data

by Maela Kloareg and David Causeur

Simultaneous tests of a huge number of hypotheses is a core issue in high flow experimental methods such as microarray for transcriptomic data. In the central debate about the type I error rate, Benjamini and Hochberg (1995) have proposed a procedure that is shown to control the now popular False Discovery Rate (FDR) under assumption of independence between the test statistics. These results have been extended to a larger class of dependency by Benjamini and Yekutieli (2001) and improvements have emerged in recent years, among which step-up procedures have shown desirable properties. The present paper focuses on the type II error rate. The proposed method improves the power by means of double-sampling test statistics integrating external information available both on the sample for which the outcomes are measured and also on additional items. The small sample distribution of the test statistics is provided and simulation studies are used to show the beneficial impact of introducing relevant covariates in the testing strategy. Finally, the present method is implemented in a situation where microarray data are used to select the genes that affect the degree of muscle destructuration in pigs. A phenotypic covariate is introduced in the analysis to improve the search for differentially expressed genes.

Journal of Data Science, v.7, no.2, p.235-253

Encouraging Students to Think Critically: Regression Modeling and Goodness-of-Fit

by Timothy E. O'Brien and Gerald M. Funk

This note underscores important considerations that should be taken into account when teaching students to check for inadequacies of a given linear, nonlinear or logistic regression models. Key illustrations are provided which underscore the shortcomings of currently used procedures. A brief overview of nonlinear regression models is given in order to lay the foundation for testing for lack of fit in nonlinear models. This paper also introduces a new 'scaled' binary logistic regression model to highlight potential problems with the usual logistic model, and implications for choosing a robust optimal experimental design are also underscored and discussed.

Journal of Data Science, v.7, no.2, p.255-266

Panel Regression of Arbitrarily Distributed Responses

by Gordon G. Bechtel
The primary advantage of panel over cross-sectional regression stems from its control for the effects of omitted variables or "unobserved heterogeneity". However, panel regression is based on the strong assumptions that measurement errors are independently identically ( i.i.d.) and normal. These assumptions are evaded by design-based regression, which dispenses with measurement errors altogether by regarding the response as a fixed real number.

The present paper establishes a middle ground between these extreme interpretations of longitudinal data. The individual is now represented as a panel of responses containing dependently non-identically distributed (d.n.d) measurement errors. Modeling the expectations of these responses preserves the Neyman randomization theory, rendering panel regression slopes approximately unbiased and normal in the presence of arbitrarily distributed measurement error. The generality of this reinterpretation is illustrated with German Socio-Economic Panel (GSOEP) responses that are discretely distributed on a 3-point scale.

Journal of Data Science, v.7, no.2, p.267-276

Iterative Optimal Sufficient Dimension Reduction for Conditional Mean in Multivariate Regression

by Jae Keun Yoo

Recently, Yoo and Cook (2007) developed an optimal version of Cook and Setodji (2003). When predictors are not highly skewed, the Yoo-Cook approach can be improved, especially with small samples, by iteratively estimating the inner product matrix used in their method without changing their asymptotic results. Since highly skewed predictors are often transformed for normality in sufficient dimension reduction literature, the proposed method can have more useful application in practice than Yoo and Cook (2007).

Journal of Data Science, v.7, no.2, p.277-283

On Some Structural Importance of System Components

by Fan C. Meng

In this note a new method of comparing component structural importance is introduced and compared to other existing ones. Especially, relationships of the new comparison method to the H-importance due to Hwang (2001,2005), the criticality ordering due to Boland {\it et al.} (1989) and Birnbaum importance are obtained. Illustrative examples are given.