Volume 5, Number 2, April 2007

  • Exploring Gene Expression Data, Using Plots
  • Principal Component Analysis in Linear Regression Survival Model with Microarray Data
  • Small F-ratios: Red Flags in the Linear Model
  • Alternative Tests of Independence in Two-Way Categorical Tables
  • Singular Spectrum Analysis: Methodology and Comparison
  • Automated Linking PUBMED Documents with GO Terms Using SVM
  • Missing Information as a Diagnostic Tool for Latent Class Analysis
  • Nonparametric Estimation of the Incubation Period of AIDS with Left Truncation and Right Censoring

In Memory of Professor Jack C. Lee (1941-2007)

Journal of Data Science, v.5, no.2, p.151-182

Exploring Gene Expression Data, Using Plots

by Dianne Cook, Heike Hofmann, Eun-Kyung Lee, Hao Yang, Basil Nikolau and Eve Wurtele

This paper describes how to explore gene expression data using a combination of graphical and numerical methods. We start from the general methodology for multivariate data visualization, describing heatmaps, parallel coordinate plots and scatterplots. We propose new methods for gene expression data analysis using direct manipulation graphics. With linked scatterplots and parallel coordinate plots we explore gene expression data differently than many common practices. To check replicates in relation to treatments we introduce a new type of plot called a "replicate line" plot. There is a worked example, that focuses on an experimental study containing two two-level factors, genotype and cofactor presence, with two replicates.

Journal of Data Science, v.5, no.2, p.183-198

Principal Component Analysis in Linear Regression Survival Model with Microarray Data

by Steven Ma

As a useful alternative to the Cox proportional hazards model, the linear regression survival model assumes a linear relationship between the covariates and a known monotone transformation, for example logarithm, of an event time of interest. In this article, we study the linear regression survival model with right censored survival data, when high-dimensional microarray measurements are present. Such data may arise in studies investigating the statistical influence of molecular features on survival risk. We propose using the principal component regression (PCR) technique for model reduction based on the weight least squared Stute estimate. Compared with other model reduction techniques, the PCR approach is relatively insensitive to the number of covariates and hence suitable for high dimensional microarray data. Component selection based on the nonparametric bootstrap, and model evaluation using the time-dependent ROC (receiver operating characteristic) technique are investigated. We demonstrate the proposed approach with datasets from two microarray gene expression profiling studies of lymphoma cancers.

Journal of Data Science, v.5, no.2, p.199-215

Small F-ratios: Red Flags in the Linear Model

by Gary E. Meek, Ceyhun Ozgur and Kenneth A. Dunning

All textbooks and articles dealing with classical tests in the context of linear models stress the implications of a significantly large F-ratio since it indicates that the mean square for whatever effect is being evaluated contains significantly more than just error variation. In general, though, with one minor exception, all texts and articles, to the authors' knowledge, ignore the implications of an F-ratio that is significantly smaller than one would expect due to chance alone. Why this is so is difficult to explain since such an occurrence is similar to a range value falling below the lower limit on a control chart for variation or a p-value falling below the lower limit on a control chart for proportion defective. In both of those cases the small value represents an unusual and significant occurrence and, if valid, a process change that indicates an improvement. Therefore, it behooves the quality manager to determine what that change is in order to have it continue. In the case of a significantly small F-ratio some problem may be indicated that requires the designer of the experiment to identify it, and to take "corrective action". While graphical procedures are available for helping to identify some of the possible problems that are discussed they are somewhat subjective when deciding if one is looking at an actual effect; e.g., interaction, or whether the result is merely due to random variation. A significantly small F-ratio can be used to support conclusions based on the graphical procedures by providing a level of statistical significance as well as serving as a red flag or warning that problems may exist in the design and/or analysis.

Journal of Data Science, v.5, no.2, p.217-237

Alternative Tests of Independence in Two-Way Categorical Tables

by Balgobin Nandram and Jai Won Choi

The chi-squared test for independence in two-way categorical tables depends on the assumptions that the data follow the multinomial distribution. Thus, we suggest alternatives when the assumptions of multinomial distribution do not hold. First, we consider the Bayes factor which is used for hypothesis testing in Bayesian statistics. Unfortunately, this has the problem that it is sensitive to the choice of prior distributions. We note here that the intrinsic Bayes factor is not appropriate because the prior distributions under consideration are all proper. Thus, we propose using Bayesian estimation which is generally not as sensitive to prior specifications as the Bayes factor. Our approach is to construct a 95% simultaneous credible region (i.e., a hyper-rectangle) for the interactions. A test that all interactions are zero is equivalent to a test of independence in two-way categorical tables. Thus, a 95% simultaneous credible region of the interactions provides a test of independence by inversion.

Journal of Data Science, v.5, no.2, p.239.257

Singular Spectrum Analysis: Methodology and Comparison

by Hossein Hassani

In recent years Singular Spectrum Analysis (SSA), used as a powerful technique in time series analysis, has been developed and applied to many practical problems. In this paper, the performance of the SSA technique has been considered by applying it to a well-known time series data set, namely, monthly accidental deaths in the USA. The results are compared with those obtained using Box-Jenkins SARIMA models, the ARAR algorithm and the Holt-Winter algorithm (as described in Brockwell and Davis (2002)). The results show that the SSA technique gives a much more accurate forecast than the other methods indicated above.

Journal of Data Science, v.5, no.2, p.259-267

Automated Linking PUBMED Documents with GO Terms Using SVM

by Su-Shing Chen and Hyunki Kim

We have developed an automated linking scheme for PUBMED citations with GO terms using SVM (Support Vector Machine), a classification algorithm. The PUBMED database has been essential to life science researchers with over 12 million citations. More recently GO (Gene Ontology) has provided a graph structure for biological process, cellular component, and molecular function of genomic data. By text mining the textual content of PUBMED and associating them with GO terms, we have built up an ontological map for these databases so that users can search PUBMED via GO terms and conversely GO entries via PUBMED classification. Consequently, some interesting and unexpected knowledge may be captured from them for further data analysis and biological experimentation. This paper reports our results on SVM implementation and the need to parallelize for the training phase.

Journal of Data Science, v.5, no.2, p.269-288

Missing Information as a Diagnostic Tool for Latent Class Analysis

by Ofer Harel and Diana Miglioretti

Latent class analysis (LCA) is a popular method for analyzing multiple categorical outcomes. Given the potential for LCA model assumptions to influence inference, model diagnostics are a particulary important part of LCA. We suggest using the rate of missing information as an additional diagnostic tool. The rate of missing information gives an indication of the amount of information missing as a result of observing multiple surrogates in place of the underlying latent variable of interest and provides a measure of how confident one can be in the model results. Simulation studies and real data examples are presented to explore the usefulness of the proposed measure.

Journal of Data Science, v.5, no.2, p.289-296

Nonparametric Estimation of the Incubation Period of AIDS with Left Truncation and Right Censoring

by Swami Onkar Shivraj, Debasish Chattopadhya and Gurprit Grover

In the natural history of Human Immunodeficiency Virus Type-1 (HIV-1) infection, many studies included the participants who were seropositive at time of enrollment. Estimation of the unknown times since exposure to HIV-1 in the prevalent cohorts is of primary importance for estimation of the incubation period of Acquired Immunodeficiency Syndrome (AIDS). To estimate incubation period of AIDS we used prior distribution of incubation times, based on a external data as suggested by Bacchetti and Jewell (1991, Biometrics, 47,947-960). In the present study, our estimate was nonparametric based on a method proposed by Wang, Jewell and Tsai (1986, Annals of Statistics, 14, 1597-1605).