Volume 7, Number 1, January 2009

  • Yuanjia Wang and Yixin Fang
    Least square and Empirical Bayes Approaches for Estimating Random Change Points
  • Sujuan Gao and Rodolphe Thiébaut
    Mixed-effect Models for Truncated Longitudinal Outcomes with Nonignorable Missing Data
  • Enrique E. Alvarez, Francisco J. Ciocchini and Kishori Konwar
    A Locally Stationary Markov Chain Model for Labor Dynamics
  • Alessio Pollice and Giovanna Jona Lasinio
    Two Approaches to Imputation and Adjustment of Air Quality Data from a Composite Monitoring Network
  • Geoffrey Jones and Paul Lyons
    Approximate Graphical Methods for Inverse Regression
  • A. Majumdar and R. Eubank
    Bayesian Semiparametric Sales Projections for the Texas Lottery
  • Terence C. Mills
    Modeling Current Temperature Trends
  • Aylin Alin, Serdar Kurt, Anthony Randal McIntosh, Adile Öniz and Murat Özgören
    Partial Least Squares Analysis in Electrical Brain Activity
  • Xiaoyin Wang and Kevin W. Koepenick
    A Statistical Analysis of Well Failures in Baltimore County
  • Roger L. Goodwin
    A Replicated Experiment Used in Manufacturing

Journal of Data Science, v.7, no.1, p.1-12

Least square and Empirical Bayes Approaches for Estimating Random Change Points

by Yuanjia Wang and Yixin Fang

Here we develop methods for applications where random change points are known to be present a priori and the interest lies in their estimation and investigating risk factors that influence them. A simple least-square method estimating each individual's change point based on one's own observations is first proposed. An easy-to-compute empirical Bayes type shrinkage is then proposed to pool information from separately estimated change points. A method to improve the empirical Bayes estimates is developed. Simulations are conducted to compare least-square estimates and Bayes shrinkage estimates. The proposed methods are applied to the Berkeley Growth Study data to estimate the transition age of the puberty height growth.

Journal of Data Science, v.7, no.1, p.13-25

Mixed-effect Models for Truncated Longitudinal Outcomes with Nonignorable Missing Data

by Sujuan Gao and Rodolphe Thiébaut

Mixed effects models are often used for estimating fixed effects and variance components in continuous longitudinal outcomes. An EM based estimation approach for mixed effects models when the outcomes are truncated was proposed by Hughes (1999). We consider the situation when the longitudinal outcomes are also subject to non-ignorable missing in addition to truncation. A shared random effect parameter model is presented where the missing data mechanism depends on the random effects used to model the longitudinal outcomes. Data from the Indianapolis-Ibadan dementia project is used to illustrate the proposed approach.

Journal of Data Science, v.7, no.1, p.27-42

A Locally Stationary Markov Chain Model for Labor Dynamics

by Enrique E. Alvarez, Francisco J. Ciocchini and Kishori Konwar
Labor market surveys usually partition individuals into three states: employed, unemployed, and out of the labor force. In particular, the Argentine `` Encuesta Permanente de Hogares (EPH)" follows a rotating scheme so that each selected household is interviewed four times within two years. Each time, the current labor state of individuals is recorded, together with extensive demographic information. We model those labor paths as consecutive observations from independent Markov chains, were transition matrixes are related to covariates through a multivariate logistic link.

Because the EPH is severely affected by attrition, a significant fraction of the surveyed paths contain just one single point. Instead of discarding those observations, we opt to base estimation on the full data by (\textit{i}% ) assuming the Markov chains are stationary and (\textit{ii}) incorporating the chronological time of the first interview as an additional covariate for each individual. This novel treatment represents a convenient approximation, which we illustrate with data from Argentina in the period 1995-2002 via maximum likelihood estimation. Several interesting labor market indexes, which are functionally related to the transition matrixes, are also presented in the last portion of the paper and illustrated with real data.

Journal of Data Science, v.7, no.1, p.43-59

Two Approaches to Imputation and Adjustment of Air Quality Data from a Composite Monitoring Network

by Alessio Pollice and Giovanna Jona Lasinio

An analysis of air quality data is provided for the municipal area of Taranto characterized by high environmental risks, due to the massive presence of industrial sites with elevated environmental impact activities. The present study is focused on particulate matter as measured by PM10 concentrations. Preliminary analysis involved addressing several data problems, mainly: (i) an imputation techniques were considered to cope with the large number of missing data, due to both different working periods for groups of monitoring stations and occasional malfunction of PM10 sensors; (ii) due to the use of different validation techniques for each of the three monitoring networks, a calibration procedure was devised to allow for data comparability. Missing data imputation and calibration were addressed by three alternative procedures sharing a leave-one-out type mechanism and based on {\it ad hoc} exploratory tools and on the recursive Bayesian estimation and prediction of spatial linear mixed effects models. The three procedures are introduced by motivating issues and compared in terms of performance.

Journal of Data Science, v.7, no.1, p.61-72

Approximate Graphical Methods for Inverse Regression

Geoffrey Jones and Paul Lyons

Graphical procedures can be useful for illustrating and evaluating the process of inverse regression. We first review some simple and well-known graphical approaches for univariate linear and nonlinear models. We then propose a new graphical tool applicable to situations where the response is bivariate and repeated measures data are available. The proposed method is illustrated with an example of the age determination of tern chicks using measurements on body weight and wing length.

Journal of Data Science, v.7, no.1, p.73-87

Bayesian Semiparametric Sales Projections for the Texas Lottery

by A. Majumdar and R. Eubank

State lotteries employ sales projections to determine appropriate advertised jackpot levels for some of their games. This paper focuses on prediction of sales for the Lotto Texas game of the Texas Lottery. A novel prediction method is developed in this setting that utilizes functional data analysis concepts in conjunction with a Bayesian paradigm to produce predictions and associated precision assessments.

Journal of Data Science, v.7, no.1, p.89-97

Modeling Current Temperature Trends

by Terence C. Mills

Current trends in Northern Hemisphere and Central England temperatures are estimated using a variety of statistical signal extraction and filtering techniques and their extrapolations are compared with the predictions from coupled atmospheric-ocean general circulation models. Earlier warming trend epochs are also analysed and compared with the current warming trend, suggesting that the long-run patterns of temperature trends should also be considered alongside the current emphasis on global warming.

Journal of Data Science, v.7, no.1, p.99-110

Partial Least Squares Analysis in Electrical Brain Activity

by Aylin Alin, Serdar Kurt, Anthony Randal McIntosh, Adile Öniz and Murat Özgören

Partial least squares (PLS) method has been designed for handling two common problems in the data that are encountered in most of the applied sciences including the neuroimaging data: 1) Collinearity problem among explanatory variables X or among dependent variables (Y ); 2) Small number of observations with large number of explanatory variables. The idea behind this method is to explain as much as possible covariance between two blocks of X and Y variables by a small number of uncorrelated variables. Apart from the other applied sciences in which PLS are used, in the application of imaging data PLS has been used to identify task dependent changes in activity, changes in the relations between brain and behavior, and to examine functional connectivity of one or more brain regions. The aim of this paper is to give some information about PLS and apply on electroencephalography (EEG) data to identify stimulation dependent changes in EEG activity.

Journal of Data Science, v.7, no.1, p.111-127

A Statistical Analysis of Well Failures in Baltimore County

by Xiaoyin Wang and Kevin W. Koepenick

A statistical evaluation of the Baltimore County water well database is performed to gain insight on the sustainability of domestic supply wells in crystalline bedrock aquifers over the last 15 years. Variables potentially related to well yield that are considered included well construction, geology, well depth, and static water level. A variety of statistical methods are utilized to assess correlation and significance from a database of approximately 8,500 wells, and a logistic regression model is developed to predict the probability of well failure by geology type. Results of a two-way analysis of variance technique indicate that the average well depth and yield are statistically different among the established geology groups, and between failed and non-failed wells. The static water level is shown to be statistically different among the geology groups but not among failed and non-failed wells. A logistic regression model results that well yield is the most influential variable for predicting well failure. Static water level and well depth was not found to be significant in predicting well failure.

Journal of Data Science, v.7, no.1, p.129-138

A Replicated Experiment Used in Manufacturing

by Roger L. Goodwin

Controlled experiments give researchers a statistical tool for determining the yield from subjecting an experimental unit to various treatments. We will discuss a replicated, block design applied to the experimental unit yeast. We subjected the yeast to six treatments. The purpose of the experiment is to extract a compound to be used in the manufacturing industry. We considered an ANOVA and a MANOVA model to analyze the data. The rationale for selecting one model over the other will be discussed. Results and recommendations of which treatments to use when processing the yeast will be presented, also.