### Journal of Data Science, v.7, no.1, p.1-12

#### Least square and Empirical Bayes Approaches for Estimating Random Change Points

##### by Yuanjia Wang and Yixin Fang

- Full Text (PDF): [142.52kB]

Here we develop methods for applications where random change points are known to be present a priori and the interest lies in their estimation and investigating risk factors that influence them. A simple least-square method estimating each individual's change point based on one's own observations is first proposed. An easy-to-compute empirical Bayes type shrinkage is then proposed to pool information from separately estimated change points. A method to improve the empirical Bayes estimates is developed. Simulations are conducted to compare least-square estimates and Bayes shrinkage estimates. The proposed methods are applied to the Berkeley Growth Study data to estimate the transition age of the puberty height growth.

### Journal of Data Science, v.7, no.1, p.13-25

#### Mixed-effect Models for Truncated Longitudinal Outcomes with Nonignorable Missing Data

##### by Sujuan Gao and Rodolphe Thiébaut

- Full Text (PDF): [92.53kB]

Mixed effects models are often used for estimating fixed effects and variance components in continuous longitudinal outcomes. An EM based estimation approach for mixed effects models when the outcomes are truncated was proposed by Hughes (1999). We consider the situation when the longitudinal outcomes are also subject to non-ignorable missing in addition to truncation. A shared random effect parameter model is presented where the missing data mechanism depends on the random effects used to model the longitudinal outcomes. Data from the Indianapolis-Ibadan dementia project is used to illustrate the proposed approach.

### Journal of Data Science, v.7, no.1, p.27-42

#### A Locally Stationary Markov Chain Model for Labor Dynamics

##### by Enrique E. Alvarez, Francisco J. Ciocchini and Kishori Konwar

- Full Text (PDF): [149.94kB]

Because the EPH is severely affected by attrition, a significant fraction of the surveyed paths contain just one single point. Instead of discarding those observations, we opt to base estimation on the full data by (\textit{i}% ) assuming the Markov chains are stationary and (\textit{ii}) incorporating the chronological time of the first interview as an additional covariate for each individual. This novel treatment represents a convenient approximation, which we illustrate with data from Argentina in the period 1995-2002 via maximum likelihood estimation. Several interesting labor market indexes, which are functionally related to the transition matrixes, are also presented in the last portion of the paper and illustrated with real data.

### Journal of Data Science, v.7, no.1, p.43-59

#### Two Approaches to Imputation and Adjustment of Air Quality Data from a Composite Monitoring Network

##### by Alessio Pollice and Giovanna Jona Lasinio

- Full Text (PDF): [221.05kB]

An analysis of air quality data is provided for the municipal area of Taranto characterized by high environmental risks, due to the massive presence of industrial sites with elevated environmental impact activities. The present study is focused on particulate matter as measured by PM10 concentrations. Preliminary analysis involved addressing several data problems, mainly: (i) an imputation techniques were considered to cope with the large number of missing data, due to both different working periods for groups of monitoring stations and occasional malfunction of PM10 sensors; (ii) due to the use of different validation techniques for each of the three monitoring networks, a calibration procedure was devised to allow for data comparability. Missing data imputation and calibration were addressed by three alternative procedures sharing a leave-one-out type mechanism and based on {\it ad hoc} exploratory tools and on the recursive Bayesian estimation and prediction of spatial linear mixed effects models. The three procedures are introduced by motivating issues and compared in terms of performance.

### Journal of Data Science, v.7, no.1, p.61-72

#### Approximate Graphical Methods for Inverse Regression

##### Geoffrey Jones and Paul Lyons

- Full Text (PDF): [139.92kB]

Graphical procedures can be useful for illustrating and evaluating the process of inverse regression. We first review some simple and well-known graphical approaches for univariate linear and nonlinear models. We then propose a new graphical tool applicable to situations where the response is bivariate and repeated measures data are available. The proposed method is illustrated with an example of the age determination of tern chicks using measurements on body weight and wing length.

### Journal of Data Science, v.7, no.1, p.73-87

#### Bayesian Semiparametric Sales Projections for the Texas Lottery

##### by A. Majumdar and R. Eubank

- Full Text (PDF): [615.64kB]

State lotteries employ sales projections to determine appropriate advertised jackpot levels for some of their games. This paper focuses on prediction of sales for the Lotto Texas game of the Texas Lottery. A novel prediction method is developed in this setting that utilizes functional data analysis concepts in conjunction with a Bayesian paradigm to produce predictions and associated precision assessments.

### Journal of Data Science, v.7, no.1, p.89-97

#### Modeling Current Temperature Trends

##### by Terence C. Mills

- Full Text (PDF): [181.61kB]

Current trends in Northern Hemisphere and Central England temperatures are estimated using a variety of statistical signal extraction and filtering techniques and their extrapolations are compared with the predictions from coupled atmospheric-ocean general circulation models. Earlier warming trend epochs are also analysed and compared with the current warming trend, suggesting that the long-run patterns of temperature trends should also be considered alongside the current emphasis on global warming.

### Journal of Data Science, v.7, no.1, p.99-110

#### Partial Least Squares Analysis in Electrical Brain Activity

##### by Aylin Alin, Serdar Kurt, Anthony Randal McIntosh, Adile Öniz and Murat Özgören

- Full Text (PDF): [654.08kB]

Partial least squares (PLS) method has been designed for handling two common problems in the data that are encountered in most of the applied sciences including the neuroimaging data: 1) Collinearity problem among explanatory variables X or among dependent variables (Y ); 2) Small number of observations with large number of explanatory variables. The idea behind this method is to explain as much as possible covariance between two blocks of X and Y variables by a small number of uncorrelated variables. Apart from the other applied sciences in which PLS are used, in the application of imaging data PLS has been used to identify task dependent changes in activity, changes in the relations between brain and behavior, and to examine functional connectivity of one or more brain regions. The aim of this paper is to give some information about PLS and apply on electroencephalography (EEG) data to identify stimulation dependent changes in EEG activity.

### Journal of Data Science, v.7, no.1, p.111-127

#### A Statistical Analysis of Well Failures in Baltimore County

##### by Xiaoyin Wang and Kevin W. Koepenick

- Full Text (PDF): [1.04MB]

A statistical evaluation of the Baltimore County water well database is performed to gain insight on the sustainability of domestic supply wells in crystalline bedrock aquifers over the last 15 years. Variables potentially related to well yield that are considered included well construction, geology, well depth, and static water level. A variety of statistical methods are utilized to assess correlation and significance from a database of approximately 8,500 wells, and a logistic regression model is developed to predict the probability of well failure by geology type. Results of a two-way analysis of variance technique indicate that the average well depth and yield are statistically different among the established geology groups, and between failed and non-failed wells. The static water level is shown to be statistically different among the geology groups but not among failed and non-failed wells. A logistic regression model results that well yield is the most influential variable for predicting well failure. Static water level and well depth was not found to be significant in predicting well failure.

### Journal of Data Science, v.7, no.1, p.129-138

#### A Replicated Experiment Used in Manufacturing

##### by Roger L. Goodwin

- Full Text (PDF): [80.76kB]

Controlled experiments give researchers a statistical tool for determining the yield from subjecting an experimental unit to various treatments. We will discuss a replicated, block design applied to the experimental unit yeast. We subjected the yeast to six treatments. The purpose of the experiment is to extract a compound to be used in the manufacturing industry. We considered an ANOVA and a MANOVA model to analyze the data. The rationale for selecting one model over the other will be discussed. Results and recommendations of which treatments to use when processing the yeast will be presented, also.