# Volume 8, Number 1, January 2010

• Yoon Young Jung, Youngja Park, Dean P. Jones, Thomas R. Ziegler and Brani Vidakovic
Self-similarity in NMR Spectra: An Application in Assessing the Level of Cysteine
• Thomas L. Moore and Vicki Bentley-Condit
A Study of Permutation Tests in the Context of a Problem in Primatology
• Jing Wang
A Nonparametric Approach Using Dirichlet Process for Hierarchical Generalized Linear Mixed Models
• Matthew J. Davis
Contrast Coding in Multiple Regression Analysis: Strengths, Weaknesses, and Utility of Popular Coding Structures
• Hossein Hassani, Shahin Gheitanchi and Mohammad Reza Yeganegi
On the Application of Data Mining to Official Data
• Viviana Giampaoli and Arnaldo Mandel
Language Rhythm Model Selection by Weighted Kappa
• Gianna Agro, Frank Lad and Giuseppe Sanfilippo
Sequentially Forecasting Economic Indices Using Mixture Linear Combinations of EP Distributions
• Marwa Ahmed and Mohamed Shoukri
A Bayesian Estimator of the Intracluster Correlation Coefficient from Correlated Binary Responses
• Zhao Chen
Empirical Bayes Analysis on the Power Law Process with Natural Conjugate Priors
• Juha Karvanen, Olli Saarela and Kari Kuulasmaa
Nonparametric Multiple Imputation of Left Censored Event Times in Analysis of Follow-up Data
• Betsy L. Cadwell, Theodore J. Thompson, James P. Boyle and Lawrence E. Barker
Bayesian Small Area Estimates of Diabetes Prevalence by U.S. County, 2005

### Journal of Data Science, v.8, no.1, p.1-19

#### Self-similarity in NMR Spectra: An Application in Assessing the Level of Cysteine

##### by Yoon Young Jung, Youngja Park, Dean P. Jones, Thomas R. Ziegler and Brani Vidakovic

High resolution of NMR spectroscopic data of biosamples are a rich source of information on the metabolic response to physiological variation or pathological events. There are many advantages of NMR techniques such as the sample preparation is fast, simple and non-invasive. Statistical analysis of NMR spectra usually focuses on differential expression of large resonance intensity corresponding to abundant metabolites and involves several data reprocessing steps. In this paper we estimate functional components of spectra and test their significance using multiscale echniques. We also explore scaling in NMR spectra and use the systematic variability of scaling descriptors to predict the level of cysteine, an important precursor of glutathione, a control antioxidant in human body. This is motivated by high cost (in time and resources) of traditional methods for assessing cysteine level by high performance liquid chromatograph (HPLC).

### Journal of Data Science, v.8, no.1, p.21-41

#### A Study of Permutation Tests in the Context of a Problem in Primatology

##### by Thomas L. Moore and Vicki Bentley-Condit

Female baboons, some with infants, were observed and counts made of interactions in which females interacted with the infants of other females (so-called infant-handling). Independent of these observations, each baboon is assigned a dominance rank of low,'' medium,''or high.'' Researchers hypothesized that females tend to handle infants of females ranked below them. The data form an array with row-labels being infant labels and columns being female labels. Entry $(i,j)$ counts total infant handlings of infant $i$ by female $j$. Each count corresponds to one of 9 combinations of female by infant/mother ranks, which induces a 3-by-3 table of total interactions. We use a permutation test to support the research hypothesis, where ranks are permuted at random. We also discuss statistical properties of our method such as choice of test statistic, power, and stability of results to individual observations.

We discover that the data support a nuanced view of baboon interaction, where higher-ranked females prefer to handle down the hierarchy, while lower-ranked females must balance the desire to accede to the desires of the high-ranked females while protecting their infants from the potential risks involved in such interactions.

### Journal of Data Science, v.8, no.1, p.43-59

#### A Nonparametric Approach Using Dirichlet Process for Hierarchical Generalized Linear Mixed Models

##### by Jing Wang

In this paper, we propose a nonparametric approach using the Dirichlet processes (DP) as a class of prior distributions for the istribution G of the random effects in the hierarchical generalized linear mixed model (GLMM). The support of the prior distribution (and the posterior distribution) is large, allowing for a wide range of shapes for G. This provides great flexibility in estimating G and therefore produces a more flexible estimator than does the parametric analysis. We present some computation strategies for posterior computations involved in DP modeling. The proposed method is illustrated with real examples as well as simulations.

### Journal of Data Science, v.8, no.1, p.61-73

#### Contrast Coding in Multiple Regression Analysis: Strengths, Weaknesses, and Utility of Popular Coding Structures

##### by Matthew J. Davis

The use of multiple regression analysis (MRA) has been on the rise over the last few decades in part due to the realization that analysis of variance (ANOVA) statistics can be advantageously completed using MRA. Given the limitations of ANOVA strategies it is argued that MRA is the better analysis; however, in order to use ANOVA in MRA coding structures must be employed by the researcher which can be confusing to understand. The present paper attempts to simplify this discussion by providing a description of the most popular coding structures, with emphasis on their strengths, limitations, and uses. A visual analysis of each of these strategies is also included along with all necessary steps to create the contrasts. Finally, a decision tree is presented that can be used by researchers to determine which coding structure to utilize in their current research project.

### Journal of Data Science, v.8, no.1, p.75-89

#### On the Application of Data Mining to Official Data

##### by Hossein Hassani, Shahin Gheitanchi and Mohammad Reza Yeganegi

Retrieving valuable knowledge and statistical patterns from official data has a great potential in supporting strategic policy making. Data Mining (DM) techniques are well-known for providing flexible and efficient analytical tools for data processing. In this paper, we provide an introduction to applications of DM to official statistics and flag the important issues and challenges. Considering recent advancements in software projects for DM, we propose intelligent data control system design and specifications as an example of DM application in official data processing.

### Journal of Data Science, v.8, no.1, p.91-99

#### Language Rhythm Model Selection by Weighted Kappa

##### by Viviana Giampaoli and Arnaldo Mandel

Given processes that assign binary vectors to data, one wishes to test models that simulate those processes and uncover groupings in the processes. It is shown that a suitable test can be derived from a kappa type agreement measure. This is applied to analyze stress placement in spoken phrases, based on experimental data previously obtained. The processes were Portuguese speakers and the grouping corresponds to the Brazilian and European varieties of that language. Optimality Theory gave rise to different models. The agreement measure was successful in pointing the relative fitness of models to language varieties.

### Journal of Data Science, v.8, no.1, p.101-126

#### Sequentially Forecasting Economic Indices Using Mixture Linear Combinations of EP Distributions

##### by Gianna Agro, Frank Lad and Giuseppe Sanfilippo

This article displays an application of the statistical method motivated by Bruno de Finetti's operational subjective theory of probability. We use exchangeable forecasting distributions based on mixtures of linear combinations of exponential power (EP) distributions to forecast the sequence of daily rates of return from the Dow-Jones index of stock prices over a 20 year period. The operational subjective statistical method for comparing distributions is quite different from that commonly used in data analysis, because it rejects the basic tenets underlying the practice of hypothesis testing. In its place, proper scoring rules for forecast distributions are used to assess the values of various forecasting strategies. Using a logarithmic scoring rule, we find that a mixture linear combination of EP distributions scores markedly better than does a simple mixture over the EP family, which scores much better than does a simple Normal mixture. Surprisingly, a mixture over a linear combination of three Normal distributions also makes a substantial improvement over a simple Normal mixture, although it does not quite match the performance of even the simple EP mixture. All substantive forecasting improvements become most marked after extreme tail phenomena were actually observed in the sequence, in particular after the abrupt drop in market prices in October, 1987. However, the improvements continue to be apparent over the long haul of 1985-2006 which has seen a number of extreme price changes. This result is supported by an analysis of the Negentropies embedded in the forecasting distributions, and a proper scoring analysis of these Negentropies as well.

### Journal of Data Science, v.8, no.1, p.127-137

#### A Bayesian Estimator of the Intracluster Correlation Coefficient from Correlated Binary Responses

##### by Marwa Ahmed and Mohamed Shoukri

Clustered binary samples arise often in biomedical investigations. An important feature of such samples is that the binary responses within clusters tend to be correlated. The Beta-Binomial model is commonly applied to account for the intra-cluster correlation -- the correlation between responses within the clusters -- among dichotomous outcomes in cluster sampling. The intracluster correlation coefficient (ICC) quantifies this correlation or level of similarity. In this paper, we propose Bayesian point and interval estimators for the ICC under the Beta-Binomial model. Using Laplace's method, the asymptotic posterior distribution of the ICC is approximated by a normal distribution. The posterior mean of this normal density is used as a central point estimator for the ICC, and 95\% credible sets are calculated. A Monte Carlo simulation is used to evaluate the coverage probability and average length of the credible set of the proposed interval estimator. The simulations indicate that for the situation when the number of clusters is above 40, the underlying mean response probability falls in the range of [0.3;0.7], and the underlying ICC values are $\leq 0.4$, the proposed interval estimator performs quite well and attains the correct coverage level. Even for number of clusters as small as 20, the proposed interval estimator may still be useful in the case of small ICC ($\leq 0.2$).

### Journal of Data Science, v.8, no.1, p.139-149

#### Empirical Bayes Analysis on the Power Law Process with Natural Conjugate Priors

##### by Zhao Chen

The power law process has been used extensively in software reliability models, reliability growth models and more generally reliable systems. In this paper we work on the Power Law Process via empirical Bayes (EB) approach. Based on a two-hyperparameter natural conjugate prior and a more generalized three-hyperparameter natural conjugate prior, which was stated in Huang and Bier (1998), we work out an empirical Bayes (EB) procedure and provide statistical inferences based on the natural conjugate priors. Given past experience about the parameters of the model, the empirical Bayes (EB) approach uses the observed data to estimate the hyperparamters of priors and then proceeds as though the prior were known.

### Journal of Data Science, v.8, no.1, p.151-172

#### Nonparametric Multiple Imputation of Left Censored Event Times in Analysis of Follow-up Data

##### by Juha Karvanen, Olli Saarela and Kari Kuulasmaa

In this paper, we consider analysis of follow-up data where each event time is either right censored, observed, left censored or left truncated. In the case of left censoring, the covariates measured at baseline are considered as missing. The work is motivated by data from the MORGAM Project, which explores the association between cardiovascular diseases and their classic and genetic risk factors. We propose a nonparametric multiple imputation (NPMI) approach where the left censored event times and the missing covariates are imputed in hot deck manner. The left truncation due to deaths prior to baseline is compensated by Lexis diagram imputation introduced in the paper. After imputation, the standard estimation methods for right censored survival data can be directly applied. The performance of the proposed imputation approach is studied with simulated and real world data. The results suggest that the NPMI is a flexible and reliable approach to the analysis of left and right censored data.

### Journal of Data Science, v.8, no.1, p.173-188

#### Bayesian Small Area Estimates of Diabetes Prevalence by U.S. County, 2005

##### by Betsy L. Cadwell, Theodore J. Thompson, James P. Boyle and Lawrence E. Barker

County specific estimates promote understanding of national and state patterns of the diabetes burden and can help better target diabetes programs. Using Bayesian multilevel models, the authors estimated the prevalence of self reported diagnosed diabetes for adults aged 20 years or older for each of the United States' 3,141 counties/county equivalents. These estimates provide the first comprehensive county level estimates of diabetes for the U.S. and provide opportunities for the practical targeting of interventions and new lines of investigations into area level risk factors for diabetes. The ranks' posterior distribution was used to identify counties with extreme diabetes burden. Counties with high (low) diabetes burden were identified as those for which at least 95\% of the posterior distribution for the rank was above (below) the median. In 2005, 428 (480) counties had high (low) diabetes burden. Design-based estimates could be obtained for 232 large population counties; model-based estimates compared favorably with these design-based estimates.