### Journal of Data Science, v.3, no.2, p.123-136

#### History and Potential of Binary Segmentation for Exploratory Data Analysis

##### by James N. Morgan

- Full Text (PDF): [124.04kB]

Exploratory data analysis hlulbecome more important as large rich data sets become available, with many explanatory variables representing competing theoretical constructs. The restrictive assumptions of linearity and additivity of effects as in regression are no longer necessary to save degrees of freedom. Where there is a clear criterion (dependent) variable or classification, sequential binary segmentation (tree) programs are being used. We explain why, using the current enhanced version (SEARCH) of the original Automatic Interaction Detector program as an illustration. Even the simple example uncovers an interaction that might well have been missed with the usual multivariate regression. We then suggest some promising uses and provide one simple example.

### Journal of Data Science, v.3, no.2, p.137-151

#### A Diagnostic for Assessing the Influence of Cases on the Prediction of Random Effects in a Mixed Model

##### by Joseph E. Cavanaugh and Junfeng Shang

- Full Text (PDF): [151.08kB]

A diagnostic defined in terms of the Kullback-Leibler directed divergence is developed for identifying cases which impact the prediction of the random effects in a mixed model. The diagnostic compares two conditional densities governing the prediction of the random effects: one based on parameter estimates computed using the full data set, the other based on parameter estimates computed using a case-deleted data set. We present the definition of the diagnostic and derive a formula for its evaluation. Its performance is investigated in an application where exam scores are modeled using a mixed model containing a fixed exam effect and a random subject effect.

### Journal of Data Science, v.3, no.2, p.153-177

#### Influence Diagnostics for Linear Mixed Models

##### by Temesgen Zewotir and Jacky S. Galpin

- Full Text (PDF): [286.14kB]

in standard computing packages. We provide routine diagnostic tools, which are computationally inexpensive. The diagnostics are functions of basic building blocks: studentized residuals, error contrast matrix, and the inverse of the response variable covariance matrix. The basic building blocks are computed only once from the complete data analysis and provide information on the influence of the data on different aspects of the model fit. Numerical examples provide analysts with the complete pictures of the diagnostics.

### Journal of Data Science, v.3, no.2, p.179-197

#### Designing for Parameter Subsets in Gaussian Nonlinear Regression Models

##### by Timothy E. O'Brien

- Full Text (PDF): [201.38kB]

This article presents and illustrates several important subset design approaches for Gaussian nonlinear regression models and for linear models where interest lies in a nonlinear function of the model parameters. These design strategies are particularly useful in situations where currently-used subset design procedures fail to provide designs which can be used to fit the model function. Our original design technique is illustrated in conjuction with D-optimality, Bayesian D-optimality and Kiefer's $\Phi _{k}$% -optimality, and is extended to yield subset designs which take account of curvature.

### Journal of Data Science, v.3, no.2, p.199-219

#### General Marginal Regression Models for the Joint Modeling of Event Frequency and Correlated Severities with Applications to Clinical Trials

##### by Andrew S. Allen and Huiman X. Barnhart

- Full Text (PDF): [178.04kB]

In many clinical trials, information is collected on both the frequency of event occurrence and the severity of each event. For example, in evaluating a new anti-epileptic medication both the total number of seizures a patient has during the study period as well as the severity (e.g., mild, severe)\ of each seizure could be measured. In order to arrive at a full picture of drug or treatment performance, one needs to jointly model the number of events and their correlated ordinal severity measures. A separate analysis is not recommended as it is inefficient and can lead to what we define as ``zero length bias" in estimates of treatment effect on severity. This paper proposes a general, likelihood based, marginal regression model for jointly modeling the number of events and their correlated ordinal severity measures. We describe parameter estimation issues and derive the Fisher information matrix for the joint model in order to obtain the asymptotic covariance matrix of the parameter estimates. A limited simulation study is conducted to examine the asymptotic properties of the maximum likelihood estimators. Using this joint model, we propose tests that incorporate information from both the number of events and their correlated ordinal severity measures. The methodology is illustrated with two examples from clinical trials: the first concerning a new drug treatment for epilepsy; the second evaluating the effect of a cholesterol lowering medication on coronary artery disease.

### Journal of Data Science, v.3, no.2, p.221-232

#### Analyzing Collinear Data by Principal Component Regression Approach --- An Example from Developing Countries

##### by Abu Jafar Mohammad Sufian

- Full Text (PDF): [119.07kB]
- Data (XLS): [17.00kB]

In many clinical trials, information is collected on both the frequency of event occurrence and the severity of each event. For example, in evaluating a new anti-epileptic medication both the total number of seizures a patient has during the study period as well as the severity (e.g., mild, severe)\ of each seizure could be measured. In order to arrive at a full picture of drug or treatment performance, one needs to jointly model the number of events and their correlated ordinal severity measures. A separate analysis is not recommended as it is inefficient and can lead to what we define as ``zero length bias" in estimates of treatment effect on severity. This paper proposes a general, likelihood based, marginal regression model for jointly modeling the number of events and their correlated ordinal severity measures. We describe parameter estimation issues and derive the Fisher information matrix for the joint model in order to obtain the asymptotic covariance matrix of the parameter estimates. A limited simulation study is conducted to examine the asymptotic properties of the maximum likelihood estimators. Using this joint model, we propose tests that incorporate information from both the number of events and their correlated ordinal severity measures. The methodology is illustrated with two examples from clinical trials: the first concerning a new drug treatment for epilepsy; the second evaluating the effect of a cholesterol lowering medication on coronary artery disease.

### Journal of Data Science, v.3, no.2, p.233-241

#### On the Radical Views of Coal Miners in the Early Twentieth Century in Southern Illinois

##### by Stephane E. Booth and David E. Booth

- Full Text (PDF): [85.71kB]

There has been great interest in the Southern Illinois mine war by historians. An explanation has been that this war was caused by miners who had radical political beliefs. We examine this view by applying four methods of ecological inference to estimate the proportion of coal miners who were socialist voters in this time period. Based on these results (especially considering the assumptions of the methods) we conclude that miners were politically less radical than previously thought.