# Volume 8, Number 2, April 2010

• Announcement of Our New Editor
• Hoon Kim and HeeJeong Lim
Comparison of Bayesian Spatio-Temporal Models for Chronic Diseases
• Wen-Jang Huang and Chia-Ling Lai
A Study of the Suprenewal Process
• Peter Congdon and Patsy Lloyd
Estimating Small Area Diabetes Prevalence in the US Using the Behavioral Risk Factor Surveillance System
• Cheng Peng
Estimating and Testing Quantile-based Process Capability Indices for Processes with Skewed Distributions
• Stefan Jaeger and Su-Shing Chen
Information Fusion for Biological Prediction
• Dominique Haughton and Phong Nguyen
Multilevel Models and Inequality in Viet Nam Multilevel Models and Inequality in Viet Nam
• Chong Ho Yu, Samuel DiGangi, Angel Jannasch-Pennell and Charles Kaprolet
A Data Mining Approach for Identifying Predictors of Student Retention from Sophomore to Junior Year
• Basilio de B. Pereira, Calyampudi R. Rao, Rubens L. Oliveira and Emilia M. do Nascimento
Combining Unsupervised and Supervised Neural Networks in Cluster Analysis of Gamma-Ray Burst
• Matthew J. Hayat, Ram C. Tiwari, Kaushik Ghosh, Mark Hachey, Ben Hankey and Rocky Feuer
Age-Adjusted US Cancer Death Rate Predictions
• Stanley Xu, Paul Gargiullo, John Mullooly, David McClure Simon J. Hambidge and Jason Glanz
Fitting Parametric and Semi-parametric Conditional Poisson Regression Models with Cox's Partial Likelihood in Self-controlled Case Series and Matched Cohort Studies

#### Announcement of Our New Editor

Effective January 1, 2011, Journal of Data Science will have a new editor. Please send contributions to:

Professor Wen-Jang Huang
Department of Applied Mathematics
National University of Kaohsiung
Kaohsiung, Taiwan 811
huangwj@nuk.edu.tw

### Journal of Data Science, v.8, no.2, p.189-211

#### Comparison of Bayesian Spatio-Temporal Models for Chronic Diseases

##### by Hoon Kim and HeeJeong Lim

This paper discusses a comprehensive statistical approach that will
be useful in answering health-related questions concerning mortality
and incidence rates of chronic diseases such as cancer and
hypertension. The developed spatio-temporal models will be useful
to explain the patterns of mortality rates of chronic disease in
terms of environmental changes and social-economic conditions. In
addition to age and time effects, models include two components of
normally distributed residual effects and spatial effects, one to
represent average regional effects and another to represent changes
of subgroups within region over time. Numerical analysis is based on
male lung cancer mortality data from the state of Missouri. Gibbs
sampling is used to obtain the posterior quantities. As a result,
mortality rates, especially in the less populated areas. Due to the
richness of hierarchical settings, easy interpretation of parameters
and ease of implementation, any models proposed in this paper can be
applied generally to other sets of data.

### Journal of Data Science, v.8, no.2, p.213-234

#### A Study of the Suprenewal Process

##### by Wen-Jang Huang and Chia-Ling Lai

The classical coupon collector's problem is concerned with the
number of purchases in order to have a complete collection, assuming
that on each purchase a consumer can obtain a randomly chosen
coupon. For most real situations, a consumer may not just get
exactly one coupon on each purchase. Motivated by the classical
coupon collector's problem, in this work, we study the so-called
suprenewal process. Let $\{X_i,i\geq1\}$ be a sequence of
independent and identically distributed random variables,
$S_{n}=\sum^n_{i=1}X_{i},\ n\geq1,\ S_{0}=0$. For every $t\geq0$,
define $Q_{t}=\inf\{n\mid n\geq0,\ S_{n}\geq t\}$. For the classical
coupon collector's problem, $Q_t$ denotes the minimal number of
purchases, such that the total number of coupons that the consumer
has owned is greater than or equal to $t$, $t\geq 0$. First the
process $\{Q_t,\ t\geq 0\}$ and the renewal process $\{N_t,t\geq 0\}$, where $N_t = \mbox{sup}\{n|n\geq 0,\ S_n\leq t\}$, generated
by the same sequence $\{X_i,i\geq1\}$ are compared. Next some
fundamental and interesting properties of $\{Q_t,\ t\geq 0\}$ are
provided. Finally limiting and some other related results are
obtained for the process $\{Q_t,\ t\geq 0\}$.

### Journal of Data Science, v.8, no.2, p.235-252

#### Estimating Small Area Diabetes Prevalence in the US Using the Behavioral Risk Factor Surveillance System

##### by Peter Congdon and Patsy Lloyd

Information regarding small area prevalence of chronic disease is
important for public health strategy and resourcing equity. This
paper develops a prevalence model taking account of survey and
census data to derive small area prevalence estimates for diabetes.
The application involves 32000 small area subdivisions (zip code
census tracts) of the US, with the prevalence estimates taking
account of information from the US-wide Behavioral Risk Factor
Surveillance System (BRFSS) survey on population prevalence
differentials by age, gender, ethnic group and education. The
effects of such aspects of population composition on prevalence are
widely recognized. However, the model also incorporates spatial or
contextual influences via spatially structured effects for each US
state; such contextual effects are allowed to differ between ethnic
groups and other demographic categories using a multivariate spatial
prior. A\ Bayesian estimation approach is used and analysis
demonstrates the considerably improved fit of a fully specified
compositional-contextual model as compared to simpler standard'
approaches which are typically limited to age and area effects.

### Journal of Data Science, v.8, no.2, p.253-268

#### Estimating and Testing Quantile-based Process Capability Indices for Processes with Skewed Distributions

##### by Cheng Peng

regarding the new family of quantile based process capability
indices (qPCI) $C_{MA}(\tau, v)$. We develop both asymptotic
parametric and non-parametric confidence limits and testing
procedures of $C_{MA}(\tau, v)$. The kernel density estimator of
process was proposed to find the consistent estimator of the
variance of the nonparametric consistent estimator of $C_{MA}(\tau, v)$. Therefore, the proposed procedure is ready for practical
implementation to any processes. Illustrative examples are also
provided to show the steps of implementing the proposed methods
directly on the real-life problems. We also present a simulation
study on the sample size required for using asymptotic results.

### Journal of Data Science, v.8, no.2, p.269-288

#### Information Fusion for Biological Prediction

##### by Stefan Jaeger and Su-Shing Chen

Information fusion has become a powerful tool for challenging
applications such as biological prediction problems. In this paper,
we apply a new information-theoretical fusion technique to HIV-1
protease cleavage site prediction, which is a problem that has been
in the focus of much interest and investigation of the machine
learning community recently. It poses a difficult classification
task due to its high dimensional feature space and a relatively
small set of available training patterns. We also apply a new set of
biophysical features to this problem and present experiments with
neural networks, support vector machines, and decision trees.
Application of our feature set results in high recognition rates and
concise decision trees, producing manageable rule sets that can
guide future experiments. In particular, we found a combination of
neural networks and support vector machines to be beneficial for
this problem.

### Journal of Data Science, v.8, no.2, p.289-306

#### Multilevel Models and Inequality in Viet Nam

##### by Dominique Haughton and Phong Nguyen

This paper proposes to investigate inequality in Viet Nam from the
point of view of a study of the urban/rural gap by means of a
multilevel model. Using data from the Viet Nam Household Living
Standards Survey of 2002, the paper constructs a multilevel model,
yielding random effects in the urban/rural gap which can be seen as
location-specific random contributions to the urban/rural gap above
and beyond the effects of known location characteristics, such as
the level of education of the population, etc. The paper also
demonstrates how the multilevel model can be used to obtain small
area estimates at the commune level.

### Journal of Data Science, v.8, no.2, p.307-325

#### A Data Mining Approach for Identifying Predictors of Student Retention from Sophomore to Junior Year

##### by Chong Ho Yu, Samuel DiGangi, Angel Jannasch-Pennell and Charles Kaprolet

Student retention is an important issue for all university policy
makers due to the potential negative impact on the image of the
university and the career path of the dropouts. Although this issue
has been thoroughly studied by many institutional researchers using
parametric techniques, such as regression analysis and logit
exploring the issue with the use of three data mining techniques,
namely, classification trees, multivariate adaptive regression
splines (MARS), and neural networks. Data mining procedures identify
transferred hours, residency, and ethnicity as crucial factors to
retention. Carrying transferred hours into the university implies
that the students have taken college level classes somewhere else,
suggesting that they are more academically prepared for university
study than those who have no transferred hours. Although residency
was found to be a crucial predictor to retention, one should not go
too far as to interpret this finding that retention is affected by
proximity to the university location. Instead, this is a typical
example of Simpson's Paradox. The geographical information system
analysis indicates that non-residents from the east coast tend to be
more persistent in enrollment than their west coast schoolmates.

### Journal of Data Science, v.8, no.2, p.327-338

#### Combining Unsupervised and Supervised Neural Networks in Cluster Analysis of Gamma-Ray Burst

##### by Basilio de B. Pereira, Calyampudi R. Rao, Rubens L. Oliveira and Emilia M. do Nascimento

The paper proposes the use of Kohonen's Self Organizing Map (SOM),
and supervised neural networks to find clusters in samples of
gamma-ray burst (GRB) using the measurements given in BATSE GRB.
The extent of separation between clusters obtained by SOM was
examined by cross validation procedure using supervised neural
networks for classification. A method is proposed for variable
selection to reduce the `curse of dimensionality". Six variables
were chosen for cluster analysis. Additionally, principal components
were computed using all the original variables and 6 components
which accounted for a high percentage of variance was chosen for SOM
analysis. All these methods indicate 4 or 5 clusters. Further
analysis based on the average profiles of the GRB indicated a
possible reduction in the number of clusters.

### Journal of Data Science, v.8, no.2, p.339-348

#### Age-Adjusted US Cancer Death Rate Predictions

##### by Matthew J. Hayat, Ram C. Tiwari, Kaushik Ghosh, Mark Hachey, Ben Hankey and Rocky Feuer

The likelihood of developing cancer during one's lifetime is
approximately one in two for men and one in three for women in the
United States. Cancer is the second-leading cause of death and
accounts for one in every four deaths. Evidence-based policy
planning and decision making by cancer researchers and public health
site-specific cancer death rates. Because of the 3-year lag in
reporting, forecasting methodology is employed here to estimate the
current year's rates based on complete observed death data up
through three years prior to the current year. The authors expand
the State Space Model (SSM) statistical methodology currently in use
by the American Cancer Society (ACS) to predict age-adjusted cancer
death rates for the current year. These predictions are compared
with those from the previous Proc Forecast ACS method and results
suggest the expanded SSM performs well.

### Journal of Data Science, v.8, no.2, p.349-360

#### Fitting Parametric and Semi-parametric Conditional Poisson Regression Models with Cox's Partial Likelihood in Self-controlled Case Series and Matched Cohort Studies

##### by Stanley Xu, Paul Gargiullo, John Mullooly, David McClure Simon J. Hambidge and Jason Glanz

The self-controlled case series (SCCS) and the matched cohort are
two frequently used study designs to adjust for known and unknown
confounding effects in epidemiological studies. Count data arising
from these two designs may not be independent. While conditional
Poisson regression models have been used to take into account the
dependence of such data, these models have not been available in
some standard statistical software packages (e.g., SAS). This
article demonstrates 1) the relationship of the likelihood function
and parameter estimation between the conditional Poisson regression
models and Cox's proportional hazard models in SCCS and matched
cohort studies; 2) that it is possible to fit conditional Poisson
regression models with procedures (e.g., \it {PHREG} \rm in SAS)
using Cox's partial likelihood model. We tested both conditional
Poisson likelihood and Cox's partial likelihood models on data from
studies using either SCCS or a matched cohort design. For the SCCS
study, we fitted both parametric and semi-parametric models to model
age effects, and described a simple way to apply the parametric and
complex semi-parametric analysis to case series data.