#### Announcement of Our New Editor

Effective January 1, 2011, Journal of Data Science will have a new editor. Please send contributions to:

Professor Wen-Jang Huang

Department of Applied Mathematics

National University of Kaohsiung

Kaohsiung, Taiwan 811

huangwj@nuk.edu.tw

### Journal of Data Science, v.8, no.2, p.189-211

#### Comparison of Bayesian Spatio-Temporal Models for Chronic Diseases

##### by Hoon Kim and HeeJeong Lim

- Full Text (PDF): [725.95kB]

This paper discusses a comprehensive statistical approach that will

be useful in answering health-related questions concerning mortality

and incidence rates of chronic diseases such as cancer and

hypertension. The developed spatio-temporal models will be useful

to explain the patterns of mortality rates of chronic disease in

terms of environmental changes and social-economic conditions. In

addition to age and time effects, models include two components of

normally distributed residual effects and spatial effects, one to

represent average regional effects and another to represent changes

of subgroups within region over time. Numerical analysis is based on

male lung cancer mortality data from the state of Missouri. Gibbs

sampling is used to obtain the posterior quantities. As a result,

all models discussed in this article fit well in stabilizing the

mortality rates, especially in the less populated areas. Due to the

richness of hierarchical settings, easy interpretation of parameters

and ease of implementation, any models proposed in this paper can be

applied generally to other sets of data.

### Journal of Data Science, v.8, no.2, p.213-234

#### A Study of the Suprenewal Process

##### by Wen-Jang Huang and Chia-Ling Lai

- Full Text (PDF): [188.70kB]

The classical coupon collector's problem is concerned with the

number of purchases in order to have a complete collection, assuming

that on each purchase a consumer can obtain a randomly chosen

coupon. For most real situations, a consumer may not just get

exactly one coupon on each purchase. Motivated by the classical

coupon collector's problem, in this work, we study the so-called

suprenewal process. Let $\{X_i,i\geq1\}$ be a sequence of

independent and identically distributed random variables,

$S_{n}=\sum^n_{i=1}X_{i},\ n\geq1,\ S_{0}=0$. For every $t\geq0$,

define $Q_{t}=\inf\{n\mid n\geq0,\ S_{n}\geq t\}$. For the classical

coupon collector's problem, $Q_t$ denotes the minimal number of

purchases, such that the total number of coupons that the consumer

has owned is greater than or equal to $t$, $t\geq 0$. First the

process $\{Q_t,\ t\geq 0\}$ and the renewal process $\{N_t,t\geq

0\}$, where $N_t = \mbox{sup}\{n|n\geq 0,\ S_n\leq t\}$, generated

by the same sequence $\{X_i,i\geq1\}$ are compared. Next some

fundamental and interesting properties of $\{Q_t,\ t\geq 0\}$ are

provided. Finally limiting and some other related results are

obtained for the process $\{Q_t,\ t\geq 0\}$.

### Journal of Data Science, v.8, no.2, p.235-252

#### Estimating Small Area Diabetes Prevalence in the US Using the Behavioral Risk Factor Surveillance System

##### by Peter Congdon and Patsy Lloyd

- Full Text (PDF): [263.70kB]

Information regarding small area prevalence of chronic disease is

important for public health strategy and resourcing equity. This

paper develops a prevalence model taking account of survey and

census data to derive small area prevalence estimates for diabetes.

The application involves 32000 small area subdivisions (zip code

census tracts) of the US, with the prevalence estimates taking

account of information from the US-wide Behavioral Risk Factor

Surveillance System (BRFSS) survey on population prevalence

differentials by age, gender, ethnic group and education. The

effects of such aspects of population composition on prevalence are

widely recognized. However, the model also incorporates spatial or

contextual influences via spatially structured effects for each US

state; such contextual effects are allowed to differ between ethnic

groups and other demographic categories using a multivariate spatial

prior. A\ Bayesian estimation approach is used and analysis

demonstrates the considerably improved fit of a fully specified

compositional-contextual model as compared to simpler `standard'

approaches which are typically limited to age and area effects.

### Journal of Data Science, v.8, no.2, p.253-268

#### Estimating and Testing Quantile-based Process Capability Indices for Processes with Skewed Distributions

##### by Cheng Peng

- Full Text (PDF): [138.53kB]

This article extends the recent work of Vannman and Albing (2007)

regarding the new family of quantile based process capability

indices (qPCI) $C_{MA}(\tau, v)$. We develop both asymptotic

parametric and non-parametric confidence limits and testing

procedures of $C_{MA}(\tau, v)$. The kernel density estimator of

process was proposed to find the consistent estimator of the

variance of the nonparametric consistent estimator of $C_{MA}(\tau,

v)$. Therefore, the proposed procedure is ready for practical

implementation to any processes. Illustrative examples are also

provided to show the steps of implementing the proposed methods

directly on the real-life problems. We also present a simulation

study on the sample size required for using asymptotic results.

### Journal of Data Science, v.8, no.2, p.269-288

#### Information Fusion for Biological Prediction

##### by Stefan Jaeger and Su-Shing Chen

- Full Text (PDF): [141.98kB]

Information fusion has become a powerful tool for challenging

applications such as biological prediction problems. In this paper,

we apply a new information-theoretical fusion technique to HIV-1

protease cleavage site prediction, which is a problem that has been

in the focus of much interest and investigation of the machine

learning community recently. It poses a difficult classification

task due to its high dimensional feature space and a relatively

small set of available training patterns. We also apply a new set of

biophysical features to this problem and present experiments with

neural networks, support vector machines, and decision trees.

Application of our feature set results in high recognition rates and

concise decision trees, producing manageable rule sets that can

guide future experiments. In particular, we found a combination of

neural networks and support vector machines to be beneficial for

this problem.

### Journal of Data Science, v.8, no.2, p.289-306

#### Multilevel Models and Inequality in Viet Nam

##### by Dominique Haughton and Phong Nguyen

- Full Text (PDF): [238.68kB]

This paper proposes to investigate inequality in Viet Nam from the

point of view of a study of the urban/rural gap by means of a

multilevel model. Using data from the Viet Nam Household Living

Standards Survey of 2002, the paper constructs a multilevel model,

yielding random effects in the urban/rural gap which can be seen as

location-specific random contributions to the urban/rural gap above

and beyond the effects of known location characteristics, such as

the level of education of the population, etc. The paper also

demonstrates how the multilevel model can be used to obtain small

area estimates at the commune level.

### Journal of Data Science, v.8, no.2, p.307-325

#### A Data Mining Approach for Identifying Predictors of Student Retention from Sophomore to Junior Year

##### by Chong Ho Yu, Samuel DiGangi, Angel Jannasch-Pennell and Charles Kaprolet

- Full Text (PDF): [247.11kB]

Student retention is an important issue for all university policy

makers due to the potential negative impact on the image of the

university and the career path of the dropouts. Although this issue

has been thoroughly studied by many institutional researchers using

parametric techniques, such as regression analysis and logit

modeling, this article attempts to bring in a new perspective by

exploring the issue with the use of three data mining techniques,

namely, classification trees, multivariate adaptive regression

splines (MARS), and neural networks. Data mining procedures identify

transferred hours, residency, and ethnicity as crucial factors to

retention. Carrying transferred hours into the university implies

that the students have taken college level classes somewhere else,

suggesting that they are more academically prepared for university

study than those who have no transferred hours. Although residency

was found to be a crucial predictor to retention, one should not go

too far as to interpret this finding that retention is affected by

proximity to the university location. Instead, this is a typical

example of Simpson's Paradox. The geographical information system

analysis indicates that non-residents from the east coast tend to be

more persistent in enrollment than their west coast schoolmates.

### Journal of Data Science, v.8, no.2, p.327-338

#### Combining Unsupervised and Supervised Neural Networks in Cluster Analysis of Gamma-Ray Burst

##### by Basilio de B. Pereira, Calyampudi R. Rao, Rubens L. Oliveira and Emilia M. do Nascimento

- Full Text (PDF): [1.00MB]

The paper proposes the use of Kohonen's Self Organizing Map (SOM),

and supervised neural networks to find clusters in samples of

gamma-ray burst (GRB) using the measurements given in BATSE GRB.

The extent of separation between clusters obtained by SOM was

examined by cross validation procedure using supervised neural

networks for classification. A method is proposed for variable

selection to reduce the ``curse of dimensionality". Six variables

were chosen for cluster analysis. Additionally, principal components

were computed using all the original variables and 6 components

which accounted for a high percentage of variance was chosen for SOM

analysis. All these methods indicate 4 or 5 clusters. Further

analysis based on the average profiles of the GRB indicated a

possible reduction in the number of clusters.

### Journal of Data Science, v.8, no.2, p.339-348

#### Age-Adjusted US Cancer Death Rate Predictions

##### by Matthew J. Hayat, Ram C. Tiwari, Kaushik Ghosh, Mark Hachey, Ben Hankey and Rocky Feuer

- Full Text (PDF): [227.76kB]

The likelihood of developing cancer during one's lifetime is

approximately one in two for men and one in three for women in the

United States. Cancer is the second-leading cause of death and

accounts for one in every four deaths. Evidence-based policy

planning and decision making by cancer researchers and public health

administrators are best accomplished with up-to-date age-adjusted

site-specific cancer death rates. Because of the 3-year lag in

reporting, forecasting methodology is employed here to estimate the

current year's rates based on complete observed death data up

through three years prior to the current year. The authors expand

the State Space Model (SSM) statistical methodology currently in use

by the American Cancer Society (ACS) to predict age-adjusted cancer

death rates for the current year. These predictions are compared

with those from the previous Proc Forecast ACS method and results

suggest the expanded SSM performs well.

### Journal of Data Science, v.8, no.2, p.349-360

#### Fitting Parametric and Semi-parametric Conditional Poisson Regression Models with Cox's Partial Likelihood in Self-controlled Case Series and Matched Cohort Studies

##### by Stanley Xu, Paul Gargiullo, John Mullooly, David McClure Simon J. Hambidge and Jason Glanz

- Full Text (PDF): [94.33kB]

The self-controlled case series (SCCS) and the matched cohort are

two frequently used study designs to adjust for known and unknown

confounding effects in epidemiological studies. Count data arising

from these two designs may not be independent. While conditional

Poisson regression models have been used to take into account the

dependence of such data, these models have not been available in

some standard statistical software packages (e.g., SAS). This

article demonstrates 1) the relationship of the likelihood function

and parameter estimation between the conditional Poisson regression

models and Cox's proportional hazard models in SCCS and matched

cohort studies; 2) that it is possible to fit conditional Poisson

regression models with procedures (e.g., \it {PHREG} \rm in SAS)

using Cox's partial likelihood model. We tested both conditional

Poisson likelihood and Cox's partial likelihood models on data from

studies using either SCCS or a matched cohort design. For the SCCS

study, we fitted both parametric and semi-parametric models to model

age effects, and described a simple way to apply the parametric and

complex semi-parametric analysis to case series data.