### Journal of Data Science, v.5, no.1, p.1-21

#### Pseudo-likelihood Methods for the Analysis of Longitudinal Binary Data Subject to Nonignorable Non-monotone Missingness

##### by Michael Parzen, Stuart R. Lipsitz, Garrett M. Fitzmaurice, Joseph G. Ibrahim, Andrea Troxel and Geert Molenberghs

- Full Text (PDF): [192.81kB]

For longitudinal binary data with non-monotone non-ignorable missing outcomes over time, a full likelihood approach is complicated algebraically, and maximum likelihood estimation can be computationally prohibitive with many times of follow-up. We propose pseudo-likelihoods to estimate the covariate effects on the marginal probabilities of the outcomes, in addition to the association parameters and missingness parameters. The pseudo-likelihood requires specification of the distribution for the data at all pairs of times on the same subject, but makes no assumptions about the joint distribution of the data at three or more times on the same subject, so the method can be considered semi-parametric. If using maximum likelihood, the full likelihood must be correctly specified in order to obtain consistent estimates. We show in simulations that our proposed pseudo-likelihood produces a more efficient estimate of the regression parameters than the pseudo-likelihood for non-ignorable missingness proposed by Troxel et al. (1998). Application to data from the Six Cities study (Ware, et al, 1984), a longitudinal study of the health effects of air pollution, is discussed.

### Journal of Data Science, v.5, no.1, p.23-40

#### Assessing the Effectiveness of Anti-smoking Media Campaigns by Recall and Rating Scores --- A Pattern-Mixture GEE Model Approach

##### by Ming Ji, Chengjie Xiong, Elizabeth A. Gilpin and Lois Biener

- Full Text (PDF): [159.15kB]

Anti-smoking media campaign is an effective tobacco control strategy. How to identify what types of advertising messages are effective is important for maximizing the use of limited funding sources for such campaigns. In this paper, we propose a statistical modeling approach for systematically assessing the effectiveness of anti-smoking media campaigns based on ad recall rates and rating scores. This research is motivated by the need for evaluating youth responses to the Massachusetts Tobacco Control Pro gram (MTCP) media campaign. Pattern-mixture GEE models are proposed to evaluate the impact of viewer and ads characteristics on ad recall rates and rating scores controlling for missing values, confounding and correlations in the data. A key difficulty for pattern-mixture modeling is that there were too many distinct missing data patterns which cause convergence problem for modeling fitting based on limited data. A heuristic argument based on collapsing missing data patterns is used to test the missing co mpletely at random (MCAR) assumption in pattern-mixture GEE models. The proposed modeling approach and the recall-rating study design provide a complete system for identifying the most effective type of advertising messages.

### Journal of Data Science, v.5, no.1, p.41-51

#### Application of EM Algorithm to Mixture Cure Model for Grouped Relative Survival Data

##### by Binbing Yu and Ram C. Tiwari

- Full Text (PDF): [116.36kB]

The interest in estimating the probability of cure has been increasing in cancer survival analysis as the cure of some cancer sites is becoming a reality. Mixture cure models have been used to model the failure time data with the existence of long-term survivors. The mixture cure model assumes that a fraction of the survivors are cured from the disease of interest. The failure time distribution for the uncured individuals (latency) can be modeled by either parametric models or a semi-parametric proportiona l hazards model. In the model, the probability of cure and the latency distribution are both related to the prognostic factors and patients' characteristics. The maximum likelihood estimates (MLEs) of these parameters can be obtained using the Newton-Raphson algorithm. The EM algorithm has been proposed as a simple alternative by Larson and Dinse (1985) and Taylor (1995). in various setting for the cause-specific survival analysis. This approach is extended here to the grouped relative survival data. The methods are applied to analyze the colorectal cancer relative survival data from the Surveillance, Epidemiology, and End Results (SEER) program.

### Journal of Data Science, v.5, no.1, p.53-66

#### Using Occupancy Models to Estimate the Number of Duplicate Cases in a Data System without Unique Identifiers

##### by Ruiguang Song, Timothy Green, Matthew McKenna, and M. Kathleen Glynn

- Full Text (PDF): [143.20kB]

Data systems collecting information from different sources or over long periods of time can receive multiple reports from the same individual. An important example is public health surveillance systems that monitor conditions with long natural histories. Several state-level systems for surveillance of one such condition, the human immunodeficiency virus (HIV), use codes composed of combinations of non-unique personal characteristics such as birth date, soundex (a code based on last name), and sex as pati ent identifiers. As a result, these systems cannot distinguish between several different individuals having identical codes and a unique individual erroneously represented several times. We applied results for occupancy models to estimate the potential magnitude of duplicate case counting for AIDS cases reported to the Centers for Disease Control and Prevention with only non-unique partial personal identifiers. Occupancy models with equal and unequal occupancy probabilities are considered. Unbiased est imators for the numbers of true duplicates within and between case reporting areas are provided. Formulas to calculate estimators' variances are also provided. These results can be applied to evaluating duplicate reporting in other data systems that have no unique identifier for each individual.

### Journal of Data Science, v.5, no.1, p.67-83

#### A Frailty Model to Assess Plant Disease Spread from Individual Count Data

##### by Samuel Soubeyrand, Ivan Sache, Christian Lannou and Joel Chadoeuf

- Full Text (PDF): [179.04kB]

Spread of airborne plant diseases from a propagule source is classically assessed by fitting a gradient curve to aggregated data coming from field experiments. But, aggregating data decreases information about processes involved in disease spread. To overcome this problem, individual count data can be collected; it was done in the case of short-distance spread of wheat brown rust. However, for such data, the gradient curve is a limited model since heterogeneity of hosts is ignored and, consequently, overdispersion occurs. So, we propose a parametric frailty model in which the frailties represent propensities of hosts to be infected. The model is used to assess dispersal of propagules and heterogeneity of hosts.

### Journal of Data Science, v.5, no.1, p.85-101

#### Nonparametric Modeling of Quarterly Unemployment Rates

##### by Lujian Yang

- Full Text (PDF): [176.39kB]

A seasonal additive nonlinear vector autoregression (SANVAR) model is proposed for multivariate seasonal time series to explore the possible interaction among the various univariate series. Significant lagged variables are selected and additive autoregression functions estimated based on the selected variables using spline smoothing method. Conservative confidence bands are constructed for the additive autoregression function. The model is fitted to two sets of bivariate quarterly unemployment rate data with comparisons made to the linear periodic vector autoregression model. It is found that when the data does not significantly deviate from linearity, the periodic model is preferred. In cases of strong nonlinearity, however, the additive model is more parsimonious and has much higher out-of-sample prediction power. In addition, interactions among various univariate series are automatically detected.

### Journal of Data Science, v.5, no.1, p.103-129

#### Statistical Analysis of Electricity Prices

##### by Estate Khmaladze

- Full Text (PDF): [233.17kB]

The paper presents a statistical analysis of electricity spot prices in a deregulated market in New South Wales, Australia, in the period 10 May, 1996 - 7 March, 1998. It is unusual that a single set of data, such as this, allows one to consider a relatively systematic sequence of statistical problems, each resulting in clear, although not always obvious, solutions. This is the reason why these data and their analysis can be used as a relatively good base for training in practical statistical analysis. Existing formerly as a report, the material has been used in lecture courses in several universities in Australia and New Zealand.

### Journal of Data Science, v.5, no.1, p.131-142

#### Literate Life Expectancy in Bangladesh: A New Approach of Social Indicator

##### by Md. Hasinur Rahaman Khan and Md. Asaduzzaman

- Full Text (PDF): [104.08kB]

Social indicators have been used informally for a very long time, particularly in economics, to assess the state of the nation and progress towards national objectives. Measuring people's quality of life emphasizes human well being and particularly issues of equity, poverty, and gender. In this context, this paper uses a latest indicator of social development, Literate Life Expectancy (LLE), which was introduced by Lutz (1995). We have tried to highlight the importance of using a pure social indicator which is largely a demographically-based indicator and that intentionally does not use any economic measurement but rather combines in one number both life expectancy and literacy. In other words, Literate Life Expectancy is the aggregate average number of years that a person lives in a literate state. The Literate Life Expectancy index proved to be a very clear and simple comprehensive measure of social development at urban or rural level of spatial aggregation. Importantly, this index could be used to calculate future social development by adopting different mortality and educational scenarios, which can be associated with specific policy assumptions. To demonstrate Literate Life Expectancy's usefulness, we assessed the levels of social development in Bangladesh at the residence levels. The obtained results at the national level shows the remarkable difference in the Literate Life Expectancy between urban and rural people (men and women). With the literacy and life expectancy information, sex differentials are seen and compared throughout each age group for both rural and urban areas, which clearly proves the existing gender difference either in rural or in urban area of Bangladesh.