Volume 6, Number 2, April 2008

  • A Statistical Approach for Dating Archaeological Contexts
  • Recovering Vote Choice from Partial Incomplete Data
  • Evaluating Aortic Stenosis Using the Archimedean Copula Methodology
  • On the Principles of Believe the Positive and Believe the Negative for Diagnosis Using Two Continuous Tests
  • Confidence Band for Additive Regression Model
  • Bayesian Wavelet Regression for Spatial Estimation
  • Application of the Pattern-Mixture Latent Trajectory Model in an Epidemiological Study with Non-Ignorable Missingness
  • Identifying the Unique Projection and Follow-up Runs for k = 4 or 5 Important Factors from the n = 12, 20 or 24-run Plackett Burman Designs
  • A Bayesian Approach to Zero-Numerator Problems Using Hierarchical Models

Journal of Data Science, v.6, no.2, p.135-154

A Statistical Approach for Dating Archaeological Contexts

by L. Bellanger, R. Tomassone, and P. Husi

This paper describes a statistical model developing from Correspondence Analysis to date archaeological contexts of the city of Tours (France) and also to obtain an estimated absolute timescale. The data set used in the study is reported as a contingency table of ceramics against contexts. But, as pottery is not intrinsically a dating indicator (a date is rarely inscribed on each piece of pottery), we estimate dates of contexts from their finds, and we use coins to attest the date of assemblages. The model-based approach uses classical tools (correspondence analysis, linear regression and resampling methods) in an iterative scheme. Archaeologists may find in the paper a useful set of known statistical methods, while statisticians can learn a way to order\ well known techniques. No method is new, but their gathering is characteristic of this application.

Journal of Data Science, v.6, no.2, p.155-171

Recovering Vote Choice from Partial Incomplete Data

by Wendy K. Tam Cho and George G. Judge

In voting rights cases, judges often infer unobservable individual vote choices from election data aggregated at the precinct level. That is, one must solve an ill-posed inverse problem to obtain the critical information used in these cases. The ill-posed nature of the problem means that traditional frequentist and Bayesian approaches cannot be employed without first imposing a range of assumptions. In order to mitigate the problems resulting from incorporating potentially inaccurate information in these cases, we propose the use of information theoretic methods as a basis for recovering an estimate of the unobservable individual vote choices. We illustrate the empirical non-parametric likelihood methods with some election data.

Journal of Data Science, v.6, no.2, p.173-187

Evaluating Aortic Stenosis Using the Archimedean Copula Methodology

by Pranesh Kumar and Mohamed M. Shoukri

In modeling and analyzing multivariate data, the conventionally used measure of dependence structure is the Pearson's correlation coefficient. However use of the correlation as a dependence measure has several pitfalls. Copulas recently have emerged as an alternative measure of the dependence, overcoming most of the drawbacks of the correlation. We discuss Archimedean copulas and their relationships with tail dependence. An algorithm to construct empirical and Archimedean copulas is described. Monte Carlo simulations are carried out to replicate and analyze data sets by identifying the appropriate copula. We apply the Archimedean copula based methodology to assess the accuracy of Doppler echocardiography in determining aortic valve area from the {\it Aortic Stenosis: Simultaneous Doppler --- Catheter Correlative study} carried out at the King Faisal Specialist Hospital and Research Centre, Riyadh, KSA.

Journal of Data Science, v.6, no.2, p.189-205

On the Principles of Believe the Positive and Believe the Negative for Diagnosis Using Two Continuous Tests

by Changyu Shen

Believe the Positive (BP) and Believe the Negative (BN) rules for combining two continuous diagnostic tests are compared with procedures based on likelihood ratio and linear combination of the two tests. The sensitivity-specificity relationship for BP/BN is illustrated through a graphical presentation of a "ROC surface", which leads to a natural approach of choosing between BP and BN. With a bivariate normal model, it is shown that the discriminating power of this approach is higher when the correlation between the two tests has different signs for non-diseased and diseased population, given the location and variations of the two distributions are fixed. The idea is illustrated through an example.

Journal of Data Science, v.6, no.2, p.207-217

Confidence Band for Additive Regression Model

by Lijian Yang

Additive model is widely recognized as an effective tool for dimension reduction. Existing methods for estimation of additive regression function, including backfitting, marginal integration, projection and spline methods, do not provide any level of uniform confidence. In this paper a simple construction of confidence band is proposed for the additive regression function based on polynomial spline estimation and wild bootstrap. Monte Carlo results show three desirable properties of the proposed band: excellent coverage of the true function, width rapidly shrinking to zero with increasing sample size, and minimal computing time. These properties make he procedure is highly recommended for nonparametric regression with confidence when additive modeling is appropriate.

Journal of Data Science, v.6, no.2, p.219-229

Bayesian Wavelet Regression for Spatial Estimation

by G. Avarez and B. Sanso

We consider the problem of estimating the properties of an oil reservoir, like porosity and sand thickness, in an exploration scenario where only a few wells have been drilled. We use gamma ray records measured directly from the wells as well as seismic traces recorded around the wells. To model the association between the soil properties and the signals, we fit a linear regression model. Additionally we account for the spatial correlation structure of the observations using a correlation function that depends on the distance between two points. We transform the predictor variable using discrete wavelets and then perform a Bayesian variable selection using a Metropolis search. We obtain predictions of the properties over the whole reservoir providing a probabilistic quantification of their uncertainties, thanks to the Bayesian nature of our method. The cross-validated results show that a very high accuracy can be achieved even with a very small number of wavelet coefficients.

Journal of Data Science, v.6, no.2, p.231-246

Application of the Pattern-Mixture Latent Trajectory Model in an Epidemiological Study with Non-Ignorable Missingness

by Hiroko H. Dodge, Changyu Shen and Mary Ganguli

In longitudinal studies where the same individuals are followed over time, bias caused by unobserved data raises a serious concern, particularly when the data are missing in a non-ignorable manner. One approach to deal with non-ignorable missing data is a pattern mixture model. In this paper, we combine the pattern mixture model with latent trajectory analysis using the SAS TRAJ procedure, which offers a practical solution to many problems of the same nature. Our model assumes a stochastic process that categorizes a relative large number of missing-data patterns into several latent groups, each of which has unique outcome trajectory, which allows patterns with missing values to share information with patterns with more data points. We estimated the longitudinal trajectories of a memory test over 12 years of follow-up, using data from the prospective epidemiological study of dementia. Missing data patterns were created conditional on survival, and final marginal response was obtained by excluding those who had died at each time point. The approach presented here is appealing since it can be easily implemented using common software.

Journal of Data Science, v.6, no.2, p.247-259

Identifying the Unique Projection and Follow-up Runs for k = 4 or 5 Important Factors from the n = 12, 20 or 24-run Plackett Burman Designs

by J. Marcus Jobe and Tom Critzer

Complexities involved with identifying the projection for a specific set of k factors (k = 2, ... ,11) from an n-run (n = 12, 20 or 24) Plackett Burman design are described. Once the correct projection is determined, difficulties with selecting the necessary additional runs to complete either the full or half fraction factorial for the respective projection are noted, especially for n = 12, 20 or 24 and k = 4 or 5. Because of these difficulties, a user-friendly computational approach that identifies the projection and corresponding necessary follow-up runs to complete the full or half fraction factorial is given. The method is illustrated with a real data example.

Journal of Data Science, v.6, no.2, p.261-268

A Bayesian Approach to Zero-Numerator Problems Using Hierarchical Models

by Zhongxue Chen and Monnie McGee

The rule of three gives 3/n as the upper 95% bound for the success rate of the zero-numerator problems. However, this bound is usually conservative although it is useful in practice. Some Bayesian methods with beta distributions as priors have been studied. However, choosing the parameters for the priors is subjective and can severely impact the corresponding posterior distributions. In this paper, some hierarchical models are proposed, which provide practitioners other options for those zero-numerator problems.