Volume 3, Number 1, January 2005

  • An Application of Graphical Modeling to the Analysis of Intranet Benefits and Applications
  • Exact Robust Tests for Detecting Candidate-Gene Association in Case-Parents Trio Design
  • Increasing the Precision of Estimates of Immunization Coverage Among 19- to 35-Month-Old Children in the United States
  • A Comparison of the Posterior Choropleth Maps for Disease Mapping
  • Observer Variability: A New Approach in Evaluating Interobserver Agreement
  • Comparison of Distance Measures in Cluster Analysis with Dichotomous Data
  • Bayesian Analysis for Change Points in the Volatility of Latin American Emerging Markets

Journal of Data Science, v.3, no.1, p.1-17

An Application of Graphical Modeling to the Analysis of Intranet Benefits and Applications

by Raffaella Settimi, Linda V. Knight, Theresa A. Steinbach and James D. White

Applications of multivariate statistical techniques, including graphical models, are seldom found in e-commerce studies. However, as this paper demonstrates, we find that probabilistic graphical models are useful in this area, both because of their ability to handle large numbers of potentially interrelated variables, and because of their ability to communicate statistical relationships clearly to both the researcher and the ultimate business audience. We show an application of this methodology to intranets, internal corporate information systems employing Internet technology. In particular, we study both the interrelationships among intranet benefits and the interrelationships among intranet applications. This approach confirms some hypothesized relationships, and uncovers heretofore-unanticipated relationships among intranet variables, providing guidance for business professionals seeking to develop effective intranet systems. The techniques described here also have potential applicability in other e-commerce arenas, including business-to-consumer and business-to-business applications.

Journal of Data Science, v.3, no.1, p.19-33

Exact Robust Tests for Detecting Candidate-Gene Association in Case-Parents Trio Design

by Zehua Chen and Gang Zheng

In the case-parents trio design for testing candidate-gene association, the distribution of the data under the null hypothesis of no association is completely known. Therefore, the exact null distribution of any test statistic can be simulated by using Monte-Carlo method. In the literature, several robust tests have been proposed for testing the association in the case-parents trio design when the genetic model is unknown, but all these tests are based on the asymptotic null distributions of the test statistics. In this article, we promote the exact robust tests using Monte-Carlo simulations. It is because: (i) the asymptotic tests are not accurate in terms of the probability of type I error when sample size is small or moderate; (ii) asymptotic theory is not available for certain good candidates of test statistics. We examined the validity of the asymptotic distributions of some of the test statistics studied in the literature and found that in certain cases the probability of type I error is greatly inflated in the asymptotic tests. In this article, we also propose new robust test statistics which are statistically more reasonable but without asymptotic theory available. The powers of these robust statistics are compared with those of the existent statistics in the literature through a simulation study. It is found that these robust statistics are preferable to the others in terms of their efficiency and robustness.

Journal of Data Science, v.3, no.1, p.35-45

Increasing the Precision of Estimates of Immunization Coverage Among 19- to 35-Month-Old Children in the United States

by Lawrence E. Barker, Mary M. McCauley and Qian Li

The National Immunization Survey (NIS) is the United States' primary tool for assessing immunization coverage among 19- to 35-month-old children. Although annual estimates from the NIS are quite precise at the national level, US State-level estimates have much larger sampling error than national-level estimates. We combined two independent unbiased estimates of US State-level coverages within a given year to obtain new estimates which are more precise than previously published estimate s. We first calculated a model-based estimate for each State for 2001 using multiple years of NIS data. Next, we combined each model-based estimate with the corresponding, previously reported NIS estimate for 2001. Our resulting estimates of State-level immunization coverage had smaller standard errors than the previously published estimates. To make similar improvements in precision by increasing sample size would, depending on State, require an increase in sample size of 30% - 120%.

Journal of Data Science, v.3, no.1, p.47-68

A Comparison of the Posterior Choropleth Maps for Disease Mapping

by Balgobin Nandram, Jie Liu and Jai Won Choi

In Bayesian analysis of mortality rates it is standard practice to present the posterior mean rates in a choropleth map, a stepped statistical surface identified by colored or shaded areas. A natural objection against the posterior mean map is that it may not be the ``best'' representation of the mortality rates. One should really present the map that has the highest posterior density over the ensemble of areas in the map (i.e., the coordinates that maximize the joint posterior density of the mortality rates). Thus, the posterior modal map maximizes the joint posterior density of the mortality rates. We apply a Poisson regression model, a Bayesian hierarchical model, that has been used to study mortality data and other rare events when there are occurrences from many areas. The model provides convenient Rao-Blackwellized estimators of the mortality rates. Our method enables us to construct the posterior modal map of mortality data from chronic obstructive pulmonary diseases (COPD) in the continental United States. We show how to fit the Poisson regression model using Markov chain Monte Carlo methods (i.e., the Metropolis-Hastings sampler), and obtain both the posterior modal map and posterior mean map are obtained by an output analysis from the Metropolis-Hastings sampler. The COPD data are used to provide an empirical comparison of these two maps. As expected, we have found important differences between the two maps, and recommended that the posterior modal map should be used.

Journal of Data Science, v.3, no.1, p.69-83

Observer Variability: A New Approach in Evaluating Interobserver Agreement

by Michael Haber, Huiman X. Barnhart, Jingli Song and James Gruden

Existing indices of observer agreement for continuous data, such as the intraclass correlation coefficient or the concordance correlation coefficient, measure the {\it total} observer-related variability, which includes the variabilities between and within observers. This work introduces a new index that measures the {\it interobserver} variability, which is defined in terms of the distances among the `true values' assigned by different observers on the same subject. The new coefficient of interobserver variability ($CIV$) is defined as the ratio of the interobserver and the total observer variability. We show how to estimate the $CIV$ and how to use bootstrap and ANOVA-based methods for inference. We also develop a coefficient of excess observer variability, which compares the total observer variability to the expected total observer variability when there are no differences among the observers. This coefficient is a simple function of the $CIV$. In addition, we show how the value of the $CIV$, estimated from an agreement study, can be used in the design of measurements studies. We illustrate the new concepts and methods by two examples, where (1) two radiologists used calcium scores to evaluate the severity of coronary artery arteriosclerosis, and (2) two methods were used to measure knee joint angle.

Journal of Data Science, v.3, no.1, p.85-100

Comparison of Distance Measures in Cluster Analysis with Dichotomous Data

by Holmes Finch

The current study examines the performance of cluster analysis with dichotomous data using distance measures based on response pattern similarity. In many contexts, such as educational and psychological testing, cluster analysis is a useful means for exploring datasets and identifying underlying groups among individuals. However, standard approaches to cluster analysis assume that the variables used to group observations are continuous in nature. This paper focuses on four methods for calculating distan ce between individuals using dichotomous data, and the subsequent introduction of these distances to a clustering algorithm such as Ward's. The four methods in question, are potentially useful for practitioners because they are relatively easy to carry out using standard statistical software such as SAS and SPSS, and have been shown to have potential for correctly grouping observations based on dichotomous data. Results of both a simulation study and application to a set of binary survey responses show t hat three of the four measures behave similarly, and can yield correct cluster recovery rates of between 60\% and 90\%. Furthermore, these methods were found to work better, in nearly all cases, than using the raw data with Ward's clustering algorithm.

Journal of Data Science, v.3, no.1, p.101-122

Bayesian Analysis for Change Points in the Volatility of Latin American Emerging Markets

by Rosangela H. Loschi, Cristiano R. Moura and Pilar L. Iglesias

We have extended some previous works by applying the product partition model (PPM) to identify multiple change points in the variance of normal data sequence assuming mean equal to zero. This type of problem is very common in applied economics and finance. We consider the Gibbs sampling scheme proposed in the literature to obtain the posterior estimates or product estimates for the variance and the posterior distributions for the instants when changes take place and also for the number of change points in the sequence. The PPM is used to obtain the posterior behavior of the volatility (measured as the variance) in the series of returns of four important Latin American stock indexes (MERVAL-Argentina, IBOVESPA-Brazil, IPSA-Chile and IPyC-Mexico). The posterior number of change point as well as the posterior most probable partition for each index series are also obtained.