Volume 6, Number 1, January 2008

  • Estimating Optimum Linear Combination of Multiple Correlated Diagnostic Tests at a Fixed Specificity with Receiver Operating Characteristic Curves
  • Indirect Area Estimates of Disease Prevalence: Bayesian Evidence Synthesis with an Application to Coronary Heart Disease
  • Models for Value-added Investigations of Teaching Styles Data
  • Bayesian Circle Segmentation with Application to DNA Copy Number Alteration Detection
  • A Note on Hypothesis Testing with Random Sample Sizes and its Relationship to Bayes Factors
  • Identifying Multisubject Cortical Activation in Functional MRI: A Frequency Domain Approach
  • Joint Spatio-Temporal Modeling of Low Incidence Cancers Sharing Common Risk Factors
  • Underlying and Multiple Causes of Death in Preterm Infants

Journal of Data Science, v.6, no.1, p.1-13

Estimating Optimum Linear Combination of Multiple Correlated Diagnostic Tests at a Fixed Specificity with Receiver Operating Characteristic Curves

by Feng Gao, Chengjie Xiong, Yan Yan, Kai Yu and Zhengjun Zhang

Receiver operating characteristic (ROC) methodology is widely used to evaluate diagnostic tests. It is not uncommon in medical practice that multiple diagnostic tests are applied to the same study sample. A variety of methods have been proposed to combine such potentially correlated tests to increase the diagnostic accuracy. Usually the optimum combination is searched based on the area under a ROC curve (AUC), an overall summary statistics that measures the distance between the distributions of diseased and non-diseased populations. For many clinical practitioners, however, a more relevant question of interest may be "what the sensitivity would be for a given specificity (say, 90\%) or what the specificity would be for a given sensitivity?". Generally there is no unique linear combination superior to all others over the entire range of specificities or sensitivities. Under the framework of a ROC curve, in this paper we presented a method to estimate an optimum linear combination maximizing sensitivity at a fixed specificity while assuming a multivariate normal distribution in diagnostic tests. The method was applied to a real-world study where the accuracy of two biomarkers was evaluated in the diagnosis of pancreatic cancer. The performance of the method was also evaluated by simulation studies.

Journal of Data Science, v.6, no.1, p.15-32

Indirect Area Estimates of Disease Prevalence: Bayesian Evidence Synthesis with an Application to Coronary Heart Disease

by Peter Congdon

Risks for many chronic diseases (coronary heart disease, cancer, mental illness, diabetes, asthma, etc) are strongly linked both to socio-economic and ethnic group and so prevalence varies considerably between areas. Variations in prevalence are important in assessing health care needs and in comparing health care provision (e.g. of surgical intervention rates) to health need. This paper focuses on estimating prevalence of coronary heart disease and uses a Bayesian approach to synthesize information of different types to make indirect prevalence estimates for geographic units where prevalence data are not otherwise available. One source is information on prevalence risk gradients from national health survey data; such data typically provide only regional identifiers (for confidentiality reasons) and so gradients by age, sex, ethnicity, broad region, and socio-economic status may be obtained by regression methods. Often a series of health surveys is available and one may consider pooling strength over surveys by using information on prevalence gradients from earlier surveys (e.g. via a power prior approach). The second source of information is population totals by age, sex, ethnicity, etc from censuses or intercensal population estimates, to which survey based prevalence rates are applied. The other potential data source is information on area mortality, since for heart disease and some other major chronic diseases there is a positive correlation over areas between prevalence of disease and mortality from that disease. A case study considers the development of estimates of coronary heart disease prevalence in 354 English areas using (a) data from the Health Surveys for England for 2003 and 1999 (b) population data from the 2001 UK Census, and (c) area mortality data for 2003.

Journal of Data Science, v.6, no.1, p.33-51

Models for Value-added Investigations of Teaching Styles Data

by Neil H. Spencer

This paper considers models of educational data where a value-added analysis is required. These models are multilevel in nature and contain endogenous regressors. Multivariate models are considered so as to simultaneously model results from different subject areas. Path models and factor models are considered as types of model that can be used to overcome the problem of endogeneity. Estimation methods available in MLwiN and EQS are used. The use of a factor model with EQS is shown to give estimates of the effects of teaching styles that have smaller standard errors than any other method studied.

Journal of Data Science, v.6, no.1, p.53-73

Bayesian Circle Segmentation with Application to DNA Copy Number Alteration Detection

by Junfeng Liu, E. Jame Harner and Harry Yang

Several statistical approaches have been proposed to consider circumstances under which one universal distribution is not capable of fitting into the whole domain. This paper studies Bayesian detection of multiple interior epidemic/square waves in the interval domain, featured by two identical statistical distributions at both ends. We introduce a simple dimension-matching parameter proposal to implement the sampling-based posterior inference for special cases where each segmented distribution on a circle has the same set of regulating parameters. Molecular biology research reveals that, cancer progression may involve DNA copy number alteration at genome regions and connection of two biologically inactive chromosome ends results in a circle holding multiple epidemic/square waves. A slight modification of a simple novel Bayesian change point identification algorithm, random grafting-pruning Markov chain Monte Carlo (RGPMCMC), is proposed by adjusting the original change point birth/death symmetric transition probability with a differ-by-one change point number ratio. The algorithm performance is studied through simulations with connection to DNA copy number alteration detection, which promises potential application to cancer diagnosis at the genome level.

Journal of Data Science, v.6, no.1, p.75-87

A Note on Hypothesis Testing with Random Sample Sizes and its Relationship to Bayes Factors

by Scott Berry and Kert Viele

Frequentist and Bayesian hypothesis testing are often viewed as "two separate worlds'' by practitioners. While theoretical relationships of course exist, our goal here is to demonstrate a practical example where one must be careful conducting frequentist hypothesis testing, and in that context illustrate a practical equivalence between Bayesian and frequentist testing. In particular, if the sample size is random (hardly unusual in practical problems where the sample size may be "all available experimental units''), then choosing an $\alpha$ level in advance such as 0.05 and using it for every possible sample size is inadmissible. In other words, one can find a different overall procedure which has the same overall type I error but greater power. Not coincidentally, this alternative procedure is based on Bayesian testing procedures.

Journal of Data Science, v.6, no.1, p.89-103

Identifying Multisubject Cortical Activation in Functional MRI: A Frequency Domain Approach

by Joao Ricardo Sato, Chang Chiann, Eduardo Hiromassa Taniguchi, Emerson Gomes dos Santos, Paula Ricci Arantes, Maria Lucia Mourao, Edson Amaro Junior and Pedro Alberto Morettin

Functional magnetic resonance imaging (fMRI) has, since its description fifteen years ago, become the most common in-vivo neuroimaging technique. FMRI allows the identification of brain areas which are related to specific tasks, by statistical analysis of the BOLD (blood oxigenation level dependent) signal. Classically, the observed BOLD signal is compared to an expected haemodynamic response function (HRF) using a general linear model (GLM). However, the results of GLM rely on the HRF specification, which is usually determined in an ad hoc fashion. For periodic experimental designs, we propose a multisubject frequency domain brain mapping, which requires only the stimulation frequency, and consequently avoids subjective choices of HRF. We present some computational simulations, which demonstrate a good performance of the proposed approach in short length time series. In addition, an application to real fMRI datasets is also presented.

Journal of Data Science, v.6, no.1, p.105-123

Joint Spatio-Temporal Modeling of Low Incidence Cancers Sharing Common Risk Factors

by Jacob J. Oleson, Brian J. Smith and Hoon Kim

In this article, we present a joint modeling approach that combines information from multiple diseases. Our model can be used to obtain more reliable estimates in rare diseases by incorporating information from more common diseases for which there exists a shared set of important risk factors. Information is shared through both a latent spatial process and a latent temporal process. We develop a fully Bayesian hierarchical implementation of our spatio-temporal model in order to estimate relative risk, adjusted for age and gender, at the county level in Iowa in five-year intervals for the period 1973-2002. Our analysis includes lung, oral, and esophageal cancers which are related to excessive tobacco and alcohol use risk factors. Lung cancer risk estimates tend to be stable due to the large number of occurrences in small regions, i.e. counties. The lower risk cancers (oral and esophageal) have fewer occurrences in small regions and thus have estimates that are highly variable and unreliable. Estimates from individual and joint modeling of these diseases are examined and compared. The joint modeling approach has a profound impact on estimates regarding the low risk oral and esophageal cancers while the higher risk lung cancer is minutely impacted. Clearer spatial and temporal patterns are obtained and the standard errors of the estimates are reduced leading to more reliable estimates.

Journal of Data Science, v.6, no.1, p.125-134

Underlying and Multiple Causes of Death in Preterm Infants

by Panagiota Kitsantas

A limited number of studies have utilized multiple causes of death to investigate infant mortality patterns. The purpose of the present study was to examine the risk distribution of underlying and multiple causes of infant death for congenital anomalies, short gestation/low birth weight (LBW), respiratory conditions, infections, sudden infant death syndrome and external causes across four gestational age groups, namely $\leq 23, 24-30, 31-36, \geq 37$, and determine the extent to which mortality from each condition is underestimated when only the underlying cause of death is used. The data were obtained from the North Carolina linked birth/infant death files (1999 to 2003) and included 4908 death records. The findings of this study indicate that infants born less than 30 weeks old are more likely (odds ratio ranging from 1.99 to 6.03) to have multiple causes recorded when the underlying cause is congenital anomalies, respiratory conditions and infections in comparison to infants whose gestational age is at least 37 weeks. The underlying cause of death underestimated mortality for a number of cause-specific deaths including short gestation/LBW, respiratory conditions, infections and external causes. This was particularly evident among infants born preterm. Based on these findings, it is recommended that multiple causes, whenever available, should be studied in conjunction with the underlying cause of death data.