Volume 2, Number 4, October 2004

• Air Pollution Mix and Emergency Room Visits for Respiratory and Cardiac Diseases in Taipei
• Estimating Optimal Transformations for Multiple Regression Using the ACE Algorithm
• Evaluation of Missing Value Estimation for Microarray Data
• Korean Economic Condition Indicator Using a Neural Network Trained on the 1997 Crisis
• Relationship between Clinic and Ambulatory Blood Pressure Measurements in Children
• Bayes Factors for Comparing Two Restricted Means: An Example Involving Hypertense Individuals

Journal of Data Science, v.2, no.4, p.311-327

Air Pollution Mix and Emergency Room Visits for Respiratory and Cardiac Diseases in Taipei

by Jing-Shiang Hwang, Tsuey-Hwa Hu and Chang-Chuan Chan

To clarify the contribution of ambient air pollutants to acute health effects, we examined the association between daily air pollution levels and emergency room (ER) visits for respiratory and cardiac diseases in Taipei City, Taiwan from January 1997 to December 1998. Average daily concentrations of particulate matter less than 2.5 $\mu m$ in aerodynamic diameter (PM$_{2.5}$), PM$_{10}$, carbon monoxide, sulfur dioxide, nitrogen dioxide and ozone were obtained from ambient air quality monitoring stations. The daily counts of ER visits stratified by diagnosis and age were modeled by both single-pollutant and multi-pollutant Poisson regression models with adjustment of confounding factors to evaluate the effects of individual pollutants. A mixture model was constructed by adding ratios of the pollutants to the multi-pollutant model to examine the air pollution mixture on ER visits. The single-pollutant models showed that an interquartile range change of PM$_{2.5}$ (16 $\mu g/m^{3}$) was associated with increased ER visits for respiratory disease in all age groups, with relative risks 1.04 to 1.06 and increased ER visits for cardiac disease in adult and elderly age groups, with a relative risk of 1.05. Gaseous pollutants, mainly NO$_{2}$ and CO, were also associated with increased visits by children for respiratory disease and visits by adults and elderly individuals for cardiac disease. Examination of joint effect of mixes of PM$_{2.5}$ and gaseous pollutants showed that an interquartile range increase was associated with increases in ER visits by children for respiratory disease and by adults for cardiac disease, with a relative risk of 1.09.

Journal of Data Science, v.2, no.4, p.329-346

Estimating Optimal Transformations for Multiple Regression Using the ACE Algorithm

by Duolao Wang and Michael Murphy

This paper introduces the alternating conditional expectation (ACE) algorithm of Breiman and Friedman (1985) for estimating the transformations of a response and a set of predictor variables in multiple regression that produce the maximum linear effect between the (transformed) independent variables and the (transformed) response variable. These transformations can give the data analyst insight into the relationships between these variables so that relationship between them can be best described and non-linear relationships can be uncovered. The power and usefulness of ACE guided transformation in multivariate analysis are illustrated using a simulated data set as well as a real data set. The results from these examples clearly demonstrate that ACE is able to identify the correct functional forms, to reveal more accurate relationships, and to improve the model fit considerably compared to the conventional linear model.

Journal of Data Science, v.2, no.4, p.347-370

Evaluation of Missing Value Estimation for Microarray Data

by Danh V. Nguyen, Naisyin Wang and Raymond J. Carroll

Microarray gene expression data contains missing values (MVs). However, some methods for downstream analyses, including some prediction tools, require a complete expression data matrix. Current methods for estimating the MVs include sample mean and K-nearest neighbors (KNN). Whether the accuracy of estimation (imputation) methods depends on the actual gene expression has not been thoroughly investigated. Under this setting, we examine how the accuracy depends on the actual expression level and propose new methods that provide improvements in accuracy relative to the current methods in certain ranges of gene expression. In particular, we propose regression methods, namely multiple imputation via ordinary least squares (OLS) and missing value prediction using partial least squares (PLS). Mean estimation of MVs ignores the observed correlation structure of the genes and is highly inaccurate. Estimating MVs using KNN, a method which incorporates pairwise gene expression information, provides substantial improvement in accuracy on average. However, the accuracy of KNN across the wide range of observed gene expression is unlikely to be uniform and this is revealed by evaluating accuracy as a function of the expression level.

Journal of Data Science, v.2, no.4, p.371-381

Korean Economic Condition Indicator Using a Neural Network Trained on the 1997 Crisis

by Tae Yoon Kim, Changha Hwang and Jongkyu Lee

The main aim of this article is to develop an efficient indicator for Korean economic conditions based on its disastrous 1997 economic crisis experience. For this an artificial neural network, a well known tool for pattern recognition, is employed. The dynamic movements of the 1997 stock price index are divided into three patterns or intervals according to a "volatility" level and then presented to the neural network as a training set. It turns out that the crisis trained neural network has a surprisingly high degree of accuracy in judging the given economic condition, which strongly suggests that the post crisis Korean economy has been profoundly influenced by the 1997 crisis. This result might also be useful to other countries trying to build an early crisis warning indicator.

Journal of Data Science, v.2, no.4, p.383-397

Relationship between Clinic and Ambulatory Blood Pressure Measurements in Children

by Dejian Lai, Tim S. Poffenbarger, Kathy D. Franco, Ronald Portman and Jonathan M. Sorof

Decision making on diagnosis of hypertension is important to clinicians, patients and general public. We analyzed the agreement between clinic blood pressure (BP) measurements (individual or in combination) and ambulatory wake BP in the diagnosis of hypertension in children. In this study, three sequential clinic BP measurements were performed at the initiation of the 24-hour ambulatory BP monitoring (ABPM) using the identical monitor for both clinic and ambulatory measurements. Ninety patients were reviewed. Pearson Correlation coefficient between clinic BP (individual or in combination) and wake ambulatory BP ranged from 0.81 to 0.85 for SBP and 0.52 to 0.60 for DBP. Multiple regression models showed no improvement using the mean of multiple versus single clinic BP measurements. We also tried principal component method that formed an optimal combination of the clinic measurements. The first principal component accounted about 95\% of the total variation, but there was little improvement of the regression model between the wake ambulatory and the first principal component of the three repeated clinic measurements. Our results suggest that assessment for hypertension in children by clinic BP alone is often unreliable and is not improved by multiple BP measurements on a single occasion.

Journal of Data Science, v.2, no.4, p.399-418

Bayes Factors for Comparing Two Restricted Means: An Example Involving Hypertense Individuals

by Viviana Giampaoli and Julio M. Singer

We are interesed in comparing the mean the diastolic blood pressure of individuals submitted to a stress stimulus to that of individuals under normal conditions with the prior knowledge that the subjects in both groups are hypertense. Essentially, this may be formulated as a two sample problem for Gaussian populations with bounded means. For such purposes, we consider two different approaches to obtain Bayes factors. The first is based on predictive distributions and the second is based on Markov Chain Monte Carlo methods. The sensitivity of the Bayes factors with respect to choice of prior distributions is also investigated.