Volume 4, Number 1, January 2006

  • On the Use of Geostatistical Cross-Association Method for Lithostratigraphical Correlation
  • A Dynamic Spatial Model for Chronic Wasting Disease in Colorado
  • Using Hybrid Clustering to Approximate Fastest Paths on Urban Networks
  • A Comparison of Propensity Score and Linear Regression Analysis of Complex Survey Data
  • Using Conditional Copula to Estimate Value at Risk
  • Zero-Inflated Generalized Poisson Regression Model with an Application to Domestic Violence Data

Journal of Data Science, v.4, no.1, p.1-20

On the Use of Geostatistical Cross-Association Method for Lithostratigraphical Correlation

by Walid Abdolqader Saqqa and Mohammad Fraiwan Al-Saleh

The aim of this paper is to determine the effectiveness of cross association in detecting the similarity between correlated geological columnar sections. For this purpose, {\it cross association} is used to compare several geological columnar sections which are arbitrarily selected from different localities in central and north Jordan. It turns out, for most of the study cases, that the sections which consist of the same rock units (formations) are statistically classified as similar ($p$-value $\ll .05$), while sections of different rock units (formations) are statistically classified as dissimilar ($p$-value $\gg .05$).

Journal of Data Science, v.4, no.1, p.21-37

A Dynamic Spatial Model for Chronic Wasting Disease in Colorado

by Craig J. Johns and Christopher H. Mehl

A spatio-temporal statistical model for Chronic Wasting Disease is presented. The model has underpinnings from traditional epidemic models with differential equations and uses a Bayesian hierarchy to directly incorporate existing prevalence data. Spatial dynamics are modeled explicitly through a system of difference equations rather than through covariance. The posterior distribution gives evidence of a long term stable level of disease prevalence, and approximates the probability of the movement of the disease from one area to another. Predictions for the future of Chronic Wasting Disease in Colorado are given. The model is used to formulate efficient sampling schemes for future data collection.

Journal of Data Science, v.4, no.1, p.39-65

Using Hybrid Clustering to Approximate Fastest Paths on Urban Networks

by Anjali Awasthi, Yves Lechevallier, Michel Parent and Jean-Marie Proth

Estimating fastest paths on large networks is a crucial problem for dynamic route guidance systems. The present paper proposes a statistical approach for approximating fastest paths on urban networks. The traffic data used for conducting the statistical analysis is generated using a macroscopic traffic simulation software developed by us. The traffic data consists of the input flows, the arc states or the number of cars in the arcs and the paths joining the various origins and the destinations of the network. To find out the relationship between the input flows, arc states and the fastest paths of the network, we subject the traffic data to hybrid clustering. The hybrid clustering uses two methods namely $k$-means and Ward's hierarchical agglomerative clustering. The strength of the relationship among the traffic variables was measured using canonical correlation analysis. The results of hybrid clustering are decision rules that provide fastest paths as a function of arc states and input flows. These decision rules are stored in a database for performing predictive route guidance. Whenever a driver arrives at the entry point of the network, the current arc states and input flows are matched against the database parameters. If agreement is found, then the database provides the fastest path to the driver using the corresponding decision rule. In case of disagreement, the database recommends the driver to choose the shortest path as the fastest path in order to reach the destination.

Journal of Data Science, v.4, no.1, p.67-91

A Comparison of Propensity Score and Linear Regression Analysis of Complex Survey Data

by Elaine L. Zanutto

We extend propensity score methodology to incorporate survey weights from complex survey data and compare the use of multiple linear regression and propensity score analysis to estimate treatment effects in observational data from a complex survey. For illustration, we use these two methods to estimate the effect of gender on information technology (IT) salaries. In our analysis, both methods agree on the size and statistical significance of the overall gender salary gaps in the United States in four diff erent IT occupations after controlling for educational and job-related covariates. Each method, however, has its own advantages which are discussed. We also show that it is important to incorporate the survey design in both linear regression and propensity score analysis. Ignoring the survey weights affects the estimates of population-level effects substantially in our analysis.

Journal of Data Science, v.4, no.1, p.93-115

Using Conditional Copula to Estimate Value at Risk

by Helder Parra Palaro and Luiz Koodi Hotta

Value at Risk (VaR) plays a central role in risk management. There are several approaches for the estimation of VaR, such as historical simulation, the variance-covariance (also known as analytical), and the Monte Carlo approaches. Whereas the first approach does not assume any distribution, the last two approaches demand the joint distribution to be known, which in the analytical approach is frequently the normal distribution. The copula theory is a fundamental tool in modeling multivariate distributions. It allows the definition of the joint distribution through the marginal distributions and the dependence between the variables. Recently the copula theory has been extended to the conditional case, allowing the use of copulae to model dynamical structures. Time variation in the first and second conditional moments is widely discussed in the literature, so allowing the time variation in the conditional dependence seems to be natural. This work presents some concepts and properties of copula functions and an application of the copula theory in the estimation of VaR of a portfolio composed by Nasdaq and S&P500 stock indices.

Journal of Data Science, v.4, no.1, p.117-130

Zero-Inflated Generalized Poisson Regression Model with an Application to Domestic Violence Data

by Felix Famoye and Karan P. Singh

The generalized Poisson regression model has been used to model dispersed count data. It is a good competitor to the negative binomial regression model when the count data is over-dispersed. Zero-inflated Poisson and zero-inflated negative binomial regression models have been proposed for the situations where the data generating process results into too many zeros. In this paper, we propose a zero-inflated generalized Poisson (ZIGP) regression model to model domestic violence data with too many zeros. Esti mation of the model parameters using the method of maximum likelihood is provided. A score test is presented to test whether the number of zeros is too large for the generalized Poisson model to adequately fit the domestic violence data.