Volume 2, Number 1, January 2004

  • Unexpected Features of Financial Time Series: Higher-order Anomalies and
    Predictability
  • The Poisson Inverse Gaussian Regression Model in the Analysis of Clustered
    Counts Data
  • Markov Chain Monte Carlo Methods for Inference in Frailty Models with
    Doubly-censored Data
  • The Environment of the Bowdoin College Museum of Art
  • A Two-stage Bayesian Model for Predicting Winners in Major League
    Baseball
  • Interpretation of Epidemiological Data Using Multiple Correspondence Analysis
    and Log-linear Models
  • SEER: A Graphical Tool for Multidimensional and Categorical Data

Journal of Data Science, v.2, no.1, p.1-15

Unexpected Features of Financial Time Series: Higher-order Anomalies and Predictability

by Erhard Reschenhofer

Examining the daily Dow Jones Industrial Average (DJI) we find evidence both of higher-order anomalies and predictability. While most researchers are only aware of the relatively harmless anomalies that occur just in the mean, the first part of this article provides empirical evidence of more dangerous kinds of anomalies occurring in higher-order moments. This evidence casts some doubt on the common practice of fitting standard time series models (e.g., ARMA models, GARCH models, or stochastic volatility m odels) to financial time series and carrying out tests based upon autocorrelation coefficients without making proper provision for these anomalies. The second part of this article provides evidence in favor of the predictability of the returns on the DJI and, more interestingly, against the efficient market hypothesis. The special value of this evidence is due to the simplicity of the involved methods.

Journal of Data Science, v.2, no.1, p.17-32

The Poisson Inverse Gaussian Regression Model in the Analysis of Clustered Counts Data

by M. M. Shoukri, M. H. Asyali, R. VanDorp and D. Kelton

We explore the possibility of modeling clustered count data using the Poisson Inverse Gaussian distribution. We develop a regression model, which relates the number of mastitis cases in a sample of dairy farms in Ontario, Canada, to various farm level covariates, to illustrate the methodology. Residual plots are constructed to explore the quality of the fit. We compare the results with a negative binomial regression model using maximum likelihood estimation, and to the generalized linear mixed regression model fitted in SAS.

Journal of Data Science, v.2, no.1, p.33-47

Markov Chain Monte Carlo Methods for Inference in Frailty Models with Doubly-censored Data

by Geoffrey Jones

Frailty models have become popular in survival analysis for dealing with situations where groups of observations are correlated. If the data comprise only exact or right-censored failure times, inference can be done by either integrating out the frailties directly or by using the EM algorithm. If there is both left- and right-censoring this is no longer the case. However the MCMC method of Clayton (1991, {\it Biometrics} {\bf47}, 467-485) can be easily extended by imputation of the left-censored times. Several schemes for doing this are suggested and compared. Application of the methods is illustrated using data on the joint failures of patients with {\it fibrodysplasia ossificans progressiva.

Journal of Data Science, v.2, no.1, p.49-60

The Environment of the Bowdoin College Museum of Art

by Rosemary A. Roberts

Conservation of artifacts is a major concern of museum curators. Light, humidity, and air pollution are responsible for the deterioration of many artifacts and materials. We present here an exploratory analysis of humidity and temperature data that were collected to document the environment of the Bowdoin College Museum of Art, located in the Walker Art Building at Bowdoin College. As a result of this study, funds are being sought to install a climate control system.

Journal of Data Science, v.2, no.1, p.61-73

A Two-stage Bayesian Model for Predicting Winners in Major League Baseball

by Tae Young Yang and Tim Swartz

The probability of winning a game in major league baseball depends on various factors relating to team strength including the past performance of the two teams, the batting ability of the two teams and the starting pitchers. These three factors change over time. We combine these factors by adopting contribution parameters, and include a home field advantage variable in forming a two-stage Bayesian model. A Markov chain Monte Carlo algorithm is used to carry out Bayesian inference and to simulate outcomes of future games. We apply the approach to data obtained from the 2001 regular season in major league baseball.

Journal of Data Science, v.2, no.1, p.75-86

Interpretation of Epidemiological Data Using Multiple Correspondence Analysis and Log-linear Models

by Demosthenes B. Panagiotakos and Christos Pitsavo

In this work we present a combined approach to contingency tables analysis using correspondence analysis and log-linear models. Several investigators have recognized relations between the aforementioned methodologies, in the past. By their combination we may obtain a better understanding of the structure of the data and a more favorable interpretation of the results. As an application we applied both methodologies to an epidemiological database (CARDIO2000) regarding coronary heart disease risk factors.

Journal of Data Science, v.2, no.1, p.87-105

SEER: A Graphical Tool for Multidimensional and Categorical Data

by Chris Chiu and Ronald Fecso

This paper introduces a visualization technique, SEER, developed for policy makers and researchers to graphically analyze and explore massive amounts of categorical data collected in longitudinal surveys. This technique (a) produces panels of graphs for multiple group analysis, where the groups do not have to be mutually exclusive, (b) profiles change patterns observed in longitudinal data, and (c) clusters data into groups to enable policy makers or researchers to observe the factors associated with the c hanging patterns. This paper also includes the hash function, of the SEER method, expressed in matrix notation for it to be implemented across computer packages. The SEER technique is illustrated by using a national survey, the Survey of Doctorate Recipients (SDR), administered by the National Science Foundation (NSF). Occupational changes and career paths for a panel sample of 14,901 doctorate recipients are profiled and discussed. Results indicated that doctorate recipients in some science and engineerin g fields are roughly two times more likely to work in an occupation when it is the discipline in which they received their doctorates.