Volume 7, Number 4, October 2009

  • Erhard Reschenhofer
    Super-Whiteness of Returns Spectra
  • Vicente G. Cancho, Edwin M. M. Ortega and Heleno Bolfarine
    The Log-exponentiated-Weibull Regression Models with Cure Rate: Local Influence and Residual Analysis
  • Rand R. Wilcox
    Two-by-two ANOVA: Global and Graphical Comparisons Based on an Extension of the Shift Function
  • Liang Zhu, Jianguo Sun and Phillip Wood
    Methods for the Analysis of Alcohol and Drug Uses for Young Adults
  • Cheng K. Lee and Jenq-Daw Lee
    A Fractional Survival Model
  • Haydar Demirhan and Canan Hamurkaroglu
    An Application of Bayesian Model Averaging Approach to Traffic Accidents Data Over Hierarchical Log-Linear Models
  • Weichung Joe Shih and Junfeng Liu
    A Simple Method for Screening Binary Models with Large Sample Size and Continuous Predictor Variables
  • Timothy E. O'Brien and Martin B. Berg
    Getting the Most from Data --- Maximizing Information and Power by Using Appropriate and Modern Statistical Methods

Journal of Data Science, v.7, no.4, p.423-431

Super-Whiteness of Returns Spectra

by Erhard Reschenhofer

Until the late 70's the spectral densities of stock returns and stock index returns exhibited a type of non-constancy that could be detected by standard tests for white noise. Since then these tests have been unable to find any substantial deviations from whiteness. But that does not mean that today's returns spectra contain no useful information. Using several sophisticated frequency domain tests to look for specific patterns in the periodograms of returns series we find nothing or, more precisely, less than nothing. Actually, there is a striking power deficiency, which implies that these series exhibit even fewer patterns than white noise. To unveil the source of this "super-whiteness" we design a simple frequency domain test for characterless, fuzzy alternatives, which are not immediately usable for the construction of profitable trading strategies, and apply it to the same data. Because the power deficiency is now much smaller, we conclude that our puzzling findings may be due to trading activities based on excessive data snooping.

Journal of Data Science, v.7, no.4, p.433-458

The Log-exponentiated-Weibull Regression Models with Cure Rate: Local Influence and Residual Analysis

by Vicente G. Cancho, Edwin M. M. Ortega and Heleno Bolfarine

In this paper the log-exponentiated-Weibull regression model is modified to allow the possibility that long term survivors are present in the data. The modification leads to a log-exponentiated-Weibull regression model with cure rate, encompassing as special cases the log-exponencial regression and log-Weibull regression models with cure rate typically used to model such data. The models attempt to estimate simultaneously the effects of covariates on the acceleration/deceleration of the timing of a given event and the surviving fraction; that is, the proportion of the population for which the event never occurs. Assuming censored data, we consider a classic analysis and Bayesian analysis for the parameters of the proposed model. The normal curvatures of local influence are derived under various perturbation schemes and two deviance-type residuals are proposed to assess departures from the log-exponentiated-Weibull error assumption as well as to detect outlying observations. Finally, a data set from the medical area is analyzed.

Journal of Data Science, v.7, no.4, p.459-468

Two-by-two ANOVA: Global and Graphical Comparisons Based on an Extension of the Shift Function

by Rand R. Wilcox

When comparing two independent groups, the shift function compares all of the quantiles in a manner that controls the probability of at least one Type I error, assuming random sampling only. Moreover, it provides a much more detailed sense of how groups compare, versus using a single measure of location, and the associated plot of the data can yield valuable insights. This note examines the small-sample properties of an extension of the shift function where the goal is to compare the distributions of two specified linear sums of the random variables under study, with an emphasis on a two-by-two design. A very simple method controls the probability of a Type I error. Moreover, very little power is lost versus comparing means when sampling is from normal distributions with equal variances.

Journal of Data Science, v.7, no.4, p.469-485

Methods for the Analysis of Alcohol and Drug Uses for Young Adults

by Liang Zhu, Jianguo Sun and Phillip Wood

Alcohol and drug uses are common in today's society and it is well-known that they can lead to serious consequences. Studies have been conducted in order, for example, to understand short- or long-term temporal processes of alcohol and drug uses. This paper discusses statistical modeling for joint analysis of alcohol and drug uses and several models and the corresponding estimation approaches are presented. The methods are applied to a prospective study of alcohol and drug uses on college freshmen, which motivated this investigation. The analysis results suggest that female subjects seem to have much less consequences of alcohol and drug uses than male subjects and the consequences of alcohol and drug uses decrease along with ages.

Journal of Data Science, v.7, no.4, p.487-495

A Fractional Survival Model

by Cheng K. Lee and Jenq-Daw Lee

A survival model is derived from the exponential function using the concept of fractional differentiation. The hazard function of the proposed model generates various shapes of curves including increasing, increasing-constant-increasing, increasing-decreasing-increasing, and so-called bathtub hazard curve. The model also contains a parameter that is the maximum of the survival time.

Journal of Data Science, v.7, no.4, p.497-511

An Application of Bayesian Model Averaging Approach to Traffic Accidents Data Over Hierarchical Log-Linear Models

by Haydar Demirhan and Canan Hamurkaroglu

In this article, a Bayesian model averaging approach for hierarchical log-linear models is considered. Posterior model probabilities are approximately calculated for hierarchical log-linear models. Dimension of interested model space is reduced by using Occam's window and Occam's razor approaches. 2002 road traffic accident data of Turkey is analyzed by using the considered approach.

Journal of Data Science, v.7, no.4, p.513-536

A Simple Method for Screening Binary Models with Large Sample Size and Continuous Predictor Variables

by Weichung Joe Shih and Junfeng Liu

For binary regression model with observed responses (Ys), specified predictor vectors (Xs), assumed model parameter vector (beta) and case probability function (Pr(Y=1|X, beta)), we propose a simple screening method to test goodness-of-fit when the number of observations (n) is large and Xs are continuous variables. Given any threshold tau in[0,1], we consider classifying each subject with predictor X into Y^*=1 or 0 (a deterministic binary variable other than the observed random binary variable $Y$) according to whether the calculated case probability (Pr$Y=1|X,beta)) under hypothesized true model > or equal to or < tau. For each tau, we check the difference between the expected marginal classification error rate (false positives [Y^*=1, Y=0] or false negatives [Y^*=0, Y=1]) under hypothesized true model with the observed marginal error rate which is directly observed due to this classification rule. The screening profile is created by plotting $\tau$-specific marginal error rates (expected and observed) versus tau in[0,1]. Inconsistency indicates lack-of-fit and consistence indicates good model fit. We note that, the variation of the difference between the expected marginal classification error rate and the observed one is constant (O(n^{-1/2})) and free of tau. The smallest homogeneous variation at each tau potentially detects flexible model discrepancies with high power. Simulation study shows that, this profile approach named as CERC (classification-error-rate-calibration) is useful for checking wrong parameter value, incorrect predictor vector component subset and link function misspecification. We also provide some theoretical results as well as numerical examples to show that, ROC (receiver operating characteristics) curve is not suitable for binary model goodness-of-fit test.

Journal of Data Science, v.7, no.4, p.537-550

Getting the Most from Data --- Maximizing Information and Power by Using Appropriate and Modern Statistical Methods

by Timothy E. O'Brien and Martin B. Berg

Through a series of carefully chosen illustrations from biometry and biomedicine, this note underscores the importance of using appropriate analytical techniques to increase power in statistical modeling and testing. These examples also serve to highlight some of the important recent developments in applied statistics of use to practitioners.