Volume 4, Number 2, April 2006

  • A Bayesian Approach to the Multiple Comparisons Problem
  • A Growing Self-Organizing Neural Network for Lifestyle Segmentation
  • A Robust Approach to the Interest Rate Term Structure Estimation
  • Estimating Vaccine Efficacy from Household Data Using Surrogate Outcome and a Validation Sample
  • Training Students and Researchers in Bayesian Methods
  • Statistical Functional Modeling of Quality Changes of Garlic under Different Storage Regimes
  • Bifactorial Design Applied to Recombinant Protein Expression

Journal of Data Science, v.4, no.2, p.131-146

A Bayesian Approach to the Multiple Comparisons Problem

by Andrew A. Neath and Joseph E. Cavanaugh

Consider the problem of selecting independent samples from several populations for the purpose of between-group comparisons. An important aspect of the solution is the determination of clusters where mean levels are equal, often accomplished using multiple comparisons testing. We formulate the hypothesis testing problem of determining equal-mean clusters as a model selection problem. Information from all competing models is combined through Bayesian methods in an effort to provide a more realistic accounting of uncertainty. An example illustrates how the Bayesian approach leads to a logically sound presentation of multiple comparison results.

Journal of Data Science, v.4, no.2, p.147-168

A Growing Self-Organizing Neural Network for Lifestyle Segmentation

by Reinhold Decker

Lifestyles can be used to explain existent and to anticipate future consumer behavior, both in a geographical and a temporal context. Basing market segmentations on consumer lifestyles enables the development of purposeful advertising strategies and the design of new products meeting future demands. The present paper introduces a new growing self-organizing neural network which identifies lifestyles, or rather consumer types, in survey data largely autonomously. Before applying the algorithm to real marketing data we are going to demonstrate its general performance and adaptability by means of synthetic 2D data featuring distinct heterogeneity with respect to the arrangement of the individual data points.

Journal of Data Science, v.4, no.2, p.169-188

A Robust Approach to the Interest Rate Term Structure Estimation

by Min Li and Yan Yu

This paper estimates the interest rate term structures of Treasury and individual corporate bonds using a robust criterion. The Treasury term structure is estimated with Bayesian regression splines based on nonlinear least absolute deviation. The number and locations of the knots in the regression splines are adaptively chosen using the reversible jump Markov chain Monte Carlo method. Due to the small sample size, the individual corporate term structure is estimated by adding a positive parametric credit spread to the estimated Treasury term structure using a Bayesian approach. We present a case study of U.S. Treasury STRIPS (Separate Trading of Registered Interest and Principal of Securities) and AT\&T bonds from April 1994 to December 1996. Compared with several existing term structure estimation approaches, the proposed method is robust to outliers in our case study.

Journal of Data Science, v.4, no.2, p.189-205

Estimating Vaccine Efficacy from Household Data Using Surrogate Outcome and a Validation Sample

by Xiaohong M. Davis and Michael Haber

Household data are frequently used in estimating vaccine efficacy because it provides information about every individual's exposure to vaccinated and unvaccinated infected household members. This information is essential for reliable estimation of vaccine efficacy for infectiousness ($VE_I$), in addition to estimating vaccine efficacy for susceptibility ($VE_S$). However, accurate infection outcome data is not always available on each person due to high cost or lack of feasible methods to collect this information. Lack of reliable data on true infection status may result in biased or inefficient estimates of vaccine efficacy. In this paper, a semiparametric method that uses surrogate outcome data and a validation sample is introduced for estimation of $VE_S$ and $VE_I$ from a sample of households. The surrogate outcome data is usually based on illness symptoms. We report the results of simulations conducted to examine the performance of the estimates, compare the proposed semiparametric method with maximum likelihood methods that either use the validation data only or use the surrogate data only and address study design issues. The new method shows improved precision as compared to a method based on the validation sample only and smaller bias as compared to a method using surrogate outcome data only. In addition, the use of household data is shown to greatly improve the attenuation in the estimate of $VE_S$ due to misclassification of the outcome, as compared to the use of a random sample of unrelated individuals.

Journal of Data Science, v.4, no.2, p.207-232

Training Students and Researchers in Bayesian Methods

by Bruno Lecoutre

Frequentist Null Hypothesis Significance Testing (NHST) is so an integral part of scientists' behavior that its uses cannot be discontinued by flinging it out of the window. Faced with this situation, the suggested strategy for training students and researchers in statistical inference methods for experimental data analysis involves a smooth transition towards the Bayesian paradigm. Its general outlines are as follows. (1)~To present natural Bayesian interpretations of NHST outcomes to draw attention to their shortcomings. (2)~To create as a result of this the need for a change of emphasis in the presentation and interpretation of results. (3)~Finally to equip users with a real possibility of thinking sensibly about statistical inference problems and behaving in a more reasonable manner. The conclusion is that teaching the Bayesian approach in the context of experimental data analysis appears both {\it desirable} and {\it feasible}. This feasibility is illustrated for analysis of variance methods.

Journal of Data Science, v.4, no.2, p.233-246

Statistical Functional Modeling of Quality Changes of Garlic under Different Storage Regimes

by E. T. Castano, E. S. Mercado, F. G. Leon, C. H. Gorrostieta, J. J. Chamorro, E. B. Vazquez and V. T. Aguirre

In this paper we analyze the weight loss behaviour of Mexican garlic under different storage conditions. Garlic is an important Mexican export product. Quality losses during storage are important to understand due to cost and sale opportunity implications. Weight losses profiles for each experimental conditions, represented as functions, are modeled by means of functional linear models and hypotheses tests are performed to compare treatments. Monte Carlo sampling version of permutation tests are used to obtain $p$-values. Using the functional approach clearly defined storage regimes that significantly decrease the speed of deterioration of the product relative to traditional Mexican agricultural practices.

Journal of Data Science, v.4, no.2, p.247-255

Bifactorial Design Applied to Recombinant Protein Expression

by Martinez-Luaces, V., Guineo-Cobs, G., Velazquez, B., Chabalgoity, A. and Massaldi, H.

We have studied the effect of several factors that influence recombinant protein production, by using the expression of recombinant streptolysin-O as our model. This protein, produced by {\it Streptococcus pyogenes}, is important in the biotechnological industry, where it is used to produce immunodiagnostic reagents. In order to improve the yield of this protein, we tried an alternative production method using strains of {\it Escherichia coli }and recombinant DNA technology. We have evaluated this method a t the laboratory scale, taking into account factors such as inductor concentration, temperature of induction, proportion of culture medium volume to total flask volume, and strain of {\it Escherichia coli} used. To this end we applied techniques of experimental design, particularly a ``fixed-effects bifactorial design", with the expression level of recombinant streptolysin-O in {\it E. coli} being the response to the factors. All the effects studied were found to be significant and relevant to the economic s of the protein production.