Journal of Data Science, v.5, no.3, p.297-313
Linear Information Models: An Introduction
by Philip E. Cheng, Jiun W. Liou, Michelle Liou and John A. D. Aston
- Full Text (PDF): [139.13kB]
Relative entropy identities yield basic decompositions of categorical data log-likelihoods. These naturally lead to the development of information models in contrast to the hierarchical log-linear models. A recent study by the authors clarified the principal difference in the data likelihood analysis between the two model types. The proposed scheme of log-likelihood decomposition introduces a prototype of linear information models, with which a basic scheme of model selection can be formulated accordingly. Empirical studies with high-way contingency tables are exemplified to illustrate the natural selections of information models in contrast to hierarchical log-linear models.
Journal of Data Science, v.5, no.3, p.315-333
Stochastic Diffusion Modeling of Degradation Data
by Sheng-Tsaing Tseng and Chien-Yu Peng
- Full Text (PDF): [282.86kB]
Accelerated degradation tests (ADTs) can provide timely reliability information of product. Hence ADTs have been widely used to assess the lifetime distribution of highly reliable products. In order to properly predict the lifetime distribution, modeling the product's degradation path plays a key role in a degradation analysis. In this paper, we use a stochastic diffusion process to describe the product's degradation path and a recursive formula for the product's lifetime distribution can be obtained by using the first passage time (FPT) of its degradation path. In addition, two approximate formulas for the product's mean-time-to-failure (MTTF) and median life ($B50$) are given. Finally, we extend the proposed method to the case of ADT and a real LED data is used to illustrate the proposed procedure. The results demonstrate that the proposed method has a good performance for the LED lifetime prediction.
Journal of Data Science, v.5, no.3, p.335-356
How to Find Multiple Systems Underlying a Two-Way Table of 0's and 1's, With Applications to Cognitive Impairments and Medical Laboratory Science
by T. P. Hutchinson
- Full Text (PDF): [152.25kB]
Datasets are sometimes encountered that consist of a two-way table of 0's and 1's. For example, this might show which patients are impaired on which of a battery of tests, or which compounds are successful at inactivating which of several micro-organisms. The present paper describes a method of analyzing such tables, that reveals and specifies two (or more) systems or modes of action, if indeed they are needed to explain the data. The approach is an extension of what, in the context of cognitive impairments, is termed double dissociation. In order to be simple enough to be practicable, the approach is deterministic rather than probabilistic.
Journal of Data Science, v.5, no.3, p.357-378
The Time Resolution in Lag-Sequential Analysis: A Choice with Consequences
by Andre Berchtold and Gene P. Sackett
- Full Text (PDF): [169.30kB]
The creation of data sets using observational methods for the lag-sequential study of behavior requires selection of a recording time unit. This is an important issue, because standard methods such as momentary sampling and partial-interval sampling, for instance, consistently underestimate the frequency of some behaviors. This leads to inaccurate estimation of both unconditional and conditional probabilities of the different behaviors, the basic descriptive and analytic tools of sequential analysis methodology. The purpose of this paper is to investigate the creation of data sets usable for the purpose of sequential analysis. We show that such data vary depending on the time resolution and that inaccurate choices lead to biased estimations of transition probabilities.
Journal of Data Science, v.5, no.3, p.379-392
A New Change-Point Rank Tests
by Abd-Elnaser S. Abd-Rabou and Ahmed M. Gad
- Full Text (PDF): [126.31kB]
A new rank-based test statistics are proposed for the problem of a possible change in the distribution of independent observations. We extend the two-sample test statistic of Damico (2004) to the change point setup. The finite sample critical values of the proposed tests is estimated. We also conduct a Monte Carlo simulation to compare the powers of the new tests with their competitors. Using the Nile data of Cobb (1978), we demonstrate the applicability of the new tests.
Journal of Data Science, v.5, no.3, p.393-412
Statistics in Metrology: International Key Comparisons and Interlaboratory Studies
by Andrew L. Rukhin and N. Sedransk
- Full Text (PDF): [222.36kB]
Stochastic modeling and analysis of international key comparisons (interlaboratory comparisons) pose several fundamental questions for statistical methodology. A key comparison (KC) is specifically designed to derive the key comparison reference value and to assess conformance of calibrations by participating national metrology laboratories at a few, "key", settings for a particular measurement process. An approach to the statistical study of key comparisons data is proposed using a model taken from meta-analysis. This model leads to a class of weighted means estimators for the consensus value and to a method of assessing the uncertainty of the resulting estimates.
Journal of Data Science, v.5, no.3, p.413-423
Dirichlet-multinomial Model with Varying Response Rates over Time
by Jeffrey R. Wilson and Grace S. C. Chen
- Full Text (PDF): [122.08kB]
It is believed that overdispersion or extravariation as often referred is present more in survey data due to the existence of heterogeneity among and between the units. One approach to address such a phenomenon is to use a generalized Dirichlet-multinomial model. In its application the generalized Dirichlet-multinomial model assumes that the clusters are of equal sizes and the number of clusters remains the same from time to time. In practice this may rarely ever be the case when clusters are observed over time. In this paper the random variability and the varying response rates are accounted for in the model. This requires modeling another level of variation. In effect, this can be considered a hierarchical model that allows varying response rates in the presence of overdispersed multinomial data. The model and its applicability are demonstrated through an illustrative application to a subset of the well known High School and Beyond survey data.
Journal of Data Science, v.5, no.3, p.425-439
Comparisons of Gene Expression Indexes for Oligonucleotide Arrays
by Mounir Aout
- Full Text (PDF): [124.03kB]
High density oligonucleotide arrays have become a standard research tool to monitor the expression of thousands of genes simultaneously. Affymetrix GeneChip arrays are the most popular. They use short oligonucleotides to probe for genes in an RNA sample. However, important challenges remain in estimating expression level from raw hybridization intensities on the array. In this paper, we deal with the problem of estimating gene expression based on a statistical model. The present method is like Li and Wong model (2001a), but assumes more generality. More precisely, we show how the model introduced by Li and Wong can be generalized to provide new measure of gene expression. Moreover, we provide a comparison between these two models.
Journal of Data Science, v.5, no.3, p.441-449
A Posterior Distribution for the Normal Mean Arising from a Ratio
by Saralees Nadarajah and Arjun K. Gupta
- Full Text (PDF): [106.27kB]
It is shown that the most popular posterior distribution for the mean of the normal distribution is obtained by deriving the distribution of the ratio X/Y when X and Y are normal and Student's t random variables distributed independently of each other. Tabulations of the associated percentage points are given along with a computer program for generating them.
Journal of Data Science, v.5, no.3, p.451-469
Detection of Differentially Expressed Genes In Small Sets of cDNA Microarrays
by Simon Rosenfeld
- Full Text (PDF): [3.18MB]
Methods for testing the equality of two means are of critical importance in many areas of applied statistics. In the microarray context, it is often necessary to apply this kind of testing to small samples containing no more than a dozen elements, when inevitably the power of these tests is low. We suggest an augmentation of the classical t-test by introducing a new test statistic which we call "bio-weight." We show by simulation that in practically important cases of small sample size, the test based on this statistic is substantially more powerful than that of the classical t-test. The power computations are accompanied by ROC and FDR analysis of the simulated microarray data.