### Journal of Data Science, v.1, no.2, p.103-121

#### Analysis of Unbalanced Microarray Data

##### by Mei-Ling Ting Lee, G.A. Whitmore, Rus Y. Yukhananov

- Full Text (PDF): [142.90kB]

This paper investigates statistical procedures for analyzing microarray gene expression data obtained from studies with an unbalanced experimental design. We demonstrate the methods using microarray data from a study of opioid dependence in mice. The experiment was designed to investigate how morphine dependence alters gene expression in spinal cord mRNA. The aim was to identify genes that characterize the tolerance, withdrawal and two abstinence stages of dependence and to describe how gene expression is altered in moving from one stage to the next. The study design was unbalanced in several respects. First, for mice receiving morphine, arrays were made for four dependence stages, while for mice receiving placebo, arrays were made for only three stages. Second, administrative error led to an omitted replication for one treatment combination. Third, some expression readings were missing. Extending the two-stage ANOVA model of Lee et al. (2000, 2002a) this paper first uses a chi-square statistic to identify a small set of genes that exhibit differential expression over one or more treatment combinations. This gene set is then examined further using cluster analysis and novel inference methods to uncover specific genes and gene clusters that play a role at different stages of opioid dependence and, in particular, a role in the persistence of effect into the late abstinence stage. The latter effect implies that morphine dependence has a long-term genetic impact. The statistical power of the study to uncover differentially expressed genes is calculated as a prelude to further investigation. The analytical results proved useful to scientists in understanding the link between opioid dependence and gene function.

### Journal of Data Science, v.1, no.2, p.123-147

#### Consistent Parameter Estimation for Lagged Multilevel Models

##### by Neil H. Spencer

- Full Text (PDF): [169.56kB]

The estimation of parameters of lagged multilevel models is considered. This type of model is used in many application areas, including psychology and education, where changes in test results over time can be modelled. Standard estimation techniques are shown to give inconsistent results for this formulation of the multilevel model. For two simple assumptions concerning the nature of the model covariate a first and second difference instrument methods for consistent estimation are developed. Simulations are used to demonstrate their success in obtaining consistent parameter estimates. Use of the instrument methods with more complex multilevel models is considered.

### Journal of Data Science, v.1, no.2, p.149-165

#### A Classification Statistic for GEE Categorical Response Models

##### John M. Williamson, Hung-Mo Lin and Huiman X. Barnhart

- Full Text (PDF): [138.36kB]

A kappa-like classification statistic is proposed for assessing the fit of GEE regression models with a categorical response. The proposed statistic is a summary measure depicting how well categorical responses are predicted from the fitted GEE model. The statistic takes on a value of 1 if prediction is perfect and a value of 0 if the fitted model fares no better than random chance, i.e., fitting the repeated categorical responses with an intercept-only model. To demonstrate the usefulness of the classification statistic, we present simulation results as well as two examples from biomedical studies.

### Journal of Data Science, v.1, no.2, p.167-183

#### Assessing the Effect of an Open-ended Category on the Trend in 2xK Ordered Tables

##### by Shiva Gautam and Takamaru Ashikaga

- Full Text (PDF): [125.35kB]

Trend in proportion in 2 by K ordered tables is evaluated by assigning scores to ordered categories. Investigators often encounter 2 by K ordered tables with an open-ended category. An open-ended category arises when category scores for the first K-1 categories are known or given a priori but the score for the last category is unknown. In such situations, an arbitrary score is often assigned to the open-ended (or the last) category before evaluating the trend. Thus two investigator analyzing the same data set may assign different scores and may arrive at different conclusions. In the spirit of preliminary data analysis it is shown through examples that there are situations where the conclusion is not affected by the choice of scores assigned to the open-ended category. The paper also explores situations where the conclusion may depend on the choice of a score for the open-ended category. In the former case, the usual trend analysis may be performed after assigning a score to the open-ended category. In the latter case, the trend may be evaluated after adjusting for the open-ended category as demonstrated in this paper. Alternately, the trend may be evaluated by Gautam's method which does not depend on a particular choice of a score.

### Journal of Data Science, v.1, no.2, p.185-197

#### Comparing Reliabilities of the Strength of Two Container Designs: A Case Study

##### by Esteban Walker and Frank Guess

- Full Text (PDF): [134.06kB]
- Data (DOC): [103.00kB]

Two designs for PET (polyethylene terephthalate) beverage bottles were compared. These bottles are used for carbonated beverages; and thus, a very critical property is their burst strength. The burst strengths of bottles from each design across 24 cavities were measured. Standard nonparametric methods suggested a highly significant difference in the reliability of the two designs. Using simple graphical techniques, it was found that the reliability data of the new design appeared to be a mixture of distributions caused by the presence of "arly mortality," due possibly to different failure modes. Even though they were clearly different, neither design was uniformly more reliable than the other. Standard parametric methods showed inadequate fit due to the bimodality of the strength data of the new design. The paper stresses (1) the need of operational clear definitions for "reliability," (2) the need of graphical exploratory analysis to discover anomalies in the data, and (3) the value of nonparametric methods, and (4) the problems of using parametric techniques when the assumptions are violated. To justify work on improvement of the new design, the potential effect of the removal of the early mortality on the new design was analyzed.

### Journal of Data Science, v.1, no.2, p.199-230

#### Analysis of Bank Failure Using Published Financial Statements: The Case of Indonesia (Part 1)

##### by Loso Judijanto and E. V. Khmaladze

- Full Text (PDF): [244.70kB]
- Data-1 (DOC): [51.00kB]
- Data-2 (XLS): [165.00kB]
- Data-3 (XLS): [178.00kB]

Published financial statement is the only publicly available report on financial condition of a bank operating in Indonesia. It contains limited information, but we want to exploit it to discriminate between normal, problem, and liquidated banks and to find factors underlying these conditions.

We observed 213 banks and analyzed 32 initial variables representing earning and profitability, productivity and efficiency, quality of assets, capital adequacy, growth and aggressiveness, credibility, size, income and source of fund diversification, liquidity, and dependence on affiliates.

In the classification we used ranks of each variable rather than its numerical value as such. After making necessary transformations, creating new variables and deleting unnecessary variables, we found that the ranks of 12 variables out of initial 32 could discriminate three groups of banks significantly two years before failure while the ranks of just two variables could discriminate significantly one year before failure.

In this first paper we outline our approach and consider variables describing earning and profitability, productivity and efficiency and quality of assets. In the second paper we continue the analysis of other variables. Then we show that, for good discrimination, it is sufficient to select seven basic aspects of financial structure and performance of a bank, which can be efficiently and consistently measured by the variables of simple and clear intuitive meaning (see the list of abbreviations below in the text). These are: efficiency in productivity and earning (ranks of EBT/SE, PM, ROE and ROEA), capital adequacy (ranks of E/EA and E/L), interest gap (ranks of IM and NII/L), credibility (ranks of ARCF), liquidity (ranks of LA/D), dependence on affiliates (ranks of NFA/L), and security of earning assets (ranks of PLL/L).