### Journal of Data Science, v.4, no.3, p.257-274

#### An Modified PLSR Method in Prediction

##### by Bo Cheng and Xizhi Wu

- Full Text (PDF): [181.37kB]

Among many statistical methods for linear models with the multicollinearity problem, partial least squares regression (PLSR) has become, in recent years, increasingly popular and, very often, the best choice. However, while dealing with the predicting problem from automobile market, we noticed that the results from PLSR appear unstable though it is still the best among some standard statistical methods. This unstable feature is likely due to the impact of the information contained in explanatory variables that is irrelevant to the response variable. Based on the algorithm of PLSR, this paper introduces a new method, modified partial least squares regression (MPLSR), to emphasize the impact of the relevant information of explanatory variables on the response variable. With the MPLSR method, satisfactory predicting results are obtained in the above practical problem. The performance of MPLSR, PLSR and some standard statistical methods are compared by a set of Monte Carlo experiments. This paper shows that the MPLSR is the most stable and accurate method, especially when the ratio of the number of observation and the number of explanatory variables is low.

### Journal of Data Science, v.4, no.3, p.275-289

#### Testing for Activation in Data from FMRI Experiments

##### by Martina Pavlicova, Noel Cressie, and Thomas J. and Santner

- Full Text (PDF): [587.60kB]

The traditional method for processing functional magnetic resonance imaging (FMRI) data is based on a voxel-wise, general linear model. For experiments conducted using a block design, where periods of activation are interspersed with periods of rest, a haemodynamic response function (HRF) is convolved with the design function and, for each voxel, the convolution is regressed on prewhitened data. An initial analysis of the data often involves computing voxel-wise two-sample t-tests, which avoids a direct specification of the HRF. Assuming only the length of the haemodynamic delay is known, scans acquired in transition periods between activation and rest are omitted, and the two-sample t-test is used to compare mean levels during activation versus mean levels during rest. However, the validity of the two-sample t-test is based on the assumption that the data are Gaussian with equal variances. In this article, we consider the Wilcoxon rank test as well as modified versions of the classical $t$-test that correct for departures from these assumptions. The relative performance of the tests are assessed by applying them to simulated data and comparing their size and power; one of the modified tests (the CW test) is shown to be superior.

### Journal of Data Science, v.4, no.3, p.291-306

#### An Evaluation of Multiple Behavioral Risk Factors for Cancer in a Working Class, Multi-Ethnic Population

##### by Melody S. Goodman, Yi Li, Gary G. Bennett, Anne M. Stoddard and Karen M. Emmons

- Full Text (PDF): [165.88kB]

Behavioral risk factors for cancer tend to cluster within individuals, which can compound risk beyond that associated with the individual risk factors alone. There has been increasing attention paid to the prevalence of multiple risk factors (MRF) for cancer, and to the importance of designing interventions that help individuals reduce their risks across multiple behaviors simultaneously. The purpose of this paper is to develop methodology to identify an optimal linear combination of multiple risk factors (score function) which would facilitate evaluation of cancer interventions.

### Journal of Data Science, v.4, no.3, p.307-321

#### Reducing Subjectivity in the Likelihood

##### by S. James Press

- Full Text (PDF): [143.72kB]

Some scientists prefer to exercise substantial judgment in formulating a likelihood function for their data. Others prefer to try to get the data to tell them which likelihood is most appropriate. We suggest here that one way to reduce the judgment component of the likelihood function is to adopt a mixture of potential likelihoods and let the data determine the weights on each likelihood. We distinguish several different types of subjectivity in the likelihood function and show with examples how these subjective elements may be given more equitable treatment.

### Journal of Data Science, v.4, no.3, p.323-341

#### Application of One Sided t-tests and a Generalized Experiment Wise Error Rate to High-Density Oligonucleotide Microarray Experiments: An Example Using Arabidopsis

##### by W. M. Muir, J. Romero-Severson, S. D. Rider Jr., A. Simons, and J. Ogas

- Full Text (PDF): [183.96kB]

The GFWER(k) is defined as the probability of rejecting k or more true null hypothesis at a given p level. Computing probabilities by GFWER(k) was shown to be simple to apply and, depending on the value of k, can greatly increase power. A k value as small as 2 or 3 was concluded to be adequate for large or small experiments respectively. A one sided t-test along with GFWER(2)=.05 identified 43 genes as exhibiting PICKLE-dependent expression. Expression of all 43 genes was re-examined by qRT-PCR, of which 36 (83.7%) were confirmed to exhibit PICKLE-dependent expression.

### Journal of Data Science, v.4, no.3, p.343-356

#### Developing Multivariate Survival Trees with a Proportional Hazards Structure

##### by Feng Gao, Amita K. Manatunga, Shande Chen

- Full Text (PDF): [160.76kB]

In this paper, a tree-structured method is proposed to extend Classification and Regression Trees (CART) algorithm to multivariate survival data, assuming a proportional hazard structure in the whole tree. The method works on the marginal survivor distributions and uses a sandwich estimator of variance to account for the association between survival times. The Wald-test statistics is defined as the splitting rule and the survival trees are developed by maximizing between-node separation. The proposed method intends to classify patients into subgroups with distinctively different prognosis. However, unlike the conventional tree-growing algorithms which work on a subset of data at every partition, the proposed method deals with the whole data set and searches the global optimal split at each partition. The method is applied to a prostate cancer data and its performance is also evaluated by several simulation studies.

### Journal of Data Science, v.4, no.3, p.357-370

#### Critical Values and Power for a Small Sample Test of Difference in Proportions in the Presence of Extra-Binomial Variation

##### by John S. Lawson and Benjamin Ahlstrom

- Full Text (PDF): [174.16kB]

We develop a likelihood ratio test statistic, based on the beta-binomial distribution, for comparing a single treated group with dichotomous data to dual control groups. This statistic is useful in cases where there is overdispersion or extra-binomial variation. We apply the statistic to data from a two year rodent carcinogenicity study with dual control groups. The test statistic we developed is similar to others that have been developed for incorporation of historical control groups with rodent carcinogenicity experiments. However, for the small sample case we considered, large sample theory used by the other test statistics did not apply. We determined the critical values of this statistic by enumerating its distribution. A small Monte Carlo study shows the new test statistic controls the significance level much better than Fisher's exact test when there is overdispersion and that it has adequate power.

### Journal of Data Science, v.4, no.3, p.371-386

#### Improved Tolerance Limits by Combining Analytical and Experimental Data: An Information Integration Methodology

##### by A. Alexandre Trindade and Stan Uryasev

- Full Text (PDF): [170.29kB]

We propose a coherent methodology for integrating different sources of information on a response variable of interest, in order to accurately

predict percentiles of its distribution. Under the assumption that one of the sources is more reliable than the other(s), the approach combines factors formed from the data into an additive linear regression model. Quantile regression, designed for quantifying the goodness of fit precisely at a desired quantile, is used as the optimality criterion in model-fitting. Asymptotic confidence interval construction methods for the percentiles are adopted to compute statistical tolerance limits for the response. The approach is demonstrated on a materials science case study that pools together information on failure load from physical tests and computer model predictions. A small simulation study assesses the precision of the inferences. The methodology gives plausible percentile estimates. Resulting tolerance limits are close to nominal coverage probability levels.