### Journal of Data Science, v.3, no.3, p.213-222

#### Efficient Sampling Design in Audit Data

##### by Yan Liu, Mary Batcher and Fritz Scheuren

- Full Text (PDF): [807.48kB]

Auditors are often faced with reviewing a sample drawn from special populations. One is the special population where invoices are divided into two categories, according to whether or not invoices are qualified. In other words, the qualified amount follows a nonstandard mixture distribution in which the qualified amount is either zero with a certain probability or the same as the known invoice amount with a certain probability. The other is the population where some invoices are partially qualified. In other words, some invoices have a qualified amount between zero and the full invoice amount. For these settings, the typical sample design is stratified random, with the estimation method employing a ratio type method. This paper focuses on efficient sample design for this setting and provides some guidelines in setting up stratum boundaries, calculating sample size and allocating sample size optimally across strata.

### Journal of Data Science, v.3, no.3, p.223-240

#### Estimating the Interest Rate Term Structures of Treasury and Corporate Debt with Bayesian Penalized Splines

##### by Min Li and Yan Yu

- Full Text (PDF): [399.65kB]

This paper provides a Bayesian approach to estimating the interest rate term structures of Treasury and corporate debt with a penalized spline model. Although the literature on term structure modeling is vast, to the best of our knowledge, all methods developed so far belong to the frequentist school. In this paper, we develop a two-step estimation procedure from a Bayesian perspective. The Treasury term structure is first estimated with a Bayesian penalized spline model. The smoothing parameter is naturally embedded in the model as a ratio of posterior variances and does not need to be selected as in the frequentist approach. The corporate term structure is then estimated by adding a credit spread to the estimated Treasury term structure, incorporating knowledge of the positive credit spread into the Bayesian model as an informative prior. In contrast to the frequentist method, the small sample size of the corporate debt poses no particular difficulty to the proposed Bayesian approach.

### Journal of Data Science, v.3, no.3, p.241-256

#### Application and Comparison of Methods for Analysing Correlated Interval-censored Data from Sexual Partnerships

##### by Khangelani Zuma and Mark N. Lurie

- Full Text (PDF): [151.53kB]

In epidemiological studies where subjects are seen periodically on follow-up visits, interval-censored data occur naturally. The exact time the change of state (such as HIV seroconversion) occurs is not known exactly, only that it occurred sometime within a specific time interval. This paper considers estimation of parameters when HIV infection times are interval-censored and correlated. It is assumed that each sexual partnership has a specific unobservable random effect that induces association between infection times. Parameters are estimated using the expectation-maximization algorithm and the Gibbs sampler. The results from the two methods are compared. Both methods yield fixed effects and baseline hazard estimates that are comparable. However, standard errors and frailty variance estimates are underestimated in the expectation-maximization algorithm compared to those from the Gibbs sampler. The Gibbs sampler is considered a plausible alternative to the expectation-maximization algorithm.

### Journal of Data Science, v.3, no.3, p.257-278

#### Testing Statistical Significance of the Area under a Receiving Operating Characteristics Curve for Repeated Measures Design with Bootstrapping

##### by Honghu Liu, Gang Li, William G. Cumberland and Tongtong Wu

- Full Text (PDF): [389.34kB]

Receiver operating characteristic (ROC) curve is an effective and widely used method for evaluating the discriminating power of a diagnostic test or statistical model. As a useful statistical method, a wealth of literature about its theories and computation methods has been established. The research on ROC curves, however, has focused mainly on cross-sectional design. Very little research on estimating ROC curves and their summary statistics, especially significance testing, has been conducted for repeated measures design. Due to the complexity of estimating the standard error of a ROC curve, there is no currently established statistical method for testing the significance of ROC curves under a repeated measures design. In this paper, we estimate the area of a ROC curve under a repeated measures design through generalized linear mixed model (GLMM) using the predicted probability of a disease or positivity of a condition and propose a bootstrap method to estimate the standard error of the area under a ROC curve for such designs. Statistical significance testing of the area under a ROC curve is then conducted using the bootstrapped standard error. The validity of bootstrap approach and the statistical testing of the area under the ROC curve was validated through simulation analyses. A special statistical software written in SAS/IML/MACRO v8 was also created for implementing the bootstrapping algorithm, conducting the calculations and statistical testing.

### Journal of Data Science, v.3, no.3, p.279-294

#### Monitoring the SARS Epidemic in China: A Time Series Analysis

##### by Dejian Lai

- Full Text (PDF): [943.20kB]

In this article, we studied three types of time series analysis methods in modeling and forecasting the severe acute respiratory syndrome (SARS) epidemic in mainland China. The first model was a Box-Jenkins model, autoregressive model with order 1 (AR(1)). The second model was a random walk (ARIMA(0,1,0)) model on the log transformed daily reported SARS cases and the third one was a combination of growth curve fitting and autoregressive moving average model, ARMA(1,1). We applied all these three methods to monitor the dynamic of SARS in China based on the daily probable new cases reported by the Ministry of Health of China.

### Journal of Data Science, v.3, no.3, p.295-304

#### Is the Scientific Discovery of DNA Fingerprint by Chance or by Design?

##### by Harry Yang and Iksung Cho

- Full Text (PDF): [119.84kB]

DNA fingerprinting is a microbiological technique widely used to find a DNA sequence specific for a microbe. It involves slicing the genomes of the microbe into DNA fragments with manageable sizes, sorting the DNA pieces by length and finally identifying a DNA sequence unique to the microbe, using probe-based assays. This unique DNA is referred to as DNA fingerprint of the microbe under study. In this paper, we introduce a probabilistic model to estimate the chance of identifying the DNA fingerprint from the genome of a microbe when the DNA fingerprinting method is employed. We derive a closed-form functional relationship between the chance of finding the fingerprint and factors that can be experimentally controlled either in part, fully or not at all. Because the odds of finding a specific DNA fingerprint can only be improved by experimental design to a certain degree, in a broader sense, we show that the discovery of a DNA fingerprint is a process governed more by chance than by design. Nevertheless, the results can be potentially used to guide experiments in maximizing the chance of finding a DNA fingerprint of interest.

### Journal of Data Science, v.3, no.3, p.305-330

#### Multiple Change Point Analysis for the Regular Exponential Family using the Product Partition Model

##### by R. H. Loschi, F. R. B. Cruz and R. B. Arellano-Valle

- Full Text (PDF): [255.16kB]

As an extension to previous research efforts, the PPM is applied to the identification of multiple change points in the parameter that indexes the regular exponential family. We define the PPM for Yao's prior cohesions and contiguous blocks. Because the exponential family provides a rich set of models, we also present the PPM for some particular members of this family in both continuous and discrete cases and the PPM is applied to identify multiple change points in real data. Firstly, multiple changes are identified in the rates of crimes in one of the biggest cities in Brazil. In order to illustrate the continuous case, multiple changes are identified in the volatility (variance) and in the expected return (mean) of some Latin America emerging markets return series.