### Journal of Data Science, v.5, no.4, p.471-490

#### Distortion Diagnostics for Covariate-adjusted Regression: Graphical Techniques Based on Local Linear Modeling

##### by Danh V. Nguyen and and Damla Senturk

- Full Text (PDF): [301.85kB]

Linear regression models are often useful tools for exploring the relationship between a response and a set of explanatory (predictor) variables. When both the observed response and the predictor variables are contaminated/distorted by unknown functions of an observable confounder, inferring the underlying relationship between the latent (unobserved) variables is more challenging. Recently, Senturk and Muller (2005) proposed the method of covariate-adjusted regression (CAR) analysis for this distorted data setting. In this paper, we describe graphical techniques for assessing departures from or violations of specific assumptions regarding the type and form of the data distortion. The type of data distortion consists of multiplicative, additive or no-distortion. The form of the distortion encompasses a class of general smooth distorting functions. However, common confounding adjustment methods in regression analysis implicitly make distortion assumptions, such as assuming additive or multiplicative linear distortions. We illustrate graphical detection of departures from such assumptions on the distortion. The graphical diagnostic techniques are illustrated with numerical and real data examples. The proposed graphical assessment of distortion assumptions is feasible due to the CAR estimation method, which utilizes a local regression technique to estimate a set of transformed distorting functions (Senturk and Nguyen, 2006).

### Journal of Data Science, v.5, no.4, p.491-502

#### Count Regression Models with an Application to Zoological Data Containing Structural Zeros

##### by Ilknur Ozmen and Felix Famoye

- Full Text (PDF): [126.49kB]

Recently, count regression models have been used to model over-dispersed and zero-inflated count response variable that is affected by one or more covariates. Generalized Poisson (GP) and negative binomial (NB) regression models have been suggested to deal with over-dispersion. Zero-inflated count regression models such as the zero-inflated Poisson (ZIP), zero-inflated negative binomial (ZINB) and zero-inflated generalized Poisson (ZIGP) regression models have been used to handle count data with many zeros. The aim of this study is to model the number of C. caretta hatchlings dying from exposure to the sun. We present an evaluation framework to the suitability of applying the Poisson, NB, GP, ZIP and ZIGP to zoological data set where the count data may exhibit evidence of many zeros and over-dispersion. Estimation of the model parameters using the method of maximum likelihood (ML) is provided. Based on the score test and the goodness of fit measure for zoological data, the GP regression model performs better than other count regression models.

### Journal of Data Science, v.5, no.4, p.503-518

#### Application of Multiple Imputation to Data from Two-phase Sampling: Estimation of the Incidence Rate of Cognitive Impairment

##### by Changyu Shen

- Full Text (PDF): [210.24kB]

Epidemiological cohort study that adopts a two-phase design raises serious issue on how to treat a fairly large amount of missing values that are either Missing At Random (MAR) due to the study design or potentially Missing Not At Random (MNAR) due to non-response and loss to follow-up. Cognitive impairment (CI) is an evolving concept that needs epidemiological characterization for its maturity. In this work, we attempt to estimate the incidence rate CI by accounting for the aforementioned missing-data process. We consider baseline and first follow-up data of 2191 African-Americans enrolled in a prospective epidemiological study of dementia that adopted a two-phase sampling design. We developed a multiple imputation procedure in the mixture model framework that can be easily implemented in SAS. Sensitivity analysis is carried out to assess the dependence of the estimates on specific model assumptions. It is shown that African-Americans in the age of 65-75 have much higher incidence rate of CI than younger or older elderly. In conclusion, multiple imputation provides a practical and general framework for the estimation of epidemiological characteristics in two-phase sampling studies.

### Journal of Data Science, v.5, no.4, p.519-534

#### Power Calculations for ZIP and ZINB Models

##### by John M. Williamson, Hung-Mo Lin, Robert H. Lyles and Allen W. Hightower

- Full Text (PDF): [158.90kB]

We present power calculations for zero-inflated Poisson (ZIP) and zero-inflated negative-binomial (ZINB) models. We detail direct computations for a ZIP model based on a two-sample Wald test using the expected information matrix. We also demonstrate how Lyles, Lin, and Williamson's method (2006) of power approximation for categorical and count outcomes can be extended to both zero-inflated models. This method can be used for power calculations based on the Wald test (via the observed information matrix) and the likelihood ratio test, and can accommodate both categorical and continuous covariates. All the power calculations can be conducted when covariates are used in the modeling of both the count data and the ``excess zero'' data, or in either part separately. We present simulations to detail the performance of the power calculations. Analysis of a malaria study is used for illustration.

### Journal of Data Science, v.5, no.4, p.535-554

#### Distribution-Free Regression: Reinterpreting Design-Based Sampling

##### by Gordon G. Bechtel

- Full Text (PDF): [165.33kB]

An individual in a finite population is represented by a random variable whose expectation is linearly composed of explanatory variables and a personal effect. This expectation locates her (his) random variable on a scale when s(he) responds to a questionnaire item or physical instrument. This formulation reinterprets design-based sampling, which represents an individual as a constant waiting to be observed. Retaining constant expectations, however, along with fixed realizations of random variables, preserves and strengthens design-based theory through the Horvitz-Thompson (1952) theorem. This interpretation reaffirms the usual design-based regression estimates, whose normality is seen to be free of any assumptions about the distribution of the outcome variable. It also formulates response error in a way that renders a superpopulation, postulated by model-based sampling, unnecessary. The value of distribution-free regression is illustrated with an analysis of American presidential approval.

### Journal of Data Science, v.5, no.4, p.555-576

#### The Determinants of Birth Interval in Ahvaz-Iran: A Graphical Chain Modeling Approach

##### by Abdolrahman Rasekh and Majid Momtaz

- Full Text (PDF): [193.83kB]

Birth interval is a major determinant of the rates of fertility. In this paper a graphical modeling approach is used to study the effect of different socio-economic factors on birth intervals of children in Ahvaz-Iran. This approach provides an easily interpretable empirical description and illustrates explicitly the conditional independence structure between each pair of variables. The interpretation can be read directly from a mathematical graph. Besides examining the direct association of each determinant on birth interval, we also examine the effects of socio-economic determinants on intermediate determinants to understand the pathways through which the socio-economic determinants affect the birth interval. The data analysed come from a sample of women referred to ``Health and Medical Centres" during October and November 2002.

### Journal of Data Science, v.5, no.4, p.577-596

#### Impact of Foreign Direct Investment on Regional Innovation Capability: A Case of China

##### by Yufen Chen

- Full Text (PDF): [189.54kB]

Foreign direct investment (FDI) has been traditionally considered an important channel in the diffusion of advanced technology. Whether it can promote technology progress for the host country is a focused problem. This paper analyzes the relationship between FDI and regional innovation capability (RIC). We find that the spillover effects of FDI are not as significant as it is usually thought. It is found that the impact of FDI on RIC is weak; the entry of FDI has no use for enhancing indigenous innovation capability. Moreover inward FDI might have the crowding-out effect on innovation and domestic R&D activity. The research manifests that increasing domestic R&D inputs, strengthening the innovation capabilities and absorptive capacity in domestic enterprises are determinants to improve RIC.

### Journal of Data Science, v.5, no.4, p.597-612

#### Textbooks on Differential Calculus in Eighteenth Century Europe: A Comparative Stylistic Analysis

##### by Monica Blanco Abellan

- Full Text (PDF): [165.50kB]

Comparative mathematical textbook analysis aims at the determination of differences among countries concerning the development and transmission of mathematics. On the other hand, textual statistics provides a means to quantify a text by applying multivariate statistical techniques. So far this statistical approach has not been applied to comparative mathematical textbook analysis yet. The object of this paper is to quantify and compare the style of a number of textbooks on differential calculus written in 18th century Europe. To that purpose two multivariate statistical techniques have been applied: 1) simple correspondence analysis and 2) hierarchical clustering analysis. The results of both analysis help to detect some interesting associations among the analysed textbooks.