Journal of Data Science, v.1, no.3, p.231-248
Exploratory Model Selection for Spatially Designed Experiments --- Some Examples
by Walter T. Federer
- Full Text (PDF): [112.84kB]
Exploratory model selection was used to find a response model that accounted for the spatial variability present in the experimental results from four examples of spatially designed field experiments. It was found that the class of differential gradients within incomplete blocks was useful for finding a response model that accounted for the spatial variability present in the first example. The class of orthogonal polynomial regressions of response on row and column position and interactions of the regres sions was useful for discovering an appropriate response model for the data of examples two, three, and four. The results obtained from the selected response model were compared with standard textbook analyses. Considerable differences in residual mean squares, coefficients of variation, and F-values for treatment to residual mean squares were found. The increase in replication for the selected response model over the textbook response model is demonstrated. The increase can be many fold.
Journal of Data Science, v.1, no.3, p.249-260
Analysis Methods for Supersaturated Design: Some Comparisons
by Runze Li and Dennis K. J. Lin
- Full Text (PDF): [100.70kB]
Supersaturated designs are very cost-effective with respect to the number of runs and as such are highly desirable in many preliminary studies in industrial experimentation. Variable selection plays an important role in analyzing data from the supersaturated designs. Traditional approaches, such as the best subset variable selection and stepwise regression, may not be appropriate in this situation. In this paper, we introduce a variable selection procedure to screen active effects in the SSDs via nonconvex penalized least squares approach. Empirical comparison with Bayesian variable selection approaches is conducted. Our simulation shows that the nonconvex penalized least squares method compares very favorably with the Bayesian variable selection approach proposed in Beattie, Fong and Lin (2001).
Journal of Data Science, v.1, no.3, p.261-274
Estimation of the Proportion of Sterile Couples Using the Negative Binomial Distribution
by Mohammad Fraiwan Al-Saleh and Fatima Khalid AL-Batainah
- Full Text (PDF): [116.26kB]
A Sterile family is a couple who has no children by their deliberate choice or because they are biologically infertile. Couples who are childless by chance are not considered to be sterile. The object is to estimate the proportion of sterile couples in Jordan indirectly based on the 1994 population census, by separating the two types of childless couples into sterile and fertile couples. Three methods of fitting a negative binomial distribution to the completed family size data obtained from 1994-population census are investigated. It appeared that the third method gives the best fit. Based on the fitted distribution, the proportion of sterile couples is estimated at 6.1% of all couples. This estimate is much lower than the corresponding estimate of sterile couples in the USA, which was estimated at 11%. The difference between the two can be due to some socio-cultural factors influencing the deliberate choice of couples to have no children. The method of estimation can be applied on other populations.
Journal of Data Science, v.1, no.3, p.275-292
Imputation Allowing Standard Variance Formulas
by Michael P. Cohen
- Full Text (PDF): [128.08kB]
Although deletion of cases is still a common method of dealing with item nonresponse, imputation is a major alternative. With traditional methods of imputation, though, the usual variance formulas understate the variance of estimates. This paper proposes that items be imputed from distributions more diffuse than those of the real data, thereby compensating for the underestimation of variance by the usual formulas. The impact on covariances is considered in the design of the method. The method is intended for use by data analysts applying techniques based on functions of first and second moments of means only.
Journal of Data Science, v.1, no.3, p.293-312
Comparison of Two Multiple Imputation Procedures in a Cancer Screening Survey
by Coen A. Bernaards, Melissa M. Farmer, Karen Qi, Gareth S. Dulai, Patricia A. Ganz, Katherine L. Kahn.
- Full Text (PDF): [142.30kB]
Commonly in survey research, multiple, different analyses are conducted by one or more than one researcher on the same data set. The conclusions from these analyses should be consistent despite the presence of missing data. Multiple imputation is frequently used to ensure consistency of analyses. Two methods for multiple imputation of missing data are a combination of hot deck and regression imputation, and multivariate normal multiple imputation. It is unknown whether these methods will give similar results in practical situations with large numbers of variables. We applied both multiple imputation methods to a cancer screening survey data with 2 continuous, 48 Likert scale items, and 74 binary response items. Correlations and variances of imputated data sets were compared in a first attempt to investigate similarity of the imputation methods. The results of both methods were found to be similar; either of the two methods are endorsed for surveys similar to the data set presented.
Journal of Data Science, v.1, no.3, p.313-336
Analysis of Bank Failure Using Published Financial Statements: The Case of Indonesia (Part 2)
by Loso Judijanto and E. V. Khmaladze
- Full Text (PDF): [208.29kB]
- Data-1 (DOC): [51.00kB]
- Data-2 (XLS): [165.00kB]
- Data-3 (XLS): [178.00kB]
Published financial statement is the only publicly available report on financial condition of a bank operating in Indonesia. It contains limited information, but we want to exploit it to discriminate between normal, problem, and liquidated banks and to find factors underlying these conditions. We observed 213 banks and analysed 42 initial variables representing earning and profitability, productivity and efficiency, quality of assets, capital adequacy, growth and aggressiveness, credibility, size, income and source of fund diversification, liquidity, and dependence on affiliates. In the classification we used ranks of each variable rather than its numerical value as such. After learning the characteristic of variables theoretically, applying certain statistical tests, making necessary transformations, creating new variables and deleting unnecessary variables, we found that the ranks of 12 variables out of initial 42 could discriminate three groups of banks significantly two years before failure while the ranks of just two variables could discriminate significantly one year before failure. We considered three major groups of variables in our first paper. In this second paper we start with capital adequacy variables and consider altogether six groups of variables. Then we show that it is sufficient to select seven basic aspects of financial structure and performance of a bank, which can be efficiently and consistently measured by the variables of simple and clear intuitive meaning (see the list of abbreviations below in the text). These are: efficiency in productivity and earning (ranks of EBT/SE, PM, ROE and ROEA) capital adequacy (ranks of E/EA and E/L), interest gap (ranks of IM and NII/L), credibility (ranks of ARCF), liquidity (ranks of LA/D), dependence on affiliates (ranks of NFA/L), and security of earning assets (ranks of PLL/L).
Journal of Data Science, v.1, no.3, p.337-360
A Survey for Technological Innovation in Taiwan
by Hsien-Ta Wang, Tsui Mu, Li-Kung Chen, Tzy-Mei Lin, Chih-Ming Chiang, Hsin-Neng Hsieh, Yu-Ting Cheng and Ben-Chang Shia
- Full Text (PDF): [127.37kB]
Statistical data on R&D development in Taiwan has been formally incorporated into the OECD/MSTI database (Main Science and Technology Indicators). Our surveys and analysis of R&D activities are clear and complete. However, the mode of development of the knowledge economy development, aside from the R&D activities themselves, depends on the production, circulation, and application of knowledge and/or technology. Direct relationship between the new knowledge produced by R&D activities and industrial sector products or manufacturing process innovations is difficult to visualize. R&D activities are high in risk, their outcome uncertain; there is no assurance that investment in R&D will yield innovative products for the marketplace. By contrast, enterprises without any investment in R&D activities can still bring out products that are new in technology.
Thus, the EU began for the first time to develop Community Innovation Survey (CIS) in the 90's. The main goal was to collect data on how enterprises in different countries invest in the technology innovation process, and on the products of this process. The results of the analysis of such surveys can make contribution to the development of innovation policies and to the spread and transfer of new technology. The industrial structure of Taiwan is composed mainly of small and medium-size enterprises, accounting for 98% of all enterprises. Major R&D activities is concentrated in larger organizations, as the 300 largest domestic enterprises account for about 70% of R&D activity spending. According to the results of the first Taiwan Technological Innovation Survey (TTIS1), in the three years from 1998 to 2000, overall 50.2% of business enterprises engaged in technology innovation activity. In this survey, the definition of "technology innovation activity" of the Organization for Economic Cooperation and Development (OECD) was used (for enterprises with over 20 employees, the results were weighted according to the number of enterprises in a stratum). The figures for the manufacturing sector and service sector were 51.1% and 49.3% respectively.
Looking at the results weighted for number of employees, in the three years from 1998-2000, for all enterprises with 20 employees or more, approximately 63.7% of employee had engaged in technology innovation activity; the figures for the manufacturing and service sectors were 68.3% and 58.6% respectively.
For funds spent on innovation activity, in the year 2000, the total invested by enterprises with 20 employees or more was approximately 563.86 billion NT$, which accounted for 2.81% of the total revenues of these enterprises. This is an underestimate. Looking specifically at the manufacturing sector, technology innovation activity spending accounted for 4.08% of revenues, versus 1.84% for the service sector. As for constraints on innovation, or factors hampering innovation, the most important factor was the lack of appropriate technology and the percentage of R&D employee. The second most important factor was excessiveness of economic risk. Most of an enterprise's main information sources come from customers or consumers; the next most important was the internal part of the company.