Volume 2, Number 2, April 2004

  • Selection of an Artificial Neural Network Model for the Post-calibration of Weather Radar Rainfall Estimation
  • A State Duration Model for Brand Choice and Inter-purchase Time
  • Wavelet Analysis of Tide-affected Low Streamflows Series
  • Identify Breast Cancer Subtypes by Gene Expression Profiles
  • Geostatistical Analysis of Chinese Cancer Mortality: Variogram, Kriging and Beyond
  • Statistical Analysis of Market Penetration in a Mandatory Privatized Pension Market Using Generalized Logistic Curves

Journal of Data Science, v.2, no.2, p.107-124

Selection of an Artificial Neural Network Model for the Post-calibration of Weather by Radar Rainfall Estimation

by Masoud Hessami, Francois Anctil and Alain A. Viau

A statistical approach, based on artificial neural networks, is proposed for the post-calibration of weather radar rainfall estimation. Tested artificial neural networks include multilayer feedforward networks and radial basis functions. The multilayer feedforward training algorithms consisted of four variants of the gradient descent method, four variants of the conjugate gradient method, Quasi-Newton, One Step Secant, Resilient backpropagation, Levenberg-Marquardt method and Levenberg-Marquardt method usi ng Bayesian regularization. The radial basis networks were the radial basis functions and the generalized regression networks. In general, results showed that the Levenberg-Marquardt algorithm using Bayesian regularization can be introduced as a robust and reliable algorithm for post-calibration of weather radar rainfall estimation. This method benefits from the convergence speed of the Levenberg-Marquardt algorithm and from the over fitting control of Bayes' theorem. All the other multilayer feedforward training algorithms result in failure since they often lead to over fitting or converged to a local minimum, which prevents them from generalizing the data. Radial basis networks are also problematic since they are very sensitive when used with sparse data.

Journal of Data Science, v.2, no.2, p.125-147

A State Duration Model for Brand Choice and Inter-purchase Time

by Lynn Kuo and Zhen Chen

A new approach for analyzing state duration data in brand-choice studies is explored. This approach not only incorporates the correlation among repeated purchases for a subject, it also models the purchase timing and the brand decision jointly. The former is accomplished by applying transition model approaches from longitudinal studies while the latter is done by conditioning on the brand choice variable. Then mixed multinomial logit models and Cox proportional hazards models are employed to model the marginal densities of the brand choice and the conditional densities of the interpurchase time given the brand choice. We illustrate the approach using a Nielsen household scanner panel data set.

Journal of Data Science, v.2, no.2, p.149-163

Wavelet Analysis of Tide-affected Low Streamflows Series

by Yeo-Howe Lim and Leonard M. Lye

In certain rivers that drain very flat terrains in coastal areas, the streamflow series observed at a flow-gauging station may come under the direct influence of the backwater effects of tides. The phenomena may be negligible under conditions of high flows but can be critical under some extreme low-flow conditions. The errors in low flow estimation are large if a proper de-noising is not implemented to remove the effects of the tidal effects. Scrutinizing the hydrologic time series using a standard time-frequency domain based Fourier transform methodology cannot resolve conclusively the sources of the noise. However, a new perspective can be obtained by using a wavelet transformation to analyze the time series in the time-scale domain. By using this approach, a case study involving a streamflow series observed at Kapit, Sarawak, Malaysia yielded conclusive evidence of the influence of tides at the flow-gauging site during the low flow period. Upon confirmation that the noise is indeed of tidal origin, the observed water level series was subjected to an appropriate wavelet-based de-noising procedure to derive a smoothed series. Then, together with an established rating curve, a de-noised discharge series could also be approximated. Low-flow quantiles were subsequently derived by fitting a suitable frequency distribution to the annual minimum series abstracted from the de-noised discharge series. The methodology presented illustrates the potential of using wavelet analysis methods in solving other similar problems.

Journal of Data Science, v.2, no.2, p.165-175

Identify Breast Cancer Subtypes by Gene Expression Profiles

by Grace S. Shieh, Chy-Huei Bai and Chih Lee

Support vector machines (SVMs), with linear, polynomial and radial kernels, were applied to classify subtypes of breast cancer by gene expression profiles of tissues samples. Using the top 500 genes ranked by between-group to within-group sum of squares, SVMs with linear kernel had an average accuracy rate about 97% when applied to a balanced dataset; this accuracy rate was significantly higher than that of the original data. After imputation, the smallest subsample of the balanced dataset was comparable to the other subsamples' (containing more than 10 samples). In biomedical sciences, it is of interest to identify genes that can be used to classify subtypes of breast cancer well. Using SVMs, we identified 500 genes and looked up the functions of 297 genes from databases. Furthermore, about 65% of these 297 genes were known to be related to breast cancer, and this confirms the consistency of our results with existing biomedical knowledge. Those 203 genes may also be investigated further to see if they are involved in breast cancer; any novel findings will be important.

Journal of Data Science, v.2, no.2, p.177-193

Geostatistical Analysis of Chinese Cancer Mortality: Variogram, Kriging and Beyond

by Dejian Lai

In this study, we used geostatistical tools to spatially analyze the Chinese cancer mortality rates. We first quantified the spatial variations of the observations using the variogram and fitted spherical and exponential parametric models to the sample variograms. Then, utilizing the fitted variogram function, we performed ordinary Kriging on the Chinese cancer mortality rates based on both models and produced a set of contour maps.

Journal of Data Science, v.2, no.2, p.195-211

Statistical Analysis of Market Penetration in a Mandatory Privatized Pension Market Using Generalized Logistic Curves (Abstract, Full Text)

by Victor M. Guerrero and Tapen Sinha

In this paper we analyze market penetration of the Mexican Pension System. This market is unique in two respects: it is mandatory and it is private. Very few markets in the world have these two characteristics. In Mexico, the pension system became privatized in July 1997. By the end of 1999, more than 98% of workers (in the formal sector) had affiliated themselves with some private pension provider. We used two simple statistical methods of analysis of market share to draw some conclusions about how market share unfolds in a mandatory but privatized market. Our first descriptive analysis is based on a generalized logistic growth curve and the second on some simple linear regression fitting. Our results show that individual pension funds did not have similar growth patterns. Early market leaders (in terms of market share) did not necessarily stay leaders in the end. However, the first 12 months turned out to be critical for market share.