#### Special Issue: Data mining in Chemistry and traditional Chinese medicine

##### Guest editors: K-.T. Fang and Y.-Z Liang

- A note from the guest editors (PDF): [30.83kB]

### Journal of Data Science, v.1, no.4, p.361-389

#### The Matrix Expression, Topological Index and Atomic Attribute of Molecular Topological Structure

##### by Qian-Nan Hu, Yi-Zeng Liang, and Kai-Tai Fang

- Full Text (PDF): [202.24kB]

The matrix expression, topological index and atomic attribute of molecular topological structure are reviewed. Nine matrices, twenty-six kinds of indices and eight methods dealing with weighted molecular graphs are summed up in three tables. Some shortcomings of the topological indices are discussed as: (1) the physical-chemical meaning of topological index is not explicit; (2) it is difficult to interpret the QSAR and QSPR models derived from the topological indices; and (3) topological index usually neglects the stereochemical information or the three-dimensional structure of the molecule. Three directions of topological index are focused on: (1) description of local information; (2) studies on inter-correlation of topological index; and (3) variable index.

### Journal of Data Science, v.1, no.4, p.391-404

#### Boosting Applied to Classification of Mass Spectral Data

##### by K. Varmuza, Ping He and Kai-Tai Fang

- Full Text (PDF): [117.20kB]

Boosting is a machine learning algorithm that is not well known in chemometrics. We apply boosting tree to the classification of mass spectral data. In the experiment, recognition of 15 chemical substructures from mass spectral data have been taken into account. The performance of boosting is very encouraging. Compared with previous result, boosting significantly improves the accuracy of classifiers based on mass spectra.

### Journal of Data Science, v.1, no.4, p.405-423

#### Application of Orthogonal Block Variables and Canonical Correlation Analysis in Modeling Pharmacological Activity of Alkaloids from Plant Medicines

##### by Qian-Nan Hu, Yi-Zeng Liang, Xiao-Ling Peng, Yin Hong and Lian Zhu

- Full Text (PDF): [138.45kB]

A new kind of orthogonal block variables, derived from subspace projection and canonical correlation analysis, is applied to model pharmaological activity of alkaloids from plant drugs. The alkaloids are grouped into three cases by intravenous, intraperitoneal, and subcutaneous injections. Four block variables (family of variables) investigated in this work are valence molecular connectivity index, alpha kappa index, E-State index and element counts of molecules, respectively. The regression model embracing only few new orthogonal block variables against pharmaological activity shows significant improvement than those, say multiple linear regression (MLR) simply using original variables, principal component regression (PCR) and the ones selecting only one or two of the original family variables, both in fitting and prediction ability of the correlation model. The reason for this might be that the new orthogonal block variables in fact include almost all of the information of the original variables but without collinearity between them.

### Journal of Data Science, v.1, no.4, p.425-445

#### The Classification Tree Combined with SIR and Its Applications to Classification of Mass Spectra

##### by Ping He, Kai-Tai Fang and Cheng-Jian Xu

- Full Text (PDF): [164.43kB]

A new approach combining classification tree (CT) with sliced inverse regression (SIR) is proposed and applied to the classification of mass spectra in this paper. The classification tree has been widely used to generate classifiers from the mass spectral data because of its powerful ability in automatic variable selection and automatic interaction detection. However, it is often weak on presenting the linear and global relationships among variables. When the variables enter a model with the form of linear combination, the classification tree can not detect the form and leads to a low accuracy. SIR is an effective method to find useful linear combinations of predictor variables to regress the response variable. So merging CT and SIR harmoniously can inherit both advantages of them. Experiments in the paper show that the proposed approach can improve classification accuracy of decision tree and get better result than other classical classification methods.

### Journal of Data Science, v.1, no.4, p.447-460

#### Multivariate Chemometric Study on the Interfacial Properties of Nucleic-Acid Bases

##### by Hai-Bin Luo, Yuen-Ting Fong, and Yuen-Kit Cheng

- Full Text (PDF): [749.58kB]

Systematic quantitative structure-retention relationship studies of nucleic acid bases were carried out by the combined use of multivariate analysis and experimental chromatographic technique. The results revealed a multiple linear relationship between the chromatographic retention and the molecular structural parameters yielding a regression $R^2$ value of 0.8113 (cross-validated $Q^2$ = 0.6945). Five molecular descriptors, viz., moment of inertia ($I_x, I_y \textrm{ and } I_z$), molar volume, and polar surface area, are able to account for the retention behavior of the compounds. Principal component analysis and factor analysis results indicate that the descriptors moment of inertia and molar volume have a primary influence on the chromatographic retention. The results provide useful insights for the future experimental and theoretical studies on the medicinal research of nucleic acid-base compounds.

### Journal of Data Science, v.1, no.4, p.461-480

#### Boiling Points Predictions Study via Dimension Reduction Methods: SIR, PCR and PLSR

##### by Hong Yin, Yi-Zeng Liang and Qinnan Hu

- Full Text (PDF): [218.20kB]

Variable selection is an important tool in QSAR. In this article, we employ three known techniques: sliced inverse regression (SIR), principal components regression (PCR) and partial least squares regression (PLSR) for models to predict the boiling points of 530 saturated hydrocarbons. With 122 topological indices as input variables our results show that these three methods have good performance and perform better than some existing methods in the literature.

### Journal of Data Science, v.1, no.4, p.481-496

#### Data Mining in Chemometrics: Sub-structures Learning via Peak Combinations Searching in Mass Spectra

##### by Yu Tang, Yi-Zeng Liang and Kai-Tai Fang

- Full Text (PDF): [163.45kB]

In this paper, a new approach of finding sub-structures in chemical compounds by searching peak combinations in mass spectra is given. Based on these peak combinations, further identification and classification methods are also proposed. As an application of these methods, saturated Alcohol and Ether are classified efficiently by using a variable selection method.

### Journal of Data Science, v.1, no.4, p.497-509

#### Building an Honest Tree for Mass Spectra Classification Based on Prior Logarithm Normal Distribution

##### by Cheng-Jian Xu, Ping He and Yi-Zeng Liang

- Full Text (PDF): [171.06kB]

Structure elucidation is one of big tasks for analytical researcher and it often needs an efficient classifier. The decision tree is especially attractive for easy understanding and intuitive representation. However, small change in the data set due to the experiment error can often result in a very different series of split. In this paper, a prior logarithm normal distribution is adopted to weight the original mass spectra. It helps to building an honest tree for later structure elucidation.