Patient classification of hypertension in Traditional Chinese Medicine ...

4 downloads 1016 Views 738KB Size Report
Open Access Research. First Online: 23 September 2015. DOI : 10.1186/1755-8794-8-S3-S4. Cite this article as: Li, GZ., He, Z., Shao, FF. et al. BMC Med ...
Li et al. BMC Medical Genomics 2015, 8(Suppl 3):S4 http://www.biomedcentral.com/1755-8794/8/S3/S4

RESEARCH

Open Access

Patient classification of hypertension in Traditional Chinese Medicine using multi-label learning techniques Guo-Zheng Li1,2†, Zehui He1†, Feng-Feng Shao2, Ai-Hua Ou1*, Xiao-Zhong Lin3 From IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2014) Belfast, UK. 2-5 November 2014

Abstract Background: Hypertension is one of the major risk factors for cardiovascular diseases. Research on the patient classification of hypertension has become an important topic because Traditional Chinese Medicine lies primarily in “treatment based on syndromes differentiation of the patients”. Methods: Clinical data of hypertension was collected with 12 syndromes and 129 symptoms including inspection, tongue, inquiry, and palpation symptoms. Syndromes differentiation was modeled as a patient classification problem in the field of data mining, and a new multi-label learning model BrSmoteSvm was built dealing with the class-imbalanced of the dataset. Results: The experiments showed that the BrSmoteSvm had a better results comparing to other multi-label classifiers in the evaluation criteria of Average precision, Coverage, One-error, Ranking loss. Conclusions: BrSmoteSvm can model the hypertension’s syndromes differentiation better considering the imbalanced problem.

Background Hypertension is one of the major risk factors of cardiovascular diseases. It contributes to one half of the coronary heart disease and approximately two thirds of the cerebrovascular disease burdens [1]. There are over 972 million hypertension patients in the world [2].Traditional Chinese Medicine (TCM) has been playing an important role on treating hypertension, and it lies primarily in “treatment based on syndrome differentiation of the patients”. Traditionally, syndrome differentiation is performed by TCM practitioner should have solid theoretical foundation and plentiful experiences. In the field of data mining, syndrome differentiation can be regarded as a patient classification problem * Correspondence: [email protected] † Contributed equally 1 Department of Big Medical Data, Guangdong Provincial Hospital of Chinese Medicine, The Second Affiliated Hospital of Guangzhou University of Chinese Medicine, Guangzhou, China Full list of author information is available at the end of the article

which can be solved with specific data mining and machine learning techniques. It has become a fast developing field with the accumulating of clinical data [3-6]. In traditional classification problems, one case would be only classified to one category (i.e. label) which is called single label classification. While in TCM, one patient may have more than one syndromes which should be multi-label classification problems in the data mining field. Multi-label learning has been used in TCM field and got better results comparing with conventional learning methods. Liu et al. compared the performance of Multi-label-KNN and KNN on a coronary heart disease dataset. Li et al. had investigated the contribution of symptoms to syndromes diagnosis by using fusion symptoms with ML-KNN classifier [7]. Li et al. and Shao et al. proposed embedded multi-label feature selection method MEFS [8] and wrapper multi-label feature selection method HOML [9], respectively, to get better performance for the multi-label classification.

© 2015 Li et al.; This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http:// creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/ zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Li et al. BMC Medical Genomics 2015, 8(Suppl 3):S4 http://www.biomedcentral.com/1755-8794/8/S3/S4

Multi-label classification was mainly motivated by the tasks of text categorization and medical diagnosis in the past. The existing methods for multi-label classification can be grouped into two main categories: a) problem transformation methods, and b) algorithm adaptation methods. The problem transformation methods transform the multi-label classification problem either into one or more single-label classification or regression problems, and there have been many learning algorithms depending on transformation methods. The algorithm adaptation methods could extend specific learning algorithms to deal with multi-label data directly [10]. In classification, a dataset is said to be imbalanced when the number of cases which represents one class is much smaller than the ones from other classes [11]. Furthermore, the class with the lowest number of cases is usually the class of interest from the point of view of the learning task [12]. This phenomenon is of great interest as it turns up in many real-world classification problems, such as risk management [13], fraud detection [14], and especially medical diagnosis [15-19]. In this study, a new classification method named BrSmoteSvm is built for hypertension syndromes differentiation. The BrSmoteSvm works on both multi-label data and class-imbalanced problem. It is a combination of Binary Relevance (BR), Synthetic Minority Over-sampling Technique (SMOTE) [16] and Support Vector Machine (SVM) [17]. Firstly, BR algorithm is used to transform the multi-label classification problem into single-label classification. And it is found class-imbalance on the single-label situation. Then, SMOTE is applied to decrease the effect of the class-imbalance problem. At last, SVM is used as the binary classifier to differentiate the syndromes. The rest of this paper is arranged as follows. Section 2 describes the materials and the methods of this study. Section 3 presents the results and discussion of our experiment. Section 4 presents the conclusions.

Methods Materials

The study dataset originated from the hypertension patients who visited the in-patient and out-patient departments of Internal Medicine, Nerve Internal Medicine and Health Management Center of the Guangdong Provincial Hospital of Chinese Medicine and Li Wan District Community Hospital in Guangzhou of China during November 2006 to December 2008. This study was approved by the ethics committee of the Guangdong Provincial Hospital of Chinese Medicine, China. Informed written consent was obtained from each participant prior to data collection. In total, 908 cases were collected with 13 syndromes and 129 TCM symptoms

Page 2 of 6

from inspection symptoms, tongue symptoms, inquiry symptoms, palpation symptoms and other symptoms. Four cases were excluded from the analysis because of missing answers on features. And one syndrome were excluded because of its nonnumeric value to make sure the smooth application of data mining methods. Finally, we got 904 cases with 12 syndromes and 129 symptoms. Table 1 shows the number of cases (D); the number of features (M); the number of labels (|L|); the Label Cardinality (LC), which is the average number of single-labels associated with each 1 |D| |Yi | ; the Label example defined by LC(D) = |D| i=1 Density (LD), which is the normalized cardinality |D| 1 |D| |Yi | , L = i=1 Yi ; the defined by LC(D) = |D| i=1 L number of Distinct Combinations (DC) of labels. |D| represents the number of examples and |Yi| represents the label number of the i case. Computational methods

In multi-label classification, each case could have several syndromes. The cases are associated with a subset of labels Y⊆L where L is the set of possible labels. Following is a brief introduction of the algorithms used in this study. 1) SMOTE SMOTE is used to decease the influence of the classimbalanced problem. It is an over-sampling approach in which the minority class is over-sampled by creating “synthetic” examples. The main idea of SMOTE can be described as follows. Step 1: Compute the k nearest neighbors for each minority class instance. Randomly choose N of the k nearest neighbors of each minority class instance saved as Populate. Step 2: Take the difference of the feature vector between each minority class instance and its nearest neighbors in Populate. Multiply this difference by a random number between 0 and 1, and add it to the feature vector of each minority class instance. The synthetic examples generated by SMOTE cause the classifier to create larger and less specific decision regions rather than smaller and more specific regions. More general regions are now learned for the minority class samples rather than those being subsumed by the majority class samples around them.

Table 1. Description of the datasets Dataset

Domain

N

M

|L|

LC

LD

DC

hypertension

medical

904

129

12

0.86

0.07

57

Li et al. BMC Medical Genomics 2015, 8(Suppl 3):S4 http://www.biomedcentral.com/1755-8794/8/S3/S4

2) SVM SVM is used as the binary classifier in BR. The original SVM algorithm was invented by Vladimir N. Vapnik and the current standard incarnation (soft margin) was proposed by Vapnik and Corinna Cortes in 1995. The basic SVM takes a set of input data and related label, and for each given input, two possible classes forms the output, making it a non-probabilistic binary linear classifier. Given a set of training instances, each marked as belonging to one of two classes, an SVM training algorithm builds a model that can assign new instances into one class or the other. An SVM model is a representation of the instances as points in space, mapped so that the instances of the separate classes are divided by a clear gap and the gap is as wide as possible. The test instances are then mapped into that same space and predicted to belong to a class based on which side of the gap they fall on. The above describes that SVM performs a linear classification. In addition, SVM can also efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. 3) BrSmoteSvm The main idea of BrSmoteSvm is described as follows. In each fold of the 10-fold cross validation, BR, a problem transformation method is used. The basic idea of BR is to decompose the multi-label learning problem into q independent binary classification problems, where q is the number of label and each binary classification problem corresponds to a possible label in the label space [18]. Therefore, for any multi-label training example, each instance will be involved in the learning process of q binary classifiers. Then SMOTE is applied to training data to decrease effect of the class-imbalanced problem. In the end, SVM is used as the binary classifier. After the 10-fold cross validation, we get the predicted label set. Experimental design and evaluation

In our experiment, 10-fold cross validation is utilized to test the accuracy of the classification. Let 700 cases of the data be training set, and 204 cases be testing set. In order to validate performance of BrSmoteSvm, it is compared with other popular multi-label classifiers. 1) ML-KNN. The number of neighbors is set to 10 and the smoothing factor is set to 1 as recommended. 2) Random k-Labelsets (RAKEL) [19]. J48 is used as the base learner; the number of models is set to 5; the size of subsets is set to 8. 3) Instance-based learning and logistic regression (IBLR) [20]. The number of nearest neighbors is set to 10. 4) Ensemble of Classifier Chains (ECC). J48 is used as the base learner for each Classifier Chains model; the number of models is set to 10.

Page 3 of 6

5) A lazy multi-label classification method based on the KNN (BRKNN) [21]. The number of the nearest neighbors is set to 10. At last, for SMOTE, N is set to fixed value 10, and k is chosen from {10, 12, 14, 16, 18, and 20}; then, k is set to fixed value 16, and N is chosen from {6, 8, 10, 12, 14, and 16} to evaluate the robustness of our method. Let × denote the domain of cases and let Y={1,2,...,Q} be the set of labels. The purpose of the learning system is to output a multi-label classifier h: X®2y for the given training set through optimizing some specific evaluation metric. In other word, a successful learning system would output larger values for labels in Y i than those not in Yi for the given instance xi and its label set Yi. For example, f (xi,yi)>f (xi,yj) for any yi in Yi and yj not in Yi. The real-value function f (.,.) can be transformed to a ranking function rank(.,.), which maps the outputs of f (xi,y) for any y in Y to {1,2,...,Q} such that if f (xi,yi) >f (xi,yj) then rank(xi,yi)