using rough set and boosting ensemble techniques to ...

5 downloads 1662 Views 702KB Size Report
applying any data mining technique, dissimilar attributes need to be minimized as a ...... Tanagra - free data mining software for teaching and research. [Online].
IJICIS, Vol.15, No. 2 APRIL 2015

International Journal of Intelligent Computing and Information Sciences

USING ROUGH SET AND BOOSTING ENSEMBLE TECHNIQUES TO ENHANCE CLASSIFICATION PERFORMANCE OF HEPATITIS C VIRUS M. E. Helal

M. Elmogy

R. M. Al-Awady

Information Systems Deprtmant., Faculty of Computers and Information, Mansoura University, Egypt

Information Technology Deprtmant., Faculty of Computers and Information, Mansoura University, Egypt

Electronics and Communications Deprtmant., Faculty of Engineering, Mansoura University, Egypt

[email protected]

[email protected]

[email protected]

Abstract-Machine learning techniques have been extensively applied to help medical experts in making a diagnosis of many diseases. Classification is a machine learning technique that is used to forecast the relationship between data samples and classes. It is an essential task in different applications, such as image classification and medical diagnosis. There are different classification techniques, such as SVM, C5.0, Neural Network, K-Nearest Neighbor, and Naive Bayes Classifier. Feature selection for classification of cancer data means discovering feature values of malignant tumors and benign ones. It also means using this knowledge to forecast the state of new cases. In this paper, we use Rough sets as a feature selection technique to create a subset feature from the original features. Therefore, we use the resulting subset with different classification and ensemble techniques to discover classes of unknown data using HCV data set. SVM, C5.0, and Ensemble classifiers are used as classification techniques to discover classes of unknown data. In this paper, the percentage of accuracy, sensitivity, and specificity are used as evaluation parameters for the tested classification techniques. Experimental results show that the proposed hybrid RS-Boosting/SVM technique has higher accuracy, sensitivity and specificity rates with selected subset features than other tested techniques. Keywords: Rough Set theory (RST), Feature Selection, Classification, Ensemble Classifier, C5.0, Support Vector Machine (SVM), Hepatitis C Virus (HCV).

1. Introduction Machine learning technologies have become well suited for analyzing data. In the medical area, classification and treatment are the main tasks for a physician. Machine learning studies are concentrating on learning how to recognize complicated patterns and create intelligent decisions depending on the tested data [1]. Medical data consists of attributes where missing value and redundant information need to be discarded. In medical domain, one of the most fundamental requirements for feature selection and classification is the ability to deal with inconsistent and imprecise information due to the considerable quantity of noisy, unrelated or misleading features [2]. Medical data analysis is a complicated task because it requires knowledge from the medical data set as well as advanced techniques for processing, 45

Helal et. al.: Using Rough Set and Boosting Ensemble Techniques to Enhance Classification Performance of Hepatitis C Virus

storing, and accessing information from the data. Traditional techniques are not capable enough of producing optimal results from incomplete or redundant data through the analysis process [3]. Feature selection is the process of creating a subset from the original features that makes the machine learning less time-consuming and easier. The best feature selection algorithm tests all different combinations of features against benchmarks, such as forecast accuracy and execution time [4]. Feature selection plays a significant role in building intelligent classification systems [5]. It minimizes the dimension of data and time execution. Therefore, it can make good classification performance [6]. Feature selection has the ability for choosing useful attributes and reducing cardinality [7]. If the subset selected in step m+1 is better than the subset selected in step m, the subset chosen in step m+1 can be chosen as the optimum subset. Therefore, feature selection is capable of minimizing the number of input features in a classifier to increase accuracy and have less time-consuming model [8]. When applying any data mining technique, dissimilar attributes need to be minimized as a preprocessing step. The most important objective of feature selection is to improve model performance, prevent overfitting, and to present faster and more efficient models. Classification is a method used for detecting classes of unknown data. In the classification task, the data set being mined is divided into training and testing set. Classification process includes two stages known as training and testing stage. A familiar problem in machine learning in general and in classification specifically is to overcome the problem of “overfitting”. When the number of training patterns is somewhat smaller than the number of features, this is known as data overfitting. Therefore, feature selection can be used as preprocessing step to solving this problem. Classification of the tumor is one of the essential steps to diagnose cancer. Tumors can be classified into benign and malignant. The malignant tumors are called cancer and they more serious than benign tumors. Early diagnosis needs a precise and reliable diagnosis algorithm that helps specialists to differentiate between malignant tumors and benign ones without the need for surgical biopsy. In addition, it offers accurate, timely analysis of patient's particular type of cancer, and the available treatment options that can lead to successful treatment. All specialists are not experts in cross-domain, so the automation of the diagnostic system is needed. In these systems, real-world data may be unrelated, superfluous, noisy data and not all the attributes are valuable for classification in this case feature selection is necessary while dealing with actual data sets. Ensembles classifiers are used to build a predictive classifier by integrating multiple classifiers that learn a target function by training some individual classifiers and combining their predictions. Ensemble classifier often has higher accuracy than any of individual classifier in the ensemble. Bagging and Boosting are used for producing ensemble models. These methods depend on getting different training sets for each classifier that makes up the ensemble. Bagging is based on a variety of bootstrap training sets that generated from the original training set and using each of them to create a classifier for inclusion in the ensemble. This technique is very useful for large and high dimensional data, where finding effective model or classifier that can work in one step is impractical because of the difficulty and size of the problem [9]. Boosting is a method used to enhance the performance of a weak classifier. It works by continually running a classifier on various training data and then combining the classifiers into ensemble classifier. Depending on an error rate of the base classifier, boosting reweights the training set and enhance its behavior based on the latest error it reaches. Furthermore, if the error rate of a single classifier is equal to 0 or greater than 0.5, the sequential building of single classifiers stops [10].

46

IJICIS, Vol.15, No. 2 APRIL 2015

In this paper, we examine and estimate the performance of using the Rough set as preprocessing technique, SVM, decision tree and ensemble classifiers. The rest of this paper is organized as follows. Section 2 contains related work that shows some of the current, previous studies in medical diagnosis based on machine learning techniques. Section 3 highlights the concepts of Rough Set Theory and describes some fundamental concepts of the decision tree, SVM, and Boosting ensemble. Section 4 discusses the proposed model. Section 5 shows results of applying different techniques and classification accuracy. Finally, the conclusion and the future work are presented in Section 6. 2. Fundamental concepts: 2.1. Rough set Theory: Rough set theory (RST) is a new approach that deals with incomplete data. Its methodology is concerned with the analysis of incomplete, uncertain or inexact information and knowledge. RST is especially useful to discover relationships in data, which is called knowledge discovery or data mining. Rough set theory is used for several types of data with different sizes for approximate or rough classification. In a given class C, the approximation are upper and lower approximation of C. Upper approximation consists of all of the tuples that are based on the knowledge of attributes that not belonging to C. Lower approximation consists of all of that data tuples that based on knowledge of the attributes that belong to C without uncertainty. S=(U, R) is an approximation space and X  U. The lower approximation of X is the set of objects of U that are surely in X, defined as: RX={xU [x]R  X} the lower approximation of X by R in S The upper approximation of X is the set of objects of U that are possibly in X, defined as:

RX   x U [x]R  X   



BNR(X) is the boundary set that is defined as RX - RX, the pair ( RX , RX) defines a rough set in S. The Rough set is a useful smart method that is applied in the medical area and used for finding data dependencies. It evaluates important attributes, reduces all redundant objects and attributes, and seeks the minimum subset of attributes. In medical fields, it aims at defining subsets of the most important attributes that affect the treatment of patients [2]. In this research, the data is HCV data set. However, RST can be applied to any data that contains an outcome (or class) and some measurements (or attributes). 2.2. Data Mining Algorithms: i.

Decision Tree:

Decision Trees are one of the most popular algorithms for classification, which classifies input data depending on their attributes. Decision trees attempt to find a reliable association between input and target values. Decision trees are used to classify instances by sorting them depending on values of the feature down the tree starting from the root to the leaf node. Each node in the tree is used to represents a feature in an instance to be classified, and each branch of the tree is used to represents a value that the node can assume. Decision trees have many advantages such as that they provide human-readable rules of classification and provide model transparency. There are various algorithms in this area, such as ID 3, C 5.0, and random tree. C 5.0 is an extension of C4.5 and ID 3 decision tree algorithms. It is designed to analyze massive data sets. It can produce classifiers expressed as either decision trees or rule sets. It 47

Helal et. al.: Using Rough Set and Boosting Ensemble Techniques to Enhance Classification Performance of Hepatitis C Virus

automatically extracts classification rules in the form of the decision tree from given training data. The difference between C5.0 and C4.5 is that C5.0 has less memory space and time required; the tree produced by C5.0 is very small which finally improves the accuracy of the classification [12]. C5.0 uses conventional splitting algorithms including Entropy-based information gain to split the sample into subsamples. The gain ratio is a robust and consistently gives a better choice of tests for large data sets. The model works by splitting the sample based on the maximum information gain that the attributes have. Each subsample is then split again until the subsamples cannot be split. In the end, the subsamples that are not included in the value of the model are removed. C5.0 is a robust model for problems such as missing data and large numbers of input features. It usually does not need long training times to estimate. Moreover, C5.0 models tend to be easily understood and present the most powerful boosting method to raise the accuracy of classification than some other model types [20]. ii.

Support Vector Machine (SVM)

SVM is a supervised machine learning method commonly used in classification, data analysis, and pattern recognition. An SVM training algorithm builds a model that classifies new samples to a class that associates with it. An SVM model is a method that carries out classification tasks by building hyperplanes in a multidimensional space; a linear method is used to divide the examples of the separate classes by a clear gap. The gap should be as large as possible. New samples are classified to relate the class based on which side of the gap they fall on [10]. SVM is a strong classification technique that increases the predictive accuracy of a model without over-fitting the training data [12]. There are numerous kernel functions in SVM classifier such as linear, RBF, polynomial, and sigmoid kernels. 3. Related Work Many researchers have handled medical work in the field of classification of different cancer diseases. There are different algorithms proposed by various authors for the prediction of cancer regions. Some of the existing techniques are presented in this section. For example, Wahed et al. [11] presented three different ways of using SVM as feature selection technique. The first way is to use SVM as a classifier based on training data to make a model. The second way is to use SVM as a learner where data is clustered into three, four and five clusters through K-Means technique. The third way is to use SVM as feature weighting, by forecasting feature importance about a target class. In this study, it is observed that the SVM technique presents the best accuracy as a classifier, a learner, and a feature weighting technique compared with the other techniques used in this study. Hota [12, 13] proposed predictive models for breast cancer. In this work, many classification techniques have been used to classify data related to breast cancer. The various individual models developed are tested and combined to form Ensemble model. He utilized in these papers Ensemble models (Bayesian and C5.0) and (SVM and C5.0). A testing accuracy of the model shows the efficiency of these Ensemble models. Feature subsets are obtained by applying feature selection algorithm and models that are tested on these data sets. Results showed that Ensemble model with selected subset feature could be the best alternative to health care predictive model for the diagnosis of breast cancer. Chen et al. [14] conducted research that uses SVM classifier with Rough set as a feature selection technique for the diagnosis of breast cancer. This method consists of two phases. In the first phase, they used Rough set as a feature selection method to find the optimal features. It provided an exclusion of unnecessary data. In the second phase, the selected feature subset is used as the inputs to an SVM 48

IJICIS, Vol.15, No. 2 APRIL 2015

classifier. The performance and efficiency of the method are validated on the Wisconsin Prognostic Breast Cancer (WPBC [26]). In this work, it observed that the proposed method achieved the highest classification accuracies for the selected subset that contained five features. They will evaluate the proposed RS_SVM in other larger breast cancer datasets. In addition, they will develop a more efficient approach to identifying the optimal model parameters. Jacob and Ramani [15] have built a data-mining framework for prognostic breast cancer. They focus on their study on building an efficient classifier for the WPBC data set through carrying out twenty classification algorithms. In addition, it investigated the effect of feature selection using six algorithms to enhance classification accuracy and decreased the feature subset size. They demonstrated that Random Tree and C4.5 classification algorithm made high accuracy in the training and test phase of classification with suitable evaluation parameters. Babalık et al. [16] designed SVM classifier based on artificial bee colony as a pre-processor approach to improve the accuracy of the classifier. In this research, they examined their work on three different online data sets [26]. With both pure data sets and weighted datasets by artificial bee colony (ABC), SVM is used as the classifier, RBF kernel as the kernel function in SVM. The K-fold cross algorithm is used to validate data sets to improve the reliability. The classification accuracy of the proposed approach is observed higher than the accuracy of pure SVM classifier. They intended to use this approach with a different data set or different problems in the future. Gupta et al. [17] introduced a research that used three different machine-learning tools over four different healthcare data sets in doing the performance analysis of several data mining classification techniques. They have used the Knowledge Discovery in Database approach as the research methodology. In this study, they used Clementine machine learning [30], Tanagra [28], and WEKA [31] tools to achieve the proposed objectives. Results showed that different classification techniques behave in a different way based on the nature of different data sets, attributes, and size. The classification technique, which has demonstrated the highest accuracy rate and lowest error rate over a data set, has been selected as the best classification technique for that data set. Elsayad [18] proposed an ensemble model that applies three different data mining methods. Multilayer perceptron neural network, C5.0 decision tree, and linear discriminate analysis are used to build an Ensemble model for the problem of differential diagnosis of these erythemato-squamous diseases. The proposed ensemble combined the models using a confidence-weighted voting scheme. The classification performance of the proposed system was presented using statistical accuracy, specificity, and sensitivity. The performance of Multilayer perceptron neural network (MLPNN) was enhanced using the scored predictions of C5.0, DT, and LDA models in the proposed ensemble arrangement. The proposed Ensemble model achieved an accuracy of 98.23%, which is very close to the work of Ubeyli [19]. Srivastava et al. [20] presented Rough-SVM Approach based on the hybridization of SVM and Rough Set Exploration System (RSES). RSES is used as a pre-processing stage to detect reducts that then applied to SVM to get high classification results. The classifications experiments are conducted on reduced training and testing Heart data set. In these experiments, LIBSVM with RBF kernel function has been used. RBF kernel parameters γ and the cost parameter C, has been determined using 5 –fold cross validation method. Classification accuracy increased using Rough–SVM than using SVM. Elsayad and Elsalamony [21] assessed the classification performance of four different decision tree models CHAID, C&R, C5.0, and QUEST. Experimental results showed the effectiveness of all models. 49

Helal et. al.: Using Rough Set and Boosting Ensemble Techniques to Enhance Classification Performance of Hepatitis C Virus

RBF-SVM identified a set of all attributes that are sufficient to achieve 100% classification sensitivity on training and sharing C5.0 in test subsets of 98.198%. Durairaj and Sathyavathi [3] presented an intelligent technique of Rough set theory for analyzing the imprecise medical data. They used the Rough set reduction technique to compute the optimal reduct set without changing the knowledge of the original set. In the experiments, they used “In Vitro Fertilization (IVF)” medical datasets in the analysis process. The reduction algorithm produced the result factors that affected the success rate of the IVF treatment. In ROSETTA toolkit, the Johnson reduction algorithm used in this analysis process and to predict the optimal reduct set. The experimental results show that the Rough set theory is an efficient tool for identifying influential parameters in determining the success rate of IVF treatment. Hassanien and Ali [22] proposed a model that used a Rough set for creating classification rules of the breast cancer data. The first step in this model is to select and normalize the attributes. The second step is to generate the Rough set dependency rules directly from the real value attribute vector. Then, Rough set is applied to get all reducts of the data that identify the minimal subset of attributes that are associated with a class label for classification. The obtained results of Rough set and ID3 decision tree classifier algorithm are compared with each other. Rough sets produced the highest accuracy rates and the most compact rules. They will intend to integrate Rough sets with other intelligent tools such as fuzzy sets and neural network for classification and rule generation. Setiawan et al. [23] proposed a method to reduce the high number of extracted rules from Coronary Artery Disease data set to a smaller number of rules. They have used support filtering and RST rule importance selection as a hybrid approach. High-quality rules and the most important rules are selected by Support filtering and RST methods respectively. Results show that the proposed method can select a small number of rules from a large number of rules without reducing the quality of classification. In this section, some of the related works are listed above, different authors proposed different algorithms for the prediction of cancer regions. Some of the existing techniques are presented in this section are using the feature selection technique as preprocessing step to enhance classifier performance. The proposed model use the Rough set as a feature selection technique to improve the performance of individual classifiers and Ensemble classifiers used in this paper. 4. The Proposed Model: This study focuses on integrating feature selection technique with individual and ensemble classifiers (Boosting C5.0 and Boosting SVM) and compares the result for each classifier with and without feature selection. In this proposed model, the Rough set is employed for feature selection technique and C5.0, SVM, Boosting C5.0, and Boosting SVM are employed as classification techniques. The proposed model consists of three stages as shown in Figure 1. First Stage: Data Collection and Preprocessing: The HCV data set was collected from clinical trials of a newly developed medication for HCV [23, 24]. Data preprocessing is one of the most critical steps in a data mining process which transforms data into a format that will be more easily, and efficient process. The data preparation process is often the mainly time-consuming and computationally. In this step, the data set has been split into the training set, which is used to build the model and testing set that is used to evaluate the proposed model.

50

IJICIS, Vol.15, No. 2 APRIL 2015

Second Stage: Building Proposed Model: In this model, RST is used as a feature selection technique to create a subset of the original features that make machine learning easier and less time-consuming. After creating a subset, different classification techniques are used to classify new data. In addition, we separately use C5.0 and SVM as a classification technique. In this study, we classify data by using each classification technique individually, by using the combination of RS with different classification techniques mentioned in the previous section and the combination of Boosting classification techniques with RS. In this paper, RSBoosting SVM and RS-Boosting C 5.0 are used as proposed models. Results are compared with other used techniques. Dataset

First Stage: Data collection and preprocessing

Pre-processed Data Data pre-processing Transformed Data

Testing Set

Training Set

Second Stage: Building proposed model: Discover data dependencies, evaluates the importance of attributes ,and seeks the minimum subset of attributes.

Feature selection using RST

Relevant Feature Subset Using C5.0,SVM, Boosting classifier

Classification Technique Classified Data

Third Stage: Evaluate proposed model: Three statistical measures are used to test (Accuracy, sensitivity Specificity) of each model.

Evaluation

Figure 1: The architecture of proposed model.

Third Stage: Evaluate Proposed Model: In this model, to evaluate the performance of each classification technique, three statistical measures are used. Accuracy is used to indicate the percentage of correctly classified cases in all of the classified samples. The second used criteria namely sensitivity refer to the rate of correctly classified positive. The third used criteria namely specificity refers to the rate of correctly classified negative. 51

Helal et. al.: Using Rough Set and Boosting Ensemble Techniques to Enhance Classification Performance of Hepatitis C Virus

The pseudo code of the proposed model: Step 1: Collect data and transform it into a decision table (A). A  (U, A {d}) where U={x1,x2,x3,….,xn} that is called Universe and A is a nonempty finite set of attributes, a  A, a :U Va . The set Va is called the value set of a. Step 2: Split data into training and testing datasets. Step 3: Find an optimal subset of features using rough set on training data set. 

Generate the discretize table using training data set. Transform the original decision table A  (U, A {d}) into new discrete decision table A D  (U , AD {d}) where AD={a Da: anA}and Da is set of cuts. The table A D is called the D-discretized table of A.



Calculate reducts by using the exhaustive algorithm. The decision reduct is a set B  A of attributes such that it cannot be reduced and IND(B) = IND(B). Where IND(B) is defined as: x

IND(B) y  aBa(x)  a( y)

, x,yU. If an (x,y)U×U belong to IND(B) then x and y are indiscernible by attributes from B. 

Minimize reducts by using shorten method using a 90% shortening ratio without reducing accuracy and coverage measurement.



Generate rules and remove rules with small support. A D  (U , AD {d}) discrete decision table every xU determines a sequence a1(x),a2(x),…..an(x) ; d1(x),d2(x),….dn(x) where {a1(x),a2(x),…..an(x)}=A and {d1(x),d2(x),….dn(x)}=d. Decision rule induced by A  xd



Test the efficiency of using the rough set as feature selection.

Step 4: Classify data by using the combination of Boosting classification techniques with RS. AD  (U, AD {d}) is minimized discrete decision table that contains optimal feature selected min

by Rough set. Using the optimal feature with Boosting techniques (Boosting C5.0 or Boosting SVM) to classify data. Step 5: Evaluate the performance of classification techniques by using Accuracy, sensitivity, and specificity. 5. Experiment Results and Analysis: This section evaluates the performance of the individual SVM and C5.0 techniques. In addition, it evaluates some hybrid techniques for classification, such as RS-SVM, RS-C5.0, RS-Boosting C5.0, and RS-Boosting SVM. Hepatitis C Virus Data Set: The HCV data set consists of 119 cases; each of which is described by 28 attributes: 23 numerical and 5 categorical attributes. The objective of the data set is to predict the existence or nonexistence of hepatitis virus-related to the proposed medication. These attributes are listed in Table 1.

52

IJICIS, Vol.15, No. 2 APRIL 2015

TABLE 1: THE TESTED ATTRIBUTES.

No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

Attribute Name Sex Source S.G.P.T (ALT) S.G.O.T (AST) Serum Bilirubin (SB) Serum Albumen (SA) Serum Ferritin Ascites Spleen Lesions Portal vein ( P.V ) PCR PLT WBC HGB Haemoglobin Headache Blood Pressure Nausea Vertigo Vomiting Constipation Diarrhea Appetite Gasp Fatigue Skin colour Eye Colour Decision Class

Attribute Description Male or female Source of HCV: blood transfusion, non-sterile tools by dentist or surgery range between 0 to 40 U/L Normal Normal range between 0 to 45 U/L Normal range between 0 to 1.1 mg/dL Serum Albumin; normal range between 3.5 to 5.1 g/dL Normal range between 22 to 300 No, Mild, and Ascites Normal, Absent, and Enlarged 0,1 or 2 Natural diameter is 12 mm Quantitative analysis of the virus U/mL Platelets normal range between 150 to 450 /cm m White Blood Corpuscles normal range between 4 to 11/cm m The range for a male between 12.5 to 17.5 g/dL range for a female between 11.5 to 16.5 g/dL Yes or No Yes or No Yes or No Yes or No Yes or No Yes or No Yes or No Yes or No Yes or No Yes or No Yes or No Yes or No -1 absent, 1 present of HCV

In this study, machine-learning tools (RSES v2.2[27], Tanagra 1.4.50 [28] and c5.0 [29] (see5 release 2.1)) are used to achieve the proposed objectives. The percentage of accuracy, sensitivity, and specificity of classification techniques are calculated as the measurement parameters for analysis. The accuracy parameter suggests that a high value of accuracy rate for a classification technique applied to a data set shows that the data set is highly correctly classified by the obtained classifier. On the other hand, the low value of accuracy rate for a classification technique applied to a data set shows that the obtained classifier less correctly classifies the data set. The sensitivity parameter is used to measure the rate of positive class that is correctly classified. Specificity parameter is used to measure the rate of the negative class that is correctly classified. In this paper, the performance of each classification model is evaluated using three statistical measures; classification accuracy, sensitivity, and specificity. These measures are defined as the follows:  Accuracy = (TP + TN)/ (TP + FP + FN + TN) OR (TP + TN) / (N)  Sensitivity=(TP)/( TP + FN) 53

Helal et. al.: Using Rough Set and Boosting Ensemble Techniques to Enhance Classification Performance of Hepatitis C Virus

 Specificity=( TN)/( TN + FP) TP represents an instance, which is positive and predicted by the model as positive. FP represents an instance, which is negative but predicted by the model as positive. FN represents an instance, which is positive but predicted by the model as negative. TN represents an instance, which is negative and predicted by the model as negative. During the experiment, HCV data set was divided into training and testing datasets with splitting factor nearly 0.75, 0.25 for the training set and testing set, respectively. The training set is used to build the classifier and then use testing data to validate it. Then, the participating classification techniques are applied to generate the classifiers via Tanagra 1.4.50 and c5.0 machine learning tools. The results are recorded in terms of percentage of accuracy rates, sensitivity, and specificity as shown in Table 3. In this section, the algorithms and methods described in the previous sections are tested on real HCV data sets. The goal of the research is to compare the different methods described previously and find the best method. HCV treatment may take 12 months or more. In this experiment, HCV data set has three snapshots during treatment after three, six, and nine months. Therefore, data pre-processing, feature selection, and classification have been done three times to measure the progress of HCV cases. In this paper, RSES tool is used as a feature selection technique on three snapshots of the dataset. After splitting data set, discretize method is applied to generating cuts. After that, we use those cuts to generate the discretize table. Moreover, then we use this table to calculate reducts by using the exhaustive algorithm. This method generated 56 reduct sets. The following step is to minimize reducts by using shorten method without reducing accuracy and coverage measurement. After this, the generated reducts are minimized to five. Then, we use generate rules method to generate rules that produced 109 rules. The next task is to remove rules with support less than three we obtain 55 rules. After this, the test set is used to test the efficiency of using the rough set as feature selection; the result is shown as in Figure 2.

Figure 2: View of classification results when applying RST on 3 Months Dataset

Applying the previous steps in 6 months dataset, and when shorten method is used, reducts are minimized to 18 reducts. Then we use generate rules method to generate rules that produced 405 rules. After this step, we remove rules with support less than four rules that produced 154 rules. The final step is to use the test set to test the efficiency; the result is shown as in Figure 3. 54

IJICIS, Vol.15, No. 2 APRIL 2015

Figure 3: View of classification results when applying RST on 6 Months Dataset

Applying the previous steps in 9 months dataset, and when using shorten method, reducts are minimized to 56 reducts. After that, we use generate rules method to generate rules which produced 740 rules. After this step, we remove rules with support less than 14 that produced 13 rules. The final step is to use test set to test the efficiency; the result is shown as in Figure 4.

Figure 4: View of classification results when applying RST on 9 Months Dataset

C5.0 and SVM are applied without applying feature selection technique on original data sets. On the other hand, Rough set is applied on three snapshots; proposed hybrid RS-SVM, RS-C5.0, RS-Boosting C5.0, RS-Boosting SVM are tested on the selected subset features shown in Table 2. The results are shown in Table 3. Dataset 3 Month Data set Data 6 Month set Data 9 Month set

No. Features 7 7 13

TABLE 2: THE SELECTED SUB FEATURE. Selected Features Sex -WBC - S.G.P.T (ALT) - HGB - Fatigue - Serum Bilirubin (SB) - Gasp S.G.O.T (AST) - HGB - Skin color - Serum Bilirubin (SB) - Serum Ferritin - S.G.P.T (ALT)(ALT) – Appetite Sex - Source - S.G.O.T (AST)- S.G.P.T - HGB - Serum Bilirubin (SB) - Serum Albumen (SA) - Serum Ferritin - Spleen – PLT – Gasp –Fatigue - Skin color 55

Helal et. al.: Using Rough Set and Boosting Ensemble Techniques to Enhance Classification Performance of Hepatitis C Virus

3 Month Dataset

6 Month Dataset

20

9 Month Dataset 8

Proposed model Rough Set (RS-SVM Boosting C5.0) C5.0 RS-SVM RS-C5.0 Proposed model (RS-Boosting SVM) Proposed model (RS- Boosting C5.0)

6 - boosting reduced 1to 6 trials since last classifier 1 is very inaccurate 1 1 1 6 5 – with Gamma parameter =3 6 boosting reduced to 6 trials since last classifier is very inaccurate

98.9% 92 % 92% -

80% 80% 100% 63.3% 93.1% 66.7% 100%

100%

specificity 0% 100% 0% 0% 0% 100% 0% 25% 66.67% 0% 50% 50% 100% 100%

100%

82.14% 0% 100% 66.67% 60% 50% 100% 100%

50% 96% 100% 62.5% 100% 70.83% 100% 100%

66.7%

33.3%

75%

120%

100% 80%

accurac y sensitivi ty

60% 40% 20% 0% RSRSRSRS-C5.0 RS-SVM C5.0 SVM Boosting Boosting Boosting Classification techniques C5.0 SVM with SVM Gamma Figure 5: Classification Techniques Performance on 3 Months DataSet

56

Evaluation

Dataset

TABLE 3: THE CLASSIFICATION ACCURACY FOR DIFFERENT CLASSIFICATION ALGORITHMS NO. Accuracy Used techniques No. classifier used sensitivity Feature Train Test Rough Set 93.3% 100% 20 (75%) (25%) SVM 1 100% 100% C5.0 1 89.7% 96.7% 100% RS-SVM 1 93.1% 100% RS-C5.0 1 89.7% 96.7% 100% 7 Proposed model 3 100% 100% 1 89.7% 96.7% 100% Proposed model (RS-Boosting Rough Set 80% 100% 20 boosting reduced to (RS-SVM) Boosting SVM 1 93.1% 100% 1 trial since last C5.0) C5.0 1 75.9% 93.3% 100% classifier is very RS-SVM 1 89.66% 100% inaccurate RS-C5.0 1 89.7% 70% 71.4% Proposed model 7 100% 100% 7 (RS-Boosting 3 - with Gamma 100% 100% SVM) parameter =3

IJICIS, Vol.15, No. 2 APRIL 2015

120% 100% 80% 40% accurac y

20% 0% RSRSRSBoosting Boosting Boosting C5.0 SVM with SVM Gamma

RS-C5.0

RS-SVM

C5.0

Evaluation

60%

SVM

Classification Techniques Figure 6: Classification Techniques Performance on 6 Months DataSet

Figure 5 and Figure 7 show that SVM had achieved high accuracy, sensitivity, and specificity rates unlike using the C5.0 classifier. Also, the proposed hybrid RS-Boosting/SVM has higher accuracy, sensitivity, and specificity rates with selected subset features than using any hybrid models. Figure 6 shows that SVM has achieved same accuracy and sensitivity rates like using the C5.0 classifier but SVM has high specificity rate unlike C5.0. Also, the proposed hybrid RS-Boosting/SVM has higher accuracy, sensitivity, and specificity rates with selected subset features than using any hybrid models. 120%

80%

60% accurac y

40%

Evaluation

100%

20% 0% RSRSRSRS-C5.0 RS-SVM C5.0 SVM Boosting Boosting Boosting C5.0 SVM with SVMClassification Techniques FigureGamma 7: Classification Techniques Performance on 9 Months Dataset

Table 3 shows the classification accuracy, sensitivity and specificity rates of different classification algorithms. Figures 5, 6, and 7 show the performance of each classification technique through three statistical measures (Accuracy, Sensitivity, Specificity). SVM classifier has 100% for accuracy, sensitivity and specificity rates unlike using the C5.0 classifier for two snapshots of dataset (3 Months and 9 Months datasets). The proposed hybrid RS-Boosting/SVM has higher accuracy, sensitivity and specificity rates with selected subset features than using any hybrid models for all tested snapshots (3Months, 6Months, and 9Months). 57

Helal et. al.: Using Rough Set and Boosting Ensemble Techniques to Enhance Classification Performance of Hepatitis C Virus

6. Conclusion Classification technique is used for detecting classes of unknown data. The classification of the tumor is one of the important steps to diagnose cancer. Feature selection plays an essential job in constructing intelligent classification systems. It can enhance classification performance. Ensemble classifier often has more accuracy performance than any of the individual classifiers in the ensemble. In this paper, we have proposed a model for classifying HCV data based on RST as feature selection technique and combine it with different classification techniques. The model attempts to classify patients into two classes. The classification performances of the models listed in the paper are evaluated and compared to each other using three statistical measures; classification accuracy, specificity, and sensitivity. Experimental result shows that the proposed hybrid RS-Boosting SVM have a higher accuracy sensitivity and specificity rates with selected subset features than using hybrid RS-Boosting C5.0 for all snapshots (3 Months, 6 Months, and 9 Months) data sets. In future work, we intend to combine two or more different classifiers into ensemble classifiers and show how ensemble classifier improves the efficiency of classification. References Oguntimilehin A., Adetunmbi A.O. and Abiola O.B., “A Machine Learning Approach to Clinical Diagnosis of Typhoid Fever”, International Journal of Computer and Information Technology, Vol. 02, Issue 04, pp. 671-676, July 2013. 2. S.N. Suhana and S.S.Mariyam, “Feature Granularity of Cardiac Dataset using Rough Set Technique”. Aust. J. Basic &Appl. Sci., 8(4), pp. 114-122, 2014. 3. M.Durairaj and T.Sathyavathi, “Applying Rough Set Theory for Medical Informatics Data Analysis”, ISROSET-International Journal of Scientific Research in Computer Science and Engineering, Vol. 01, Issue-05, pp. 1-8, Sep -Oct 2013, ISSN: 2320-7639. 4. Jacob Vesterlund, A Thesis Report on “Feature Selection and Classification of Cdna Microarray Samples in ROSETTA”, April 2008. 5. H. W. Liu, J. G. Sun, L. Liu, And H. J. Zhang, “Feature Selection With Dynamic Mutual Information”, Pattern Recognition, 42: pp. 1330-1339, 2009. 6. Talavera, L. “An Evaluation of Filter and Wrapper Methods for Feature Selection in Categorical Clustering”, In Processing of 6th International Symposium on Intelligent Data Analysis, IDA, pp. 440-451, 2005, Springer. ISBN: 3-540-28795-7. 7. Lalith Prasad K A and Prakasha S, “Clustering Based Feature Subset Selection for High Dimensional Data”, IJSRD - International Journal for Scientific Research & Development, Vol. 2, Issue 04, 2014, ISSN (online): 2321-0613 8. Jordina Torrents Barrena, Domènec Puig Valls, “Tumor Mass Detection through Gabor Filters and Supervised Pixel-Based Classification in Breast Cancer”, University Rovira, Virgil, 2014. 9. N. S. Chandolikar & V. D. Nandavadekar, “Investigation of Feature Selection and Ensemble Methods for Performance Improvement of Intrusion Attack Classification”, International Journal Of Computer Science And Engineering (IJCSE) Vol. 2, Issue 3, pp. 131-136, July 2013, Issn 22789960. 10. Ching Wei Wang, “New Ensemble Machine Learning Method for Classification and Prediction on Gene Expression Data”, Proceedings of the 28th IEEE EMBS Annual International Conference New York City, USA, Aug 30-Sept 3, 2006. 11. Elsayed A. Wahed, Ibrahim Al Emam and Amr Badr, “Feature Selection for Cancer Classification: An SVM Based Approach”, International Journal of Computer Applications Vol. 46– No.8, pp. 0975 – 8887, May 2012. 1.

58

IJICIS, Vol.15, No. 2 APRIL 2015

12. H.S.Hota, “Diagnosis Of Breast Cancer Using Intelligent Techniques”, International Journal Of Emerging Science And Engineering (IJESE), Vol. 1, Issue-3, January 2013 ISSN: 2319–6378. 13. H.S. Hota, “Identification of Breast Cancer Using Ensemble of Support Vector Machine and Decision Tree With Reduced Feature Subset”, International Journal Of Innovative Technology And Exploring Engineering (IJITEE), Vol. 3, Issue-9, February 2014, ISSN: 2278-3075. 14. Hui-Ling Chen, Bo Yang, Jie Liu and Da-You Liu, “A Support Vector Machine Classifier with Rough Set-Based Feature Selection for Breast Cancer Diagnosis”, Expert Systems with Applications 38, pp. 9014–9022, 2011. 15. Shomona Gracia Jacob and R. Geetha Ramani, “Efficient Classifier for Classification of Prognostic Breast Cancer Data through Data Mining Techniques”, Proceedings of the World Congress on Engineering and Computer Science 2012 Vol. I WCECS 2012, October 24-26, 2012, San Francisco, USA. 16. Babalık, A., Babaoğlu, İ., Özkış, A, “A Pre-Processing Approach Based On Artificial Bee Colony for Classification by Support Vector Machine”, International Journal of Computer and Communication Engineering, Vol. 2, No. 1, pp. 68-70, 2013. 17. Shelly Gupta, Dharminder Kumar, Anand Sharma, “Performance Analysis of Various Data Mining Classification Techniques on Healthcare Data”, International Journal of Computer Science & Information Technology (IJCSIT), Vol 3, pp.155-169, 2011. 18. Elsayad, A. M, “Diagnosis of Erythemato-Squamous Diseases Using Ensemble of Data Mining Methods”. ICGST-BIME Journal, Vol. 10, No. 1, pp. 13-23, 2010. 19. Ubeyli E, “Multiclass Support Vector Machines For Diagnosis of Erythemato-Squamous Diseases”, Expert Systems with Applications, 35(4): pp. 1733–1740, 2008. 20. D. K. Srivastava and K.S. Patnaik, “Data Classification: A Rough-SVM Approach”, Contemporary Engineering Sciences, Vol. 3, No.2, pp. 77 – 86, 2010. 21. Alaa. M. Elsayad and H. A. Elsalamony, “Diagnosis of Breast Cancer Using Decision Tree Models and SVM”, International Journal of Computer Applications Vol. 83, No. 5, pp. 19-29, December 2013. 22. Hassanien, A. E. & Ali, J. M. H, “Rough Set Approach for Generation of Classification Rules of Breast Cancer Data,” Informatica, Lith. Acad. Sci. Vol. 15 , No. 1 , pp. 23-38, 2004. 23. N.A. Setiawan, P.A. Venkatachalam, and Ahmad Fadzil M.H,“Rule Selection for Coronary Artery Disease Diagnosis Based on Rough Set”, International Journal of Recent Trends in Engineering, Vol 2, No. 5, November 2009. 24. Farid A. Badria ,Mohammed M. Eissa , Mohammed Elmogy , and Mohammed Hashem, “Rough Based Granular Computing Approach For Making Treatment Decisions of Hepatitis C”, 23rd International Conference on Computer Theory and Applications Iccta ,Alexandria, Egypt, 29-31 October 2013. 25. Mohammed M. Eissa, Mohammed Elmogy, Mohammed Hashem, and Farid A. Badria, “Hybrid Rough Genetic Algorithm Model for Making Treatment Decisions of Hepatitis C”,2nd International Conference For Engineering And Technology Icet,Guc, Cairo, Egypt ,April 2014. 26. https://archive.ics.uci.edu/ml/machine-learning-databases (Last accessed on 3-4-2015). 27. http://logic.mimuw.edu.pl/~rses/ (Last accessed on 2-4-2015). 28. Tanagra - free data mining software for teaching and research. [Online]. http://eric.univlyon2.fr/~ricco/tanagra/en/tanagra.html (Last accessed on 2-4-2015). 29. C5.0 – free data mining software. [Online]. https://www.rulequest.com/download.html (Last accessed on 2-4-2015).

59

Suggest Documents