An adaptive rule-based classifier for mining big biological data

165 downloads 67846 Views 1MB Size Report
Aug 3, 2016 - and useful information from large data or databases (Farid et al.,. 2013; Han .... these biological data have similar properties 3V's (volume, va-.
Expert Systems With Applications 64 (2016) 305–316

Contents lists available at ScienceDirect

Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa

An adaptive rule-based classifier for mining big biological data Dewan Md. Farid a,∗, Mohammad Abdullah Al-Mamun b, Bernard Manderick a, Ann Nowe a a b

Computational Modeling Lab, Department of Computer Science, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, Belgium Department of Population Medicine & Diagnostic Sciences, College of Veterinary Medicine, Cornell University, Ithaca, NY 14850, USA

a r t i c l e

i n f o

Article history: Received 2 January 2016 Revised 1 August 2016 Accepted 2 August 2016 Available online 3 August 2016 Keywords: Brugada syndrome Classification Decision tree Genomic data Rule-based classifier

a b s t r a c t In this paper, we introduce a new adaptive rule-based classifier for multi-class classification of biological data, where several problems of classifying biological data are addressed: overfitting, noisy instances and class-imbalance data. It is well known that rules are interesting way for representing data in a human interpretable way. The proposed rule-based classifier combines the random subspace and boosting approaches with ensemble of decision trees to construct a set of classification rules without involving global optimisation. The classifier considers random subspace approach to avoid overfitting, boosting approach for classifying noisy instances and ensemble of decision trees to deal with class-imbalance problem. The classifier uses two popular classification techniques: decision tree and k-nearest-neighbor algorithms. Decision trees are used for evolving classification rules from the training data, while k-nearest-neighbor is used for analysing the misclassified instances and removing vagueness between the contradictory rules. It considers a series of k iterations to develop a set of classification rules from the training data and pays more attention to the misclassified instances in the next iteration by giving it a boosting flavour. This paper particularly focuses to come up with an optimal ensemble classifier that will help for improving the prediction accuracy of DNA variant identification and classification task. The performance of proposed classifier is tested with compared to well-approved existing machine learning and data mining algorithms on genomic data (148 Exome data sets) of Brugada syndrome and 10 real benchmark life sciences data sets from the UCI (University of California, Irvine) machine learning repository. The experimental results indicate that the proposed classifier has exemplary classification accuracy on different types of biological data. Overall, the proposed classifier offers good prediction accuracy to new DNA variants classification where noisy and misclassified variants are optimised to increase test performance. © 2016 Elsevier Ltd. All rights reserved.

1. Introduction The emerging field of bioinformatics combines the challenging research areas of biology and informatics to develop different methods and tools for analysing biological data. The main challenge behind is to extract the relevant information from the large amount of clinical and genomic data, then transform it into useful knowledge (Taminau, 2012). Three major issues are involved in this process: (a) collecting clinical and genomic data, (b) retrieving relevant information from the data and (c) extracting new knowledge from the information. Since last decade various life science research groups generated huge amount of clinical and genomic data from the Human Genome Project (HGP) and some



Corresponding author. Fax: +32 26291879. E-mail addresses: [email protected] (D.Md. Farid), [email protected] (M.A. Al-Mamun), [email protected] (B. Manderick), [email protected] (A. Nowe). http://dx.doi.org/10.1016/j.eswa.2016.08.008 0957-4174/© 2016 Elsevier Ltd. All rights reserved.

of them are publicly available through online repositories (Artimo et al., 2012; Berman et al., 20 0 0; Lichman, 2013). Routinely, the computational intelligence researchers have applied machine learning (ML) and data mining (DM) algorithms for explicating the biological data (Latkowski & Osowski, 2015; Tzanis, Kavakiotis, & Vlahavas, 2011; Yang & Chen, 2015). Typically the biological data are noisy, high dimensional space (in thousands), small size of samples (in dozens) and some gene sequences have large variance (Alter et al., 2011), which results in the danger of overfitting and low efficiency to classification. Biological data mining (BDM) is the process of extracting new knowledge (previously unknown) from the biological data. It presents extensive DM concepts, theories and applications in biological research. DM uses ML algorithms for discovering patterns and useful information from large data or databases (Farid et al., 2013; Han, Kamber, & Pei, 2011). DM has two major functions: (a) classification (supervised learning) and (b) clustering (unsupervised learning). In classification, the mining classifiers predict the

306

D.Md. Farid et al. / Expert Systems With Applications 64 (2016) 305–316

class value of a new/unseen instance after remarking the training data (Farid, Zhang, Rahman, Hossain, & Strachan, 2014; Nápoles, Grau, Bello, & Grau, 2014). The training instances are grouped into classes before mining the data. On the other-side, clustering (or segments) groups the instances into clusters based on the similarities among the instances on predefined features (Milone, Stegmayer, Kamenetzky, López, & Carrari, 2013). The instances within a cluster have more similarity in comparison to one another but are very dissimilar to instances in other clusters (Al-Mamun et al., 2016). Both classification and clustering methods play an important role to analyse biological data such as genomic/DNA microarray data classification and analysis (Hanczar & Nadif, 2011; 2012; Liew, Yan, & Yang, 2005; Lin, Liu, Chen, Chao, & Chen, 2006). But mining becomes more difficult when biological data has large number of features and small number of instances/variants (Gheyas & Smith, 2010; Hua, Tembe, & Dougherty, 2009). This paper presents an adaptive rule-based (ARB) classifier for classifying multi-class biological/genomic data to improve the prediction accuracy of DNA variants classification task. Where it uses two efficient and effective supervised learning algorithms: decision tree (DT) and k-nearest-neighbor (kNN) method. DTs are used for evolving a set of classification rules from the training data, while kNN is used for analysing the misclassified instances and removing ambiguity between the contradictory rules. It is suggested that rules make it easy to deal with complex classification problems. Rule-based classifier has various advantages: (a) highly expressive as DT, (b) easy to interpret, (c) easy to generate, (d) can classify new instances rapidly, and (e) performance comparable to DT. Also, new rules can be added to existing rules without disturbing ones already in there and rules can be executed in any order (Han et al., 2011). Usually, there are two characteristics of the rule-based classifier: (a) classifier contains mutually exclusive rules if the rules are independent of each other, where every instance is covered by at most one rule, and (b) classifier exhaustive coverage if it accounts for every possible combination of feature values, where each instance is covered by at least one rule (Witten, Frank, & Hall, 2011). A rule-based classifier makes use of a set of IF-THEN rules for classification. The IF part of the rule is called rule antecedent or precondition. The THEN part of the rule is called rule consequent. The antecedent part of condition consists of one or more feature tests and these tests are logically ANDed. The consequent part consists of the class prediction. The proposed ARB classifier combines the random subspace and boosting approaches with ensemble of DTs to construct a set of classification rules. We consider random subspace approach to avoid overfitting, boosting approach for classifying noisy instances and ensemble of DTs to deal with class-imbalance problem. It considers a series of k iterations to generate classification rules. A rule set is generated using a DT with random subspace from the training data in each iteration. Each rule is generated for each leaf node of the tree. Each path in the tree from root to a leaf corresponds with a rule. The rules are extracted from the training data using the C4.5 (DT induction) algorithm. The ARB classifier mainly concentrates to the misclassified training instances in the next iteration and conclusively finds the instances those are difficult to classify. An instance’s weight reflects how difficult it is to classify. The weights of instances are adjusted according to how they are classified in each iteration. If an instance is correctly classified, then its weight is decreased, otherwise if misclassified, then its weight is increased. The ARB classifier produces good classification rules without any need for global optimisation. We have tested the performance of proposed ARB classifier on 148 Exome data sets (Brugada syndrome variants classification) and 10 real benchmark life sciences data sets from the UCI (University of California, Irvine) machine learning repository (Lichman, 2013). The experimental analysis of proposed classifier has compared with very popular

and strong classification algorithms: (a) RainForest tree, (b) DT (C4.5), (c) naïve Bayes (NB) classifier, and (d) kNN classifier. The DT, NB and kNN classifiers have been using for mining biological data in last few years. We have chosen these classifiers because they also result in transparent models that can are human interpretable. Other models like random forests usually result in better performance, but are poor from a transparency point of view. The rest of the paper is organised as follows. Section 2 presents the knowledge discovery process from big biological data. Section 3 introduces DT and kNN algorithms. Section 4 presents the proposed ARB classifier in details. Section 5 presents the algorithm for analysing misclassified instances. Section 6 provides experimental results on genomic and benchmark life sciences data sets. Finally, Section 7 concludes the findings and proposed directions for future work. 2. Mining big biological data 2.1. Mining big data Mining big data is the process of extracting knowledge to uncover large hidden information from the massive amount of complex data or databases (Al-Jarrah, Yoo, Muhaidat, Karagiannidis, & Taha, 2015; Assunção, Calheiros, Bianchi, Netto, & Buyya, 2015). The data in big data comes in different forms including two-dimensional tables, images, documents and complex records from multiple sources (Kambatla, Kollias, Kumar, & Grama, 2014). It must support search, retrieval and analysis (Barbierato, Gribaudo, & Iacono, 2014; Singh & Reddy, 2014). The three V’s define big data: Volume (the quantity of data), Variety (the category of data) and Velocity (the speed of data in and out). It might suggest throwing a few more V’s into the mix: Vision (having a purpose/plan), Verification (ensuring that the data conforms to a set of specifications) and Validation (checking that its purpose is fulfilled) (Jin, Wah, Cheng, & Wang, 2015). Regarding data and their uses, there is a big difference between big data and business intelligence (BI). BI provides historical, current and predictive views of business operations, which includes reporting, online analytical processing, business performance management, competitive intelligence, benchmarking and predictive analytics. But in big data, the data is processed employing advanced ML and DM algorithms to extract the meaningful information/knowledge from the data (Chen & Zhang, 2014; Najafabadi et al., 2015). In big data mining applications, very large training sets of millions of instances are common. Most often the training data will not fit in memory. The efficiency of existing ML and DM algorithms, such as DT, NB and kNN, has been well established for relatively small data sets (Gehrke, Ramakrishnan, & Ganti, 20 0 0). These algorithms become inefficient due to swapping of the training instances in and out of main and cache memories. So, more scalable approaches are required to increase the capability of handling training data that are too large to fit in memory. 2.2. Classifying big biological data In this ‘Omics’ era, life science generates the genomic, transcriptomic, epigenomic, proteomic, metabolomics data scaling from terabyte (TB) to petabyte (PB), even exabyte (EB) (López, Aguilar, Alonso, & Moreno, 2012). It is plausible to mention that these biological data have similar properties 3V’s (volume, variety and velocity) like big data has. But these biological data are highly heterogeneous than other contemporal big data like Google, WeChat and Ali Baba, where 3V features showcase causal relationships among certain elements like genes, proteins, and pathways in molecular level (Toga & Dinov, 2015). Knowledge discovery from biological data is a DM application. The process of

D.Md. Farid et al. / Expert Systems With Applications 64 (2016) 305–316

307

Table 1 Gene panel of Brugada syndrome. Chromosome

Name of Gene

Chr Chr Chr Chr Chr Chr

1 3 4 7 10 11

Chr Chr Chr Chr Chr Chr Chr

12 15 17 19 20 21 X

KCND3 SCN5A, GPD1L, SLMAP, CAV3, SCN10A ANK2 CACNA2D1, AKAP9, KCNH2 CACNAB2 KCNE3, SCN3B, SCN2B, KCNJ5, KCNQ1, SCN4B CACNA1C, KCNJ8 HCN4 RANGRF, KCNJ2 SCN1B, TRPM4 SNTA1 KCNE1, KCNE2 KCNE1L

Fig. 1. Knowledge discovery from genomic/biological data.

Fig. 2. Heterogeneous nuclear RNA (hnRNA).

knowledge discovery from the genomic data is illustrated in Fig. 1, which is slightly different from Knowledge Discovery in Databases (KDD) process (Han et al., 2011). The visualisation and knowledge representation techniques are used to present mined knowledge to bioinformaticians in this process. In this study, the Exome data sets of 148 patients have analysed for Brugada syndrome at Centre for Medical Genetics of VUB UZ Brussels (Universitair Ziekenhuis Brussel) (www.uzbrussel.be/) as part of the BRiDGEIris (BRussels big Data platform for sharing and discovery in clinical GEnomics) project. Brugada syndrome (BrS), also known as sudden adult death syndrome (SADS) is a genetic disease, which increases the risk of sudden cardiac death (SCD) at a young age (Hofman et al., 2013). The Spanish cardiologists Pedro Brugada and Josep Brugada name BrS. It is detected by abnormal electrocardiogram (ECG) findings called a type 1 Brugada ECG pattern, which is much more common in men. People with BrS have the risk of SCD from the lower chambers of their heart (ventricular arrhythmias). BrS is a heart rhythm disorder. Each heartbeat triggers an electrical impulse generated by special cells in the right upper chamber of heart. The channels (tiny pores) on each of these cells direct this electrical activity that makes the heart beat. In BrS, a defect in these channels can cause the abnormal heartbeat and spin electrically out of control in an abnormally fast and dangerous rhythm (ventricular fibrillation). So, SCD caused when the heart doesn’t pump effectively and not enough blood travels to the rest of the body. DNA sequencing can be used to detect the genes that are associated with genetic disease such as BrS. In this paper, we have used 148 Exome data sets that is part of genome formed by exons. An exon is any nucleotide sequence/DNA sequence within a gene. Exons include coding sequence (CDS), untranslated region (UTR) and unused sequence called introns that shown in Fig. 2. The notation5’ and 3’ in Fig. 2 refer to the direction of the DNA template in the chromosome that is used to distinguish between the two untranslated regions. Exome consists of all DNA that is transcribed into mature RNA. The exome of the human genome consists of roughly 180,0 0 0 exons constituting about 1% of the total genome, or about 30 megabases of DNA. Firstly, the Exome data sets (annotated vcf files) are handled based on the gene panel of the disease (BrS) to reduce the number of variants. The gene

panel of BrS contains 406 records. Table 1 shows the chromosomes and genes in gene panel of BrS. Data pre-processing technique is applied on data sets, which transform raw data into an understandable format for further processing (Yang & Chen, 2015). Data pre-processing is needed for mining biological data, which includes several techniques: (a) data cleaning, (b) data integration, (c) data transformation, (d) data reduction, and (e) data discretisation. Then the feature selection method is applied on formatted data. Feature selection is the process of selecting a subset of relevant features from a total original features in data (Farid & Rahman, 2013). It is a form of search based on a given optimisation principal that improves the performance of a mining model (Latkowski & Osowski, 2015). In biological data, features may contain false correlations and the information they add is contained in other features (Bolón-Canedo, Sánchez-Maroño, & Alonso-Betanzos, 2012). Also, some extra features can increase the computational time. Mainly the following three reasons are used for feature selection: (a) simplification of models, (b) shorter training times, and (c) reducing overfitting. Finally, a mining algorithm is used for classifying the biological/genomic data. Recently, several ML and DM algorithms have been applied for mining biological data. Yang and Chen (2015) presented a framework to find the correlation between clinical and pathological data using DM algorithms. The authors applied data pre-processing to deal with redundant instances and missing values in pathological data, and DT induction for classifying clinical data. The proposed algorithm extracted rules from pathological and clinical data. Latkowski and Osowski (2015) developed an ensemble classifier for selecting the number of highest rank genes and gene sequences from gene expression microarray data. The algorithm used feature selection methods to select genes that were used as input features for ensemble model composed of support vector machines (SVMs) and random forest of DTs. A numbers of classifiers using SVMs were built on the set of genes selected by different feature selection methods. The results of different classifiers were combined together into final decision by random forest trees. Bolón-Canedo et al. (2012) presented an ensemble model for analysing DNA micro array data, which combined three wellapproved DM algorithms: (a) DT classifier, C4.5, (b) NB classifier and (c) nearest-neighbor instance-based learners, IB1. Different subset of features was used for classification by each specific classifier. Finally, the outputs of these classifiers were combined by simple voting technique. Stelle, Barioni, and Scott (2011) proposed a method for predicting protein structure and protein folding. This method designed and implemented a local database with 20,0 0 0 proteins extracted from the Protein Data Bank (PDB; http://www.rcsb.org/pdb/) (Berman et al., 20 0 0). Then it used the

308

D.Md. Farid et al. / Expert Systems With Applications 64 (2016) 305–316

|D |

Table 2 Commonly used symbols and terms. Symbol

Term

D xi X Aj aij cl DT

Training data A data instance A subset of instances A feature A feature’s value A class label A decision tree

calculation, where |Dj| acts as the weight of the jth partition.

In f oA (D ) =

n  |D j | × In f o(D j ) |D|

(2)

j=1

Information gain is defined as the difference between Info(D) and InfoA (D) that is shown in Eq. (3).

Gain(A ) = In f o(D ) − In f oA (D )

rule mining to identify hydrophobicity patterns that achieved a specific secondary structure in different proteins. Liu, Liu, and Zhang (2010) presented an ensemble gene selection method (EGS) based on information theory to choose multiple gene subsets for classification. A majority voting technique was used to train multiple classifiers by the different gene subsets. The NB and kNN classifiers were used to construct the classification model. The conventional learning algorithms like DT, NB, and kNN give less accuracy for mining big biological data, because these are designed for classifying small data sets with little number of features (Farid et al., 2014). Usually, biological data has high number of features and the data are generating continuously. Each Exome data set for BrS has 143 features with numeric, nominal and strings. It is also difficult to update a mining model continually. Therefore, in this paper, we have used rule-based classifier that can be updated continuously with the new rules. As new rules can be added to existing rules without disturbing ones already in there and rules can be executed in any order. 3. Classification methods In supervised learning, data classification is composed of a two-step process of training and testing. A mining model/classifier is built from the instances in training phase, where instances are associated with class labels. Then in testing phase, the classifier is used to predict the class labels for the testing instances. Given a data set, D = {x1 , . . . , xi , . . . , xN }, which contains N number of instances. Each instance, xi ∈ D has n features Aj , j = 1, . . . , n. The value of feature Aj of instance xi is aij . X ⊂ D is a subset of instances and each one belongs to one of M classes {c1 , . . . , cl , . . . , cM }. Table 2 summarises the symbols and terms that are commonly used throughout the paper.

(3)

Quinlan later presented C4.5 (a successor of ID3 algorithm) that became a benchmark in supervised learning algorithms. C4.5 uses an extension of Information Gain, which is known as Gain Ratio (Quinlan, 1993). It applies a kind of normalisation of Information Gain using Split Information defined analogously to Info(D) as shown in Eq. (4).

SplitIn f oA (D ) = −

n  |D j | × log2 |D| j=1



|D j | |D|



(4)

The Aj with the maximum Gain Ratio, which is defined in Eq. (5) is selected as the splitting feature.

GainRatio(A ) =

Gain(A ) SplitIn f o(A )

(5)

The Gini Index is used in Classification and Regression Trees (CART) algorithm, which generates the binary classification tree for decision making (Breiman, Friedman, Stone, & Olshen, 1984). It measures the impurity of D, a data partition or X, as shown in Eq. (6), where, pi is the probability that xi ∈ D belongs to class, cl and is estimated by |cl , D|/|D|. The sum is computed over M classes.

Gini(D ) = 1 −

N 

p2i

(6)

i=1

It considers a binary split, a weighted sum of the impurity of each resulting partition. For example, if a binary split on A partitions D into D1 and D2 the Gini Index of D given that partitioning is shown in Eq. (7).

GiniA (D ) =

|D1 | |D2 | Gini(D1 ) + Gini(D2 ) |D| |D|

(7)

For each Aj , each of the possible binary splits is considered. The Aj that maximises the reduction in impurity is selected as the splitting feature, shown in Eq. (8).

3.1. Decision tree (DT)

Gini(A ) = Gini(D ) − GiniA (D )

DT induction is a top down recursive divide and conquer algorithm for multi-class classification task (Han et al., 2011). It is easy to interpret and explain, also requires little prior knowledge. Nonlinear relationships between features do not affect the tree performance. The goal of DT is to iteratively partition the data into smaller subsets until all the subsets belong to a single class or the stopping criteria of DT building process are met. During the early 1980s, Quinlan (1986) proposed ID3 (Iterative Dichotomiser 3) algorithm that used information theory to select the best feature, Aj . The Aj with the maximum Information Gain is chosen as root node of the tree. To classify an instance, xi ∈ D the average amount of information needed to identify a class, cl is shown in Eq. (1). Where pi is the probability that xi belongs to the class, cl and is estimated by |cl , D|/|D|.

The time and space complexity of a tree depend on the size of the data set, number of features and the size of the generated tree (Farid et al., 2014). The key disadvantage of DTs is that without proper pruning (or limiting tree growth), trees tend to overfit the training data. In this paper, we have used C4.5 method to extract the classification rules from training data. Each rule is generated for each leaf node of the tree. Each path in tree from the root to a leaf corresponds with a rule. Algorithm 1 outlines the DT induction algorithm.

In f o(D ) = −

N 

pi log2 ( pi )

(1)

i=1

InfoA (D) is the expected information required to correctly classify xi ∈ D based on the partitioning by Aj . Eq. (2) shows InfoA (D)

(8)

3.2. K-nearest-neighbor (knn) classifier kNN is a simple classifier, which uses the distance measurement techniques that widely used in pattern recognition (Han et al., 2011; Witten et al., 2011). kNN finds k instances, X = {x1 , x2 , . . . , xk } ∈ Dtraining that are closest to the test instance, xtest and assigns the most frequent class label, cl → xtest among the X. When a classification is to be made for a new instance, xnew , its distance to each Aj ∈ Dtraining , must be determined. Only the k closest instances, X ∈ Dtraining are considered further. The closest is

D.Md. Farid et al. / Expert Systems With Applications 64 (2016) 305–316

4.1. Constructing classification rules

Algorithm 1 Decision tree induction. Input: D = {x1 , . . . , xi , . . . , xN } Output: A decision tree, DT . Method: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:

DT = ∅; find the root node with best splitting, A j ∈ D; DT = create the root node; DT = add arc to root node for each split predicate and label; for each arc do D j created by applying splitting predicate to D; if stopping point reached for this path then DT  = create a leaf node and label it with cl ; else DT  = DTBuild(D j ); end if DT = add DT  to arc; end for

defined in terms of a distance metric, such as Euclidean distance. The Euclidean distance between two points, x1 = (x11 , x12 , . . . , x1n ) and x2 = (x21 , x22 , . . . , x2n ), is shown in Eq. (9)



dist (x1 , x2 ) =

n 

( x1i − x2i ) 2

(9)

i=1

For, kNN classifier, the unknown instance, xunknown is assigned the most common class, cl among its k nearest neighbours. The k is chosen to be odd for a two class classification and in general not to be a multiple of the number of classes M. Usually, kNN achieves good results when the data set is large. The value of k should be large for classifying the noisy data. Also we can consider the majority voting over the k nearest neighbours to deal with noisy instances (Witten et al., 2011). The main disadvantage of the kNN classifier is that it is a lazy learner, i.e. it does not learn anything from the training data and simply uses the training data itself for classification. In this paper, we have used kNN classifier to check the class labels for the misclassified instances and removing vagueness between the contradictory rules. Algorithm 2 outlines the k-nearest-neighbor algorithm.

4. Rule-based classifier Rules are an approved alternative to DTs (Apté & Weiss, 1997). A rule is a series of tests for probability distribution over the class or classes that apply to instances covered by that rule. In this section, we discuss the proposed ARB classifier for generating a set of classification rules, which combines the random subspace and boosting approaches with ensemble of DTs.

Algorithm 2 k-nearest-neighbor classifier. Input: D = {x1 , . . . , xi , . . . , xn } Output: kNN classifier, kN N . Method: find X ∈ D that identify the k nearest neighbours, regardless of class label, cl . 2: out of these instances, X = {x1 , x2 , . . . , xk }, identify the number of instances, ki , that belong to class cl , l = 1, 2, . . . , M. Obvi ously, i ki = k. 3: assign xtest to the class cl with the maximum number of ki of instances.

1:

309

Extracting classification rules from DTs is easy and well-known process (Lawrence & Wright, 2001). Rules are highly expressive as DT, so the performance of rule-based classifier is comparable to DT. Each rule is generated for each leaf of the DT. Each path in DT from the root node to a leaf node corresponds with a rule. The DT induction is top down divide and conquers approach, which recursively partition the training data into sub-data sets based on the best splitting feature (Farid et al., 2014). The best splitting feature separates the classes in training data. The top down divide and conquer strategy also is used for generating rules. DTs correspond exactly to the rules and have no difference in effect. The key advantage of constructing classification rules derived from DTs is that each rule seems to represent an independent segment of knowledge. New rules can be added to an existing rule set without disturbing ones already there, whereas to add to a tree structure may require reshaping the whole tree (Han et al., 2011; Witten et al., 2011). In multi-class classification problems, a DT split takes all classes into account in trying to maximise the purity of the split, whereas the rules constructing approach concentrates on one class at a time, disregarding what happens to the other classes. No ordering is implied between the rules for one class and those for another. So, rules can be executed in any order. To make incremental changes to existing rule sets, rules with exceptions can be used to represent the entire concept description that correctly classify all the instances. Usually, rules are pruned to remove the redundant and contradictory instances from the training data. It is also necessary to produce effective classification rules to reduce overfitting in training data. In ML and DM, overfitting occurs when a mining model describes random error/noise instead of the underlying relationship. Generally, overfitting occurs when a mining model is complex and having too many features relative to the number of instances. In this study, we have applied the C4.5 algorithm for generating rules from several genomic and benchmark life sciences data sets. 4.2. The proposed adaptive rule-based (ARB) classifier The proposed ARB classifier combines the random subspace and boosting approaches with ensemble of DTs to construct a set of classification rules. Random subspace method (or attribute bagging) is an ensemble classifier that consists of several classifiers each operating in a subspace of the original feature space, and outputs the class based on the outputs of these individual classifiers. It has been used for DTs (random decision forests) and also applicable for many other types of classifiers, such as nearest neighbors, SVMs and one-class classifiers. Random subspace method is an attractive choice for classifying data sets with large numbers of features, such as biological data. On the other-side, boosting is designed specifically for classification to convert weak classifiers to strong ones (Farid, Rahman, & Rahman, 2011; 2011). It is an iterative process. Boosting uses voting for classification to combine the output of individual classifiers. In this proposed classifier, we consider random subspace approach to avoid overfitting, boosting approach for classifying noisy instances and ensemble of decision trees to deal with class-imbalance problem. The ARB classifier considers a series of k iterations to generate good classification rules without any need for global optimisation, which is summarised in Algorithm 3. Initially, an equal weight, N1 was assigned to each instance, xi in original training data set, D. Where N is the total number of instances. The weights of training instances were adjusted according to how they were classified in every iterations. If an instance xi was correctly classified then it’s weight was decreased,

310

D.Md. Farid et al. / Expert Systems With Applications 64 (2016) 305–316

Algorithm 3 Adaptive rule-based (ARB) classifier. Input: D = {x1 , . . . , xi , . . . , xN }, training data set; k, number of iterations; DT learning scheme; Output: rule-set; // A set of classification rules. Method: 1: rule-set = ∅; 2: for i = 1 to N do 3: xi = N1 ; // initialising weights of each xi ∈ D. 4: end for 5: for j = 1 to k do 6: if j==1 then create D j , by sampling D with replacement; 7: 8: else create D j , by D j−1 and D with maximum weighted X; 9: 10: end if build a tree, DT j ← D j by randomly selected features; 11: 12: compute er ror (DT j ); // the error rate of DT j . 13: if er ror (DT j ) ≥ threshold-value then 14: go back to step 6 and try again; 15: else rules ← DT j ; // extracting the rules from DT j . 16: 17: end if 18: for each xi ∈ D j that was correctly classified do 19: 20: 21: 22: 23: 24: 25: 26:

Fig. 3. Mining big data using the ARB classifier.

er ror (DT )

multiply the weight of xi by ( 1−error (DTj ) ); // update j

weights. end for normalise the weight of each xi ∈ D j ; rule-set = rule-set ∪ rules; end for return rule-set; create sub-data set, Dmisclassi f ied with misclassified instances from D j ; analyse Dmisclassi f ied employing algorithm 4.

else misclassified then it’s weight was increased. So, an instance’s weight reflects how difficult it was to classify. In each iteration, a sub-data set Dj was created from the original training data set D and previous sub-data set D j−1 with maximum weighted instances. The sampling with replacement technique was used to create the sub-data set D1 from the original training data D in the first iteration. A DTj was built from the sub-data set Dj with randomly selected features in each iteration. Then each rule was generated for each leaf node of DTj . Each path in DTj from the root to a leaf corresponds with a rule. The error rate of DTj was calculated by the sum of weights of misclassified instances that is shown in Eq. (10). Where, err(xi ) is the misclassification error of an instance xi . If an instance, xi is misclassified, then err(xi ) is one. Otherwise, err(xi ) is zero, when instance, xi is correctly classified.

error(DT j ) =

n 

wi × er r (xi )

(10)

i=1

If error rate of DTj is less than the threshold-value, then rules are extracted from DTj . We considered the threshold-value is 0.51. So, if a tree, DTj correctly classified 49% of training instances from sub-data set, Dj then we generated rules from the tree. The weights of correctly classified instances were updated after extracting rules. If an instance, xi in jth iteration was correctly classified, it’s weight was multiplied by error

er ror (DT ) ( 1−error (DTj ) ). j

Then

the weights for all instances (including the misclassified instances) were normalised. To normalise a weight, we multiplied it by the sum of old weights, divided by the sum of new weights. So, the weights of misclassified instances are increased and the weights of

correctly classified instances were decreased. Finally, a sub-data set Dmisclassified was created from Dj with the misclassified instances. The ARB classifier can be used for mining biological big data. We can create several smaller samples (or subsets) of data from the big data that each of which fits in main memory. Each subset of data is used to construct a set of rules using ARB classifier, resulting in several sets of classification rules. Then the rules are examined and used to merge together to construct the final set of classification rules to deal with big data. Fig. 3 shows the process of mining big data using the ARB classifier. In biological big data, new data are generated continuously, so the mining model needs to update frequently after a period of time to adopt with the new data (Farid et al., 2013). The ARB classifier is good choice for dealing with big biological/genomic data, as new rules from new data can be added with exiting rules without disturbing the exiting rules. Also, rules can be executed in any order. For mining biological data some time we need constantbased rules from bioinformaticians that is very difficult to adopt with traditional ML classifiers like DT and NB classifiers. But, we can adopt any constant-based rules with rules that generated by using the ARB classifier. 5. Effective rule induction Rules are easy to deal with in complex classification problems. However, conflicts arise when several rules with different classes apply (Han et al., 2011). So, what to do when different rules lead to different classes for the same instance. The incremental modifications of rules can be made to a rule set by expressing exceptions to existing rules rather then re-engineering the entire set. Also, rules require modification so that the new instances can be classified correctly. In this section, we discuss an algorithm for analysing misclassified instances to remove the ambiguity between rules. 5.1. Reduced-error pruning In multi-class classification, rules are used to reduce overfitting in noisy training data. Incremental reduced-error pruning for generating rules performs quite well to obtain a global optimisation step in large data sets. It is well known that repeated incremental pruning is used to produce error reduction. Both the size and performance of classification rules are significantly improved by post-induction optimisation. An initial set of classification rules

D.Md. Farid et al. / Expert Systems With Applications 64 (2016) 305–316

and misclassified instances are examined to reduce classification errors. The error pruning reduces the complexity of the rules and improves classification accuracy by the reduction of overfitting. The reduced-error pruning has the advantage of simplicity and speed. We split the original training data into two parts: (a) a growing set (or training set), and (b) a pruning set (or testing set). Usually, the two-thirds of training instances were used for growing set and one-third of instances are used for pruning set. The growing set was used to generate a set of classification rules using the ARB classifier that is summarised in Algorithm 3. The classification rules were generated from instances in the growing set only. So, the ARB classifier might miss important rules, because some key instances had been assigned to the pruning set. The idea of using a separate pruning set for pruning is called reduced-error pruning. A rule generated from the growing set is deleted, and the effect is evaluated by trying out the truncated rule from the pruning set and seeing whether it performs well than the original rule. If the new truncated rule performs better then this new rule is added to the rule set. This process continues for each rule and for each class. The overall best rules are established by evaluating the rules on the pruning set. We could also apply another approach that creates a full, unpruned, rule set first, pruning it afterwards by discarding individual tests. But this method is much slower (Witten et al., 2011). 5.2. Analysing misclassified instances The contradictory and redundant training instances make the mining model more complex and less accurate. Most of the probability based ML algorithms like DT and NB classifier suffer from the overfitting problem, because of the redundant instances in training data (Farid & Rahman, 2013). On the other hand, contradictory instances confuse the mining model, because same instance may appear with different class values in the training data. In most cases, contradictory instances are mainly responsible for making contradictory rules. Therefore, we needed to analyse the instances that were misclassified by trees in Algorithm 3. To check the classes of misclassified instances we used the kNN classifier with feature selection and weighting approach. We applied DT induction for feature selection and weighting approach. To analyse the misclassified instances, firstly we built a tree, DT from the misclassified instances. Then each feature that was tested in the tree, Aj ∈ Dmisclassified was assigned by a weight 1d . Where d is the depth of the tree. We do not consider the features that were not tested in the tree for similarity measure of kNN classifier. Secondly, we applied kNN classifier to classify each misclassified instance based on the weighted features. Finally, the algorithm updated the class label of misclassified instances. After analysing and updating the classes of misclassified instances we checked for the contradictory rules, if there is any. Also, we removed the ambiguity between the contradictory rules. The process of analysing the misclassified instances is summarised in Algorithm 4. 6. Experiments 6.1. Experimental setup In this study, we implemented the ARB classifier in Java. We used NetBeans integrated development environment (IDE) 8.1, which is a software development platform written in Java (https://netbeans.org). NetBeans IDE supports development of all Java application types such as Java SE, Java ME and Java EE. The code for the DT and kNN classifiers is adopted from Weka 3: data mining software in Java. Weka was developed at the University of Waikato in New Zealand; the name stands for Waikato Environment for Knowledge Analysis (Hall et al., 2009). An advantage of using

311

Algorithm 4 Analysing misclassified instances. Input: D, original training data; Dmisclassi f ied , data set with misclassified instances; Output: A set of instances, X with right class labels. Method: 1: build a tree, DT using Dmisclassi f ied ; 2: for each A j ∈ Dmisclassi f ied do if A j is tested in DT then 3: 4: assign weight to A j by d1 , where d is the depth of DT ; else 5: 6: not to consider, A j for similarity measure; end if 7: 8: end for 9: for each xi ∈ Dmisclassi f ied do 10: find X ∈ D, with the similarity of weighted A = {A1 , . . . , A j , . . . , A n }; 11: find the most frequent class, cl , in X; assign xi ← cl ; 12: 13: end for

Weka is that we can access the programs in Weka from inside our own Java code. So, we can solve the ML problem in our application with minimal additional Java programming. We have used the classification accuracy, precision, recall, F-score and 10-fold cross validation to test the performance of the ARB classifier with J48 (C4.5 DT learner), IBk (kNN classifier) and NaïveBayes (standard probabilistic naïve Bayes classifier) in Weka. The classification accuracy is measured by Eq. (11).

| X | accuracy =

i=1

assess(xi )

|X |

, xi ∈ X

(11)

If xi is correctly classified then assess(xi ) = 1, or If xi is misclassified then assess(xi ) = 0. The calculations of precision, recall and F-score are shown in Eqs. 12 to 14, where TP, TN, FP and FN denote true positives (xi predicted to be in cl and is actually in it), true negatives (xi not predicted to be in cl and is not actually in it), false positives (xi predicted to be in cl but is not actually in it), and false negatives (xi not predicted to be in cl but is actually in it) respectively. We considered the weighted average for precision, recall and F-score analysis. The weighting for the evaluation is calculated using the number of instances belonging to one class divided by the total number of instances in one data set.

precision = recall =

TP TP + FP

TP TP + FN

F − score =

2 × precision × recall precision + recall

(12)

(13)

(14)

The 10-fold cross validation breaks the training data into 10 sets of equal size. It trains the classifier on nine data sets and tests it using the remaining one data set. It repeats 10 times and takes a mean accuracy rate. The experiments were conducted using a MacBook Pro- with 2.7 GHz quad-core Intel Core i7 Processor and 16 GB of RAM. The objective of the experiments is to test the performance of ARB classifier for classifying DNA variants of BrS with compare to existing ML and DM algorithms. 6.2. Experiments on genomic data The disease causing DNA variants for Brugada syndrome (BrS) is classified into five classes: nonpathogenic; variant of unknown significance (VUS) type 1, 2, or 3; or pathogenic mutation (Hofman et al., 2013). Table 3 shows the classification of gene variants for

312

D.Md. Farid et al. / Expert Systems With Applications 64 (2016) 305–316

Fig. 4. Genomic data: 148 Exome data sets.

Table 3 Classification of DNA variants for Brugada syndrome. Class Class Class Class Class Class

Table 4 Number of variants in 148 Exome data sets. Annotated vcf File

Gene Panel

BrS Variants

19,687 133

17,795 120

2143 14

267

228

135

6

6

0

Label I II III IV V

Nonpathogenic VUS1 - Unlikely pathogenic VUS2 - Unclear VUS3 - Likely pathogenic Pathogenic

BrS. Each Exome data set contains 143 features and the feature values are numeric, nominal and string. First, we applied the gene panel and data pre-processing techniques on each Exome data set (annotated vcf file) to find the variants that satisfy BrS. This process reduced the number of variants in each annotated variant call file (vcf file). Fig. 4 shows the number of variants in each annotated vcf file (blue line), the number of variants that satisfy gene panel (green line), and the number of variants that satisfy BrS (red line). There were total 19,687 variants in 148 Exome data sets (annotated vcf files) and among these variants 17,795 variants satisfy the gene panel. Finally, after data pre-processing we had only 2143 variants that satisfy BrS. Table 4 tabulates the variants in 148 Exome data sets with sum, average, max and min values. We extracted classification rules from each of the Exome data set by employing the ARB classifier. The efficiency of existing ML and DM algorithms, such as C4.5, NB, and kNN, has been well established for relatively small data sets and performs poorly on Exome data sets. As each Exome data set contains 143 features and a few number of variants, we applied RainForest tree method with C4.5 algorithm to build a DT using 148 Exome data sets. RainForest adapts to the amount of main memory available and applies to any DT induction algorithm (Gehrke et al., 20 0 0). The method maintained an AVC-set (Attribute-Value, Class-label) for

Total variants Average no. of variants Maximum no. of variants Minimum no. of variants

Table 5 The accuracy, precision, recall and F-score of RainForest, NB, kNN and ARB classifier using training data. Algorithm

Classification accuracy (%)

Precision (weighted avg.)

Recall (weighted avg.)

F-score (weighted avg.)

RainForest NB kNN ARB classifier

83.33 83.33 75 91.66

0.76 0.79 0.56 0.95

0.83 0.83 0.75 0.91

0.79 0.78 0.64 0.92

each feature, at each tree node, describing the training instances at the node. The AVC-set of a feature at tree node gives the class label counts for each value of this feature for the instances at that tree node. We also used AVC-set to calculate the prior and class-conditional probabilities for NB classifier using 148 Exome data sets. The performance of the ARB classifier against RainForest, NB and kNN classifiers on 148 Exome data sets shown in Table 5. The ARB classifier correctly classifies 91% gene variants for BrS using training data. We have considered five iterations for the ARB classifier on each Exome data set.

D.Md. Farid et al. / Expert Systems With Applications 64 (2016) 305–316 Table 6 The accuracy, precision, recall and F-score of RainForest, NB, kNN and ARB classifier using 10 folds cross-validation. Algorithm

RainForest NB kNN ARB classifier

Classification accuracy (%)

58.33 58.33 50 75

Precision (weighted avg.)

Recall (weighted avg.)

F-score (weighted avg.)

0.46 0.63 0.33 0.73

0.58 0.58 0.5 0.75

0.51 0.6 0.4 0.68

Table 7 The accuracy, precision, recall and F-score of RainForest, NB, kNN and ARB classifier using testing data. Algorithm

Classification accuracy (%)

Precision (weighted avg.)

Recall (weighted avg.)

F-score (weighted avg.)

RainForest NB kNN ARB classifier

50 50 50 66.66

0.33 0.25 0.25 0.44

0.5 0.5 0.5 0.66

0.4 0.62 0.33 0.53

Table 8 Data set descriptions. No.

data sets

Instances

No of Att.

Att. Types

Classes

1 2 3 4 5 6 7 8 9 10

Appendicitis Breast cancer Contraceptive Ecoli Heart Pima diabetes Iris Soybean Thyroid Yeast

106 286 1473 336 270 768 150 683 215 1484

7 9 9 7 13 8 4 35 5 8

Numeric Nominal Numeric Numeric Numeric Numeric Numeric Nominal Numeric Numeric

2 2 3 8 2 2 3 19 2 10

We tested the performance of the ARB classifier against RainForest, NB and kNN classifiers using 10-folds cross validation on 148 Exome data sets that shown in Table 6. Table 7 shows the performance of the ARB classifier against RainForest, NB and kNN classifiers using unseen test variants of 45 Exome data sets. Where 103 Exome data sets were used for training the models. Knowledge extraction from genomic big data is very challenging task, as genomic data contains very few number of variants/instances with large number of input features. Feature selection plays an important role in DNA variant classification. Each Exome data set has 143 features, but we have considered most relevant 92 features for the experiments. The ARB classifier generates classification rules from each Exome data set and adopts the new rules (that generated from new Exome data sets) with exiting rules. It can correctly classify the BrS DNA variants with acceptable accuracy rate. So, the research hypothesis is that the ABR classifier (an ensemble method) can increase the prediction accuracy for classification of genomic data. 6.3. Experiments on benchmark life sciences data We tested the performance of ARB classifier with compared to C4.5, kNN and NB classifiers on 10 real benchmark life sciences data sets from UCI (University of California, Irvine) ML repository (Lichman, 2013). Table 8 describes the data sets. Firstly, we have evaluated the performance of ARB classifier against C4.5, kNN and NB classifiers using classification accuracy, precision, recall and F-score with 10-fold cross validation on the 10 benchmark life sciences data sets shown in Table 8. The comparison of classification accuracies of ARB classifier, C4.5, kNN and NB classifiers are clarified in Table 9. The ARB classifier outperforms the C4.5,

313

Table 9 The classification accuracy (%) of C4.5, kNN, NB and ARB classifier with 10-fold cross validation. Data sets

C4.5

kNN

NB

ARB classifier

Appendicitis Breast cancer Contraceptive Ecoli Heart Pima diabetes Iris Soybean Thyroid Yeast

85.84 75.52 50.98 79.76 77.40 73.82 96 91.50 98.13 56.73

86.79 73.42 49.76 83.03 78.88 73.17 95.33 90.19 97.2 56.94

85.84 71.67 48.13 78.86 83.7 76.3 96 92.97 98.13 57.88

87.73 75.52 50.1 83.92 83.7 75.65 95.33 91.94 98.13 61.99

Table 10 The accuracy, precision, recall and F-score for C4.5 classifier with 10-fold cross validation. Data sets

Classification accuracy (%)

Precision (weighted avg.)

Recall (weighted avg.)

F-score (weighted avg.)

Appendicitis Breast cancer Contraceptive Ecoli Heart Pima diabetes Iris Soybean Thyroid Yeast

85.84 75.52 50.98 79.76 77.4 73.82 96 91.5 98.13 56.73

0.84 0.75 0.5 0.78 0.77 0.73 0.96 0.91 0.98 0.55

0.85 0.75 0.51 0.79 0.77 0.73 0.96 0.91 0.98 0.56

0.85 0.71 0.5 0.79 0.77 0.73 0.96 0.91 0.98 0.56

Fig. 5. The comparison of classification accuracies among C4.5, kNN, NB and ARB classifiers.

kNN and NB classifiers on appendicitis, ecoli and yeast data sets. It also outperforms kNN and NB classifiers on breast cancer and contraceptive data sets, C4.5 and kNN classifiers on heart, pima diabetes and soybean data sets. For thyroid data set C4.5, NB and proposed classifier have the same classification accuracy rate. Fig. 5 shows the comparison of classification accuracies among C4.5, kNN, NB and proposed classifier on 10 life sciences data sets with 10-fold cross validation. The results shown in Tables 10,11,12,13 indicate that the ARB classifier has similar results often better as compared to traditional ML and DM techniques like C4.5, kNN and NB classifiers. The C4.5 classifier is one of the popular classification methods that are applied in many real life applications. Table 10 presents the

314

D.Md. Farid et al. / Expert Systems With Applications 64 (2016) 305–316 Table 11 The accuracy, precision, recall and F-score for kNN classifier (where, k = 5) with 10-fold cross validation. Data sets

Appendicitis Breast cancer Contraceptive Ecoli Heart Pima diabetes Iris Soybean Thyroid Yeast

Classification accuracy (%)

Precision (weighted avg.)

Recall (weighted avg.)

F-score (weighted avg.)

86.79 73.42 49.76 83.03 78.88 73.17 95.33 90.19 97.2 56.94

0.86 0.72 0.49 0.82 0.78 0.72 0.95 0.91 0.97 0.56

0.86 0.73 0.49 0.83 0.78 0.73 0.95 0.90 0.97 0.56

0.86 0.67 0.49 0.82 0.78 0.76 0.95 0.99 0.97 0.56

Table 12 The accuracy, precision, recall and F-score for NB classifier with 10-fold cross validation. Data sets

Appendicitis Breast cancer Contraceptive Ecoli Heart Pima diabetes Iris Soybean Thyroid Yeast

Classification accuracy (%)

Precision (weighted avg.)

Recall (weighted avg.)

F-score (weighted avg.)

85.84 71.67 48.13 78.86 83.7 76.3 96 92.97 98.13 57.88

0.86 0.7 0.5 0.79 0.83 0.75 0.96 0.93 0.98 0.58

0.85 0.71 0.48 0.78 0.83 0.76 0.96 0.93 0.98 0.57

0.86 0.7 0.48 0.79 0.83 0.76 0.96 0.92 0.98 0.57

Table 13 The accuracy, precision, recall and F-score for ARB classifier (where, k = 3) with 10-fold cross validation. Data sets

Classification accuracy (%)

Precision (weighted avg.)

Recall (weighted avg.)

F-score (weighted avg.)

Appendicitis Breast cancer Contraceptive Ecoli Heart Pima diabetes Iris Soybean Thyroid Yeast

87.73 75.52 50.1 83.92 83.7 75.65 95.33 91.94 98.13 61.99

0.87 0.75 0.49 0.82 0.83 0.75 0.95 0.92 0.98 0.61

0.87 0.75 0.5 0.83 0.83 0.75 0.95 0.91 0.98 0.62

0.87 0.71 0.5 0.83 0.83 0.75 0.95 0.91 0.98 0.61

classification accuracy and weighted average of precision, recall and F-score of C4.5 classifier with 10-fold cross validation on 10 data sets. The C4.5 classifier has good accuracy rates more than 90% on Iris, Soybean and Thyroid data sets. To compare the results of different classification methods, we considered the kNN classifier in this study. The kNN has the similar results compared to C4.5 on the data sets that are clarified in Table 11. But, kNN classifier has 86% accuracy on Appendicitis data set and 83% accuracy on Ecoli data set. The NB classifier is another strong classification method based on Bayes’ theorem with independence assumptions between the features. It calculated the prior and class conditional probabilities from the training data and then predicts the class label of a test instance with highest posterior probability. Table 12 tabulates the classification accuracy and weighted average of precision, recall and F-score of naïve Bayes (NB) classifier with 10-fold cross validation on 10 data sets.

Table 14 The classification accuracy (%) of C4.5, kNN, NB and ARB classifier with 10-fold cross validation on data sets having 10% noisy instances. Data sets

C4.5

kNN

NB

ARB classifier

Appendicitis Breast cancer Contraceptive Ecoli Heart Pima diabetes Iris Soybean Thyroid Yeast

76.41 69.23 49.96 69.04 66.29 65.88 84 85.21 87.21 51.81

77.35 70.27 45.01 67.55 68.14 68.09 85.33 84.91 85.72 52.41

75.47 70.62 47.92 67.55 71.48 71.35 80.66 86.67 88.59 51.97

79.58 70.83 50.1 71.27 73.22 72.28 86 88.32 89.47 54.19

Table 15 The classification accuracy (%) of C4.5, kNN, NB and ARB classifier with 10-fold cross validation on data sets having 20% noisy instances. Data sets

C4.5

kNN

NB

ARB classifier

Appendicitis Breast cancer Contraceptive Ecoli Heart Pima diabetes Iris Soybean Thyroid Yeast

66.98 69.58 46.15 52.08 63.33 61.58 66 78.18 77.93 43.36

63.2 65.03 42.34 59.82 63.33 59.37 68 77.3 75.17 43.91

56.6 67.48 46.97 59.82 65.92 64.84 70 79.94 79.76 43.27

69.17 70.78 48.14 63.86 68.74 65.06 72 81.13 81.59 45.45

The NB classifier has good classification accuracies on Appendicitis, Heart, Pima diabetes, Iris, Soybean and Thyroid data sets. Table 13 presents the classification accuracy and weighted average of precision, recall and F-score of the ARB classifier with 10-fold cross validation on 10 data sets. We have considered three iterations for the ARB classifier on each data set. It has 88% accuracy on Appendicitis data set, 84% accuracy on Ecoli data set, 84% accuracy on Heart data set and 98% accuracy on Thyroid data set. Therefore, the ARB classifier achieves better or at least equal performance like the C4.5, kNN and NB classifiers. Secondly, we used noisy data sets to test the performance of C4.5, kNN, NB and ARB classifier. We have artificially injected the contradictory instances into each data set in Table 8. The contradictory instances had same feature values but different class labels, as a result the mining classifiers become lass accurate. Also, we have changed the class label of randomly selected instances and added redundant instances into each data set in Table 8. The classification accuracies of C4.5, kNN, NB and ARB classifier with 10-fold cross validation on 10 data sets containing 10% and 20% noisy instances are presented respectively in Tables 14 and 15. Figs. 6 and 7 present the data tabulated in Tables 14 and 15 respectively. The ARB classifier outperforms the C4.5, kNN and NB classifiers on every noisy data sets having 10% and 20% noisy instances. To deal with noisy instances, we considered three iterations for the proposed ARB classifier. The ARB classifier gives more attention to the misclassified instances in the next iteration to extract the classification rules from the training data. Regarding Table 15 (where data sets having 20% noisy training instances), the proposed method achieved the best result for all data sets. The traditional learning algorithms like C4.5, kNN, NB perform poorly on high dimensional noisy data. Also, these classifiers perform poorly when small data sets having contradictory and redundant training instances.

D.Md. Farid et al. / Expert Systems With Applications 64 (2016) 305–316

Fig. 6. The comparison of classification accuracies among C4.5, kNN, NB and ARB classifier on data sets having 10% noisy instances.

315

data. Also, it offers good adaptability where the new rules can be added to existing rules without disturbing ones already in there, and rules can be executed in any order. The efficiency for the ARB classifier outperforms for all the data sets with compared to traditional ML classifiers. The experimental results elucidate that the ARB classifier has exemplary classification accuracy on different types of biological data sets (small and biological big data sets). This paper aims to answer the research challenges by developing an optimal ensemble method for information discovery on genomic big data. Another goal is to extract the relevant information/new knowledge from genetic data and transfer it to the medical setting, with an aim to provide an improved and effective diagnosis process for patients. In future, we will extract the classification rules employing ARB classifier for other oligogenic diseases like Epileptic Encephalopathies and Cleft Lip and/or Palate. Also, weighting of rules will be used to extract information from genomics with clinical data for pathology classification tasks. In ML for DM applications, we will use the ARB classifier as a base classifier in active learning and multi-class classification in a dynamic feature set. Acknowledgment We appreciate the support for this research received from the BRiDGEIris (BRussels big Data platform for sharing and discovery in clinical GEnomics) project that is being hosted by IB2 (Interuniversity Institute of Bioinformatics in Brussels) and funded by INNOVIRIS (Brussels Institute for Research and Innovation). Also, FWO research project G004414N “Machine Learning for Data Mining Applications in Cancer Genomics”. References

Fig. 7. The comparison of classification accuracies among C4.5, kNN, NB and ARB classifier on data sets having 20% noisy instances.

7. Conclusions Mining biological/genomic big data is a challenging task due to its high-dimensional, incomplete, noisy and inconsistent nature. The traditional ML for DM algorithms like C4.5, kNN and NB perform well in small data sets, but their classification accuracy drastically declined with the big biological data. This paper investigated the performance of traditional ML and DM algorithms on 148 Exome data sets and 10 different real benchmark life sciences data sets from the UCI machine learning repository while proposing an ARB classifier to improve the lacking of traditional classifiers. An adaptive ARB classifier was implemented where the random subspace and boosting approaches were combined with ensemble of DTs to construct a set of classification rules for classifying multi-class genomic data. The ARB classifier used DTs to evolve classification rules from training data and kNN to analysis training instances that were misclassified by the DTs. The ARB classifier deals with overfitting, noisy instances and class-imbalance problem while all the traditional ML algorithms fail to show similar degree of accuracy. The main advantage of the ARB classifier is that it can be used for mining big biological heterogeneous

Al-Jarrah, O. Y., Yoo, P. D., Muhaidat, S., Karagiannidis, G. K., & Taha, K. (2015). Efficient machine learning for big data: A review. Big Data Research, 2(3), 87–93. Al-Mamun, M. A., Farid, D. M., Ravenhill, L., Hossain, M. A., Fall, C., & Bass, R. (2016). An in silico model to demonstrate the effects of maspin on cancer cell dynamics. Journal of Theoretical Biology, 388, 37–49. Alter, M. D., Kharkar, R., Ramsey, K. E., Craig, D. W., Melmed, R. D., Grebe, T. A., et al. (2011). Autism and increased paternal age related changes in global levels of gene expression regulation. PLOS ONE, 6(2), 1–10. Apté, C., & Weiss, S. (1997). Data mining with decision trees and decision rules. Future Generation Computer Systems, 13(2-3), 197–210. Artimo, P., Jonnalagedda, M., Arnold, K., Baratin, D., Csardi, G., de Castro, E., et al. (2012). ExPASy: SIB bioinformatics resource portal. Nucleic Acids Research, 40(W1), W597–W603. Assunção, M. D., Calheiros, R. N., Bianchi, S., Netto, M. A. S., & Buyya, R. (2015). Big data computing and clouds: Trends and future directions. Journal of Parallel and Distributed Computing, 79–80, 3–15. Barbierato, E., Gribaudo, M., & Iacono, M. (2014). Performance evaluation of NoSQL big-data applications using multi-formalism models. Future Generation Computer Systems, 37, 345–353. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., et al. (20 0 0). The protein data bank. Nucleic Acids Research, 28(1), 235–242. Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2012). An ensemble of filters and classifiers for micro array data classification. Pattern Recognition, 45(1), 531–539. Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. Chapman and Hall/CRC. Chen, C. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on big data. Information Sciences, 275, 314–347. Farid, D. M., & Rahman, C. M. (2013). Assigning weights to training instances increases classification accuracy. International Journal of Data Mining & Knowledge Management Process, 3(1), 13–25. Farid, D. M., & Rahman, C. M. (2013). Mining complex data streams: Discretization, attribute selection and classification. Journal of Advances in Information Technology, 4(3), 129–135. Farid, D. M., Rahman, M. Z., & Rahman, C. M. (2011). Adaptive intrusion detection based on boosting and naïve bayesian classifier. International Journal of Computer Applications, 24(3), 12–19. Farid, D. M., Rahman, M. Z., & Rahman, C. M. (2011). An ensemble approach to classifier construction based on bootstrap aggregation. International Journal of Computer Applications, 25(5), 30–34. Farid, D. M., Zhang, L., Hossain, A., Rahman, C. M., Strachan, R., Sexton, G., et al. (2013). An adaptive ensemble classifier for mining concept drifting data streams. Expert Systems with Applications, 40(15), 5895–5906.

316

D.Md. Farid et al. / Expert Systems With Applications 64 (2016) 305–316

Farid, D. M., Zhang, L., Rahman, C. M., Hossain, M., & Strachan, R. (2014). Hybrid decision tree and naïve bayes classifiers for multi-class classification tasks. Expert Systems with Applications, 41(4), 1937–1946. Gehrke, J., Ramakrishnan, R., & Ganti, V. (20 0 0). Rainforest - a framework for fast decision tree construction of large datasets. Data Mining and Knowledge Discovery, 4(2-3), 127–162. Gheyas, I. A., & Smith, L. S. (2010). Feature subset selection in large dimensionality domains. Pattern Recognition, 43(1), 5–13. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. ACM SIGKDD Explorations, 11(1), 10–18. Han, J., Kamber, M., & Pei, J. (2011). Data mining concepts and techniques (3rd). Morgan Kaufmann. Hanczar, B., & Nadif, M. (2011). Using the bagging approach for biclustering of gene expression data. Neurocomputing, 74(10), 1595–1605. Hanczar, B., & Nadif, M. (2012). Ensemble methods for biclustering tasks. Pattern Recognition, 2012(11), 3938–3949. Hofman, N., Tan, H. L., Alders, M., Kolder, I., de Haij, S., Mannens, M. M., et al. (2013). Yield of molecular and clinical testing for arrhythmia syndromes: report of a 15 years’ experience. Circulation, 128, 1513–1521. Hua, J., Tembe, W. D., & Dougherty, E. R. (2009). Performance of feature-selection methods in the classification of high-dimension data. Pattern Recognition, 42(3), 409–424. Jin, X., Wah, B. W., Cheng, X., & Wang, Y. (2015). Significance and challenges of big data research. Big Data Research, 2(2), 59–64. Kambatla, K., Kollias, G., Kumar, V., & Grama, A. (2014). Trends in big data analytics. Journal of Parallel and Distributed Computing, 74(7), 2561–2573. Latkowski, T., & Osowski, S. (2015). Data mining for feature selection in gene expression autism data. Expert Systems with Applications, 42(2), 864–872. Lawrence, R. L., & Wright, A. (2001). Rule-based classification systems using classification and regression tree (cart) analysis. Photogrammetric engineering and remote sensing, 67(10), 1137–1142. Lichman, M. (2013). UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml. Liew, A. W.-C., Yan, H., & Yang, M. (2005). Pattern recognition techniques for the emerging field of bio informatics: A review. Pattern Recognition, 38(11), 2055–2073.

Lin, T.-C., Liu, R.-S., Chen, C.-Y., Chao, Y.-T., & Chen, S.-Y. (2006). Pattern classification in DNA micro array data of multiple tumor types. Pattern Recognition, 39(12), 2426–2438. Liu, H., Liu, L., & Zhang, H. (2010). Ensemble gene selection for cancer classification. Pattern Recognition, 43(8), 2763–2772. López, V. F., Aguilar, R., Alonso, L., & Moreno, M. N. (2012). Data mining for grammatical inference with bioinformatics criteria. Expert Systems with Applications, 39(3), 2330–2334. Milone, D. H., Stegmayer, G., Kamenetzky, L., López, M., & Carrari, F. (2013). Clustering biological data with SOMs: On topology preservation in non-linear dimensional reduction. Expert Systems with Applications, 40(9), 3841–3845. Najafabadi, M. M., Villanustre, F., Khoshgoftaar, T. M., Seliya, N., Wald, R., & Muharemagic, E. (2015). Deep learning applications and challenges in big data analytics. Journal of Big Data, 2(1), 1–21. Nápoles, G., Grau, I., Bello, R., & Grau, R. (2014). Two-steps learning of fuzzy cognitive maps for prediction and knowledge discovery on the HIV-1 drug resistance. Expert Systems with Applications, 41(3), 821–830. Quinlan, J. R. (1986). Induction of decision tree. Machine Learning, 1(1), 81–106. Quinlan, J. R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann. Singh, D., & Reddy, C. K. (2014). A survey on platforms for big data analytics. Journal of Big Data, 2(8), 1–20. Stelle, D., Barioni, M. C., & Scott, L. P. (2011). Using data mining to identify structural rules in proteins. Applied Mathematics and Computation, 218(5), 1997–2004. Taminau, J. (2012). Unlocking the potential of public available gene expression data for large-scale analysis. Vrije Universiteit Brussel, Belgium: Faculty of Science and Bio-Engineering Science, Ph.D. thesis.. Toga, A. W., & Dinov, I. D. (2015). Sharing big biomedical data. Journal of Big Data, 2(7), 1–12. Tzanis, G., Kavakiotis, I., & Vlahavas, I. (2011). PolyA-iEP: A data mining method for the effective prediction of polyadenylation sites. Expert Systems with Applications, 38(10), 12398–12408. Witten, I. H., Frank, E., & Hall, M. A. (2011). Data mining: Practical machine learning tools and techniques (3rd). Morgan Kaufmann. Yang, H., & Chen, Y.-P. P. (2015). Data mining in lung cancer pathologic staging diagnosis: Correlation between clinical and pathology information. Expert Systems with Applications, 42(15-16), 6168–6176.

Suggest Documents