Encoding and decoding the knowledge of association rules over SVM

5 downloads 0 Views 1MB Size Report
Jun 24, 2008 - Abstract This paper presents a constructive method for association rule ... data, seven benchmark cancers diagnosis, and one application of cell-phone fraud detec- ... [11,19] aggregation in that (1) SVM ensemble takes the number of ... method is, it is difficult to define such a cost function that follows up all ...
Knowl Inf Syst (2009) 19:79–105 DOI 10.1007/s10115-008-0147-1 REGULAR PAPER

Encoding and decoding the knowledge of association rules over SVM classification trees Shaoning Pang · Nikola Kasabov

Received: 15 May 2007 / Revised: 16 March 2008 / Accepted: 25 April 2008 / Published online: 24 June 2008 © Springer-Verlag London Limited 2008

Abstract This paper presents a constructive method for association rule extraction, where the knowledge of data is encoded into an SVM classification tree (SVMT), and linguistic association rule is extracted by decoding of the trained SVMT. The method of rule extraction over the SVMT (SVMT-rule), in the spirit of decision-tree rule extraction, achieves rule extraction not only from SVM, but also over the decision-tree structure of SVMT. Thus, the obtained rules from SVMT-rule have the better comprehensibility of decision-tree rule, meanwhile retains the good classification accuracy of SVM. Moreover, profiting from the super generalization ability of SVMT owing to the aggregation of a group of SVMs, the SVMT-rule is capable of performing a very robust classification on such datasets that have seriously, even overwhelmingly, class-imbalanced data distribution. Experiments with a Gaussian synthetic data, seven benchmark cancers diagnosis, and one application of cell-phone fraud detection have highlighted the utility of SVMT and SVMT-rule on comprehensible and effective knowledge discovery, as well as the superior properties of SVMT-rule as compared to a purely support-vector based rule extraction. (A version of SVMT Matlab software is available online at http://kcir.kedri.info) Keywords Association rule extraction · Support vector machine · SVM aggregating intelligence · SVM ensemble · SVM classification tree · Class imbalance · Class overlap

1 Introduction Since support vector machines (SVM) [7–9] demonstrate a good ability in classification and regression, rule extraction from a trained SVM (SVM-rule) accordingly performs well on

S. Pang (B) · N. Kasabov Knowledge Engineering and Discovery Research Institute, Auckland University of Technology, Private Bag 92006, Auckland 1020, New Zealand e-mail: [email protected] N. Kasabov e-mail: [email protected]

123

80

S. Pang, N. Kasabov

the decision making for data mining and knowledge discovery, sometime even better than the SVM [1–6,31,33]. However, the rules obtained from the SVM-rule in practice are less comprehensible than our expectation because there is a large number of incomprehensible numerical parameters (i.e., support vectors) that appear in these rules. Compared to the SVM-rule, the decision-tree is a simple, but very efficient rule extraction method in terms of comprehensibility [35]. The extracted rules from decision tree may not be so accurate as SVM rules, but they are easy to comprehend because every rule represents one decision path that is traceable in the decision tree. 1.1 Background In the literature, rule extraction based on single SVM focuses on interpolating the support vector and hyperplane boundary from a trained SVM to a set of linguistic rule expression. Nunez et al. [1] extract a rule by first clustering support vectors by k-mean, then for each cluster, choosing the prototype vector (i.e., the center of the cluster), and the support vector furthest to the prototype vector to build an ellipsoid using a geometric method. Finally, the ellipsoid is translated to a linguistic if-then rule by a symbolic interpretation. Nunez’s method used a reduced number of support vectors, but made the generated ellipsoid rule overlaps each other. In addition, this method require having parameters such as the number of cluster and initial cluster centers as prior knowledge. As an improvement over Nunez’s method, Zhang et al. [2] proposed a hyper-rectangle rule extraction (HRE) method. Instead of clustering support vectors, the HRE clusters the original data using a support vector clustering (SVC) method to find the prototypes of each class samples, then constructs hyper-rectangle by the obtained prototypes and support vectors from the trained SVM. Because of the merits of the SVC algorithm, HRE can generate high quality rules even when training data contains outliers. Later, Fu et al developed rule extraction from support vector machine (RulExSVM) [3,6]. RulExSVM generates rules straightforwardly based on each of the support vectors. The rule condition is structured as a conjunction of several attribute conditions, each of them is built upon a hyper-rectangle associated with a certain support vector. RulExSVM is easy, but needs a tuning and pruning phase to reduce rules that have overlap and outliers. The above SVM rule extraction methods are regarded as incomprehensible techniques because they are completely support vector based rule extraction methods, and the knowledge of obtained rules is all concealed in a number of numerical support vectors that are normally not transparent to the user. To mitigate this problem, the decision tree is a good rule extraction example that is recommended here for SVM rule extraction because every rule generated by a decision tree represents a certain decision path that has a comprehensible rule antecedent and rule consequence. For instance, C4.5 Rule [18,32] interpolates every path from the root to a leaf of a trained C4.5 decision tree to an rule by regarding all the test conditions as the conjunctive rule antecedents while regarding the class label held by the leaf as the rule consequence. The comprehensibility of the C4.5 Rule is better than that of the C4.5 decision tree, because the C4.5 Rule has the knowledge over a decision tree fused and grouped, and comes with a concise knowledge structure. 1.2 Relevant works Unlike decision trees, SVM classification tree (SVMT) is a type of SVM based aggregation method towards combining a family of concurrent SVMs for an optimized problem-solving in artificial intelligence [12]. SVMT is different from the well known SVM ensemble [11,19] aggregation in that (1) SVM ensemble takes the number of SVMs for aggregation

123

Encoding and decoding the knowledge of association rules over SVM classification trees

81

and the structure of aggregation as prior knowledge that is assumed to be known in advance, whereas SVMT has this prior knowledge learned automatically from data. (2) SVMT has an even more outstanding generalization ability than the SVM ensemble in particular when it is confronted by tasks with the difficulty of a serious class-imbalance [13]. For both a more comprehensible and a more robust rule extraction to class-imbalance, this paper presents a new type of SVMT in the form of depth-first spanning (DFS) SVMT and breath-first-spanning (BFS) SVMT, and delivers rule extraction from SVMT through a rule induction over the constructed SVM decision-tree and support vectors (SVs) from local SVMs. 1.3 Paper organization This remainder of this paper is structured as follows: Sect. 2 introduces the difficulties and motivations for SVMT-rule. Section 3 derives the decision function and loss function of SVMT. Section 4 presents the spanning algorithms of DFS-SVMT and BFS-SVMT. Section 5 describes the techniques of rule extraction over SVMT (i.e., SVMT-rule). In Sect. 6, experiments with a synthetic data and two applications of cancer diagnosis and fraud detection are discussed in detail. Finally, the conclusions and outlines of future work are pointed out in Sect. 7.

2 Difficulties and motivations for SVMT-rule 2.1 Comprehensibility A difficulty with SVM rule extraction is that the extracted rules have the problem with comprehensibility, because support vectors (SVs) are represented in the rule as a set of incomprehensible numerical parameters. However, in many applications, we have to deal with a large number of SVs. To solve the comprehensibility problem of SVM rule extraction, achieving a smaller number of SVs meanwhile a good accuracy of extracted rules, several types of method have been researched so far: (1) the decision tree interpolation method [38,39]. This method uses no SVs, but an decision tree to interpolate the result of trained SVM for rule extraction. Owing to the decision tree interpolation, the extracted rules are as comprehensible as decision-tree rule, but has lost the accuracy of the SVM, because no SVs are encoded in the extracted rules; (2) SVs clustering method [2]. It is to reduce the number of SVs by clustering SVs and use only the SV cluster centers for rule extraction. The retaining of SVM classification accuracy is not ensured for the extracted rules, because the altered SVs actually has made the hyperplane different from the SVM origin. (3) SVs grouping and selecting method [40]. This approach is more effective than the the above two, because it introduces cost functions for grouping and selecting representative SVs for rule extraction. The disadvantage with the method is, it is difficult to define such a cost function that follows up all performance evaluation criteria including true positives (TPs), false positives (FPs), and AUC (the area under the receiver operating characteristic (ROC)). 2.2 Class imbalance problem On the other hand, support-vector based rule extraction also has the difficulty of classimbalance. In other words, when class-imbalance occurs in a dataset, the number of samples from one class is much larger than that from the other class (in the case of binary class

123

82

S. Pang, N. Kasabov

dataset, the class with the smaller number of samples is called ‘skewed’ class, and the other class is called ‘overweighted’ class), SVM often fails to classify the skewed class correctly [42–45], as most classification algorithm including SVM assume an balanced distribution of classes [41]. Thus, for a SVM rule extraction, the extracted rules favor frequently the classification of the weighted-class, but lose almost completely the accuracy of the skewed-class classification. For the class-imbalance problem, data sampling method is a straightforward solution. The approaches include undersampling, which reduces the number of samples in the bigger class closer to the smaller class; oversampling, which takes random equal number of subsamples repeatedly from the two classes [48], and keeps a desired ratio of the two classes in each subset of samples; and resampling, which is a mixture of undersampling and oversampling [46,47]. Algorithmic method is a more effective solution. The approaches include adjusting decision class probability, using different penalty for different classes [44,45], and modifying learning function to adjust the class boundary [42,43]. As class imbalance occurs often together with class-overlap in many applications, the above approaches still confront the following problem: while improving the recognition of the skewed-class, but degrading too much of the performance on weighted-class classification, in particular when a serious class-overlap is involved. 2.3 Motivations Towards dealing with both the above two difficulties of SVM rule extraction, our idea is to put those numerical support-vector rules into a comprehensible decision-tree structure, so that the comprehensibility of the rule set can be improved because that different sets of support-vector rules are traceable within the decision tree, and the number of support vectors can be reduced as part of support-vector rules transformed into simple and comprehensible decision-tree rules. On the other hand, support-vector rules are encoded from a group of individual simple SVMs such as linear SVM, instead of one SVM with a complicated classification hyperplane, so that the class-imbalance problem can be treated easily by allocating more computing power (i.e., individual SVMs) for the classification of the skewed-class.

Input Training Data set X

Construct SVM Classification Tree

Rule Extraction Over SVMT

Rule Extraction From Tree Structure

Rule Extraction Over SVM nodes

Rule Extraction Over One-class SVM node

Fig. 1 The summary of the steps of the proposed SVMT-rule algorithm

123

Output Merged Rule Set R

Encoding and decoding the knowledge of association rules over SVM classification trees

83

Motivated by this, we proposed in this paper a new method for rule extraction from SVM. We encoded the knowledge of rules by aggregating a set of SVMs into a decision tree and extracted rules by decoding the knowledge out of the SVMT. The following is a summary of the steps of the proposed SVMT-rule algorithm, which consists of SVM classification tree construction, and followed by three steps of rule extraction over the SVMT, including rule extraction over the tree structure, the SVM node, and the one-class SVM node, respectively. Individual steps of the SVMT-rule are detailed in the subsections below and shown schematically in Fig. 1.

3 SVM classification tree modeling The SVMT was first proposed in [13], in the context of face membership authentication application [11]. The SVMT was shown to be very capable of reducing the classification difficulty due to class-overlap in a recursive procedure, thus is useful in pattern recognition problems with large training samples but with noise information suppressed via feature extraction. Unlike SVM ensemble aggregation [19] assumes that the number of SVMs in aggregation should be known in advance as the prior knowledge despite the number of SVMs is often difficult to determine in real applications, SVMT aggregation has solved the difficulty of the SVM ensemble with the determination of the number of SVMs in a SVMT learned automatically from data. However, the spanning of the previous SVMT in [13] is completely data-driven, which grows easily an overfitting of learning and comes up with a large size decision tree. Obviously, this type of SVMT is not optimal for rule extraction and decision making [37]. To deal with this difficulty, a new type of SVMT with spanning order preference are modeled as follows. 3.1 SVMT decision function Mathematically, a 2-SVMT can be formulated as a composite structural model as follows: Given a 2-class dataset D for classification, and a predefined data partitioning function P on D, the whole dataset D can be divided through an optimized P ∗ into N partitions {g1 , g2 , . . . , g N }. Then, a 2-SVMT can be modeled as, T2−SV M T (P , f Svm , f Sgn , f Sgn ), i = 1, . . . , I, j = 1, . . . , J, k = 1, . . . , K . j k i

(1)

where Svmi is a local 2-class SVM classifier. Sgnj and Sgnk represent a regional ‘one-class classifier’ of class 1 and class 2, respectively. I,J,K such that I + J + K = N , correspond to the number of three data-partition types: partition with data from class 1 and class 2, partition with only data from class 1, and partition with only data from class 2, respectively. I,J,K and N are determined after the SVM tree model generation. Figure 2 gives an example of 2-class SVM and one-class SVM over data partition with data in 2-class and one-class, respectively. In the case that gi contains two classes data, a typical 2-class SVM f Svm , as Fig. 2a, is applied to model a local f i on gi as, f Svm =

l 

yi ((wi )T ϕ(x i )˙ + bi∗ ))

(2)

i=1

123

84

S. Pang, N. Kasabov 1 0.9

-1 0.8 0.7

+1 0.6 0.5

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

(a) 0.2 0.1 0 -0.1 -0.2 -0.3 -0.4 -0.5 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

(b) Fig. 2 Example of 2-class SVM and one-class SVM over one data partition in the case that the data is in 2 classes, or all in one class. a Two-class linear SVM, b one-class SVM

where ϕ is the kernel function, l is the number of training samples, and w, b is optimized through  L k  1 T i min (wi ) wi + C (ξ ) , 2

(3)

t=1

yi ((wi )T ϕ(x t ) + bi ) ≥ 1 − (ξ i ),

where C and k are used to weight the penalizing variable ξ , ϕ(.) acts the role of kernel function. In another case, when gi contains only data from one-class, gi can be modeled strictly as a one-class classifier [14–17], where outlier samples are identified as negative samples among the positive samples, the origin data of the partition. Following [14], a one-class SVM function can be modeled as an outlier classifier as in Fig. 2b by setting positive on the origin

123

Encoding and decoding the knowledge of association rules over SVM classification trees

85

¯ data of the partition S and negative on the complement S:  +1 i f x ∈ S f Sgn = −1 i f x ∈ S¯

(4)

where i represents class label ‘1’ or ‘2’ in binary classification. Note that one-class SVM node is modeled under the assumption that gi contains not only the one-class sample, but also some outlier samples (i.e., samples that are not belonging to that one class). However, in the case that gi contains no outlier sample, then the one-class SVM (node) is equivalent to a clustering decision function, which is if x by P belongs to class i, then the output of the one-classifier is assigned as i. Thus, we can have the decision function of 2-SVMT fˆ as, ⎧ P (x) ∈ class 1 ⎨ f Sgn i f P (x) ∈ class 2 fˆ(x) = f Sgn i f (5) ⎩ f Svm other wise where one-class classifier Sgn and Sgn are one-class regional decision maker of 2-SVMT, and typical single SVM classifier Svm is two-class regional decision maker of 2-SVMT. 3.2 SVMT loss function Clearly, errors may occur in the classification of constructed fˆ, as fˆ may differ from the true classification function f . Thus, a suitably chosen real-value loss function L = L( fˆ, f ) is used to capture the extent of this error. Loss L is therefore data dependent: L = | fˆ − y|

(6)

where y = f (x) represents the true classification value. As fˆ is applied to datasets drawn from the whole data D under a distribution of g. The expected loss can be quantified as, E[L] = | fˆ − y|g(D)d D.

(7)

In experiments, g can be realized by applying a tenfold cross validation policy on D. Since we have no information of the real classification function f , for simplicity we can fix y as the class label of the training dataset. Thus, given data drawn under a distribution g, constructing an SVM tree requires choosing a function f ∗ such that the expected loss is minimized: f ∗ = arg min | fˆ − y|g(D)d D

(8)

fˆ∈F

where, F is a suitably defined SVM tree functions. Substituting fˆ to Eq. (6), the loss function of 2-SVMT is, L=

I 

| f Svm − f i |2i +

J 

i

i=1

j=1

| f Sgn − f j |q1 j + j

K 

| f Sgn − f k |q2k k

(9)

k=1

where f i , f j and f k are the local true values of f . 2i represents the distribution probability of the ith partition that contains both class 1 and class 2 data. q represents the distribution probability of a partition that contains the data of one class (i.e., either class 1 or class 2). In Eq. (9), L is determined by the performance of every regional classification from a regional SVM classifier or a one-class classifier. Given every SVM in 2-SVMT with the

123

86

S. Pang, N. Kasabov

same kernel and penalty parameters, loss function L now depends only on the data distribution of local regions. In other words, the data partitioning method P eventually determines the loss function of the resulting SVM tree. In this sense, the construction of SVMT f ∗ is equivalent to seeking a partition function P ∗ that is able to minimize the following loss function, P ∗ = arg min | fˆ − y|g(D)d D

(10)

P ∈[P ]

= arg min

P ∈[P ]

I 

| f Svm − f i |2i +

J 

i

i=1

| f Sgn − f j |q1 j + j

j=1

K 

| f Sgn − f k |q2k k

k=1

where, [P ] is a set of suitably defined 2-split data partition functions P . The first term of the above function is the risk from SVM classification, and the remaining two terms are the risks from one-class classifier f Sgn . The risk from f Sgn can be minimized through the training of an SVM one-classifier. In the case that Sgn is simplified by making every decision of classification through P , the classification risk can be simply evaluated by the loss function of P . Due to the loss function of P has been minimized when the data partitioning is performed. For a binary partitioning [X 1 , X 2 ] = P (X ), the risk can be the same minimized by a SVM training using X 1 and X 2 as class ‘+1’ and ‘−1’, respectively. So, Eq. (10) can be updated as, P ∗ = arg min | fˆ − y|g(D)d D P ∈[P ]

= arg min

P ∈[P ]

I 

(11)

| f Svm − f i |2i . i

i=1

Thus, we can solve the above function by discovering every individual partition with the minimized SVM classification risk, P ∗ = arg min | f Svm − f i |2i . i P ∈[P ]

(12)

Equation (12) is proportional to the size of the region (i.e., number of samples in the region), thus in practice Eq. (12) can be implemented by seeking data partitions with different sizes and a minimized SVM classification risk. To this end, P ∗ is modelled as the following recursive supervised and scalable data partitioning procedure, [g1 , g2 , . . . , g j , . . . g N ] = P n (X, ρ0 ),

(13)

Subject to : L Svm (g j ) < ξ or, g j ∈ one class where X is the input dataset. ρ0 is the initial partitioning scale, and L Svm = | f Svmi − f i | is a local 2-class SVM loss function. ξ is the SVM class separability threshold. P n represents a recursive partitioning function such that P n (X, ρ) = P (P n−1 (X, ρn−1 ), ρn ). For every iteration, optimal partitions are extracted out, and the remaining samples go to the next iteration of partitioning. As a result of Eqs. (12) and (13), partitions with L Svm minimized is produced by P at different scales. Each of them builds a 2-class regional decision maker f Svm . Meanwhile, partitions that contain only one class data are also being produced. This builds the one-class regional decision makers f Sgn and f Sgn .

123

Encoding and decoding the knowledge of association rules over SVM classification trees

87

4 The encoding of SVM tree To construct a SVM tree, a data-driven spanning (DDS) approach is, to split data whenever class mixture exists in the current data partition [13], and partition data recursively until each data partition contains just one class data. As mentioned above, SVMT is grown up automatically without knowing any prior knowledge, but such a DDS type SVMT grows easily an overfitting of learning, thus gives a huge size SVM tree, which obviously is not optimal for the purpose of efficient rule extraction. To approach this problem, we construct here a new type SVMT by applying the following two spanning order preferences. 4.1 Depth-first spanning tree The idea of depth-first spanning is to expand the tree to a node of the next layer as quickly as possible before fanning out to other nodes of the current layer. The procedure is as follows: at a data partitioning scale ρ, once one data partition is judged as classification optimal (either a one-class partition or a partition with L Svm < δ), then the tree expansion goes to the next layer, and one terminal node is produced at the current layer of the tree. Thus Eq. (13) can be realized by repeating the following binary data partitioning, [X ρ , X  ] = P (X, ρ, L Svm ),

(14)

Subject to : L Svm > ξ where X ρ is the one optimal partition at scale ρ generated by P , and X  is the remaining dataset such that X = X ρ ∪ X  . 4.2 Breadth-first spanning tree Breadth-first is another tree spanning approach. The basic idea is to fan out to as many nodes as possible of current layer before expanding the tree to a node of the next layer. This can be explained as, at a data partitioning scale ρ, if there are υ data partitions judged to be classification optimal (either a one-class partition or a partition with L Svm < δ), then υ nodes are produced at current layer of the SVMT. In this case, Eq. (13) uses a multiple data partitioning, [{X 1 , X 2 , . . . , X  }, X  ] = P (X, ρ, L Svm )

(15)

Subject to : L Svm > ξ where {X 1 , X 2 , . . . , X υ } is a set of optimal partitions at scale ρ, and X  is the remaining dataset such that X = X 1 ∪ X 2 , . . . , ∪X υ } ∪ X  . υ is the number of optimal partition at scale ρ, and it is determined by partitioning function ℘ and the used partitioning scale ρ. 4.3 The SVMT algorithms As constructing a SVMT, Eq. (10) is implemented by a 2-step recursive procedures. First, a partition function is taken to split the data at a certain scale into two or multiple partitions. Next, for those partitions that are optimal for the 2-class pattern classification, a one-class classifier or a SVM classifier is built to serve as a node of the 2-SVMT.

123

88

S. Pang, N. Kasabov

Algorithm 1 and 2 below present the detailed steps of constructing a Depth-first spanning (DFS) 2-SVMT, and Breadth-first spanning (BFS) 2-SVMT, respectively.

[Algorithm 1. Depth-first spanning 2-SVMT] Inputs: Xtrain (Training dataset), ρ0 (the initial partition scale), ξ (the threshold of SVM loss function), K (the used type of SVM Kernel) Outputs: T (a trained DFS 2-SVMT) Step 1: Initialize T as the root node, and current partitioning scale ρ as ρ0 . Step 2: if Xtrain is empty then iteration stops, return constructed SVMT T . Step 3: Perform Eq. (15) data partitioning with current partitioning scale ρ, and obtain data partition X ρ . Step 4: if X ρ belongs to one-class, then train a one-class SVM as Eq. (4), and append a one-class SVM node to T . Step 5: otherwise, X ρ belongs to two-class, and the classification loss function of the partition is greater than ξ , train a standard two-class SVM as Eqs. (2) and (4)over this partition, and append a two-class SVM node to T . Step 6: if no new node is added to T , then adjust current partitioning scale by ρ = ρ − ρ. Step 7: set Xtrain as Xtrain − X ρ , and go to Step 2.

[Algorithm 2. Breadth-first spanning 2-SVMT] Inputs: Xtrain (Training dataset), ρ0 (the initial partition scale), ξ (the threshold of SVM loss function), K (the used type of SVM Kernel) Outputs: T (a trained BFS 2-SVMT) Step 1: Initialize T as the root node, and current partitioning scale ρ as ρ0 . Step 2: if Xtrain is empty then iteration stops, return constructed SVMT T . Step 3: Perform Eq. (16) data partitioning with current partitioning scale ρ, and obtain data partition {X 1 , X 2 , . . . , X  }. Step 4: For each data partition X i , do the following operations (a) if X i belongs to one-class, then train a one-class SVM as Eq. (4), and append a one-class SVM node to T . (b) otherwise, X i belongs to two-class such that the classification loss function of X i is greater than ξ , train a standard two-class SVM as Eqs. (2) and (4) over this partition, and append a two-class SVM node to T . Step 5: if no new node is added to T , then adjust current partitioning scale by ρ = ρ − ρ. Step 6: set Xtrain as Xtrain − {X 1 , X 2 , . . . , X  }, and go to Step 2. To test the above constructed 2-SVMTs, a input sample x is first judged by the test function T (x) at the root node in the SVM tree. Depending on the decision made by the root node,

x will be branched to one of the children of the root node. This procedure is repeated until a leaf node or a SVM node is reached, then the final classification decision is made for the testing sample T (x) by a node one-class classifier or a single SVM classification. Algorithm 3 illustrates the testing procedure of a 2-SVMT (DFS 2-SVMT or BFS 2-SVMT).

123

Encoding and decoding the knowledge of association rules over SVM classification trees

89

[Algorithm 3. 2-SVMT testing algorithm] Inputs: Output: Step 1: Step 2:

T (a trained 2-SVMT), x (a testing sample) C (class label of the testing sample) Set the root node as the curr ent node. If the curr ent node is an internal node, then start the following loop operations: (a) Branch x to the next generation of T by performing current node test function Tcurr ent (x). (b) Find the next node that x is branched to in T ; (c) Set the node from (b) as the new curr ent node, and go to Step 2.

Step 3: The above loop is stopped when a terminal node is reached, and the class label C is computed by a decision test Tcurrent (x). Here, the curr ent node is a terminal node, which is either an SVM node or a node with a single class. In the above SVMT algorithms, K specifies the type of SVM used in 2-SVMT construction. ρ0 determines an initial resolution for 2-SVMT to start zooming in the data and to construct an SVMT over the data. ξ is the permitted minimum loss function value, which gives a criterion of partition selection for optimal SVM classification. The default ρ can be set as the biggest the partitioning scale that the used partitioning method allows. For example, for the K-Mean partitioning method, ρ0 can be set as 2; for ECM partitioning method, ρ0 can be set as 0.9. Given a serious class-imbalance and class-overlap of data, a finer scale can be taken to enable 2-SVMT to separate classes in a more accurate resolution. The default ξ is 0. In practice, a small error is given. Certainly, as ξ decreases, the greater the time cost to obtain the 2-SVMT trained. 4.4 Coping with class imbalance and class overlap In the above SVMT algorithms, the class-imbalance is addressed in two parallel ways: 1. The recursive data zooming-in schema, where larger size data partitions (particularly for those partitions from the overweighed class, e.g., class 1 for Fig. 5 dataset) are separated out in the first few rounds. As a result, the imbalance interference is reduced for the learning of the remaining data. Meanwhile, data with class-overlap/class-mixture is left for the next round learning. So, the data is drilled down in the SVMT, decisions are made only at a scaled region where the decision is reachable. 2. The skewed-class is protected by SVMT by turning SVM loss function L Svm to the skewed class as, l 1 L Svm = (1 − (yi λi∗ K (x, xi ) + b∗ )). (16) l i=1

where l is the number of the regional skewed class samples, and x is a input sample of the regional skewed class. yi λi∗ K (x, xi ) + b∗ is the SVM boundary function, and K is an SVM kernel function. Note that the regional skewed class represents the skewed-class within a data partition. In other words, a regional skewed class also could be a global overweighted class, when the overweighted class distributes skewed regionally. Thus, Eq. (16) does no overbalanced adjustment as most data sampling methods do to favor the skewed class by using the skewed-class data repeatedly, which does obtain some improvement on the classification of the skewed

123

90

S. Pang, N. Kasabov

class, but meanwhile has an even larger performance deduction on the classification of the overweighted class. In fact, the regional skewed class loss function Eq. (16) protects equally both the skewed class and the overweighted class from being lower valued in a regional decision making (acting as a node in SVMT) due to the class-imbalance. Equation (16) eventually works effectively for the classification of the skewed class, because the probability for the skewed class data to distribute skewed regionally in practice is often much higher than that of the overweighted class to show skewed.

5 The decoding of SVM tree 5.1 Logic association rules According to the literature of rule mining [3,28,37,49], given a n dimensional data space D, a standard logic rule is formed as if x ∈ gi then Class(x) = Ck ,

(17)

where the if part is the rule antecedent, and the then part is the rule consequence. The rule means that If x belongs to the subspace gi , the it is assigned a class label Ck . Typically, there is a conjunction of logical predicate function T (x) = {T r ue, False} to tell if the condition part of the above rule is satisfied or not. In common case, the predicate function T is a test on a single attribute of D. For example, if x has attribute ai value that belongs to an interval xi ∈ [ai −, ai +], then x ∈ gi . However, when interpolating the knowledge from the decision tree structure of an SVMT, the predicate function T is extended to be on multiple attributes of D because every subspace gi in an SVMT is obtained through a supervised data partitioning Eq. (13), and both the loss function and the partitioning function of Eq. (13) are a multivariate computation. Thus, the predicate function T will be a multivariate function determined by the choice of partitioning function P T (x) = P (gi , x)

(18)

T (x) judges if x is partitioned into gi . It could be a linear or non-linear distance function depending on the choice of partitioning function. On the other hand, when interpolating a support vector from a local SVM into a lignitic rule, the predicate function T is a conjunction of a set of tests on several relaxant attributes, T (x) = xi ∈ [ai −, ai +] ∪ · · · ∩ x j ∈ [a j −, a j +].

(19)

5.2 SVM nodes knowledge interpolation As discussed above, a trained SVMT has two types of SVM nodes: (1) Two-class SVM node V = {g, ρ, f Svm }, where g contains 2 classes data with L Svm < ξ satisfied. (2) One-class SVM node V = {g, ρ, f svm } or V = {g, ρ, f svm }, where g contains either class 1 or class 2 data.

123

Encoding and decoding the knowledge of association rules over SVM classification trees

91

5.2.1 2-class SVM node For a 2-class SVM node (called SVM node hereafter), rule extraction according to [6] is based on hyper-rectangular boundary built upon support vector of a trained SVM. Given support vector sm of class l, the hyper-rectangular boundary for sm is, {smi + λ2i ≥ xi ≥ smi − λ1i , }

(20)

where 1 ≥ λ pi ≥ 0, p = {1, 2}, and i = 1, . . . , n the sequence number of dimension. Set L o (i) and Ho (i) as the lower limit and upper limit of the hyper-rectangular rule along the ith dimension, respectively, where, L o (i) = smi − λ1i

(21)

Ho (i) = smi + λ2i .

(22)

and Given a trained SVM, the rule based on support vector sm can be generated by Fu’s algorithm [6] as,

[Algorithm 4. two-class support vector rule extraction] Input: sm (support vector) Output: L o (the lower limit of the hyper-rectangular rule), Ho (the upper limit of the hyperrectangular rule) Step 1: set l = 1, refers to dimension l Step 2: calculate xl subject to f (x) = 0 and x j = sm j ( j = 1, . . . , n, and j  = l) by the Newton’s method Step 3: determine L o and Ho according to the solutions of the problem in step 2. The number of solutions of xl may be different under different data distribution: (a) if there is no solution, i.e., there is no cross point between the line extended from sm along dimension l and the decision boundary, then L o (l) = 0, Ho = 1 (b) if there is one solution: if sml ≥ xl , L o (l) = xl , and Ho (l) = 1, else L o (l) = 0, and Ho (l) = xl2 (c) if there are two solutions, xl1 and xl2 ((xl1 ≤ xl2 ) : L o (l) = xl1 , and Ho (l) = xl2 (d) if there are more than two solutions the nearest neighbor xl1 and xl2 of xl2 are chosen from the solutions under the condition xl1 ≤ sm j , xl2 ≥ xl2 and the data points {x|x j = sm j , j = 1, . . . , n, j  = l, xl1 ≤ xl ≤ xl2 } are with the same class label with the support vector sm : L o (l) = xl1 , and Ho (l) = xl2 Step 4: l = l + 1, if l < n, go to step 2, else end. Figure 3 gives an example of hyper-rectangular rule extraction from a trained 2-class SVM in SVMT, where the black circles and points represent the support vectors from class 1 and class 2, respectively. Given a support vector sm = (sm1 , sm2 ) from class 2, there is two cross points (xi1 , xi2 ) and (x j1 , x j2 ) between the extend lines of sm and the decision boundary. According to Algorithm 4., a hyper-rectangular rule with L o (1) = xi1 , Ho (2) = x j1 , L o (2) = xi2 and H o(2) = x j2 is obtained. The coverage area of this rule is identified as the dashed rectangle in Fig. 3.

123

92

S. Pang, N. Kasabov

Fig. 3 Example hyper-rectangles for rule extraction from a two-class SVM

Class 1 Support Vector

Class 1 Class 2

Class 2 Support Vector

1.0

0.9 (xj1 ,xj2)

0.8

Sm

(sm1 ,sm2)

(xi1 ,xi2)

X1 0.7

0.6

0.5 0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

X2

5.2.2 One-class SVM node The one-class SVM of Eq. (4) models a one-class data partition by distinguishing the outliers from the major data distribution of the class. In this principle, for the knowledge interpolation of one-class SVM node, the hyper-rectangular rules from one-class SVM node are required to cover as much of the class data as possible. To expand the coverage of the hyper-rectangular rule, given two support vectors sm and sn from a trained one-class SVM, where Smi  = sni along the ith dimension, the above L o and Ho of Eqs. (21) and (22) are updated as, L o (i) = min(smi , sni )

(23)

Ho (i) = max(smi , sni ).

(24)

and Accordingly, Algorithm 4 can be modified to perform a one-class support vector rule extraction as,

[Algorithm 5. One-class support vector rule extraction] Input: sm and sn (two support vectors) Output: L o (the lower limit of the hyper-rectangular rule), Ho (the upper limit of the hyperrectangular rule) Step 1: set l = 1, refers to dimension l Step 2: calculate xl subject to f (x) = 0 and x j = sm j ( j = 1, . . . , n, and j  = l) by the Newton’s method. Step 3: determine L o and Ho according to the solutions of the problem in Step 2. The number of solutions of xl may be different under different data distribution: (a) if there is no solution, i.e., there is no cross point between the line extended from sm along dimension l and the decision boundary, then L o (l) = min(0, sml , snl ) Ho = max(1, sml , snl )

123

Encoding and decoding the knowledge of association rules over SVM classification trees

93

(b) if there is one solution: L o (l) = min(xl , sml , snl ), and Ho (l) = max(1, sml , snl ), else (c) if there are two solutions: L o (l) = min(xl1 , xl2 , sml , snl ), and Ho (l) = max(xl1 , xl2 , sml , snl ) (d) if there are more than two solutions the two neighbor xl1 and xl2 nearest to sm and sn are chosen, L o (l) = min(xl1 , xl2 , sml , snl ), and Ho (l) = L o (l) = max(xl1 , xl2 , sml , snl ). Step 4: l = l + 1, if l < n, go to Step 2, else end. It is noticeable that Algorithm 5 has the problem of rule disaster. Given m one-class n dimensional support vectors, it generates n.m(m −1)/2 rules for Algorithm 5. In other words, a small number of support vectors gives possible a huge number of rules for the same partition and the same class of data. Clearly, the redundancy occurs in the extracted rule set. Alternatively, we use all one-class support vectors together for the rule extraction of oneclass SVM, because the data from one-class SVM, both the outliers and the origin of the class eventually belong to the same class, and there is no need to separate one from another from the viewpoint of classification. In this way, the above rule extraction of Eqs. (23) and (24) can be re-written as, L o (i) = min(s1i , s2i , . . . , smi , . . .)

(25)

Ho (i) = max(s1i , s2i , . . . , smi , . . .).

(26)

Algorithm 5. is simplified into Algorithm 6 to extract rectangular rules only from the pair of vectors (maximum and minimum). Figure 4b gives an example of such rule extraction, where a set of smaller rectangles of Fig. 4a are replaced with one rectangle, which is built on a pair of support vectors, and which covers most of the one-class data.

[Algorithm 6. Simplified one-class support vector rule extraction] Input: s1 , s2 , . . . , sm (set of support vector from a trained one-class SVM) Output: L o (the lower limit of the hyper-rectangular rule), Ho (the upper limit of the hyperrectangular rule) Step 1: set l = 1, refers to dimension l Step 2: calculate L o (l) by Eq. (25) Step 3: calculate Ho (l) by Eq. (26) Step 4: l = l + 1, if l < n, go to Step 2, else end.

5.3 The SVMT-rule algorithm For extracting rule from a SVMT, the rule can be decoded through decision-tree rule extraction and SVM nodes knowledge interpolation. The description of the proposed SVMT-rule is presented as Algorithm 7, where we construct SVMT by Algorithm 1 or Algorithm 2, and extract rules by tree interpolation and SVM node decoding through Algorithm 4 and Algorithm 6.

123

94

S. Pang, N. Kasabov

1.0 Sn

-1

0.9

+1

0.8 X1

+1

0.7 Sm

0.6 0.5 0.3

0.4

0.5

0.6 X2

0.7

0.8

0.9

1.0

(a) 1.0

-1

0.9

+1

0.8 X1

+1

0.7 0.6 0.5 0.3

0.4

0.5

0.6 X2

0.7

0.8

0.9

1.0

(b) Fig. 4 Example hyper-rectangles for rule extraction from a one-class SVM

[Algorithm 7. Rule extraction over SVM trees (SVMT-rule)] Inputs: Xtrain (Training dataset) Output: R (extracted ruleset) Step 1: Apply the SVM classification tree algorithm (Algorithm 1. for DFS SVMT, and Algorithm 2. for BFS SVMT) to construct T over Xtrain . Step 2: For every path in T from the root node to a terminal node (i.e., one-class node or SVM node) of SVMT, extract rules C by taking all test functions along the path as the conjunctive rule antecedent, and the class label held by the leaf as the rule consequence,and let R = C. Step 3: For every decision-tree rule in C, do the following operations: (a) if the terminal node that the leaf is 2-class SVM node, apply Algorithm 4 to the terminal node for the SV rule extraction, merge the current decision-tree rule with each obtained SV rule.

123

Encoding and decoding the knowledge of association rules over SVM classification trees Fig. 5 An illustration of SVMT-rule rule extraction, where a, b, c, and d are 4 decision paths of a double-layer SVMT, and P is the node partitioning function

95

P1 a b

1

1 d

P2

c

2

SVM

2

1

(b) otherwise the terminal node to which the leaf is attached is a one-class SVM node, then apply Algorithm 6 for one-class SV rule extraction. Additional rules are produced by merging the current decision-tree rule with each of obtained one-class SV rule. Step 4: labeled samples in Xtrain to fall into a region of each of the above obtained rules in R. Step 5: if the set of sample in a certain rule region is a subset of samples covered by another rule, this rule is removed. Step 6: repeat Step 3 until no rule is removed from R, output R. Figure 5 gives an illustration of SVMT rule extraction, where a,b,c,and d are four decision paths of an example SVMT. For decision path a–c, decision rules are extracted by taking all test functions along the path as the conjunctive rule antecedent, and the class label held by the leaf as the rule consequence. If outlier samples are considered for one-class node, additional SV rules are extracted from the one-class SVM on the node by Algorithm 6. For decision path d, it leads to a 2-class SVM node, thus Algorithm 4 is applied for local SV rules extraction. Therefore, the extracted rule set by SVMT-rule consists of decision-tree rules and local SV rules structured by the decision tree. 6 Experiments and applications In our experiments, algorithms were implemented in Matlab Version 6.5, run on Pentium 4 PC, 3.0 GHZ 512 Mb RAM, and comparison tests were performed in the environment of Neucom 2.0 [30].

123

96

S. Pang, N. Kasabov

For extracting rules over SVMT, all single 2-class SVMs are set with a standard linear kernel. A one-class SVM of RBF β = 3.5 kernel is used only for our fraud detection application, where a large database with noise is involved. All SVMs have the same penalty coefficient of 0.1. ℘ is assigned as a K-means clustering partitioning function, and all 2-SVMTs use the same default parameters of ρ0 = 2 and ξ = 0.02. Because polynomial SVMs perform better than linear SVM on most classification tasks [36], for a simple validity verification we merely compare the SVMT-rule with the SVM-rule, and the rule extraction from a trained 2nd order polynomial single SVM [6]. Note that all single SVMs are also set with the same penalty coefficient of 0.1. To evaluate the quality of extracted rules, we measure accuracy as the percentage of classification, and comprehensibility as the number of of rules plus the number of antecedents per rule. In addition, we also measure the robustness of extracted rules to the class-imbalance by computing the class bias ratio (CBR) as, 1 − min(class 1, class 2)/ max(class 1, class 2).

(27)

A larger CBR value means a more imbalanced class data distribution, or class classification accuracy. 6.1 Synthetic dataset We first experimented the SVMT-rule with an synthetic dataset for classification. Figure 6 presents the distribution of the dataset, where class 1 and class 2 data are both in a 2D Gaussian distribution, and the number of class 2 samples is approximately 1/10 of class 1, which follows that the CBR of this dataset in terms of class data distribution is 0.9. Constructing SVMT over the dataset of Fig. 6, Fig. 7 gives an example of DFS-SVMT and BFS-SVMT generated by Algorithm 1 and Algorithm 2, respectively. The root node P1 represents the data partitioning function at scale ρ0 , P2 for the data partitioning at scale ρ1 ,

Class 1 Class 2

4 3 2 1 0 -1 -2 -3 -4 -4

-2

0

Fig. 6 2-Dimensional synthetic data for classification, CBR = 0.9

123

2

4

6

1

SVM

2

SVM

P2

P3

P0

SVM

2

P4

1

P5

1

P6

SVM

1

SVM

2

P7

P23

2

1

1

P22

P8

1

P21

1

SVM

P9

2

P20

2

P10

2

1

2

2

P19

P11

2

SVM

2

P18

P12

1

P16

P13

SVM

P14

P15

P17

2

2

2

1

1

2

2

1

1

1

SVM

SVM

1

1

1

2

1

1

P1

1

SVM

1

1

1

P2

1

1

1

1

1

P3

1 2

SVM

1

1

2

P4

SVM

1

1

SVM

2

1

1

1

1

P5

1

1

1

SVM

2

P6

SVM

1

1

1

P7

2

1

2

Fig. 7 Example of BFS–SVMT (top) and DFS–SVMT (bottom) for SVMT rule extraction over Fig. 6 dataset

1

2

1

1

2

Encoding and decoding the knowledge of association rules over SVM classification trees 97

123

98 Fig. 8 The illustration of hyper-plane for single SVM and SVMT, a single SVM hyper-plane, b SVMT hyper-plane, and c the data partitioning boundary part of SVMT hyper-plane

S. Pang, N. Kasabov

4 3 2 1 0 -1 -2 -3 -4 -4

-2

0

2

4

6

2

4

6

2

4

6

(a) 4 3 2 1 0 -1 -2 -3 -4 -4

-2

0

(b) 4 3 2 1 0 -1 -2 -3 -4 -4

-2

0

(c)

123

Encoding and decoding the knowledge of association rules over SVM classification trees

99

so on so forth. Every one-class node here is denoted as a circle labelled with class ‘1’ or ‘2’, and SVM node denoted as a ellipse circled ‘SVM’ label. Each SVM node has two sons, which means that each SVM node provides, class 1 and class 2, two decision choices. They are symbolized the same as the one-class node, but they are not counted here as one-class node. The DFS–SVMT is seen to have 8 SVM nodes, 14 one-class nodes, and the BFS–SVMT has 8 SVM nodes, 24 one-class nodes. The number of one-class nodes is several times of SVM nodes, suggesting that the decisions of BFS–SVMT are more dependent on the tree than decisions made by DFS–SVMT. In the construction of SVMT, the data is zoomed in recursively by a supervised data partitioning. Figure 8b shows the resulting hyper-plane from Algorithm 2 as the data partitioning step goes to the second iteration. Compared to the single SVM hyperplane in Fig. 8a, the SVMT hyper-plane in Fig. 8b is a combination of the part of data partitioning boundary estimated in Fig. 8c and two pieces of short SVM hyper-planes. It turns out that SVMT conducts classification using less support vectors from SVM, but more decision boundaries from data partitioning. In other words, SVM is not used unless it is absolutely necessary. Table 1 compares the SVMT-rule with SVM-rule and the original SVMT on the classification of the above synthetic dataset, based on the following six measurements: (1) (2) (3) (4) (5) (6)

the number of obtained rules (N. of Rules). the number of support vectors (N. of SVs). the average general classification accuracy (Ave. Acc.). the classification accuracy of class 1 (Cls1 Acc.) the classification accuracy of class 2 (Cls2 Acc.) the class bias ratio in terms of class classification accuracy (CBR).

For the CBR 0.9 dataset in Fig. 6, SVMT-rules come up with a CBR ≤ 0.3 class classification accuracy while the SVM-rule gives a completely imbalanced output with C B R = 1.0. It is distinct that SVMT-rules have profited from the outstanding generalization ability of SVMTs as both statistics are shown to be quite similar in Table 1. The two BFS type SVMT methods (i.e., BFS SVMT-rule and BFS–SVMT) perform even better than the DFS type SVMT methods (i.e., DFS SVMT-rule and DFS–SVMT), this suggests that BFS is able to fit the data more appropriate than DFS for constructing SVMT. 6.2 Cancer diagnosis We have implemented the BFS SVMT-rule technique in cancer diagnosis and decision support. Table 2 gives the seven well-known cancer microarray datasets with a two-class classification problem. As seen in the table, most datasets were biased in the sense of having an unbalanced number of patients in the two classes, e.g., normal versus tumor. Table 1 The results of comparison between SVM-rule and SVMT-rule on the classification of Fig. 6 dataset Measurements

SVM rule

DFS SVMT-rule

DFS-SVMT

BFS SVMT-rule

BFS-SVMT

N. of rules

58

23 (8/15)



33(9/24)



N. of SVs

32

16

16

18

18

Ave. Acc.

91.3%

92.4%

92.7%

90.0%

90.1

Cls1 Acc.

100%

94.9%

94.8%

91.2%

92.0

Cls2 Acc.

0.0%

66.3%

70.5%

76.9%

79.0

CBR

1.0

0.30

0.25

0.16

0.14

123

100

S. Pang, N. Kasabov

Table 2 Cancer data sets used for testing the algorithm Cancer

Class 1 vs. Class 2

Genes

Train data

Test data

Ref.

Lymphoma(1)

DLBCL vs. FL

7,129

(58/19)77



[20]

Leukaemia

ALL vs. AML

7,129

(27/11)38

34

[21]

CNS Cancer

Survivor vs. failure

7,129

(21/39)60



[22]

Colon cancer

Normal vs. tumor

2000

(22/40)62



[23]

Ovarian cancer

Cancer vs. normal

15,154

(91/162)253



[24]

Breast Cancer

Relapse vs. non-relapse

24,482

(34/44)78

19

[25]

Lung cancer

MPM vs. ADCA

12,533

(16/16)32

149

[26]

Columns for training and validation data show the total number of patients. The numbers in brackets are the ratios of the patients in the two classes

Table 3 The results of comparison between SVM-rule and BFS SVMT-rule on seven cancers diagnosis Dataset

SVM rule

BFS SVMT-rule

BFS SVMT

Acc. (%) No. of rules No. of SVs Acc. (%) No. of rules No. of SVs Acc. (%) Lymphoma(1)

89.6

233

60

80.8

221

38

89.6

Leukaemia

73.5

245

29

94.1

108

20

94.1

CNS cancer

61.7

234

37

62.9

278

26

65.0

Colon cancer

71.3

205

48

75.3

202

32

77.3

Ovarian cancer 96.5

178

34

98.4

198

21

97.3

Breast cancer

52.6

269

62

78.9

206

33

78.9

Lung cancer

94.6

171

23

98.5

188

17

99.3

For the 3 datasets that had an independent validation dataset available, we selected 100 genes from the training dataset by a typical t-test gene selection [27], and used them on the validation dataset. For the remaining four datasets, we verify the method using tenfold cross-validation, removing randomly one tenth of the data and then using the remainder of the data as a training set. For each fold, we similarly selected 100 genes over the training set, then applied the obtained 100 genes to the corresponding testing set. The comparison of SVMT-rule to single SVM-rule and the original SVMT on seven cancer diagnosis is shown in Table 3, where rule extraction is measured in terms of diagnosis accuracy, the number of rules, and the number of involved support vectors, respectively. Although the SVMT-rule does not perform better than the SVMT, it is still impressive that Table 4 indicates the prior generalization ability of SVMT-rule to SVM-rule is about 10%(((0.808−0.896)/(1−0.808)+(0.941−0.735)/(1−0.735)+· · ·+(0.985−0.946)/(1− 0.946))/7 = 0.1096). More importantly, the comprehensibility of the BFS SVMT-rule is clearly improved as compared to SVM-rule because the average number of support vectors for SVMT-rule is only about 60% that of SVM-rule. For CNS and Lung Cancer, BFS SVMT-rule is found with less support vectors, but even more rules than SVM-rule. This inconsistency stays for the reason that these two cancers are more difficult than the other type of cancers for a correct knowledge representation, where the BFS-SVMT has allocated more efforts on decision tree spanning.

123

Encoding and decoding the knowledge of association rules over SVM classification trees

101

Table 4 The statistics of rule extraction for fraud detection Method

N. of rules

N. of SVs

SVM-rule

269

35

7.6

78.5

98

17

5.7

89.3

DFS SVMT-rule

Rule/per SV

Accuracy (%)

BFS SVMT-rule

123

19

6.4

92.3

C4.5-rule

176





80.8

6.3 Fraud detection The cell phone fraud often occurs in the telecommunication industry. We have investigated a fraud detection using SVMT-rule rule extraction. The database was obtained from a mobile telecom company. It recorded 1 year’s action data of 53,696 customers. The historical data on fraud includes 15 tables. Among them, three core tables for the topic are IBS_USERINFO (customer profile), IBS_USRBILL (customer payment information), and IBS_USERPHONE (customer action) and all others are the additional tables. Fraud detection basically is a binary classification problem, where customers are divided into two classes: fraud or non-fraud. The aim of fraud detection is to distinguish the fraud customers from the remaining honest customers. The method of fraud detection is to analyze the historical data, which is a typical data mining procedure as follows: Step 1: Data cleaning, to purge redundant data for a certain fraud action analysis. Step 2: Feature selection and extraction, to discover indicators corresponding to changes in behavioral indicatives of fraud. Step 3: Modeling, to determine fraud pattern by classi6er and predictor. Step 4: Fraud action monitoring and prediction, to issue the alarm. Step 2 extracts the eight salient features to measure the customer payment behaviors using PROC DATA of SAS/BASE and combine the customer profile(IBS USERINFO) and customer action (IBS BILL) to one table (IBS_USERPERFORMANCE). Table IBS_ USERPERFORMANCE has 53,696 records and nine attribute items. To discover the fraud patterns, BS_USERPERFORMANCE data is divided into a training set and a test set, in which the training set contains 70% of 53,696 records, and the test data set is the remaining 30% of 53,696 records. Table 4 lists the statistical results of rule extraction using SVMT-rule, SVM-rule, and C4.5-rule for the fraud detection. As expected, the SVMT-rule shows a more accurate fraud detection than SVM-rule, as well as C4.5-rule, where BFS SVMT-rule performs slightly better than DFS SVMT-rule at the price of that more support vectors and more rules are used. It is surprising that the number of rules from SVMT-rule is only about half of the number from SVM-rule. One possible explanation is that the SVMT-rule suits better a large database, where the decision tree part of SVMT plays an absolutely more important role in encapsulating data for rule extraction.

7 Discussions and conclusions The SVMT aggregates the decision tree and the support vector machine in an SVM classification tree, manipulates the two sides rule extraction by selecting the simple part of the problem for a comprehensible decision tree representation, and leaving the remaining difficult part with an support vector machine approximation. For extracting rules, SVMT-rule is

123

102

S. Pang, N. Kasabov

introduced in the form of DFS SVMT-rule and BFS SVMT-rule, which encodes the knowledge of rules by an induction over the decision tree of SVM, and the interpolation of the support vectors from local SVMs. The experimental results and applications have shown that the SVMT-rule, compared to single SVM rule extraction, has the following desirable properties: (1) SVMT-rule is more accurate, and especially more robust for the classification of data with a class-imbalanced data distribution. (2) SVMT-rule has a better comprehensibility for rule extraction because only about half the number of single SVM support vectors are kept in SVMT rules. (3) The BFS SVMT-rule commonly outperforms DFS SVMT-rule. SVMT-rule is also able to apply to a multi-class case by extending the 2-class SVMT to a m-class (m ≥ 2) SVMT. One straightforward way is to do ‘one-to-one’ or ‘one-to-all’ SVMT integration. However, this approach creates a new problems of decision forests [29]. In the principle of SVM aggregation, the construction of a m-SVMT is to decompose an m-class task into a certain number of 1 − m classes regional tasks as, Tm−SV M T (℘ , f Svm , . . . , f Svm , f Sgn , . . . , f Sgn , x)

(28)

℘ , ρ

≥ 2 represents here a multi-split data partitioning function. Similarly, where f Svm , 2 ≤ i ≤ m represents a i-class SVM classifier function, and f Sgn , 1 ≤ j ≤ m represents a jth-class one-class SVM. Accordingly, the classification function of a m-SVMT can be written as,  ℘ (x) ∈ class i f Sgn i f (29) fˆ(x) = f Svm< j> other wise where 1 ≤ i ≤ m and 2 ≤ i ≤ m. SVMT-rule has improved the comprehensibility of SVM rule extraction by reducing the number of support vectors effectively. However, there are still a few support vectors left in the rule of the SVMT-rule. To further improve the comprehensibility of the SVMT-rule, it is suggested that a symbolic rule interpolation [34] could be developed for SVMT-rule in the future. Acknowledgments The research presented in the paper was partially funded by the New Zealand Foundation for Research, Science and Technology under the grant: NERF/AUTX02-01.

References 1. Nunez H, Angulo C, Catala A (2002) Rule-extraction from support vector Machines. In: The European symposiumon aritificial neural networks, Burges, pp 107–112 2. Zhang Y, Su HY, Jia T, Chu J (2005) Rule extraction from trained support vector machines, PAKDD 2005, LANI3518. Springer, Heidelberg, pp 61–70 3. Wang L, Fu X (2005) Rule extraction from support vector machine. In: Data mining with computational intelligence, nced information and knowlegde processing. Springer, Berlin 4. Barakat N, Bradley AP (2006) Rule extraction from support vector machines: measuring the explanation capability using the area under the ROC curve. In: The 18th international conference on pattern recognition (ICPR’06), August, 2006, Hong Kong 5. Fung G, Sandilya S, Rao B (2005) Rule extraction for linear support vector machines, KDD2005, August 21–24, 2005, Chicago 6. Fu X, Ong C, Keerthi S, Huang GG, Goh L (2004) Extracting the knowledge embedded in support vector machines. In: Proceedings of IEEE international joint conference on neural networks, vol 1, no 25–29 July 2004, pp 291–296 7. Vapnik V (1982) Estimation of dependences based on empirical data. Springer, Heidelberg 8. Vapnik V (1995) The nature of statistical learning theory. Spinger, Heidelberg

123

Encoding and decoding the knowledge of association rules over SVM classification trees

103

9. Cortes C, Vapnik V (1995) Support vector network. Mach Learning 20:273–297 10. Pang S, Ozawa S, Kasabov N (2004) One-pass incremental membership authentication by face classification. ICBA 2004, LNCS, vol 3072. Springer, Heidelberg, pp 155–161 11. Pang S, Kim D, Bang SY (2003) Membership authentication in the dynamic group by face classification using SVM ensemble. Patt Recogn Lett 24:215–225 12. Pang S (2005) SVM aggregation: SVM, SVM ensemble, SVM classification tree, IEEE SMC eNewsletter Dec. 2005. http://www.ieeesmc.org/Newsletter/Dec2005/R11Pang.php 13. Pang S, Kim D, Bang SY (2005) Face membership authentication using svm classification tree generated by membership-based LLE data partition. IEEE Trans Neural Netw 16(2): 436–446 14. Schölkopf JC, Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC (1999) Estimating the support of a high-dimensional distribution. Technical report, Microsoft Research, MSR-TR-99-87 15. Tax DMJ (2001) One-class classification, concept-learning in the absence of counter-examples. PhD Thesis 16. Tax DMJ, Duin RPW (2001) Combining one-class classifiers. LNCS 2096:299–308 17. Xu Y, Brereton RG (2005) Diagnostic pattern recognition on gene expression profile data by using oneclass classifiers. J Chem Inf Model 45:1392–1401 18. Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Mateo 19. Kim H-C, Pang S, Je H-M, Kim D, Yang Bang S (2003) Constructing support vector machine ensemble. Patt Recogn 36(12): 2757–2767 20. Shipp MA, Ross KN et al (2002) Supplementary information for diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med 8(1):68–74 21. Golub TR (2004) Toward a functional taxonomy of cancer. Cancer Cell 6(2):107–8 22. Pomeroy S, Tamayo P et al (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415(6870):436–442 23. Alon U, Barkai N et al (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 8: 6745–6750 24. Petricoin EF, Ardekani AM et al (2002) Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359:572–577 25. Van’t Veer LJ et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415:530–536 26. Gordon GJ, Jensen R et al (2002) Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res 62:4963–4967 27. Dudoit S, Yang YH, Callow MJ, Speed TP (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat Sin 12:111–139 28. Schuster A, Wolff R, Trock D (2005) A high-performance distributed algorithm for mining association rules. Knowl Inf Syst 7:458–475 29. Kam Ho T (1998) The random subspace method for constructing decision forests Tin Kam Ho. IEEE Trans Patt Anal Mach Intell 20(8):832–844 30. NeuCom—A Neuro-computing Decision Support Enviroment, Knowledge Engineering and Discovery Research Institute, Auckland University of Technology, http://www.theneucom.com 31. Nez H, Angulo C, Catal A (2003) Hybrid Architecture based on support vector machines. In: Book Computational Methods in Neural Modeling Lecture Notes in Computer Science, vol 2686, pp 646–653 32. Zhou ZH, Jiang Y (2003) Medical diagnosis with C4.5 rule preceded by artificial neural netowrk ensemble. IEEE Trans Inf Technol Biomed 7(1):37–42 33. Chen Y, Wang JZ (2003) Support vector learning for fuzzy rule-based classification systems. IEEE Trans Fuzzy Syst 11(6):716–728 34. Nunez H, Angulo C, Catala A (2002) Support vector machines with symbolic interpretation. In: Proceedings of VII Brazilian symposium on neural networks, 11–14 Nov. 2002, pp 142–147 35. Duch W, Setiono R, Zurada JM (2004) Computational intelligence methods for rule-based data understanding. Proc IEEE 92(5):771–805 36. Pang S, Kim D, Bang SY (2001) Fraud detection using support vector machine ensemble. ICONIP2001, Shanghai, China 37. Terabe M, Washio T, Motoda H, Katai O, Sawaragi T (2002) Attribute generation based on association rules. Knowl Inf Syst 4:329–349 38. Barakat N, Diederich J (2004) Learning-based rule-extraction from support vector machines: performance on benchmark data sets. In: Proceedings of conference on neuro-computing and evolving intelligence, Dec. 2004 39. Barakat N, Diederich J (2005) Eclectic rule-extraction from support vector machines. Int J Comput Intell 2(1):59–62

123

104

S. Pang, N. Kasabov

40. Barakat N, Bradley A (2007) Rule extraction from support vector machines: a sequential covering apporoach. IEEE Trans Knowl Data Eng 19(6):729–741 41. Provost F (2000) Machine Learning from Imbalanced Data Sets 101. Working Notes AAAI’00 workshop learning from imbalanced data sets, pp 1–3 42. Wu G, Chang E (2005) KBA: Kernel boundary alignment considering imbalance data distribution. IEEE Trans Knowl Data Eng 17(6):786–795 43. Wu G, Chang E (2003) Adaptive feature-space conformal transformation for imbalanced data learning. In: Proceedings of 20th internatuional conference on machine learning, pp 816–823 44. Lin Y, Lee Y, Wahba G (2002) Support vector machines for classification in nonstandard situations. Mach Learn 46:191–101 45. Veropoulos K, Campbell C, Cristianini N (1999) Controlling the sensitivity of support vector machine. In: Proceedings of international joint conference on artifical intelligence, pp 55–60 46. Estabrooks J, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20:18–36 47. Zhou ZH, Liu XY (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 19(1): 63–77 48. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res (JAIR) 16:321–357 49. De Falco, Della Cioppa A, Iazzetta A, Tarantino1 E (2005) An evolutionary approach for automatically extracting intelligible classification rules. Knowl Inf Syst 7: 179–201

Author biographies Dr. Shaoning Pang received the BSc degree in physics (1994) and the MSc degree in electronic engineering (1997) from Xinjiang University, China, and the Ph.D. degree in computer science (2000) from Shanghai Jiao Tong University, China. From 2001 to 2003, he was a Research Associate at the Pohang University of Science and Technology (POSTECH), South Korea. Currently, he is a Senior Research Fellow (Permanent Academic) with Knowledge Engineering and Discovery Research Institute (KEDRI), Auckland University of Technology, Auckland, New Zealand. His research interests include support vector machine (SVM), SVM aggregating intelligence, incremental and multitask machine learning, and bioinformatics. Dr. Pang has been serving as a Program Member and Session Chair for a number of international conferences including ISNN, ICONIP, FSKD, ICNC, and ICNNSP. He was a best paper winner of the 2003 IEEE International Conference on Neural Networks and Signal Processing and an invited speaker of BrainIT 2007. He is currently acting as a regular paper reviewer for a number of refereed international journals and conferences. Dr. Pang is a Senior Member of IEEE, and a Member of The institute of Electronics, Information, and Communication Engineers (IEICE) and Association for Computing Machinery (ACM).

123

Encoding and decoding the knowledge of association rules over SVM classification trees

105

Professor Nikola Kasabov is the Founding Director and the Chief Scientist of the Knowledge Engineering and Discovery Research Institute (KEDRI), Auckland (www.kedri.info/). He holds a Chair of Knowledge Engineering at the School of Computing and Mathematical Sciences at Auckland University of Technology. He is a Fellow of the Royal Society of New Zealand, Fellow of the New Zealand Computer Society and a Senior Member of IEEE. He is the President-Elect of the International Neural Network Society (INNS) and the current President of the Asia Pacific Neural Network Assembly (APNNA). He is a member of several technical committees of the IEEE Computational Intelligence Society and of the IFIP AI TC12. Kasabov is on the editorial boards of several international journals, that include IEEE Trans. NN, IEEE Trans. FS, Information Science, J. Theoretical and Computational Nanoscience. He chairs a series of int. conferences ANNES/NCEI in New Zealand. Kasabov holds MSc and PhD from the Technical University of Sofia. His main research interests are in the areas of intelligent information systems, soft computing, neuro-computing, bioinformatics, brain study, speech and image processing, novel methods for data mining and knowledge discovery. He has published more than 400 publications that include 15 books, 120 journal papers, 60 book chapters, 32 patents and numerous conference papers. He has extensive academic experience at various academic and research organisations: University of Otago, New Zealand; University of Essex, UK; University of Trento, Italy; Technical University of Sofia, Bulgaria; University of California at Berkeley; RIKEN and KIT, Japan; T.University Kaiserslautern, Germany, and others. More information of Prof. Kasabov can be found on the KEDRI web site: http://www.kedri.info.

123

Suggest Documents