•..
20 10 International
Conference
on Networking
and Information
Technology
Decision Tree based Support Vector Machine for Intrusion Detection Prof. P. R. Devale HOD, Department of Information Technology Bharati Vidyapith's COE, Pune, India
[email protected]
Mrs. Snehal A. Mulay Department of Information Technology, Bharati Vidyapith's COE, Pune, India snehalmulay@gmaiLcom
Prof. G.V. Garje HOD, Department of Computer and IT PVG's COET, Pune, India
[email protected] R2L attacks and Normal data, and can prepare five classes for anomaly detection. The paper is organized as follows. Section 1I describes the basic principles of SVM. Section III discusses the Multiclass SVM. In section IV we will study the two decision tree algorithms to implement multiclass SVM. And the section V analyses the two algorithms.
Abstract-Support Vector Machines (SVM) are the classifiers which were originally designed for binary classification. The classification applications can solve multi-class problems. Decision-tree-based support vector machine which combines support vector machines and decision tree can be an effective way for solving multi-class problems in Intrusion Detection Systems ([OS). This method can decrease the training and testing time of the ros, increasing the efficiency of the system. The different ways to construct the binary trees divides the data set into two subsets from root to the leaf until every subset consists of only one class. The construction order of binary tree has great influence on the classification performance. In this paper we are studying two decision tree approaches: Hierarchical multiclass SVM and Tree structured multiclass SVM, to construct multiclass intrusion detection system.
II. BASIC PRrNCIPLES OF SUPPORT VECTOR MACHINE SVMs have solid theoretical foundation rooted in statistical learning theory[5]. SVMs perform the Structural Risk Minimization (SRM) principle, which minimizes an upper bound on the generalization error. In this method, one firstly maps the data into a high dimensional space via a nonlinear map, and in this high dimensional space, one constructs an optimal separating hyper-plane or linear regression function, and in the original input space, one will get a nonlinear separating hyper-plane or regression function. This process will involve a quadratic programming problem, and this will get a global optimal solution. Suppose we have N training data points (Xl. YI), (X2,y2), (x], y]), ..., (XN , YN )}, where Xi € R, and Yi €(+l,-I}. Consider a hyper-plane defined by (w, b), where w is a weight vector and b is a bias. The classification of a new object x is done with
Keywords- intrusion detection system, support vector machine, decision tree I.
INTRODUCTION
Intrusion detection is a problem of great significance to protecting information systems security, especially in view of the worldwide increasing incidents of cyber attacks. Since the ability of an IDS to classify a large variety of intrusions in real time with accurate results is important, we will consider performance measures in three critical aspects: training and testing times; scalability; and classification accuracy. The patterns of user activ-ities and audit records are examined and the intrusions are located. IDSs are classitied, based on their functionality, as misuse detectors and anomaly detectors. Misuse detection systems primarily searches for recognized patterns of attacks. Usually, misuse detection is simpler than anomaly detection as it uses rule based or signature comparison methods. Anomaly detection requires storage of normal usage behavior and operates upon audit data generated by the operating system. Support vector machines(SVM) are the classifiers which were originally designed for binary classitication[5][6], can be used to classi fy the attacks. If we combine binary SVMs with decision trees, we can have multiclass SVMs, which can classify the four types of attacks, Probing, DoS, U2R,
978-1-4244-1577-31$26.00
0 2010 IEEE
N
f(x)= sign(w.x+ b) =sign(Lui
Yi(X,· x)+ b)
The training vectors Xi occur only in the form of a dot product For each training point, there is a Lagrangian multiplier Uj. The Lagrangian multiplier values a, reflect the importance of each data point When the maximal margin hyper-plane is found, only points that lie closest to the hyper-plane will have Uj > 0 and these points are called support vectors. All other points will have c, = O. That means only those points that lie closest to the hyper-plane, give the representation of the hypothesis/classifier. These
59
2010 International Conference on Networking and Information Technology
data points serve as support vectors. Their values can be used to give an independent boundary with regard to the reliability of the hypothesis/classifier.
between class centers can not reflect the structure of the classes if the distribution of the classes are not taken into consideration. Here we will study two novel algorithms Hierarchical multiclass SVM[ l] and Tree structured multi class SVM[2]. These algorithms consider the distribution of classes along with the distances of class centers. These algorithms can help to build a pattern recognition model for IDS, classify the attacks and increase the efficiency of IDS by decreasing the training and testing time of data. Fig.1 shows the methods using which we can build the anomaly detection system [3].
III .MUL TICLASS SUPPORT VECTOR MACHINE The previous works on multi-class SVMs can be divided into two categories. For a k-class (k > 2) pattern recognition problem, consider all classes at once to construct a QP problem. The obtained optimization problem is quadratic in (k - 1) x I variables (where I denotes the number of the training patterns). The training of this multi-class SVM is usually very time-consuming. Other methods are solving a series of two-class SVMs to construct many decision functions, instead of solving one large QP problem. Some typical methods [2][9][ \0][ II], are described as follows:
&oti$tlcai cased
./1-------; Qi$~.Y.'1ceOO$~
A. One-versus-rest (OVR) It constructs k two-class SVMs. The ith SVM (i = 1.2 .. .• k) is trained with all training patterns. The ith class patterns are labeled by I and the rest patterns are labeled by - I. The class of an unknown pattern x is determined by argument maxi=I.2 .....k flex), where flex) is the decision function of the ith two-class SVM. In short, a binary classifier is constructed to separate instances of class yi from the rest of the classes. The .training and test phase of this method are usually very slow.
\ Fig. I Method
based Intrusion detection
systems
Hierarchical multiclass SVM and Tree structured multiclass SVM are the model based algorithms, which have been used for solving multi-class pattern recognition problems for UCI data sets and US Postal Service data set. The results of these algorithms show that these are the effective methods and can improve the performance of our IDS. We are using KDDCUP'99 data set for training and testing of the system. Our aim is to classify the different attacks and normal data in to five classes, prepare a training model based on that and use test data to test performance of the model used for building IDS. The USSVM is adopted to train and test every SVM.
B. One-versus-one (OVO) method. It constructs all possible two-class SVM classifiers. There are k(k - 1)12 classifiers by training each classifier on only two out of k classes. A Max Wins algorithm is used in test phase,: each classifier casts one vote for its favored class, and finally the class with most votes wins. The number of the classifiers is increased super linearly when k grows. ova becomes slow on the problem where the number of classes is large.
A.
Hierarchical multiclass SVM
The hierarchical multiclass SVM is proposed in [l]. There are some definitions: Definition I: The center of class i in the feature space is given by:
. C. Method based on a Directed Acyclic Graph(DAG) The training of DAG is same as avo. DAG uses treestructured two-class SVMs to determine which class an unknown pattern belongs to. DAG is constructed in test phase. So the classification of DAG is usually faster than
mi
avo.
I
I,
I
5=1
= i.LCP(X
5)
Definition 2: The distance between class i and class) in the feature space is given by:
IV .MUL TICLASS SVM USING DECISION TREES Decision-tree-based SVM[7][8] which combine SVMs and decision tree is also an effective way for solving multiclass problems than the above methods. The division of the feature space depends on the structure of a decision tree, and the structure of the tree relate closely to the performance of the classifier. Distance measures such as the Euclidean distance and the Mahalanobis distance are often used as separability measures[8] between classes, but the distance
DI; ~Imi-mill Where,
I
( l:
Ii
- LX5--
60
5=1
I
1.)2 J
LXI Ii 1=1
2010 International Conference on Networking and Information Technology
E
B
Definition 3: The radius of hyper-sphere in feature space is given by:
R,
=
A
C
maxllx -mill
1=1 •...• 1;
1
~
represent the distribution of the class i
,--=---"'-..... ( ACvs BDE ) Fig. 2 Inverted binary tree hierarchical
The class similarity Hierarchical order is determined according to the class distribution. The bigger classification area are made the upper classes. In the hierarchical binary tree, the inclusion of classes is according to the distance of class center. Class . similarity considers the class distance and the class distribution both. The construction of the binary tree is from bottom to the top, the inverted binary tree. The classes with less similarity are the upper nodes and the classes with larger similarity are lower nodes.
SVM
The Fig 2 shows the inverted binary tree of Hierarchical Multi-class SVM with five classes. At first stage, the class similarity between Band E computed which is the biggest and then SVMBE is constructed where Band E is united into one class and the class center of BE is computed. Now consider the similarity between A,C,D and BE. At second stage SYMAC is constructed because the class similarity between A and C is the biggest. A and C are united into one class and the class center of AC is computed. Next consider the similarity between BDE and AC. At third stage the similarity between D and BE is the biggest so SYMDBE is constructed by uniting D and BE into one class. There are now only two united classes. The top SYM is trained on these two classes. The classification procedure goes from root node to leaf node with the help of the binary SVMs, while testing a sample.
Definition 4: The class similarity is defined as:
similarity(i,j) = R~ +
multiciass
3. Compute the new united class center and class distribution by using definition I and definition 3. Similarly compute the class similarity between the united class and the other classes. using the definition 4 4. If the number of the class bigger than 2, execute step2 and step3 5. When the number of the class becomes equals to 2, construct the most upper SVM of binary tree.
R~/limi- mJ
It is the distance between the class center and the class distribution. If similarity(i, j) is bigger, the similarity of the two classes will be bigger. How to construct the hierarchical multi-class SVM ? The following steps describe the construction order of binary tree for classifying the five classes. Consider the types of attack are A,B,C,D and E is the normal data: I) Use definition 4 and compute the class similarity. 2)Unite the two classes two classes which have the biggest similarity(i,j) to train SVM
B.
Tree structured multiclass SVM
A tree-structured multi-class SVM is proposed in [2). It is a series of two-class SVMs. The distance between two class patterns, and the number of each class patterns are used to determine the structure of the tree. Let n, denote the number of /-th class patterns Xi, where i = I. 2..... k. The center point of lh class patterns is calculated using following equation: ",",
c
j
=
i
L..m-I xm
(I)
n
t
The Euclidian distance between patterns is calculated as follows:
61
/-th
class
and
l"
class
2010 International Conference on Networking and Information Technology
The separability or the distribution of two class patterns is given by a distance equation as follows:
dij
Ed
= __
r-r-r-r-rr.r.rr-o
(2)
11_
Yi +Y,
EEEE EEE EEEE
where,
_L:Jlx~-cill
Yi -
Fig. 4.1) An intuitive
division of the five class patterns.
n· 1
obviously dij equals to dJi. Find a pair (i*, j*) by calculating the distances of all pairwise classes, where the distance between ~ .•.and ~ * is extreme. To explain how to construct the tree, an example of five-class pattern recognition problem is illustrated as shown in Fig. 3. The five classes are denoted by a set.
••
"
PH-N'
11
r.r-r.
,.
I
EO
~1~
Fig. 3
r-r-o
II I
11 11
0('
HI
HE. E£. 'E£E
[·[.C·(·r
.... .:.:
-1-1-1-1 -1-1-. -1-1-1-1.
D
Fig. 4.2) A decision
·H·1 ·H·H
{A.B.C.D.E}. From Fig. 3.1, the distance between Class A and E, dAE is biggest. In the pair (A.H), let A be Class I and E be Class - I. Next, separate the rest three class patterns {B.C.D} into two classes (Class lor -I). Compare the distances dBA with dBE. The distance dBA < dBE, so Class B is put into Class I. Similarly, calculate the distances for C and D, where dCA < dCE. dDA > dDE.The result is as shown in Fig. 3.3. At last, compare the number of patterns in Class Iand -I. n I(n-I) denote the number of patterns in Classl(-I).
Parameter J1. is the balance of distribution patterns.
V.
ANAL YSIS OF THE TWO ALGORITHMS:
Both the algorithms have used same kernel function but different data sets and different values of cr and c. So their results cannot be compared. But they have some common factors like • Both of them have fast training and testing speed. • Both try to remove the drawbacks of "OAO","OAR" and DAG, the problem of unclassifiable region. • Both work based on distance of the class center to obtain the binary tree. • Both have approximately same accuracy level.
(3) of five-class
(4)
J.11=min(n, + nf'n_, - nF) max(n, + nF,n_, - nF)
tree for classifying the five classes
any more. At the end the five class patterns are divided into two subsets: {A.B"C} and {D,E}. These two class patterns give us a classifier as (I). Now construct a decision tree by separating the subsets and by constructing the corresponding classifiers recursively until each subset remains with only one element. Fig.4 shows the tree-structured SVM obtained using this method. Fig. 4.1 is the division of the five class patterns and Fig. 4.2 is a decision tree. An unknown pattern will start testing from the root, and will end at any of the leaves by traveling the decision tree,
~3;
How to separate five class patterns into two classes
J.1= min(n"n_,) max(n"n_,)
B
A
--1-1-1-1
11 11
'.~:-
-1-1-1-1 -.-1-1 -1-1
ABCVsDE
(5)
We can implement both the algorithms and use same data sets, kernel function and parameter values for comparison of accurate performance.
Supposition holds, if /1' is larger than the original parameter /1, and if not, the supposition is canceled. Repeat the operation until the parameter /1 is not increased
CONCLUSION
62
E
2010 International Conference on Networking and Information Technology
[6]
C. Burges, "A tutorial on support vector machines for Pattern recognition," Data Mining and Knowledge Discovery, vol. 2, no. 2, pp.121-167,1998. [7] K. P. Bennett, 1. A. Blue, "A support vector machine approach to decision trees," in Proceedings of IJCNN'98. Anchorage, Alaska, 1998, pp.2396-2401.
The paper proposes the Hierarchical multiclass SYM and Tree structured multiclass SYM algorithms for implementation of Intrusion Detection System. The results for the proposed system are not available but it seems that multi-class pattern recognition problems can be solved using the tree-structured binary SYMs and the resulting intrusion detection system could be faster than other methods.
[8]
[9] REFERENCES [I]
[2]
[3] [4]
[10]
Lili Cheng, Jianpei Zhang, Jing Yang, Jun Ma, "An Improved Hierarchical Multi-Class Support Vector Machine with Binary Tree Architecture" 978-0-7695-31120/08 2008 IEEE DOl 10.1109I1CICSE.2008 Jun GUO Computer CenterEast, "An Efficient Algorithm for Multiclass Support Vector Machines" , 978-0-7695-3489-3/08 2008 fEEE DOl 10.1109/ ICACfE. 2008.48 F.Sabahi,A.Movaghar "Intrusion Detection: A Survey",9780-7695-3371-1/08 2008 IEEE DOl 1O.1109/1CSNC.2008.44 Latifur Khan, Mamoun Awad, Bhavani Thuraisingham , "A
[II]
[12]
[13]
new intrusion detection system using support vector machines and hierarchical clustering", The VLDB Journal DOl 1O.1007/s00778006-0002,2007. [5]
[14]
V. N. Vapnik, The nature of statistical learning theory, SpringerVerlag,New York. NY, 1995.
Takahashi Fumitake, Ahe Shigeo, "Decision-tree-based multi-class support vector machines", in Proceedings of ICONIP'02, 2002, vol. 3, pp.1418-1422. 1. Weston, C. Watkins, "Multi-class support vector machines," Royal Holloway College, TR CSD-TR-98-04, 1998. C. W. Hsu, C. J. Lin, "A comparison of methods for multidass support vector machines," IEEE Trans. on Neural Networks, vol. 13, no. 2, pp.415-425, 2002. 1. Platt, et al, "Large margin DAGs for multi-class classification", Advances in Neural Information Processing Systems 12, Cambridge, MA:MIT Press, 2000, vol. 12, pp. 547-553. Sandhya Peddabachigaria,Ajith Abrahamb, Crina Grosanc,1ohnson Thomas, "Modeling intrusion detection system using hybrid intelligent systems", Journal of Network and Computer Applications, DOl:10.10 16/j.jnca2005.06.003 Xiaodan Wang, Zhaohui Shi, Chongming Wu and Wei Wang,"An Improved Algorithm for Decision-Tree-Based SVM",1-4244-0332-4/062006IEEE G. Eason, B. Noble, and I. N. Sneddon, "On certain integrals of Lipschitz-Hankel type involving products of Bessel functions," Phil. Trans. Roy. Soc. London, vol. A247, pp. 529-551, April 1955. (references)
63