Support Vector Classifiers and Network Intrusion ... - Semantic Scholar

5 downloads 64694 Views 57KB Size Report
attempts to discern if a given request for network service is an intrusion attempt or ... As an example, the KDD Cup data set contains roughly 5 million examples.
Support Vector Classifiers and Network Intrusion Detection John Mill

Atsushi Inoue

Spokane Falls Community College Spokane, WA 99224 Email: [email protected]

Department of Computer Science Eastern Washington University Cheney, WA 99004 [email protected]

Abstract— Within network security, there is the task of intrusion detection. Intrusion detection is a classification task that attempts to discern if a given request for network service is an intrusion attempt or a safe request. Since the creation of the 1999 KDD Cup network intrusion data set, several machine learning approaches to this task have been found to be successful. In this work we propose using the successful Support Vector Machine (SVM) learning approach to classify network requests. We use computational experiments to explore two factors that influence SVM performance in this task and demonstrate two novel approaches to this task.

I. I NTRODUCTION The increase in the usage of the Internet has also brought about an increase in attempts to compromise network security. From 1990 to 2002, there was a 300 percent increase in network compromise incidents[1]. One of the ways to deal with network compromise is with intrusion detection. Intrusion detection is the task of classifying network service requests as either hostile (ie: an attack) or normal (ie: safe ). One of the most popular intrusion detection toolkits is SNORT[2]. While successful, SNORT currently relies on security administrators to fine tune and configure the detection system. The 1999 KDD Cup[3] involved a network intrusion task. Participants were given training sets consisting of 5 million data points collected from real world networks and asked to train machine learning classifiers on the data set. The competing classifiers were then compared on a testing set collected from a different time period. Many machine learning classifiers have been tested on the KDD Cup data set [4][5][6][7] and have been shown to be successful. This has prompted a line of work to investigate a collaborative, multi-agent intrusion detection system based on soft computing[4]. In this framework, multiple agents are seen as contributing their individual classification decisions to an overall master agent. This master agent is capable of issuing reinforcement commands to constituent agents to help them adapt to new attack profiles. One candidate agent for this framework is the Support Vector Machine (SVM). SVMs are maximal margin classifiers first developed by Vapnik[8] and have been shown to be effective in a number of domains[9][10] [11][12]. In order to be useful within this framework, an SVM based agent must fulfill several key requirements: minimal training time, maximal accuracy, and be reinforcement capable. The issue of reinforcement

capable SVMs has already received considerable attention [12][13][14]. Training time is a very potent issue considering that network intrusion detection can generate massive amounts of training data. As an example, the KDD Cup data set contains roughly 5 million examples. As that the intrusion detection system must respond in real time, expensive training processes are undesirable. Since SVMs solve a quadratic optimization problem over a set of variables whose size is the number of training set points, training can be extremely time consuming. A number of methods have been developed to deal with this issue, and most involve a trade off with classification accuracy to reduce the training time. One of the purposes of this paper is to compare some of the popular massive data set SVM training algorithms in regards to this trade off as well as to present two novel SVM training methods. We first describe Support Vector Machines in general before considering some of the more recent SVM training methods (as well as two novel methods). We then present computational experiments and finish with some concluding remarks. II. S UPPORT V ECTOR M ACHINES At its heart, an SVM attempts to find a hyperplane to separate a training set S consisting of labeled vectors into two subsets. For the binary classification problem, each vector is labeled as either belonging to the positive class (+1) or the negative class (-1). In this case, the SVM attempts to find a hyperplane to separate the positive points from the negative points. To discover the hyperplane, an SVM solves the following optimization problem[15]: Given:

a training set S = ((x~1 , y1 ), ...(x~l , yl )) Pl maximize: − i=1 αi Pl P l + 21 i=1 j=1 yi yj αi αj K(x~i , x~j ) Pl subject to: i=1 yi αi = 0 0 ≥ αi ≥ C, i = 1, ..., l

In the above problem, K(x~i , x~j ) is the so called kernel function which is used to transform the data to more tractable spaces. In this work, we use the linear kernel function K(x~i , x~j ) = x~i · x~j . The final output of the above process

is a vector α ~ which is used to define a separating hyperplane as follows: h(~x) =

l X

αi yi K(x~i , ~x) + b

i=1

Typically, the number of αi in α ~ where αi > 0 is much smaller than the size of S. The training set vectors with αi > 0 are the only vectors that are saved after the optimization problem is run since they are the only vectors needed to evaluate the hyperplane defined by h. These vectors are called the support vectors and have some interesting properties often exploited by SVM trainers. One of which is that as the only training set vectors retained for future evaluation, they represent a condensation of the training set. For a given ~x, the output of h(~x) is called the margin. The sign of the margin determines which class ~x is classified into. If h(~x) is positive, the ~x is classified as positive. If h(~x) is negative, the ~x is classified as negative. The magnitude determines the strength of the classification. In the case of a 0 margin, the point is assigned to the positive class. Geometrically speaking, the sign of the margin determines which side of the hyperplane the vector is found while the magnitude of the margin is a measure of distance between the vector and the hyperplane. Points with greater distances from the hyperplane (far points) are more confidently classified than points with closer distances (near points) since small changes in the hyperplane are less likely to change the classification of far points. III. L ARGE S CALE T RAINING M ETHODS Solving a quadratic optimization problem is a time consuming process. When 5 million training examples are considered, training becomes infeasible for application purposes. To this end, numerous methods have been proposed to train SVMs on large scale data sets. We detail some of these methods, non-exhaustively, below. A. Chunking and Decomposition Chunking is one of the first large scale training methods considered[16]. The idea is to only consider a portion of the entire problem at a time. In chunking, a subset of the training set, M , is optimized using an off the shelf SVM optimizer. M is often called the working set. A portion of M , the further subset MSV , will be retained while the rest will be discarded. The generated SVM will then be used to classify the rest of the data set and the N greatest violators of the KKT conditions (necessary conditions for SVM optimality) will be added to MSV to create a new working set and a new SVM will be trained on the new working set. This process is repeated until there are no violators of the KT T conditions remaining. The main idea behind chunking is to identify all the potential support vectors and place them into the working set and then run the optimizer on this assembled collection. One of chunking’s major failings lie in that it scans the entire data set repeatedly. In addition, the problem may be so large

that even when the support vectors are identified, there may be too many support vectors for an off the shelf optimizer to handle efficiently. Thus, the problem may still be intractable. Chunking was later expanded into decomposition[17]. In decomposition, the optimizer does not attempt to find the set of support vectors, but only discovers α’s for a subset of the training data. As stated in [15], the goal of decomposition is “to optimize the global problem by only acting on a small subset of the data at a time.” The small subset acted upon is called the active set. An extreme case of decomposition is the SMO algorithm[18]. SMO has an active set of size two and uses an analytical (as opposed to an iterative optimization) solution to optimize the working set. In practice[19], chunking and decomposition are unable to tackle massive data sets in a timely way. B. Clustering Support Vector Candidates One key realization is that the support vectors are all that is needed to discover the optimal hyperplane. This is the same observation that inspired chunking. In clustering [7][20] [19], the idea is to first detect those vectors that could be support vectors. These support vector candidates are then used to train a classifier. The usual approach with a clustering SVM is to examine clusters of points. If the points of a cluster are of homogeneous label, then their exemplar or center is added to the set of support vector candidates. If the points of a cluster have heterogeneous labels, then it is assumed that these points are hardener to differentiate, and are then admitted to the set of support vector candidates. Once the support vector candidates have been assembled, an off the shelf classifier is run on the candidates to find the actual support vectors. In practice, clustering algorithms that require only single passes through data sets are used on massive data sets. The advantage of this approach, again, is to reduce the set of points an SVM classifier needs to run on. There are two general downfalls to this approach. The first is that actual support vectors may be eliminated from the candidate set and thus decrease classification accuracy. The other is that the set of support vector candidates may in fact be too large for an off the shelf optimizer to handle efficiently. C. Incremental Support Vector Machines Incremental Support Vector Machines (ISVMs)[13] [14] are based on the same principle of discovering support vector candidates. Generally speaking, a data set S is broken up into n sequential subsets, S1 , S2 , ..., Sn . An SVM classifier is first trained on S1 . The support vectors discovered from this process are then added to S2 , and a new classifier is trained. The support vectors from this process are then added to S3 and a new classifier is trained, and so on until all the subsets have been processed. The idea is to discover which vectors at each stage are the most important to the classification process. This knowledge is then carried forward to the next subset. Unlike chunking, multiple passes through the training set are not required, so the run-time of the algorithm should be less than chunking or decomposition. ISVMs have the advantage

over other support vector candidate methods (such as clustering) is that no expensive, global, preprocessing method is needed. A practical problem with this approach, however, is that as successive subsets are considered, the number of support vectors carried over from previous subsets increase the size of the training problem. This could lead to successive training problems of size much greater than the optimizer can reasonably complete. In addition, there is no guarantee that an actual support vector in Si will be considered as a support vector due to the distribution of vectors in that subset. So, this vector need not be carried forward into successive subsets and into the global solution. So, the faster training time might be at the expensive of global optimality. To combat this, a novel adaptation of ISVM has been developed. TreeSVM begins by breaking a training set S into n subsets as ISVM. An SVM is trained on each subset individually to discover support vectors for each subset, SV1 , .., SVn . These sets of support vectors are then concatenated to form a new training set S1 . We then divide S1 into n subsets and discover the support vectors for these subsets. We take the support vectors for each subset of S1 to create S2 . The process is repeated until no substantial reduction in the number of support vector candidates is discovered. Then the remaining candidates are then run through an SVM optimizer to discover the final support vectors. Unlike a traditional ISVM where the support vectors of the previous subset are carried over to the next subset, the support vectors for each subset are discovered before creating a new training set. TreeSVM combines the support vectors for each subset into a new training set in a tree like manner. This approach has a potential time advantage over ISVM in that the size of the subproblems is kept to a constant size. Without the subproblem size creep usually involved with ISVM, it is thought that the run time of the algorithm will be much less. TreeSVM should still be susceptible to the local distributions of the subsets that ISVM is subject to. D. ArraySVM ArraySVM is a new method we devised based on the Fuzzy SVM[19] used to solve multi-class problems with SVMs. For a multi-class problem, for each class, an SVM binary classifier is trained to distinguish that class from the others. When a novel point is presented for classification, the classifier that outputs the margin with the greatest magnitude is used to classify the example. If there is a tie, then one of the competing classifiers is chosen at random. Our novel approach is to apply this same idea to training on a massive data set. This training takes place as follows. The data set is broken into segments and a classifier is trained on each segment (as in TreeSVM and ISVM). These classifiers are then placed into an array. When a novel point is presented for classification, each classifier in the array is made to classify the point. The classifier whose margin has the greatest magnitude is then used to classify the point. Training time should be less than TreeSVM since the size of the problem is kept small and only one pass through the training set is required. In addition, since the size of the subproblems does not grow as they do in ISVM, ArraySVM

can be expected to be faster than ISVM. ArraySVM can be viewed as a mixture of experts approach. Each sub-SVM is the expert in its subset. To the overall classifier, a sub-SVM signals is confidence in its classification based off of the magnitude of the margin. The most confident sub-SVM is then used to classify the point. IV. E XPERIMENTAL S ETUP Some of the above methods were tested on the KDD Cup data set. SMO and SVM Light[17] were tested as exemplars of decomposition, while ISVM and TreeSVM represented support vector candidacy. ArraySVM was also tested. For ISVM, TreeSVM, and ArraySVM, SVM Light was used as the generic SVM optimizer and the training files were segmented into files no larger than 1000 points. Each method was tested on a random subsample of the full KDD Cup data set. There were ten subsamples of size one thousand and ten subsamples of size ten thousand. After training, the classifiers were then tested on the full KDD Cup testing data (some three hundred thousand testing examples). The testing set was collected at a different time than the training set and so the distribution of examples is not exact. The algorithms were measured for both training time, testing set accuracy, and number of support vectors. Previous work[12] had determined that there was a strong relationship between the number of support vectors and accuracy. All tests were conducted using Athlon XP 1800 processor on a computer with 512MB of RAM and running Red Hat Linux 9.0. V. R ESULTS Table I contains the data for the data sets containing one thousand examples. The results for each item were averaged across all ten training sets. The first thing to notice is that SMO performs poorly both in terms of accuracy and training time. One thing to note is the low number of support vectors identified on average by SMO. This leads to a less complex hypothesis that is unable to handle the all the complexities of the testing data. SVM Light had the best training time. This is expected in that aside from SMO, the other algorithms had a need to repeatedly train classifiers. The accuracy of ArraySVM was the greatest. This is probably due to its mixture of experts nature. The other two algorithms performed less well for reasons discussed above. TABLE I AVERAGED R ESULTS FOR 1K Algorithm

Acc

Training Time

SMO

75.97%

1962.25 secs

# of SV’s 28.00

SVM Light

88.55%

24.10 secs

194.10

ISVM

88.12%

26.00 secs

181.10

TreeSVM

85.99%

28.2 secs

134.10

ArraySVM

90.78%

45.00 secs

199.3

Testing on the data sets with ten thousand examples yields the results in Table II. SMO was omitted from the results after

it failed to complete a single training session in under 48 hours. The first thing to notice is that while the problem size increased by a factor of ten, the run times of SVM Light, ISVM, and TreeSVM increased by a factor of about 40. This indicates that scaling these approaches to the full KDD Cup data set does not bode well. ArraySVM only experienced a time increase proportional to the increase in the size of the data set. The other thing to notice is that the accuracy of all but ArraySVM decreased despite having received more training. For the other programs, the number of support vectors increased by roughly a factor of ten. As noted in [12] a higher number of support vectors can yield a more specific and complex hypothesis that may overgeneralize. This may be the case here. ArraySVM is composed of SVMs trained on subsets of size 1000, so each classifier in the array can be expected to have the average recorded above and thus not be susceptible to the same over generalization. ArraySVM is clearly benefiting from the increased training and is outperforming the other classifiers in terms of training time and accuracy. This is no surprise when considering that the next most accurate classifier is the SVM Light classifier trained on one thousand data points. ArraySVM takes these already high performing classifiers and attempts to combine them to improve overall performance.

massive data sets would need to be combined reinforcement learning. In this scenario, the current set of support vectors would have to be controlled carefully. Vectors whose usefulness has diminished due to outdated attack methods would have to be pruned, while vectors representing new classes of attacks would have to be added. One future line of work along this line is using a set of fuzzy vectors for support vectors and handling updates as either adding a new vector or updating one of the fuzzy support vectors. R EFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

TABLE II AVERAGED R ESULTS FOR 10K

[11] Algorithm

Acc

Training Time

# of SV’s

SVM Light

75.26%

13174.4 secs

848.00

ISVM

80.52%

1119.2 secs

1412

TreeSVM

70.75%

1115.19 secs

1075

ArraySVM

91.30%

491.00 secs

2030.23

VI. C ONCLUDING R EMARKS AND F UTURE W ORK Several methods for employing SVMs for network intrusion were presented, two of them newly developed. Of these methods, the new method ArraySVM was shown to be the best for both training time and accuracy. SMO was shown to perform poorly on this data set while the other methods presented tended to sacrifice training time for accuracy. There are two general categories of future work that this work suggests. The first category contains experimental considerations. The KDD Cup data set contains 5 million training examples and only ten thousand training examples were used. Obviously the above methods need to be tested on the full training set to validate the conclusions here. Another experimental consideration is how ArraySVM makes use of the individual classifiers. ArraySVM only currently uses the most confident classifier. Other possibilities included averaging, summing, or weighting the individual classifiers. The second and perhaps farther reaching category of future work lies with incorporating SVMs within a multi-agent intrusion detection framework. For this to happen, training on

[12]

[13] [14] [15] [16] [17] [18] [19] [20]

CERT, 2003, http://www.cert.org/. B. Caswell and M. Roesch, 2003, http://www.snort.org/. K. C. I. D. Dataset, 1999, http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. P. Miller, J. Mill, and A. Inoue, “Synergistic and Perceptual Intrusion Detection with Reinforcement (SPIDER),” in Midwest Conference on Artifical Intelligence and Cognitive Science., 2003. P. Miller and A. Inoue, “Collaborative intrusion detection system,” in North American Fuzzy Information Processing Society, 2003, submitted. A. Inoue, “Perceptual intrusion detection system,” in North American Fuzzy Information Processing Society, 2003. H. Yu, J. Yang, and J. Han, “Classifying large data sets using SVMs with hierarchical clusters,” in ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003. V. Vapnik, B. Bosner, and I. Guyon, “A training algorithm for opimal margin classifier,” in Fifth Annual Workshop on Computational Learning Theory, 1995. J. Mill, “Support Vector Machines, N-gram Kernels, and Text Classification,” Master’s thesis, Eastern Washington University, 2002. T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features.” in Proceedings of the European Conference on Machine Learning (ECML). Springer, 1998. G. Siolas and F. d’Alch Buc, “Support vectors machines based on a semantic kernel for text categorization,” IEEE-IJCNN, 2000. J. Mill and A. Inoue, “Reinforcement of a support vector classifier – toward a framework of perceptual information processing –,” in Information Processing and Management of Uncertainty in Knowledge Based Systems, 2004. C. Domeniconi and D. Gunopulos, “Incremental support vector machine construction,” in ICDM, 2001, pp. 589–592. N. A. Syed, H. Liu, and K. K. Sung, “Handling concept drifts in incremental learning with support vector machines,” in Knowledge Discovery and Data Mining, 1999, pp. 317–321. N. Christianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, 2000. E. Osuna and F. Girosi, “Reducing the run-time complexity of support vector machines,” in International Conference on Pattern Recognition, 1998. T. Joachims, “Making Large-Scale SVM Learning Practical,” in Advances in Kernel Methods - Support Vector Learning, B. Schlkopf, C. Burges, and A. Smola, Eds. MIT Press, 1999. J. Platt, “Fast Training of Support Vector Machines using Sequential Minimal Optmization.” in Advances in Kernel Methods - Support Vector Learning, B. Schlkopf, C. Burges, and A. Smola, Eds. MIT Press, 1999. S. Abe, Support Vector Machines for Pattern Classification. Kobe University Press, 2003. D. Boley and D. Cao, “Training support vector machine using adaptive clustering,” in 2004 SIAM International Conference on Data Mining, 2004, to appear.

Suggest Documents