KEDRI-NICT Project Report - APPENDIX:D
String Kernel Based SVM for Internet Security Implementation Zbynek Michlovsk´ y1 , Shaoning Pang1 , Nikola Kasabov1, Tao Ban2 , and Youki Kadobayashi2 1
Knowledge Engineering & Discover Research Institute Auckland University of Technology, Private Bag 92006, Auckland 1020, New Zealand {spang,nkasabov}@aut.ac.nz 2 Information Security Research Center, National Institute of Information and Communications Technology, Tokyo, 184-8795 Japan
[email protected],
[email protected]
Abstract. For network intrusion and virus detection, ordinary methods detect malicious network traffic and viruses by examining packets, flow logs or content of memory for any signatures of the attack. This implies that if no signature is known/created in advance, attack detection will be problematical. Addressing unknown attacks detection, we develop in this paper a network traffic and spam analyzer using a string kernel based SVM (support vector machine) supervised machine learning. The proposed method is capable of detecting network attack without known/earlier determined attack signatures, as SVM automatically learning attack signatures from traffic data. For application to internet security, we have implemented the proposed method for spam email detection over the SpamAssasin and E. M. Canada datasets, and network application authentication via real connection data analysis. The obtained above 99% accuracies have demonstrated the usefulness of string kernel SVMs on network security for either detecting ‘abnormal’ or protecting ‘normal’ traffic.
1
Introduction
Upon computers and Internet being more and more integrated into our common life, higher security requirements have been imposed on our computer and network system. One of many ways that we can take for increasing the security is to use the Intrusion Detection System (IDS). Intrusion detection (system) is a process of monitoring events occurring in a computer system or network and analyzing them on signs of possible incidents. IDS, as described in [10], can be grouped into two categories: statistical anomaly based IDS and signature based IDS. The idea of statistical anomaly IDS is to detect intrusions by comparing traffic with normal traffic model, looking for deviations. Due to the diversity of network traffic, it is difficult to model normal traffic, as we know that a normal email relaying or peer-to-peer queries may also show like with some intrusion traffic characteristics. Moreover, even for abnormal traffic, it does not in fact C.S. Leung, M. Lee, and J.H. Chan (Eds.): ICONIP 2009, Part II, LNCS 5864, pp. 530–539, 2009. c Springer-Verlag Berlin Heidelberg 2009
String Kernel Based SVM for Internet Security Implementation
531
constitute an intrusion/attack. Hence, anomaly detection often has a high false alarm rate, thus is seldom used in practice. For signature based IDS, network traffic is examined for predetermined attack patterns known as signatures. A signature consists of a string of characters (or bytes). Nowadays many intrusion detection systems also support regular expressions and even behavioral fingerprints [13]. The difficulty of signature based IDS system is that only intrusions whose signatures are known can be detected and it is necessary to constantly update a collection of these signatures to mitigate emerging threats [14]. Another approach for enhancing network security is to authenticate and protect legitimate network traffic. Traditional traffic authentication method based on network protocol (e.g. the port number) is becoming more inaccurate and not appropriate for the identification of P2P and other new types of network traffic. Other methods are mostly based on protocol anomaly detection, which are highly limited because that Internet legitimate traffic does not strictly conform to any specific traffic models. Thus those methods often involve the difficulty of high false positives error in real applications. For example, legitimate Smtp traffic could be identified as malicious traffic, if there is misconfiguration in MTA server adding suspicious fields to the header of email message. Thus, it is likely to mis-authenticate valid Smtp traffics following just the standard Smtp network protocol. For either attack detection or legitimate network traffic authentication, most signatures/rules are created by security experts who analyze network traffic and host logs after intrusions have occurred, whereas sifting through thousands lines of log files and looking for characteristics that uniquely identify as an intrusion is a vast and error prone undertaking. To overcome this shortcoming and detect unknown attack (i.e. signature is not determined), we researched machine learning on string content recognition techniques. The motivation is to train a classifier to distinguish between malicious and harmless flows or memory dump and utilize the trained classifier to classify real network flow and memory dump. Support vector machine (SVM) is known as one of the most successful classification algorithms for data mining, in this work we address string, rather than numerical, data analysis, and implemented string kernel SVM for network security in the way of intrusion detection and network application authentication. In spite of the limitation of the SVM on training efficiency, the advantage of string kernel SVMs is in needless of complete knowledge about attack signature/normal application features, as string kernel SVM is able to automatically learn the problem knowledge during the training procedure. In this work, we develop SVM based string kernel method according to different mathematical similarity expressions of two strings/substrings. For network security, we derive string kernel SVM for automatical attack (i.e. spam emails) signature analysis, conducting spam filtering without early determined spam signature. Moreover, we have used string kernel SVM to authenticate legitimate network applications, learning SVM from connection differences against normal connections.
532
2
Z. Michlovsk´ y et al.
SVM and Kernels Theory
Support vector machines (SVM) are groups of supervised learning methods applicable to classification or regression. SVM maps the data points into a high dimensional feature space, where a linear learning machine is used to find a maximal margin separation [7,8]. One of the main statistical properties of the maximal margin solution is that its performance does not depend on the dimensionality of the space where the separation takes place. In this way, it is possible to work in very high dimensional spaces, such as those induced by kernels, without over fitting. Kernels provide support vector machines with the capability of implicitly mapping non-linearly separable data points into a different higher dimensional space, where they are more separable than the original space. This method is also called Kernel trick [7]. Kernel function K(x, y) can be expressed as a dot product in a high dimensional space. If the arguments to the kernel are in a measurable space X, and if the kernel is positive semi-definite for any finite subset {x1 , ..., xn } of X and subset {c1 , ..., cn } of objects K(xi , xj )ci cj ≥ 0, (1) i,j
then there must exist a function φ(x) whose range is in an inner product space of possibly high dimension, such that K(x, y) = φ(x)φ(y). The kernel method allows for a linear algorithm to be transformed into a non-linear algorithm. This non-linear algorithm is equivalent to the linear algorithm operating in the range space of φ . However, because kernels are used, the φ function is never explicitly computed. The kernel representation of data amounts to a nonlinear projection of data into a high-dimensional space where it is easier to separate into classes [12]. Most popular kernels suitable for SVM are e.g. Polynomial Kernel, Gaussian Radial Basis Kernel, Hyberbolic Tangent Kernel [11]. All of these kernels operate with numerical data. For our purpose is necessary to use string kernels which are described in following section.
3
String Kernels Used in SVM
Regular kernels for SVM work merely on numerical data, which is unsuitable for internet security where huge amount of string data is presented. Towards extending SVM for string data processing, we implemented the following string kernels algorithms in our experiments. 3.1
Gap-Weighted Subsequence Kernel
The theory of subsequence kernel is described in the book Kernel Methods for Pattern Analysis [2]. The main idea behind the gap-weighted subsequence kernel
String Kernel Based SVM for Internet Security Implementation
533
is to compare strings by means of the subsequences they contain - the more subsequences and less gaps they contain the more similar they are. For reducing dimensionality of the feature space we consider non-contiguous substrings that have fixed length p. The feature space of gap-weighted subsequence kernel is defined as λl(i) , u ∈ Σ p , (2) φpu (s) = i:u=s(i)
where λ ∈ (0, 1) is decay factor, i is index the occurrence of subsequence u = s(i) in string s and l(i) is length of the string in s. We weight the occurrence of u with the exponentially decaying factor λl(i) . The associated kernel is defined as φpu (s)φpu (t). (3) κ(s, t) = φp (s), φp (t) = u∈Σ p
In Eq. (3), it is required to conduct an intermediate dynamic programming table DPp whose entries are: DPp (k, l) =
k l
λk−i+l−j κSp−1 (s(1 : i), t(1 : j)).
(4)
i=1 j=1
Then, the computational complexity is evaluated as 2 λ DPp (|s| , |t|) if a = b; S κp (sa, tb) = 0 otherwise
(5)
which follows that for a single value of p, the complexity of computing κSp is O(|s| |t|). Thus, the overall computational complexity of κp (s, t) is O(p |s| |t|). 3.2
Levenshtein Distance
Levenshtein (or edit) distance [4] counts differences between two strings. The distance is the number of substitutions, deletions or insertions required to transform string s with length n to string t with length m. The formal definition of Levenshtein distance [17] is given as follows: Given a string s, let s(i) stand for its ith character. For two characters a and b, define r(a, b) = 0 if a = b. Let r(a, b) = 1
(6)
Assuming two strings s and t with the length of n and m, respectively, then a (n + 1)(m + 1) array d furnishes the required values of the Levenshtein distance L(s, t). The calculation of d is a recursive procedure. First set d(i, 0) = i, i = 0, 1, ..., n and d(0, j) = j, j = 0, 1, ..., m; Then, for other pairs i, j, we have d(i, j) = min(d(i − 1, j) + 1, d(i, j − 1) + 1, d(i − 1, j − 1) + r(s(i), t(j))). (7) In our implementation, we use D = e−λ·d(i,j) for getting better results. Analog to the above substring kernel, the computational complexity of Levenshtein Distance is O(|s| |t|). In the case that s and t have the same length, the complexity is O(n2 ).
534
Z. Michlovsk´ y et al.
3.3
Bag of Words Kernel
The Bag of words kernel is represented as an unordered collection of words, disregarding grammar and word order. Words are any sequences of letters from the basic alphabet separated by punctuation or spaces. We represent a bag as a vector in a space in which each dimension is associated with one term from the dictionary φ : d → φ(d) = (tf (t1 , d), tf (t2 , d), ....tf (tN , d)) ∈ RN ,
(8)
where tf (ti , d) is the frequency of the term ti in the document d. Hence, a document is mapped into a space of dimensionality N being the size of the dictionary, typically a very large number [2]. 3.4
N-Gram Kernel
N-grams transform documents into high dimensional feature vectors where each feature corresponds to a contiguous substring [5]. The feature space associated with n-gram kernel is defined as φnu (s)φnu (t) (9) κ(s, t) = φn (s), φn (t) = u∈Σ n
where φnu (s) = |{(v1 , v2 ) : s = v1 uv2 }| , u ∈ Σ n . We have used for computing n-gram kernel naive approach therefore the time complexity is O(n |s| |t|).
4
Experiments and Discussions
For internet security, one useful approach is to actively detect and filter spam/ attack by building Bayesian filters. In addition, another practical approach is to protect legitimate network communication by authenticating every type of legitimate network application. In this section, we implemented string kernel SVMs on the spamAssasin public mail corpus for email spam detection, and experimented the authentication of 12 categories standard network application. For multi-class classification, we used support vector machine software - libSVM [1]. In our experiments, we used precomputed kernel matrices from prepared training and testing datasets. All values in precomputed kernel matrices have been scaled to interval [-1,1]. Optimal parameters of string kernel functions have been determined by cross validation tests on training datasets. Kernel matrices has been applied as input to libSVM [1].
String Kernel Based SVM for Internet Security Implementation
4.1
535
Spam Detection
For spam detection experiment, we used 5500 ham messages from SpamAssasin public mail corpus [18] and 24038 spam messages from E. M. Canada [19]. The ham emails dataset consists of two categories: EasyHam emails, 5000 non-spam messages without any spam signatures and Hard Ham emails, 500 non-spam messages similar in many aspect to spam messages - using unusual HTML markup, colored text, spam-sounding phrases, etc. Each email message has a header, a body and some potentially attachments. Note that for the convenience of comparison, we employed the exact same data setup as in [20], training dataset 23630 (19230 spam vs. 4400 ham) messages and testing dataset 5918 (4808 spam vs. 1110 ham) messages. Analog to [20], we intended to determine which part of email message have critical influence on the classification results. To this end, we prepared four subsets: Subject, Body, Header and All subsets. The Subject subset uses only the subject field of the email message, and all Subject data are normalized to the length of 100 characters; The body subset is the body part of the email message normalized to the length of 1000 characters; The header subset is the header section of the email message normalized to the length of 100 characters; The All subset concludes the From field and the Subject field of the header section plus the whole body of the email massage. Also, every instance of All subset is normalized to the length of 1200 characters. Table 1. The results from email classification using each kernel function with an comparison to [20] Features
String Kernel Function Ref. Acc. [20] N-gram Subsequence Edit Distance Bag of word Subject 96.38 96.64 95.78 81.16 92.64 Body 99.28 99.23 97.19 81.04 84.41 Header 99.80 99.80 99.75 82.68 92.12 All 99.34 99.38 98.02 81.24 90.13
Table 1 presents the percentage accuracy of correctly classified email messages for each type of string kernel and email subset, where percentage accuracy of [20] is presented in the Ref. Acc. column for comparison. As seen from the table, the classification results reached by string kernel SVM are exceeding the percentage accuracy from the the reference paper [20]. The results for subset Header demonstrate that the first 100 characters of a message header is enough for correct spam classification. Results for other subsets are consistently good, however they are seemed to more susceptible to spammers tricks. Among 4 string kernels, the outstanding performance of N-Gram and Subsequence kernel functions proves that the classification with substring/subsequence kernels is more suitable for spam detection than the kernels using whole string, such as Edit Distance. The Bag of word kernel is seen problematical on spam
536
Z. Michlovsk´ y et al.
detection, this could be explained that these spammers normally use the same words for both spam and ham similarity evaluation. 4.2
Network Application Authentication
In this experiment, we used data from tcp network traffic produced by common network applications using different communication protocols like http, https, imap, pop3, ssh, ftp and some applications using dynamic ports like Bittorent. All network data was captured, and sorted by program Wireshark [15] into separated files according to protocols and then split into individual flows (connections) by program tcpflow [16] see Fig. 1.
Network traffic data
Wireshark
Single protocols data
TcpFlow
Individual connection data
Fig. 1. Schema of data preparation for network application authentication. Network traffic data are sorted into separated files according to protocol with program Wireshark [15] and then split into individual flows using program tcpflow [16].
At preprocessing stage, we removed the traffic data in unreliable protocols like UDP, and reorder traffic data by the type of connections using program tcpflow [16]. All connections have been shorten to the length of 450 bytes and 750 bytes respectively, where connections shorter than 450/750 bytes are normalized by repeating its content. Every connection is labelled as its connection port number within the interval of 1, 49151. For connections using dynamic ports or the same port as other applications, we labelled the connections with some unoccupied ports. For example, Http, Msnms and Ocsp connections are noticed using the same port 80. To avoid repeat, we label Msnms and Ocsp as port 6 and 8, respectively. The summary of all connections is given in Table 2. Table 3 presents the best percentage accuracies obtained by 4 types of string kernel. Percentage Gen. Acc. is recorded as a ratio of the correctly classified connection to total connection. Parameters Lambda and C have been gained from previous 5-cross validation tests. Parameter Substring length represents the length of contiguous (non-contiguous) substring for N-gram (Subsequence) kernel function.
String Kernel Based SVM for Internet Security Implementation
537
Table 2. Number of connections (flows) for each protocol for training and testing Protocol name (port number) Class label Training set Testing set Http (80) 80 675 235 BitTorrent (dynamic) 4 299 99 Https (443) 443 292 90 Imap (993) 993 42 14 Pop3 (110) 993 32 10 Aol (5190) 5190 21 7 Ftp (21) 21 12 4 Msnms (80) 6 9 4 Microsoft-ds (445) 445 9 3 Ocsp (80) 8 8 3 Ssh (22) 22 6 2 Pptp (1723) 1723 2 1 Sum 1407 472 Table 3. The best results of network application classification for each kernel function Kernel Function
Parameters Gen. Acc Substring length Connection length C of C-SVC Lambda Subsequence kernel 4 750 65536 0.5 99.58% N-gram kernel 4 450 4096 0.5 99.58% Edit distance kernel 450 16 0.001 99.15% Bag of word kernel 750 1 0.25 49.75%
As seen, Subsequence and N-gram kernel functions give the best 99.58% general accuracy, which follows that only 2 of total 472 connections are misclassified (two Msnms connections are misclassified as Http connections). It is worth noting that N-gram kernel function performs extremely well for distinguishing those network applications with shorter (450 chars) connection instance. However, the Bag of word kernel function is unable to recognize network applications because this kernel distinguishes network applications based on word similarity, when the word (continuous sequence of characters between any two gaps) represents a whole network connection, it is often too long for the kernel to differentiate applications. Fig. 2 discloses the relationship between the substring length and general accuracy over network flows (connections) with length 450 and 750 chars, respectively. As seen, the general accuracy decreases over the length of substring for both N-gram and Subsequence kernels. This suggests that an longer substring normally leads to a decreased classification performance from the presented two string kernel functions. In general, results in Table 3 and Fig. 2 have proved the effectiveness of applying Edit Distance, Subsequence and N-gram kernel functions for recognizing hidden network traffic including encrypted connections. Edit Distance kernel
538
Z. Michlovsk´ y et al. Flows (connections) with length 750 chars. 100
99
99
General Accuracy [%]
General Accuracy [%]
Flows (connections) with length 450 chars. 100
98
97
96
95
94
6
8
10
97
96
95
N−Gram kernel function Subsequence kernel function 4
98
12
14
16
94
N−Gram kernel function Subsequence kernel function 4
Substring Length
6
8
10
12
14
16
Substring length
Fig. 2. Relation between the substring length and general accuracy for network flows (connections) with length 450 and 750 chars, respectively
identifies a global similarity of the string (connection), but it surprisingly gives a 99.15% authentication accuracy. This implies that functions based on complete string comparison are also suitable for network application recognition to some extent.
5
Conclusions and Future Work
In this paper, we propose a new network security technique for spam and hidden network application detection. Our technique is based on known string kernel functions with precomputed kernel matrices. In our implementation of the proposed technique, we used support vector machines with optimal configuration associated for each kernel function. This paper makes two major contributions. First, a study of Spam detection. Our email classification results presents excellent ability of string kernels SVM to detect spam from relevant emails. Kernel functions using substrings/ subsequences are evidently less susceptible to spammers tricks than function which comparing the whole strings. As seen, the best classification accuracy has been reached for Header email subset and this finding proves the email header has critical influence on the spam classification. Second, a detection of hidden network application traffic. Our experimental analysis shows that sizes of first 450B of TCP connection are capable for accurate distinguishing network applications. From results is evident the suitable functions for application recognition from TCP connections are N-Gram, Subsequence and Edit Distance function. Our experiments prove the best results in network application recognition was gained with short length of substrings in function parameter. For future work, we will continue to develop new string kernels, and address specially next network security tasks including testing memory dump and network intrusion detection.
String Kernel Based SVM for Internet Security Implementation
539
References 1. Chang, C.-C., Lin, C.-J.: LIBSVM:a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~ cjlin/libsvm 2. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, New York (2004) 3. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. WileyInterscience, Hoboken (2000) 4. Charras, C., Lecroqk, T.: Sequence comparison (1998), http://www-igm.univ-mlv.fr/~ lecroq/seqcomp/index.html 5. Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. J. Mach. Learn. Res. 2, 419–444 6. Fisk, M., Varghese, G.: Applying Fast String Matching to Intrusion Detection (September 2002) 7. Aizerman, A., Braverman, E.M., Rozoner, L.I.: Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control 25, 821–837 (1964) 8. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: COLT 1992: Proceedings of the fifth annual workshop on Computational learning theory, pp. 144–152. ACM, New York (1992) 9. Yuan, G.-X., Chang, C.-C., Lin, C.-J.: LIBSVM: libsvm experimental code for string inputs, http://140.112.30.28/~ cjlin/libsvmtools/string/libsvm-2.88-string.zip 10. Scarfone, K., Mell, P.: Guide to intrusion detection and prevention systems (idps). In: NIST: National Institute of Standards and Technology (2007), http://csrc.nist.gov/publications/nistpubs/800-94/SP800-94.pdf 11. Vapnik, V.N.: The nature of statistical learning. Springer, New York (1995) 12. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge (2000) 13. Caswell, B., Beale, J., Foster, J.C., Faircloth, J.: Snort 2.0 Intrusion Detection. Syngress (2003), http://www.amazon.ca/exec/obidos/ redirect?tag=citeulike09-20&path=ASIN/1931836744 14. Whitman, M.E., Mattord, H.J.: Principles of Information Security. Course Technology Press, Boston (2004) 15. Combs, G., et al.: Wireshark: network protocol analyzer, http://www.wireshark.org/ 16. Elson, J.: tcpflow: tcpflow reconstructs the actual data streams and stores each flow in a separate file for later analysis, http://www.circlemud.org/jelson/software/tcpflow/ 17. Bogomolny, A.: Distance Between Strings, http://www.cut-the-knot.org/doyouknow/Strings.shtml 18. SpamAssassin public mail corpus, http://spamassassin.apache.org/publiccorpus/ 19. Spam dataset, http://www.em.ca/7Ebruceg/spam/ 20. Lai, C.-C.: An empirical study of three machine learning methods for spam filtering. Knowledge-Based Systems 20, 249–254 (2007)