Real-time Traffic Classification Based on Statistical and Payload Content Features Fereshte Dehghani Department of Computer Engineering University of Isfahan Isfahan, Iran
[email protected]
Nasser Movahhedinia
[email protected]
Mohammad Reza Khayyambashi
[email protected]
Sahar Kianian
[email protected]
Abstract—In modern networks, different applications generate various traffic types with diverse service requirements. Thereby the identification and classification of traffic play an important role for increasing the performance in network management. Primitive applications were using well-known ports in transport layer, so their traffic classification can be performed based on the port number. However, the recent applications progressively use unpredictable port numbers. Consequently the later methods are based on “deep packet inspection”. Notwithstanding proper accuracy, these methods impose heavy operational load and are vulnerable to encrypted flows. The recent methods classify the traffic based on statistical packet characteristics. However, having access to a little part of statistical flow information in real-time traffic may jeopardize the performance of these methods. Regarding the advantages and disadvantages of the two later methods, in this paper we propose an approach based on payload content and statistical traffic characteristics with Naïve Bayes algorithm for real-time network traffic classification. The performance and low complexity of the propose approach confirm its competency for real-time traffic classification. Keywords-intelligent traffic classification; machine learning; payload content features; statistical features
I.
INTRODUCTION
The traffic classification is dynamical identification of network traffic flows and arranging them in different classes. Traffic classification has been increasingly important because of progressively growing type and number of applications [1]. The dynamic classification has essential advantages in the network traffic management [2].
Real-time multimedia flows, such as video and voice, have almost stable behavior, but are sensitive to delay, delay variations and somewhat packet loss. Some other applications, such as games, require rapid change of traffic characteristics and are more sensitive to traffic delay. Therefore, intelligent traffic control requires predefined knowledge about network service types to decide proper strategies, as well as real-time classification to be aware of traffic parameters to support their decisions [3, 4]. The importance of traffic classification is outscored in solving network problems such as traffic management for Internet Service Providers (ISPs), intrusion detection systems, Denial of Service (DoS) attack detection and automatically network resource reserving for customers [5]. The identification and classification of traffic are the main cores of Quality of Service (QoS) products and automated QoS architectures. Moreover, traffic classification is applicable in providing and studying statistics of network trace [6]. Traffic classification has been a challenging problem as network applications are improving drastically, while there is a few information about traffic flows [7]. A. Traffic classification methods review Generally, there are three methods for traffic classification. In this section, each method and its advantages and disadvantages are briefly described. 1) Port based traffic classification Early technologies were detecting the traffic based on port number identification in transport layer. These methods assume that all applications use predictable well-known port
numbers which are listed in Internet Assigned Numbers Authority (IANA). However, applications may use un-standardized port numbers for four reasons: (1) Nonprivileged users frequently have to use port numbers above 1024, (2) Some of applications hide their port numbers to bypass port-based filters, (3) Multiple servers are sharing a single IP address (host), and (4) some of applications (e.g. passive ftp) use dynamic unknown port numbers [8]. 2) Deep packet inspection Due to the above discussions, recent traffic classification approaches are using payload based or deep packet inspection techniques. These techniques recognize application types based on packet payload inspection in TCP/UDP transports [9]. One the ways of packet payload inspection is called signature based method. This method looks for any signature located in the payload of application layer. For example, http packets begin with URL instruction and protocol model. However, eDonkey packets frequently begin with fields which contain payload size. Signature based methods check parts of the payload for defined signature of known protocols [10]. As such, these methods are based on two assumptions: (1) The third parties, independently from source and destination, can inspect each packet's payload; (2) The classifiers must be aware of the packets payload syntax of each application. Unfortunately, the effectiveness of these methods has declined because customers may use encrypted flows, or governments may deny third parties to inspect payloads for security reasons. Furthermore packets payload syntax inspections impose heavy operational load and delay [6]. Albeit, the outstanding advantage of this technique is its high accuracy, due to its high complexity, it isn't suitable for real-time classification. 3) Machine learning and statistical feature Most of the modern traffic classification methods are using Machine Learning (ML) to classify statistical patterns existing in observable external features of traffic. As these techniques are not based on port numbers and protocol characteristics, they do not encounter the problems of pervious methods [7]. However, in order to distinguish different applications, statistical attributes e.g. packet interarrival times or inter-packet length statistics must be unique in application classes [6]. The disadvantage of these methods is that typically they access a few flow statistical characteristics; therefore, they aren't capable of superior real-time classification. Regarding the advantages and disadvantages of both deep packet inspection, and classification using statistical features methods, in this paper we propose an approach based on flow payload information and traffic statistical features with Naïve Bayes algorithm. The rest of this paper is organized as fellows. Section II reviews related work in this field. Section III describes the classification by ML, Naïve Bayes algorithm, feature filtering/selection method, flows and features and training
and testing datasets. Section IV presents simulation results. Finally section V concludes this paper. II.
RELATED WORK
Generally, the current ML research outcomes in traffic classification are often different in type of their algorithms, their ML features and their dataset characteristics, so their efficiency can't be properly compared. The ML algorithm is applied to traffic classification, for the first time in 2004 [2]. The Nearest Neighbors (NN), Linear Discriminate Analysis (LDA) and Quadratic Discriminate Analysis (QDA) ML algorithms are utilized in [11]. The error rate of classification was different based on the number of classes tried to be classified. In [12] the traffic is classified by Naïve Bayes algorithm. Then for result improvement, Naïve Bayes Kernel Estimation (NBKE) and Fast Correlation Based Filter (FCBF) are applied. Real-time traffic is classified by Naïve Bayes algorithm in [4] using only the most recent N packets of the flow. That work is recently extended to using statistical features calculated over multiple short sub-flows [13]. In many cases of intelligent traffic classification the statistical features are merely used. The content of payload (first n-Byte of a data stream) was used for the first time in 2005 [14]. This method is able to classify prevalent encrypted applications e.g. ssh and https, while it does not have high computation overhead because of disregarding the structure of the flow. Moreover, avoiding the disadvantage of deep packet inspection, this method can be used for high speed real-time classifications. The combination of both ML algorithm and information of flow content are utilized in [15]. In [16] the packet payload size is employed as the exclusive feature instead of statistical features. This research deployed three of tree-based ML algorithms and achieved 98% of accuracy. In order to benefit from advantages of both methods and improve the efficiency, we utilize both of the statistical and the packet payload features in this research. III.
TRAFFIC CLASSIFICATION USING BAYSIAN ALGORITHM
A. Traffic classification with Machine Learning Generally, traffic classification with ML algorithms can be performed in four steps illustrated in figure 1. Initially inputs are the packets captured online from the network, however, can be prepared by offline traffic trace as well. Then packets are classified into flows according to source and destination IP addresses, source and destination ports and protocols. As the introduction to feature extraction, in the second step the flow characteristics are calculated. To decrease the search space of ML algorithm when the dataset is very large, data sampling can be done to obtain a subset of the flow features. These features and a model taken from the flow attributes are utilized for features filtering/selection step. In this step, necessary and important features are selected and redundant features are filtered.
Offline Online
Preparing Dataset
Traces Packet Sniffing
Traffic Statistic Computation
features [19]. Generally, features selection algorithms are classified into wrapper and filter method algorithms. Optimum features are selected based on data characteristics and these are independent of selected classification algorithms. Therefore they are based on specific criteria of selected feature subsets before the beginning of learning. The wrapper algorithms estimate efficiency of different feature subsets based on selected ML algorithms. Consequently results of these methods depend on the chosen ML algorithm [6].
Extraction Flows from Packets
Flow Attribute Model
Flow Statistics Data Sampling
Machine Learning training Process
Machine Learning Features Filtering/ Selection
Machine Learning Algorithm
Classification Results
Results Evaluation
QoS Mapping etc.
Figure 1. Traffic classification process by Machine Learning
Finally, the ML algorithm is performed on the output of last step. Output of ML algorithm can be used for result evaluation, QoS mapping, network analysis and etc. B. Naïve Bayes classifier In this section we present a brief of Naïve Bayes classifier algorithm as described in [17]. Naïve Bayes algorithm is a supervised ML based on Baysian theory. A flow collection as 1, …, n is considered such that each flow i, that can take numeric or discrete values, is described by m discriminators as the following: (1) As a sample, for internet traffic, may represent the inter-packet lengths. In this paper, each flow exactly belongs to one of the traffic classes which are described in part E of the section III. Each class in the Naive Bayes algorithm is built based on statistical model, so each new flow y receives a probability of getting classified in to a particular class according to the Bayes rule as below, (i) j
We use Sequential Forward Selection (SFS) method which is a type of wrapper method for ML algorithm. The SFS method starts with a single attribute and places the attribute with the best result in a selected attribute list named SEL(1). In the next step, all combinations of SEL(1) and a second attribute which doesn't exist in the SEL(1) are examined. The best combination is placed in SEL(2). This process stops when the new best combination doesn't improve efficiency of ML algorithm. Finally, the attributes of the produced list will be the best primitive attributes [8]. D. Flows and features Traffic flows are bidirectional packet streams between two hosts. Each flow is a series of packets having the same five-tuple: source and destination IP addresses, source and destination ports and protocol number. Statistical characteristic of flows are different in two forward and backward directions, so direction of each flow must be considered for feature extraction. Backward flows, having well-known source port, convert to forward flows by exchanging the destination port (and destination IP address) with the source port (and source IP address) and vice versa. This technique lets the classifier to learn and classify independently from flows direction. Finally, features are extracted from resultant flows. The features of this paper divide into packet statistical and payload-based features as described below: • •
Packet statistical features: (1) packet length statistics (min, max, mean, standard deviation); (2) inter-arrival time statistics (min, max, mean, std), Payload-based features: (1) packet payload size statistics (min, max, mean, std), (2) byte encoding of the first n-byte payload.
To acquire and extract the features Tcptrace [20] and Wireshark [21] in Linux operation system are used. Where p(cj) is the probability of obtaining class cj independent of the observed data, and f(y|cj) is the distribution function of y given cj. The denominator calculates probability of input . In order to implementation of Naive Bayes algorithm, Weka tools for ML learning algorithm is used [18]. C. Features filtering/selection Finding smallest needful subset of features is one of the important steps in classifiers which removes unnecessary redundant features, decreases the required time for learning, increases classification accuracy and detects premier
E. Training and testing datasets The dataset is prepared from the public data that the MAWI Working Group has gathered [22]. Training and testing datasets are independent to each other for efficiency evaluation. Training dataset is gathered on December 29, 2000 with duration 1391.77 seconds and testing dataset is gathered on December 28, 2000 with duration 725.82 seconds. Traffic types and number of used packets are presented in Table I.
TABLE I.
TRAFFIC CLASSIFICATION AND NUMBER OF PACKETS IN TRAINING AND TESTING TIME
Application
Ports
Training set
Testing set
extraction, so simplicity of features in this technique suits for real-time traffic classification. Also the obtained results showed the good combination of both feature types as well.
http
tcp80
822854
538557
ACKNOWLEDGMENT
smtp
tcp25
48873
47101
ftp
tcp20
266148
72506
ssh
tcp22
1746
1923
The authors would like to thank Abbas Akbari for his useful discussions and help on extract flows and feature selections.
dns pop3
udp53 tcp110
546 1833
642 4581
REFERENCES [1]
The publicly MAWI dataset in this paper are anonymized for privacy concern, and true applications, generating the flows, aren't inferred. Therefore port-based method is used for detection of application type in flows. This may introduce errors to flow labels due to increasingly ineffective port-based identification. However, the percentage of miss-labeled instances is low because the majority of used traffic belongs to the default application. Moreover, the existence of few inaccurate values in dataset is a usual problem, and the appropriate ML algorithm must be able to handle this situation because of huge number of input data. IV.
[2]
[3] [4]
[5] [6]
[7]
RESULTS
To evaluate the classifier efficiency, two metrics known as Recall and Precision are used. Recall is the percentage of class X members that are correctly classified as belonging to class X. Precision is the percentage of instances that truly have been classified, among all instances classified as class X. classification results are shown in Table II. Http and ftp applications acquired very good results. However, since the amount of data in DNS application wasn't enough, this application earned lowest percentage among all of the applications. Generally result of this research was promising and also was suitable for real-time traffic classification due to the simplicity of deployed features.
[8] [9]
[10]
[11]
[12]
[13] TABLE II.
RESULT OF TRAFFIC CLASSIFICATION BY NAÏVE BAYES ALGORITHM http
smtp
ftp
ssh
pop3
dns
Recall
%٩٣.١
%٨٧.٦
%٩١.٦
%٨٤.١
%٨٢.٩
%٨٣.٢
Precision
%٩١
%٨٢.٤
%٩٠.١
%٨١
%٨٢
%٨٠
V.
CONCLUSIONS
The aim of traffic classification is identifying and clustering the traffic flows into groups which have the same traffic patterns. One of the key factors in real-time traffic classification is achievement to high Precision and Recall and low time and memory complexity. The contribution of this paper was the combination of statistical and payload based features as attributes. Features used for both methods are simple and have good time and memory complexity for
[14] [15] [16] [17] [18]
[19] [20] [21] [22]
F. Qian, G. Hu, and X. Yao, “Semi-supervised internet network traffic classification using a Gaussian mixture model,” AEUEInternational Journal of Electronics and Communications, vol. 62, pp. 557-564, 2008. A. McGregor, M. Hall, P. Lorier, and J. Brunskill, “Flow clustering using machine learning techniques,” Lecture Notes in Computer Science, vol. 3015, pp. 205-214, 2004. W. Li and A. W. Moore, “Learning for accurate classification of realtime traffic,” 2006. T. T. T. Nguyen and G. Armitage, “Training on multiple sub-flows to optimise the use of machine learning classifiers in real-world ip networks,” 2006, pp. 369-376. T. T. T. Nguyen and G. Armitage, “Clustering to Assist Supervised Machine Learning for Real-Time IP Traffic Classification,” 2008. T. T. T. Nguyen and G. Armitage, “A survey of techniques for internet traffic classification using machine learning,” IEEE Communications Surveys and Tutorials, vol. 10, pp. 56–76, 2008. J. Erman, A. Mahanti, M. Arlitt, and C. Williamson, “Identifying and discriminating between web and peer-to-peer traffic in the network core,” 2007, p. 892. S. Zander, T. Nguyen, and G. Armitage, “Self-learning IP traffic classification based on statistical flow characteristics,” 2005, p. 83. S. Sen, O. Spatscheck, and D. Wang, “Accurate, scalable in-network identification of p2p traffic using application signatures,” 2004, pp. 512-521. F. Risso, M. Baldi, O. Morandi, A. Baldini, and P. Monclus, “Lightweight, payload-based traffic classification: An experimental evaluation,” 2008, pp. 5869-5875. M. Roughan, S. Sen, O. Spatscheck, and N. Duffield, “Class-ofservice mapping for QoS: a statistical signature-based approach to IP traffic classification,” 2004, pp. 135-148. A. W. Moore and D. Zuev, “Internet traffic classification using bayesian analysis techniques,” ACM SIGMETRICS Performance Evaluation Review, vol. 33, pp. 50-60, 2005. T. Nguyen and G. Armitage, “Synthetic sub-flow pairs for timely and stable IP traffic identification,” 2006. P. Haffner, S. Sen, O. Spatscheck, and D. Wang, “ACAS: automated construction of application signatures,” 2005, p. 202. J. Ma, K. Levchenko, C. Kreibich, S. Savage, and G. M. Voelker, “Unexpected means of protocol inference,” 2006, pp. 313-326. Y. Wang and R. Nelson, “Identifying Network Application Layer Protocol with Machine Learning.” I. H. Witten and E. Frank, “Data Mining,” 2000. E. F. Mark Hall, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten “The WEKA Data Mining Software: An Update,” SIGKDD Explorations, vol. 11, 2009. Y. Wang and S. Z. Yu, “Supervised Learning Real-time Traffic Classifiers,” Journal of Networks, vol. 4, 2009. S. Ostermann, “tcptrace.” G. Combs, “Wireshark-network protocol analyzer,” Current July, 2008. M. W. Group, “http://mawi.wide.ad.jp/mawi/,” Sept. 15, 2008.