Demystifying Internet Traffic Identification or: why can’t I know what really goes on my network? Arthur Callado Federal University of Ceará
Eliane Bischoff University of Brasília
[email protected]
Quixadá, Brazil
[email protected]
Carlos Alberto Kamienski
Judith Kelner, Djamel Sadok
Federal University of ABC
[email protected]
Federal University of Pernambuco {jk, jamel}@cin.ufpe.br
Abstract—Available literature on Internet traffic identification is bursting with significant contributions. Regrettably, the careful reader needs to sieve through published material as not all results are sufficiently validated, cannot be taken for granted and many suffer from numerous limitations. This work classifies such shortcomings into four categories: careless measurement (unreal trace), unreliable reference identification, excessive filtering and the use of wrong or few metrics. This paper shows details of the problems that affected many recent works in the literature. Further, it summarizes our experience with the implementation and evaluation of diverse identification algorithms on different traffic traces, showing that none of the most cited algorithms is systematically better than the others. Keywords-traffic identification; machine learning
I.INTRODUCTION
The identification of the users’ application is very important for network managers from both access and backbone networks. The interests in identifying traffic applications include blocking unwanted traffic, anomaly detection (network attacks, defective equipment), per-application charging, per-application Quality of Service (QoS) or traffic engineering, network traffic trends for capacity planning and offering of valueadded services based on user utilization/service popularity. Initially, network designers proposed the use of the well-known ports (http://www.iana.org/assignments/port-numbers) observed in the packet transport layer to identify the application generating the traffic. However, in the last decade, many network applications tried to avoid identification by simply not respecting the well-known port, which are no longer reliable for that purpose [1][2]. This behavior of rogue applications became a problem for network administrators to identify traffic, especially unwanted applications, creating a new challenge in network management. Presently, most papers in the literature try to address the problem of traffic identification using either payload signatures [3], or traffic behavior analysis [4][5][6][7][2][8]. After a carefully understanding [9] and implementation of some of these techniques, we observed that it is hard to achieve the same level of accuracy advertised in those papers. A deeper analysis of the literature showed that all works suffer from one or more of these limitations: 1) Unreal traffic: the traffic analyzed (usually a trace previously captured) is either very small or from very specific network; 2) Changed/Filtered traces: the trace is filtered to contain a subset of the traffic, such as only TCP, only TCP with complete flows, or only traffic recognized by a payload-based signature matching algorithm; 3) Non-dependable reference: to evaluate algorithms, traffic must be previously identified using a trustable method, either one that assures traffic is correctly classified (such as manual identification by a
specialist or capturing traffic on controlled client machines) or one that was proven correct. Frequently [4][6][8], a non-validated payload analyzer is used to create that reference; 4) Wrong or few metrics: most relevant metrics [6] are completeness (how much of an application is identified) and accuracy (how correct is the identification). The use of one of these metrics alone [3] is not enough to evaluate an identification algorithm. Other metrics (e.g., False Negatives) are less meaningful and should not be used. This paper focuses on the correction of these weaknesses and, summarizing the authors’ experience on the subject, details the shortcomings for proper traffic identification on real networks. This work performed a thorough evaluation of many algorithms with varying configurations and using four different traces. The remainder of this paper is organized as follows. Section II reviews the literature exposing the shortcomings of previous work. Section III explains our methodology and section IV compares the identification results obtained. Finally, section V discusses these results, while section VI draws final conclusions. II. STATE-OF-THE-ART
A trustable classification reference (baseline) to evaluate the quality of any prediction based on inference is an important step in this area. This baseline may be created by: • Injecting synthetic trace with all flows, derived from real captures, previously identified or known; • Undertaking a hand-made identification (normally unfeasible and unreliable); or • Applying a trustable packet analyzer (validated with one of the previous methods). After obtaining a trustable reference, one must select the metrics used for the evaluation of the identification methods. The most used inference quality metrics are completeness and accuracy, first described for application identification in [6]. Completeness is calculated as [9]:
DA (1) OA where DA means detected applications and OA means original applications in the trace. Please notice that an identification algorithm can (wrongly) detect more traffic that actually belongs to a particular application, i.e. per-application completeness may yield values greater than 100%. Accuracy is calculated as: Completeness =
CDA (2) DA where CDA means correctly detected applications. Both completeness and accuracy may be used for flow, packet and byte counts (with different results!) and are complementary, thus informing just one of them is not enough. Only by combining both metrics it is possible to understand the performance of an identification technique. Accuracy =
A. Baseline Building
Traditional methods for creating a trustable baseline for evaluating identification algorithms are manual identification, payload identification and active measurement. Manual identification is a difficult easy task even for an expert. Most packets do not carry human-readable information and therefore this technique is not supposed to provide better results than a payload signature analyzer.
Payload identification presents itself as an inherently error-prone identification algorithm since its results heavily depend on the signatures utilized and type of traffic captured. Its usage for testing another identification algorithm requires the payload-based algorithm itself to be previously evaluated. Active measurement is adequate method for creating a baseline. By generating each traffic flow independently and aggregating all flows into a synthetic aggregate trace, this method provides stringent guarantees of identification correctness. However, the traffic generated must be compatible with the network studied, both in application type composition and in application diversity. Synthetic traffic generators should be avoided, since they usually do not completely reproduce the application behavior, especially when it comes to payload generation and application diversity. Szabó [10] suggested a novel approach for baseline building using active measurement: a program is installed on client machines to mark traffic according to the generating applications. Despite the potential improvements in this approach, it has drawbacks: limited scalability, collision (due to hashing the application name in two bytes), simple evasion (renaming an application) and insufficiency (knowing the application is not enough, because the same application might be responsible for different types of traffic). B. Payload Identification
Complete protocol parsing is often considered the most accurate identification method [10], despite its inability to deal with encrypted traffic. One way to make this identification less resource consuming is using a heuristic algorithm, searching for specific byte patterns called payload signatures [3]. Some vendors provide network equipment capable of identifying traffic based on payload signatures. However, maintaining good signatures is a continuous process, since application protocols evolve and new applications appear daily. C. Statistical Identification
To decrease the complexity of traffic identification, some heuristics based on statistics have been proposed. Two common methods are the use of signatures and flow clustering [4], either using flow information (addresses, ports, protocols, flow size, duration) or packet information (min./max./avg. packet sizes [5], congestion window, TCP flags). To evaluate 3 different clustering techniques, Erman [4] used two traces (ignoring UDP packets and TCP flows without SYN/FIN) and uses port numbers to establish the baseline in the first trace, while using nonvalidated payload identification for the baseline of the second trace. For the analyzed set all algorithms converge, but it is known that clustering sometimes does not converge to a result, especially when using large datasets. Results are presented only in terms of flow percentage accuracy. Based on packet headers, Bernaille [5] performs traffic identification using only the flow size of the first 5 packets in a bidirectional flow. By using only TCP flows, a non-validated payload analysis is performed to create a baseline. The identification is performed by clustering the flow sizes of the first 5 packets of a flow. Results are presented only in terms of flow identification accuracy. D. Behavioral Study
Karagiannis et al [6] propose the use of structures called graphlets to describe behavior signatures for applications. The identification algorithm, called BLINC, analyzes these structures to evaluate the social level (how many nodes a node connects to), the functional level (server/client) and the specific application level (relationship among all fields of information). It uses a non-validated payload identification method for constructing its baseline. Results are presented in terms of flow (but not byte) completeness and accuracy. We
tried to reproduce their results, but our implementation of BLINC was not successfully [13], even after months of hard work and contact with the authors. In another attempt to identify applications by behavioral analysis, Xu et al [7] propose the use of application profiling, through four steps: data preprocessing, significant cluster extraction, classification of cluster behavior and interpretation of behavior classes. This technique does not rely on volume information and therefore does not need finished flow lifetime. However, no baseline is used and only some examples of traffic behavior are compared to the port-based method. E. Machine Learning
Machine learning algorithms have been used for application identification [2][4][8]. Instead of signatures, a set of example traffic from the same network is utilized for training of the selected method of machine learning, and the desired traffic is identified based on the examples, called training set. Relevant information for using machine learning includes the selection of the algorithm and the method for selection of samples and size (number of samples) of the training set. According to the literature [8][2], the most promising algorithms for traffic identification using machine learning are Bayesian Networks and Support Vector Machines (SVM). Moore and Zuev [2] use packet behavior statistics to characterize a flow and employ four different types of Bayesian network algorithms. Baselines are built by manual identification, excluding UDP flows and incomplete flows and evaluation was performed using n-fold cross-validation. In this method, the classified data is divided in n parts. Each part is used for training once to test the remaining n-1 parts and results are for the sum of evaluations. Two metrics are presented: correctness (count of correctly identified flows and bytes divided by total) and trust (per-class accuracy). Another algorithm is NBTree [11], a hybrid of the Naïve Bayes classifier, which uses a decision tree with Naïve Bayes classifiers on the leaf nodes. This allows some specialization of the classifier’s configuration, according to groups in the training set. Li [8] studies the results of bidirectional flow identification for the SVM method, using both flow and packet information collected on a campus network. Their baseline is created using non-validated payload signatures and results are shown in terms of per-class false positives and false negatives, for flow (but not byte) count. III. CASE STUDIES
Most techniques presented in section II yield only a reasonable result, which are not good enough for network operators (which demand 90% accuracy or more, according to our interviews). Therefore, we decided to investigate the algorithms themselves. Our first step was to implement a signature-based software, called Analyser-PX, to be used as the baseline, since this is the most common reference [4][6]. It is able to recognize the most popular applications in different classes: Web, P2P, Games, Chat, Mail, Network Management, Streaming, Secure/SSL and VoIP. For validation, we performed many traffic captures of singleapplication traces, based on application process ID and gathered those traces to forge a full, aggregated traffic trace. The first trace was based on active measurement (traffic explicitly generated), merging a set of individual application traces collected from 2004/May/13 to 2008/Set/04. It includes flows from Web, P2P (BitTorrent
and E-mule), Streaming (Youtube), Mail (SMTP) and VoIP (Skype) traffic. The longest flow present in the trace lasts 22 hours. The total volume of the trace is 144MB, split among 13877 flows. Despite the insight evaluating this trace provides to traffic identification, authors felt that a passive capture of real network traffic was very important. Our second trace was captured in the main router of a 100+ machines university lab on 2007/Aug/30. The total volume is 7.2 GB, split among 135169 flows. The third trace is a packet trace captured in an academic Brazilian backbone, performed on a link that separates universities (with 3000+ active machines) of a Point-of-Presence (PoP) from the backbone network itself. The capture was performed on 2006/Nov/22 and lasted 71 minutes. The total volume gathered is 86 GB, split among 6.5M flows. Finally, the fourth trace comes from a commercial backbone (with 1,000,000+ broadband clients) in Brazil. This capture was performed on a core 1Gbps network link. Though bidirectional, due to the use of dynamic routing and the fact that during capture the reverse direction was mostly unused, one of the directions was hardly utilized during the capture. This capture was performed on 2008/May/27 and lasted 107 minutes. The total volume is 36 GB, split among 2.9M flows. IV. RESULTS
For the active measurement, the baseline was derived directly from the individual application captures used. For the other traces, the payload identification, validated in the first trace, was used to create the baseline. The baseline composition of the traces is shown in Table 1. Table 1. Baseline composition for (a) active measurement, (b) laboratory measurement, (c) academic backbone measurement and (d) commercial backbone measurement Application Unknown P2P WEB Chat Games Management Streaming Mail Bulk Transfer VoIP Secure/SSL No-Payload
(a) Active measurement Flow % Byte % 83.06 74.64 14.12 21.83 0.014 0.58 1.11 0.28 1.70 2.67 -
(b) Laboratory measurement Flow % Byte % 14.87 0.87 6.47 89.78 9.92 2.73 0.038 0.012 6.61 0.18 0.13 5.56 2.22 1.23 0.23 0.12 9.71 0.56 1.46 5.18 33.19 1.09
(c) Academic backbone (d) Commercial measurement backbone measurement Flow % Byte % Flow % Byte % 18.46 7.43 16.16 8.43 5.62 63.08 0.69 16.79 21.96 14.73 42.99 30.56 0.4 0.85 0.15 1.95 0.000061 < 0.000001 0.004 0.12 6.61 0.18 8.78 0.39 0.13 5.56 0.74 19.95 2.22 1.23 1.66 6.25 0.23 0.12 0.36 0.17 9.71 0.56 8.69 1.14 1.46 5.18 3.62 6.91 33.19 1.09 16.17 7.33
We used the Weka Tool [12] to evaluate different machine learning algorithms, which defines a common data format for analyses with different algorithms. Three different methods of selecting the training set were tested [13], but the proposed method of clustering systematically outperformed the other methods and therefore only its results are shown here, due to space constraints (for the others, see [13]). We compared results of 66 identification algorithms implemented by Weka in a preliminary evaluation and chose the best 3 algorithms (NBTree, PART and J4.8) along with the algorithms more commonly used in
the literature: Bayesian Networks with Simple Estimator (BayesNet), Bayesian Networks with Kernel Estimator (Bayes Kernel) and Support Vector Machines (SVM). Results are shown in Table 2. Table 2. Algorithm evaluation for (a) active measurement, (b) laboratory measurement, (c) academic backbone measurement and (d) commercial backbone measurement (a) Active measurement
(b) Laboratory (c) Academic backbone (d) Commercial backbone measurement measurement measurement Compl. Acc. Compl. Acc. Compl. Acc. Compl. Acc. Compl. Acc. Compl. Acc. Compl. Acc. Compl. Acc. Application Flow % Byte % Flow % Byte % Flow % Byte % Flow % Byte % Payload 27.67 73.75 98.40 98.24 Signature NBTree 100 92.84 100 98.97 94.52 68.82 99.83 94.01 62.49 27.92 96.09 74.78 10.65 25.96 74.12 59.33 PART 100 57.51 100 98.18 82.13 73.81 99.69 93.39 72.99 38.58 98.58 79.26 72.29 59.3 89.38 66.03 J48 100 56.67 100 98.71 92.45 60.69 99.66 94.28 49.13 34.54 94.21 79.53 71.05 60.86 92.77 65.27 BayesNet 100 95.31 100 96.09 95.55 66.1 99.91 73.5 93.2 55.6 99.57 72.32 11.93 23.9 75.79 56.8 Bayes 100 3.09 100 77.3 99.53 53.14 99.89 86.31 71.87 41.16 99.62 59.5 77.65 12.06 99.16 35.7 Kernel SVM 100 96.53 100 98.3 96.49 60.52 98.97 94.17 82.44 41.19 95.8 76.34 76.45 39.75 88.24 57.56 A. Active Single-Machine Measurement
While the use of active measurement allows the creation of a baseline with 100% of certainty, composition percentage results are far from what a real network would show. The composition of this trace is shown in Table 1 (a). This composition was created by selecting individual application traces that were merged. Since the baseline was created based on selected applications, the baseline is considered trustable. Using it, some identification algorithms were run and their metrics evaluated in comparison to the baseline, shown in Table 2 (a). From the results, we observed that all machine learning algorithms had 100% completeness in both flows and bytes, due to the inexistence of an unknown class and therefore no flow is considered unknown. Despite the fact that the payload signature algorithm did not identify a high percentage of the flows, it did identify correctly a high percentage of the traffic volume (in bytes). We attribute this to the characteristics of the P2P traffic (dominant in this trace), where many control connections are created for a single download and only the data connections (higher data volume) are well identified (control connections have different signatures). The same results also apply to the machine learning algorithms, with the exception of the Bayes Kernel. B. Passive Laboratory Measurement
Due to the size of the trace, a manual identification of the 135,169 observed flows was also not feasible. Also, it is error-prone since most network protocols do not provide human-readable information that can be used to differentiate applications. Therefore, given the high completeness and accuracy of the payload signature algorithm for byte identification in the previous scenario, we decided to use as the new baseline for the other traces. According to this baseline, the composition of the trace is as shown in Table 1 (b).
Clearly, the payload identification algorithm was not able to identify only 0.87% of the traffic volume in bytes, shown in Table 1 (b), represented by the “Unknown” class. Using this baseline, the traffic identification algorithms had a worse performance when compared to the previous results, as seen in Table 2 (b). However, the results for most algorithms are still appropriate in terms of byte volume. The high number of VoIP flows is explained by the control messages exchanged by Skype (the official communicator in this laboratory). C. Passive Academic Backbone Measurement
Using the same method as in the previous trace, payload signature identification was utilized for the baseline. The traffic composition is shown in Table 1 (c). Again, it is possible to observe a significant increase in the unknown byte volume (7.43%), while the ratio of unknown flows is 18.46% of the total flows. This is clearly a problem of implementing payload identification, since many protocols are private and satisfactory traffic signatures for them may not be easily discovered. However, this is not a limitation of the technique per se, since it is always possible to add new signatures. The identification results of all the machine learning algorithms, using the payload identification as a comparison baseline, are shown in Table 2 (c). Results show a systematic decrease in performance, attributed to the greater range of individual applications and users. Overall, the results show a significantly more trustable identification in terms of byte volume percentage than in flow percentage. Here, the network manager would consider the flow identification results unfeasible for traffic filtering, while byte volume identification will be able to serve its purposes, especially capacity planning. D. Passive Commercial Backbone Measurement
The composition of this trace according to the payload signature identification baseline is shown in Table 1 (d). Interestingly, when compared to the academic network, more flows are identified by the payload signatures, but a lower volume of traffic was identified. The results of the machine learning algorithms based on the payload identification baseline are shown in Table 2 (d). A considerable decrease can be observed in the total completeness and accuracy for most cases, compared to the academic network, because 99% of the flows are unidirectional. This is due to dynamic routing within a backbone network. Sometimes a forward path in the Internet does not share the same links of the reverse path. In this case, the studied router was selected for a path but the reverse path used an alternative router. The remaining traffic (less than 1%) consisted of local-link routing protocol flows. Therefore, only one direction of the traffic was observed in this trace, which may have affected the detection of flows of disguised traffic. For example, P2P traffic that disguises as HTTP traffic and is detected as HTTP traffic due to one message that was missed for being in the reverse direction. We decided to investigate this unidirectional capture problem with an additional analysis, using only the incoming part of the traffic from the academic network trace. The results (not shown here due to lack of space) were very similar to those from the commercial network, showing the problems of feeding an incomplete (unidirectional) capture to an identification algorithm. V. DISCUSSION AND RECOMMENDATIONS
From our results, we learned that evaluating active measurements or measurements on small-to-medium access networks is not enough for assuming identification accuracy in larger networks. In broader terms,
experimenting algorithms for traffic identification with a trace is not enough to estimate the identification accuracy of the algorithms on different networks. This is obvious where each scenario had a different algorithm as the best one in bytes or flows (no algorithm was best in the majority of cases). We noticed that all machine learning algorithms try to maximize the accuracy of their identification in the number of samples (flows). Larger numbers of training samples provide better accuracy in element prediction (number of flows), with unpredictable effect on number of bytes [13]. Since larger flows have higher importance in traffic prediction, this lack of control is a serious issue. Another important aspect of identification is the selection of the training set. Many papers do not attend this aspect, while others simply say that training set selection was random using a different trace. Based on the fact that our identification results, except those that used clustering for training, performed badly or unstable, we recommend that any traffic identification method based on machine learning should use clustering on a trace from the same network and then select one flow from each cluster for the training set. Another issue is the excessive trust attributed to the payload signature analyzers. Several papers [4][6][8] simply use payload signatures for training and refer to l7-filter1 to demonstrate acceptance by the Internet community, without assessing signature identification performance. We used some l7-filter signatures along with signatures developed to improve the quality of our baselines. We feel that updating and validating signatures (such as using a baseline with active measurements, as shown in this paper) must be a continuous process. Lastly, it is important to select the adequate metrics for evaluating and comparing different algorithms. The literature shows that completeness and accuracy are important metrics and using only one of them is not enough to decide which algorithm actually performed better in a particular scenario. VI.CONCLUSIONS
The main challenge in traffic identification is that the performance evaluation of a particular algorithm should not be considered a reasonable estimation for another network. Mechanisms and techniques for adapting identification methodologies to achieve better results in different networks are still missing. Authors believe that the recommendations shown here form important practices for the proper evaluation of traffic identification algorithms. The contributions of this paper are manifold. The first contribution is exposing the limitations and flaws of previous studies published in the literature, which are usually biased toward some particular configurations of the evaluation environment and/or the identification algorithm. The second contribution is a comprehensive evaluation of traffic identification, overcoming some of the problems reported. The third contribution is the introduction of a methodology, through the use of active measurement, for validating the payload signature algorithm. After this validation, the algorithm may be utilized to generate a baseline for evaluating a different identification method. The fourth contribution is the inclusion of clustering as a method of training set selection, which significantly improved the average results throughout different scenarios. Finally, the creation of a trustable baseline is very important in the evaluation of traffic identification. Most baselines are not representative, such as a handful of flows generated mechanically, while others are used without validation, such as unevaluated payload identification analyzers. 1
Application Layer Packet Classifier for Linux, http://l7-filter.sourceforge.net
An important future work is the creation of an identification baseline, using active measurement, mimicking (in proportion of each application) the composition traces available in the Internet, which is unfeasible for backbone traffic using manual identification. Another future work is the creation of a methodology that can take the best of a set of identification algorithms, yielding better identification results throughout scenarios. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
Karagiannis, Thomas; Broido, Andre; Brownlee, Nevil; Claffy, K.C.; Faloutsos, Michalis, “Is P2P dying or just hiding?”, Global Telecommunications Conference (GLOBECOMM 2004), November 2004. Moore, Andrew & Zuev, Denis, “Internet traffic Classification Using Bayesian Analysis Techniques”, Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, ACM Press, June 2005, pp. 50-60. Sen, Subhabrata; Spatscheck, Oliver; Wang, Dongmei; “Accurate, Scalable In-Network Identification of P2P Traffic Using Application Signatures”, Proceedings of the 13th international conference on World Wide Web, 2004, pp. 512-521. Erman, Jeffrey; Arlitt, Martin; Mahanti, Anirban, “Traffic classification using clustering algorithms”, Proceedings of the 2006 SIGCOMM workshop on Mining network data, 2006, pp. 281-286. Bernaille, Laurent; Teixeira, Renata; Akodkenou, Ismael; Soule, Augustin; Salamantian, Kave; “Traffic Classification On The Fly”, ACM SIGCOMM Computer Communication Review, Volume 36, Number 2, April 2006, pp. 23-26. Karagiannis, T., Papagiannaki, K. & Faloutsos, M., ”BLINC: Multilevel Traffic Classification in the Dark”, ACM SIGCOMM 2005, August/September 2005. Xu, Kuai; Zhang, Zhi-Li; Bhattacharya, Supratik, “Profiling Internet Backbone Traffic: Behavior Models and Applications”, ACM SIGCOMM 2005. Li, Zhu; Yuan, Ruixi; Guan, Xiaohong, “Accurate Classification of the Internet Traffic Based on the SVM Method”, IEEE International Conference on Communications, June 2007, pp. 1373-1378. Callado, Arthur; Kamienski, Carlos; Szabó, Géza; Péter Gerö, Balázs; Kelner, Judith; Fernandes, Stênio; Sadok, Djamel, “A Survey on Traffic Identification”, IEEE Communications Surveys and Tutorials, vol. 11, 2009, pp. 37-52. Szabó, Géza; Orincsay, Dániel; Malomsoky, Szabolcs; Szabó, István, “On the Validation of Traffic Classification Algorithms”, Passive and Active Measurement Conference (PAM’2008), Cleveland, April 2008. Jun, Li; Shunyi, Zhang; Yanging, Lu; Zailong, Ahzng, “Internet Traffic Classification Using Machine Learning”, Proceedings of the 2nd International Conference on Communications and Networking in China (CHINACOM 2007), August 2007, pp. 239-243. Witten, Ian H.; Frank, Eibe, “Data Mining: Practical Machine Learning Tools and Techniques”, Morgan Kaufmann, second edition, June 2005. Callado, Arthur; Bischoff, Eliane; Sadok, Djamel; Kelner, Judith; Kamienski, Carlos, “Demystifying Internet Traffic Identification, or: why can’t I know what really goes on my network?”, technical report, August 2008. https://www.gprt.ufpe.br/~arthur/GPRT-UFPE-tech-report-01-2008.pdf.