Hybrid Traffic Classification Approach Based on ... - Semantic Scholar

2 downloads 30085 Views 314KB Size Report
that it can identify only 3 protocols, namely SMTP, POP3 and. HTTP. In [3], Roughan ... telnet), bulk data transfer (e.g. Kazaa), corporate (e.g. database ..... MAIL. 1000 256. ×. 1004. SMTP. MAIL. 1000 256. ×. 1002. FTP. DataTransfer. 1000 256.
Hybrid Traffic Classification Approach Based on Decision Tree Wei Lu, Mahbod Tavallaee and Ali A. Ghorbani Faculty of Computer Science University of New Brunswick, Fredericton, NB E3B 5A3, Canada {wlu,m.tavallaee,ghorbani}@unb.ca Classifying network traffic is very challenging and is still an issue yet to be solved due to the increase of new applications and traffic encryption. In this paper, we propose a novel hybrid approach for the network flow classification, in which we first apply the payload signature based classifier to identify the flow applications and unknown flows are then identified by a decision tree based classifier in parallel. We evaluate our approach with over 100 million flows collected over three consecutive days on a large-scale WiFi ISP network and results show the proposed approach successfully classifies all the flows with an accuracy approaching 93%. Index Terms - Traffic Classification, Machine Learning

I. INTRODUCTION

A

ccurate classification of network traffic has received a lot of attentions due to its important roles in many subjects such as network planning, QoS provisioning, class of service mapping, to name a few. Traditionally, traffic application classification relies to a large extent on the transport layer port numbers, which was an effective way in the early days of the Internet. Port numbers, however, provide very limited information nowadays due to the increase of HTTP tunnel applications, the constant emergence of new protocols and the domination of P2P networking applications. An alternative way is to examine the payload of network flows and then create signatures for each application. This, however, generates two major limitations: (1) legal issues related to privacy, and (2) it is impossible to identify encrypted traffic. By observing traffic on a large-scale WiFi ISP network over a half year period, we found that even exploring the flow content examination method, there are still about 40% network flows that cannot be classified into specific applications (i.e. 40% network flows are labelled as unknown applications). Investigating such a huge number of unknown traffic is inevitable since they might stand for the missed known applications, the abnormalities in the traffic, malicious behaviours or simply the identification of novel applications. Recent studies on network traffic application classification include are mainly focused on “applying machine learning algorithm for clustering and classifying traffic flows based on a set of statistical features" [1,2,3], and "identifying traffic based on heuristics derived from analysis of communication patterns of hosts" [4,5]. Although existing traffic classification mechanisms generate a number of good ideas, they are far from completed yet due to the limited number of applications they can identify (e.g. only 3 in [2]) and the rough application scopes (e.g. BLINC in [4] attempts to identify the general P2P traffic instead of the specific underlying P2P applications like eDonkey, BitTorrent). Moreover comparing all above mentioned methods is difficult because of the lack of sharable dataset and appropriate metrics [6]. Addressing the limitations of aforementioned approaches, we propose in this paper a hybrid mechanism for classifying

flow applications on the fly, in which we first model and generate signatures for more than 470 applications according to port numbers and protocol specifications of these applications and then concentrating on unknown flows that cannot be identified by signatures, we investigate their temporal-frequent characteristics in order to differentiate them into the already labelled applications based on a decision tree trained with corresponding temporal-frequent characteristics of known flows. According to [6], an open issue on current application classification community is how to define application classes in a standard way so that the performance for different classifiers can be fairly evaluated and compared (e.g. "Network Management" protocols might consist of different applications across various classification techniques). In this paper we also attempt to address this issue and define the application classes in a fashion of hierarchical way, and as a result all the 470 applications are classified according to the category of Application ID, Application Name and Application Group Name from lower to higher. In summary, the major contributions of this paper include: (1) based on port numbers and payload bit strings, we create a signature base for more than 470 applications that cover most existing network applications, (2) we propose a new algorithm to differentiate unknown flows into known applications, which is based on ngram (frequent characteristics) of flow payload over a time period (temporal characteristics). The rest of the paper is organized as follows. Section 2 introduces related work, in which we summarize existing application classification approaches in terms of two categories, namely machine learning based and heuristics based. Section 3 presents our signatures based classifier and also the definition of application classes. Section 4 illustrates the decision tree algorithm based on the temporal-frequent characteristics (i.e. n-gram feature over a time period) for classifying unknown flows. Section 5 is the evaluation for our hybrid classification model with over 100 million flows collected on a large-scale WiFi ISP network. Section 6 makes some concluding remarks and discusses future work. II. RELATED WORK Early

common

techniques

for

identifying

network

978-1-4244-4148-8/09/$25.00 ©2009 This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE "GLOBECOM" 2009 proceedings.

application rely on the association of a particular port with a particular protocol. Such a port number based traffic classification approach has been proved to be ineffective due to: (1) the constant emergence of new peer-to-peer networking applications that IANA does not define the corresponding port numbers, (2) the dynamic port number assignment for some applications (e.g. FTP for data transfer), and (3) the encapsulation of different services into same application (e.g. chat or steaming can be encapsulated into the same HTTP protocol). Alternatively recent studies have brought a lot of good ideas for the traffic classification. In [1], Williams et al. conduct a preliminary performance comparison of 5 machine learning algorithms for practical IP flow classification. Evaluation results show that feature reduction greatly reduces the number of features needed to identify traffic flows and does not severely reduce the classification accuracy. Given the same features and flow trace, it was claimed that different machine learning algorithms provide very similar classification accuracy. In [2], Crotti et al. present a flow classification mechanism based on three simple properties of the captured IP packets: size of packets, inter-arrival time and arrival order. A new structure, called protocol fingerprints, is defined to express the three trace properties in an efficient way. According to an anomaly score, the protocol fingerprints allow the measurement of "how far" an unknown flow is from the basic characteristics of each protocol. The evaluation with the traffic trace collected on a large campus network show that the approach is effective for classifying a set of protocols. As claimed in their evaluation, the limitation of the approach is that it can identify only 3 protocols, namely SMTP, POP3 and HTTP. In [3], Roughan et al. propose an approach to classify traffic in terms of the class of services, such as: interactive (e.g. telnet), bulk data transfer (e.g. Kazaa), corporate (e.g. database transactions, DNS), and real-time applications (voice, video streaming). A 3-stage process is used to achieve the Class of Services (CoS) mapping, namely statistics collection, classification and rule creation. The initial set of features used in the experiment is the average packet size, flow duration, bytes per flow, packets per flow, and Root Mean Square (RMS) packet size. As claimed by the authors, a useful set of features allows the discrimination between traffic classes. Although machine learning based approaches show their capability for traffic classification to some extent, the number of applications they can identify is limited. And also the definition of application classes is rough and is not precise enough to obtain the fine-grained applications since most early works on traffic classification using machine learning is focused on achieving QoS for different classes of services. The heuristics based classifier usually encloses multiple identification criteria for achieving an optimal result for application discovery. In [4], Karagiannis et al. present a heuristic technique for classifying traffic in the dark, called BLINC, which consists of 3 components: (1) identifying network applications in social level (2) identifying network applications in functional level (3) identifying network applications in applications level. In social level, BLINC first examines the social behaviors of single hosts and then detects

communities of hosts. In functional level, BLINC observes the source port of a host. In application level, a graphlet is used to describe the signature of specific network applications, which is composed and connected by a set of nodes, namely . Each specific application is associated with a graphlet. Applications are identified by matching the graphlets. Finally, a heuristic method is used to refine the classification and discriminate complex or similar cases of graphlets. Concurrent with [4], Xu et al. develop a general methodology for building comprehensive behavior profiles of Internet backbone traffic in terms of communication patterns of end-hosts and services [5]. Relying on data mining and information-theoretic techniques, the methodology consists of significant cluster extraction, automatic behavior classification and structural modeling for in-depth interpretive analyses. The heuristics based classifiers obtain high identification accuracy and get more applications to be labeled. Their biggest limitation is, however, they fail to obtain a real time classification on the large-scale communication networks (e.g. BLINC is an offline labeling method [4]). III. SIGNATURES BASED CLASSIFIER The payload signatures based classifier is to investigate the characteristics of bit strings in the packet payload. For most applications, their initial protocol handshake steps are usually different and thus can be used for classification. Moreover, the protocol signatures can be modeled through either public documents like RFC or empirical analysis for deriving the distinct bit strings on both TCP and UDP traffic. The application signatures are composed by 12 fields, namely application ID, application name, application group name, application description, protocol, srcip, srcport, dstip, dstport, commondstport, srccontent and dstcontent. The total number of application signatures is 471, belonging to 280 application names and 21 application groups. The signatures based classifier is deployed on Fred-eZone, a free wireless fidelity (WiFi) network service provider operated by the City of Fredericton [7]. A general result is that about 40% flows cannot be classified by the current payload signatures based classification method. In next section we build a module that works in parallel with the signatures based application detection engine. The new module focuses only on those applications that the signature-based detector could not identify and that appear to the signatures-based classifier as unknown. IV. DECISION TREE BASED CLASSIFIER N-gram bytes distribution has proven its efficiency on detecting network anomalies. Wang et al. examine 1-gram byte distribution of the packet payload, represent each packet into a 256-dimenational vector describing the occurrence frequency of one of the 256 ASCII characters in the payload and then construct the normal packet profile through calculating the statistical average and deviation value of normal packets to a specific application service (e.g. HTTP)

978-1-4244-4148-8/09/$25.00 ©2009 This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE "GLOBECOM" 2009 proceedings.

[8]. Anomalies will be alerted once a Mahalanobis distance deviation of the testing data to the normal profiles exceeds a predefined threshold. Different with previous n-gram based approaches for network intrusion detection, we extend in this paper n-gram frequency into a temporal domain and generate a set of 256-dimentional vector representing the temporalfrequent characteristics of the 256 ASCII binary bytes on the payload over a predefined time interval. By observing and analyzing the known network traffic applications, labeled by the signatures based classifier, over a long period on a largescale WiFi ISP network, we found that the n-gram (i.e. n = 1 in particular) over a one second time interval for both source flow payload and destination flow payload is a strong enough feature that can be applied to differentiate traffic applications. As an example, Figures 1 and 2 illustrate this novel temporal-frequent metric for the application BitTorrent (P2P) and HTTPWeb (WEB), respectively. Axis X in these 2 Figures is the ASCII characters from 0 to 255 on the source flow payload. Axis Y stands for the frequent value for each ASCII character appeared over a predefined time interval (i.e. 1 second).

Fig. 1 Temporal-frequent metric for source flow payload of BitTorrent application

We create over 470 application profiling matrix for all the applications on the signatures base. Unknown flows that cannot be identified by signatures based classifier, therefore, could be labeled by the new application profiling matrix because unknown flows with payload, even though no signature is found to match the signature base, their temporalfrequent characteristics can always be modeled and thus can be used for unknown traffic classification. The decision tree technique is a good candidate to achieve the unknown traffic classification in this case due to its low computational complexity and the training capability for largesize dataset. In a typical decision tree, each node is either a leaf node or a decision node. A leaf node indicates the value of the target class, such as Application = Gnutella and a decision node specifies some test to be carried out on a single attribute value, with one branch and sub-tree for each possible outcome of the test. A decision tree can be used to classify an example by starting at the root of the tree and moving through it until a leaf node, which provides the classification of the instance. Suppose there is a decision tree for application classification trained by the 256-dimensional attribute < f1 , f 2 ,..., f 256 > , an unknown flow with a new 256dimensional vector will be compared starting from root node f1 to see if it is bigger than 0.1 or not, and if the testing result is f1 ≤ 0.1 , then f5 is selected to see if it is bigger than 0.3 or not, if it is bigger than 0.3, the unknown flow will be labeled as Gnutella application. The training of the decision tree for obtaining a decision model is based on the historical 470 application profiling matrix and each application profiling matrix includes at least 10,00 instances (i.e. the size of the matrix is 1000 × 256 ). The decision tree algorithm we apply is the C4.5 proposed by Quinlan [9] since it is well known and frequently used over the years. V. EXPERIMENTAL EVALUATION

Fig. 2 Temporal-frequent metric for source flow payload of HTTPWeb application

By comparing Figure 1 with the Figure 2, we see that the temporal-frequent metric of flow payload are very different for P2P application and WEB application. We denote the 256dimensional n-gram byte distribution as a vector t < f1t , f 2t ,..., f 256 > , where f jt stands for the frequency of the j th i

i

i

i

ASCII character on the flow payload over a time window ti ( j = 1, 2...256; i = 0,1, 2,...) (i.e. the temporal-frequent metric of the flow payload). Given n historical known flows for each specific application, we define a n × 256 matrix, p app , for profiling applications, which are illustrated as follows:

p

app n× 256

⎡ f 1 t1 ⎢ t2 ⎢ f1 = ⎢ ⎢ ⎢ ⎢ tn ⎢⎣ f 1

f 2t1 f 2t 2

f 2t n

f 2t51 6 ⎤ ⎥ f 2t52 6 ⎥ ⎥ ⎥ ⎥ tn ⎥ f 2 5 6 ⎥⎦

The data set for traffic trace used in the experimental evaluation are collected over three consecutive days on a large-scale WiFi ISP network, in which we deploy the signatures based classifier and achieve a 60% classification rate from over 100 millions flows. In order to create the training dataset for learning the decision tree based classifier, 11 typical applications belonging to 8 typical application groups are modeled from known labeled flows, which are illustrated in Table I. The size of input data for training decision tree is 11000 × 256 . In order to validate the decision tree model we conduct two evaluations during the experiment, one is a 10fold cross validation for the traffic trace collected over one day and the other is a real-time classification evaluation in which traffic trace collected over 2 days are used for training and the real-time traffic flows collect on the 3rd day are used for testing. Next we discuss the evaluation results for the two experiments. A. 10-fold Cross Validation for Decision Tree based Traffic Classifier Most of current machine learning based classifiers are

978-1-4244-4148-8/09/$25.00 ©2009 This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE "GLOBECOM" 2009 proceedings.

evaluated and validated relying only on the results obtained from one run of their method on a dataset. The obtained results, however, are not reliable, and the experiments are supposed to be done several times to report the average performance. The most common way to satisfy this aim is cross-validation, in which the dataset is divided into several portions, and during each run, one of the portions will be considered as the testing dataset and the rest will be considered as training. At the end, the final result will be calculated as the mean value of all experiments. During the experiment we apply the 10-fold cross validation technique to evaluation the learned decision tree, in which we randomly beak the dataset with 11,000 instances into 10 sets and each set includes 1100 instances. Nine of the sets are for training purpose and the rest one is for testing. We then repeat the testing 10 times and obtain a mean accuracy for the classification. Since each flow is bi-directional, including both source flow content and destination flow content. As a result the decision tree model we obtain includes two individual classifiers (i.e. the classifier based on source flow payload and the classifier based on destination flow payload) depending on which direction of flows is used in testing. The general classification accuracy is summarized in Tables II and III for source based classifier and destination based classifier, respectively. Tables IV and V illustrate the detailed classification accuracy according to the specific application. B. Evaluation for Flow Classification Although the decision tree based classifier achieves a high classification accuracy during the experimental evaluation with cross validation, the results might be misleading due to the similarity of application traffic on the same day. As a result, we conduct an online evaluation during which the decision tree based classifier is deployed on a large-scale WiFi ISP network and works in parallel with the signatures based classifier. More than 90,000 flows are collected over the testing day on the network and are enforced to be identified as unknown, of which the real labels are illustrated in Table VI. Tables VII and VIII describe the detailed classification accuracy for each specific application using source flow based classifier and destination flow based classifier, respectively. The general classifying accuracy is illustrated in Table IX for both classifiers. The online evaluation results show that the decision tree classifier based on destination flows achieves a 92.6% classification accuracy which is higher than 89.4% accuracy obtained by the source flows based classifier. All unknown flows are identified to specific applications and no unclassified flows happen due to the deterministic mechanism of decision tree structure.

VI. CONCLUSION We propose in this paper a novel hybrid flow classification approach combining a payload signature based classifier and machine learning based classifier. The signature based classifier creates bit string characteristic for over 470 applications that covering most current network applications,

and belong to 280 applications and 21 application groups in a fashion hierarchical structure for a formal definition of application classes. The machine learning based classifier includes a temporal-frequent metric calculated by the 1-gram of both source and destination flows and a decision tree engine trained with the 256-dimentional temporal-frequent feature vector. The decision tree classification engine woks in parallel with the signatures based classifier and thus identifying unknown network traffic on the fly. The offline and online evaluations based on a large-scale WiFi ISP network show that our hybrid approach can classify flow applications in a realtime fashion with a pretty high accuracy. One of the biggest limitations for our hybrid approach is it fails to find the new (or unknown) applications. Unsupervised learning techniques have been studied recently to discover the new applications on the Internet, it, however, suffers from a large number of false positives. In order to address this issue, in the immediate future we will extend our approach to discover automatically new applications. The basic idea behind this is that we will apply the two decision tree classifiers based on source flow and destination flow in parallel, if both classifiers agree that an unknown flow is, for example, Gnutella, then the final classification output will be, to say, Gnutella. Otherwise if the two classifiers obtain contrary identification results, to say, one labels the unknown flow into Httpweb, while the other says it is BitTorrent. In this case we will leave the flow as unknown. This idea for new application discovery is reasonable and can be supported by our experimental results. During the evaluation, we found that the classification capability for the two individual classifiers is very similar. The same flow identified into two totally different applications by the two classifiers with similar classification capability might stand for a new application. Moreover, in the near future we will release and share a largesize dataset collected on a free assess WiFi network to the public for comparing the performance of different classifiers in the academia, addressing one of the open issues mentioned in [6]. ACKNOWLEDGMENT The authors graciously acknowledge the funding from the Atlantic Canada Opportunity Agency (ACOA) through the Atlantic Innovation Fund (AIF) to Dr. Ghorbani. REFERENCES [1]

Williams, N., Zander, S. and Armitage, G.. 2006. A preliminary performance comparison of five machine learning algorithms for practical IP traffic flow classification. ACM SIGCOMM Computer Communication Review, Vol. 36, Issue 5, 5-16.

[2]

Crotti, M., Dusi, M., Gringoli, F., and Salgarelli, L. 2007. Traffic classification through simple statistical fingerprinting. ACM SIGCOMM Computer Communication Review, Vol. 37, Issue 1, 5-16.

[3]

Roughan, M., Sen, S., Spatscheck, O. and Duffield, N.G.. 2004. Class of service mapping for QoS: a statistical signature based approach to IP traffic classification. In Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement, Taormina, Sicily, Italy, 135-148.

[4]

Karagiannis, T., Papagiannaki, K. and Faloutsos, M. 2005. BLINC: multilevel traffic classification in the dark, In Proceedings of the 2005 Conference on Applications, Technologies, Architectures, and

978-1-4244-4148-8/09/$25.00 ©2009 This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE "GLOBECOM" 2009 proceedings.

Protocols for Computer Communications, Philadelphia, Pennsylvania, 229-240. [5]

Xu, K., Zhang, Z.L., and Bhattacharyya, S. 2005. Profiling Internet backbone traffic behavior models and applications, ACM SIGCOMM Computer Communication Review, Volume 35, Issue 4, 169-180.

[6]

Salgarelli, L., Gringoli, F., and Karagiannis T. 2008. Comparing traffic classifiers, ACM SIGCOMM Computer Communication Review, Volume 37, Issue 3, 65-68.

[7]

Fred-eZone WiFi ISP, http://www.fred-ezone.ca/

[8]

Wang, K. and Stolfo, S. 2005. Anomalous payload-based worm detection and signature generation. In Proceedings of the 8th International Symposium on Recent Advances in Intrusion Detection (RAID), Seattle, WA.

[9]

Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers.

TABLE I. APPLICATIONS IN TRAINING DATASET Application ID 2006 2000 2008 1010 1011 1008 1004 1002 5672 1005 5005

Application Name BitTorrent Gnutella LimeWire HTTPWeb SecureWeb POP SMTP FTP MSN SSH Windows MediaPlayer

Application Group P2P P2P P2P WEB WEB MAIL MAIL DataTransfer CHAT RemoteAccess Streaming

TABLE II. GENERAL EVALUATION RESULTS BASED DECISION TREE CLASSIFIER Performance Metrics Correctly Classified Instances Classification Accuracy Incorrectly Classified Instances False Classification Rate Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error

FOR

Performance Metrics Correctly Classified Instances Classification Accuracy Incorrectly Classified Instances False Classification Rate Kappa statistic Mean absolute error Root mean squared error Relative absolute error

23.9404 %

TABLE IV. DETAILED EVALUATION RESULTS FOR DESTINATION FLOW BASED DECISION TREE CLASSIFIER TP Rate 0.923 0.952 0.988 0.964 0.953 0.988 0.998 0.967 0.994 0.996 0.978

FP Rate 0.006 0.003 0.001 0.003 0.008 0.002 0.0 0.004 0.001 0.0 0.003

Prec ision 0.941 0.968 0.99 0.967 0.927 0.985 0.996 0.965 0.995 0.997 0.97

Re call 0.923 0.952 0.988 0.964 0.953 0.988 0.998 0.967 0.994 0.996 0.978

FMeasure 0.932 0.96 0.989 0.965 0.94 0.987 0.997 0.966 0.994 0.996 0.974

ROC Area 0.969 0.976 0.995 0.99 0.978 0.993 0.999 0.987 0.998 0.999 0.99

Application BitTorrent FTP Gnutella HTTPWeb LimeWire MSN POP SecureWeb SMTP SSH Windows MediaPlayer

Size of Matrix

1000 × 256 1000 × 256 1000 × 256 1000 × 256 1000 × 256 1000 × 256 1000 × 256 1000 × 256 1000 × 256 1000 × 256 1000 × 256

TABLE V. DETAILED EVALUATION RESULTS BASED DECISION TREE CLASSIFIER TP Rate 0.914 0.965 0.992 0.958 0.973 0.995 1.0 0.96 0.989 0.998 0.95

FP Rate 0.006 0.006 0.001 0.006 0.003 0.001 0.0 0.004 0.001 0.0 0.004

Prec ision 0.942 0.941 0.993 0.942 0.972 0.994 0.997 0.963 0.99 1.0 0.96

Re call 0.914 0.965 0.992 0.958 0.973 0.995 1.0 0.96 0.989 0.998 0.95

FMeasure 0.928 0.953 0.992 0.95 0.973 0.995 0.999 0.961 0.989 0.999 0.955

FOR

SOURCE FLOW

ROC Area 0.967 0.996 0.997 0.984 0.988 0.998 1.0 0.981 0.995 0.999 0.982

DESTINATION FLOW

Evaluation Results 10701 97.28 % 299 2.72 % 0.9701 0.0058 0.0688 3.5194 % 23.9481 %

TABLE III. GENERAL EVALUATION RESULTS BASED DECISION TREE CLASSIFIER

Root relative squared error

FOR

SOURCE FLOW

Evaluation Results 10694 97.22 % 306 2.78 % 0.9694 0.0056 0.0688 3.3674 %

Application BitTorrent FTP Gnutella HTTPWeb LimeWire MSN POP SecureWeb SMTP SSH Windows MediaPlayer

TABLE VI. DISTRIBUTION OF "UNKNOWN" APPLICATION FLOWS Applications BitTorrent FTP Gnutella HTTPWeb LimeWire MSN POP SecureWeb SMTP SSH Windows MediaPlayer

Number of Unknown Flows 29739 224 15109 16216 141 4049 26 12886 11522 2197 722

Number of Flows Correctly Labeled 27777 193 11929 12635 131 4021 26 12097 11512 2181 481

TABLE VII. CLASSIFICATION RESULTS DECISION TREE CLASSIFIER Applications BitTorrent FTP Gnutella HTTPWeb LimeWire

WITH

SOURCE FLOW BASED

Number of Flows 29739 224 15109 16216 141

978-1-4244-4148-8/09/$25.00 ©2009 This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE "GLOBECOM" 2009 proceedings.

MSN POP SecureWeb SMTP SSH WindowsMediaPlayer

4049 26 12886 11522 2197 722

TABLE VIII. CLASSIFICATION RESULTS BASED DECISION TREE CLASSIFIER Applications BitTorrent FTP Gnutella HTTPWeb LimeWire MSN POP SecureWeb SMTP SSH Windows MediaPlayer

Number of Unknown Flows 29739 224 15109 16216 141 4049 26 12886 11522 2197 722

WITH

DESTINATION FLOW

Number of Flows Correctly Labeled 27796 181 13992 13996 108 4012 26 11809 11424 2170 81

TABLE IX. GENERAL CLASSIFICATION ACCURACY CLASSIFIERS Decision Tree Classifier Based on Source Flows Total Number of Classification Flows Correctly Accuracy (%) Indentified 82983 89.4

FOR

BOTH

Decision Tree Classifier Based on Destination Flows Total Number of Classification Flows Correctly Accuracy (%) Indentified 85995 92.6

978-1-4244-4148-8/09/$25.00 ©2009 This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE "GLOBECOM" 2009 proceedings.

Suggest Documents