QG,QWHUQDWLRQDO&RQIHUHQFHRQ(OHFWURQLF'HVLJQ,&(' $XJXVW3HQDQJ0DOD\VLD
Network Traffic Classification - A Comparative Study of Two Common Decision Tree Methods: C4.5 and Random Forest Alhamza Munther
Rozmie R. Othman
School of Computer and Communication Engineering Universiti Malaysia Perlis Perlis, Malaysia
[email protected]
School of Computer and Communication Engineering Universiti Malaysia Perlis Perlis, Malaysia
[email protected]
Alabass Alalousi
Mohammed Anbar
Department of Computer, School of Computer and Information Systems University of Thamar Thamar, Yemen
[email protected]
National Advanced IPv6 Centre of Excellence Universiti Sains Malaysia Penang, Malaysia
[email protected]
Shahrul Nizam School of Computer and Communication Engineering Universiti Malaysia Perlis Perlis, Malaysia
[email protected] network engineers to classify network traffic, the most common challenge are increasing application types and the huge size of data traffics [3]. Due to the above mentioned challenges , several studies have been conducted to classify network traffic, such as using well-known port numbers which are obtained by IANA [4]. Payload-based classification it’s another approach is used to classify traffic packets based on the field of payload [5, 6]. Signature- based approach is suggested to overcome the high cost in term of time and complexity for payload-based through looking for a specific pattern (bytes) called signature [7]. Behaviour-based studied the different discriminate features to distinguish application. Different machine learning methods are introduced to automate network traffic classification such as nearest neighbors (NN) presented by Roughan et al. [8], Moore and Zuev [9] proposed to apply the supervised Machine Learningn (ML) Naive Bayes, Bayesian neural network approach is explored by [10], kmeans suggested by [11], support vector machine (SVM) offered in [12] and Hidden Markov Model (HMM) was used by [13]. Decision tree is one of the most popular methods of supervised machine learning which is used in network traffic classification. Its follow divide-and-conquer approach to the problem of learning from a set of independent inputs (instances). Various types are descended from decision tree as ID3 [14], C4.5 [15], and Random Forest [16]. ID3 is used to generate the decision tree for different types of datasets; it builds the tree from the top down, with no backtracking. C4.5 is a widely-used classification method in different fields that is descended from an earlier method ID3 and is followed in turn
Abstract—Network traffic classification gains continuous interesting while many applications emerge on the different kinds of networks with obfuscation techniques. Decision tree is a supervised machine learning method used widely to identify and classify network traffic. In this paper, we introduce a comparative study focusing on two common decision tree methods namely: C4.5 and Random forest. The study offers comparative results in two different factors are accuracy of classification and processing time. C4.5 achieved high percentage of classification accuracy reach to 99.67 for 24000 instances while Random Forest was faster than C4.5 in term of processing time. Keywords— Traffic Classification; Machine Supervised Learning; Random Forests Algorithm
I.
learning;
INTRODUCTION
Network traffic classification occupied a significant role in network management and network security areas because it introduce different services such as identifying the applications which are most consuming for network resources, it represents the core part of automated Intrusion Detection Systems (IDS), it helps to detect the malicious applications such as those produced by worms or Denial of services (DoS) attack. In addition, it helps to know the wide using applications for offering new products [1, 2]. Generally, Network traffic classification is process of categorizes network traffic according to various parameters (such as port number, arrival time, type of protocol, packet length and etc) into a number of traffic classes.. In the other hand, several challenges faced
………………………………………… 978-1-4799-6103-0/14/$31.00 ©2014 IEEE
,(((
QG,QWHUQDWLRQDO&RQIHUHQFHRQ(OHFWURQLF'HVLJQ,&(' $XJXVW3HQDQJ0DOD\VLD
by C5.0. Random forest is an ensemble learning method for classification and regression that operate by constructing a multitude of binary decision trees. This paper focuses on evaluate the performance in term of accuracy classification and processing time for both C4.5 and Random Forest methods in the area of network traffic classification. Section 2 highlighted the details of each method independently. Section 3; explore comparative results for both methods. Finally, Section 4 concludes the proposed work. II.
Ordinal attributes, can also expressed by mutliway and multi binary attribute but without violate the order property of the attribute values. Continuous attributes, here the test condition can be representing by compare the value of each attribute e.g. ( ܣ൏ )ݒwith binary outcomes and for multiway the algorithm must consider all possible ranges of values. Decision tree has several types some of were used for network traffic classification such as CART, ID3, C4.5, C5.0 and Random forest. Next subsection deals methodology details for two most popular methods are used in network traffic classification: C4.5 and Random forest.
BACKGROUND
Decision tree is one of commonly used supervised machine learning methods for network traffic classification [17]. This section provides details about decision tree elements, structure and design. Decision tree use divide-and-conquer approach to solve the classification problem by asking a series of carefully created questions about the attributes of the test record (input). Each time, when an answer is received, a follow-up question is asked until we reach a conclusion about the class label of the record. These set of questions and answers are organized in the form of hierarchical structure consisting of nodes and directed edges. Usually, decision trees consist of three elements root node that has no incoming edges rather its own one or more outgoing edges, internal nodes that has exactly one incoming edge and two or more outgoing edges and leaf nodes , each of which has exactly one incoming edge and no outgoing edges. In decision tree, each leaf node is considered a class label while other nodes represent attribute test conditions (questions). Figure 1 depicts decision tree structure and work. Basically, we can constructed many decision trees from a given dataset, some of these trees are more accurate than other, finding the optimal tree (classifier) is required high computationally infeasible because of the large size of the search space. Nonetheless, several algorithms are developed to prompt reasonably accurate trees. These algorithms adopt a greedy strategy that grows a decision tree by making a series of optimum decision based on attribute which use to partitioning the dataset. One such algorithm is Hunt’s algorithm, which is used by different decision trees algorithm as ID3 [14], C4.5 [15] and CART [18]. The basic idea of this algorithm is the decision tree is grown recursively by dividing the training records (inputs) into purer subsets (sub-trees). However, hunt’s algorithm has some stringent assumption for example its work if every attribute combination values is present in the dataset and each combination own a unique class label. Two important things should be takes in our consideration to construct decision tree, first provide method to select an attribute (features) to split the records into smaller subset then evaluating the goodness of each test condition. Second, stopping condition is needed to terminate the growing trees. Therefore, there are four ways to represent attributes in the decision tree namely: binary attributes, nominal attributes, ordinal attributes and continuous attributes as shown in figure 2. Binary attributes it’s generate two potential outcomes. Nominal attributes have many values so can be represent by two ways one called multiway splitting in which the number of outcomes depends on the number of diverse values for the corresponding attribute, second using multi binary attribute.
Fig. 1. Simple Decision Tree
Fig. 2. Types of Attribute ID3 is proposed by Ross Quinlan in 1975 at the University of Sydney [19]. ID3 algorithm follows the top-down greedy approach to make the suitable decision (i.e. makes the choice that looks best at the moment) [20]. ID3 constructs the
QG,QWHUQDWLRQDO&RQIHUHQFHRQ(OHFWURQLF'HVLJQ,&(' $XJXVW3HQDQJ0DOD\VLD
technique [22]. In other word, its construct multiple trees, each from a different random subset of the original training set consist a collection of tree-structured classifiers (i.e. forest of trees). These trees are constructed and grown by injecting random bootstrap data samples with a replacement from the original training data. One-third of the instances (inputs) are not included in the training set it uses to test the completed tree and evaluate its performance (out-of-bag data) for this reason it consider self-testing algorithm. Trees are built based on the standard binary decision tree algorithm. The set of attributes is considered as parent nodes, and each is split into no more than two children. The attributes (parent nodes) of each tree are randomly chosen from the main set of attributes. The attributes that have more than two classes can be selected more than once for different nodes, such that the classes are partitioned into two smaller groups each time the attribute is selected at a parent node.
decision trees from a fixed number of dataset these trees are used later to classify other records from the original dataset. In ID3 algorithm, the information gain is used to select the decision nodes (attributes) to split the set into subset. It always nominated the attribute with many values caused high error classification and consume high time. C4.5 is suggested to overcome these issues by using gain ratio as in equation 1 to select the records relative only to the attributes. C5.0 is consider the successor version of decision tree but C5.0 had ruled out from this study for two main reasons first, the benchmark dataset which is used in this study doesn’t apply by other researchers instead of the used volunteer-based to obtain dataset [68]. Second; Weka simulation tools that used in this study don’t support yet C5.0, therefore we cannot achieve equivalent comparison due to each algorithm is implemented on different dataset and testing environment. Random forest algorithm is supported by Weka simulation. Moreover, Random Forest possess two substantial aspects are motived us to involved in this study; first ability to classify network traffic in parallel manner second, can deal with high dimensional data traffic. Hence, this present study is limited for both methods C4.5 and Random Forest.
After the model is built, the forest is organized as a typical ensemble classifier. Each tree votes for the expected class and obtains the majority vote, i.e., the instance is classi¿ed into the class having the most votes over all trees in the forest. The generalization ܲ ܧerror of the RF algorithm (or accuracy of RF) depends on two factors, namely, correlation ߩ among trees and the strength ܵ of each individual tree [16, 23]. Therefore, RF can be express the accuracy of as:
A. C4.5 Decision Tree Is developed by Ross Quinlan [15] which is used to generate decision tree, it’s consider an extension to ID3 algorithm. C4.5 it has the ability to classify records that carried either continuous or discrete attributes. For this reason it used in the area of network traffic identification and classification. C4.5 classifies the records initially from the top (root) then pass iteratively into below trees until it reaches a class label (leaf node). C4.5 used information gain ratio as alternative to the information gain that is used in ID3. Information gain ratio is decided which feature will be a test node based on the value of entropy for feature X and class Y through measuring the correlation between X and Y as expressed in the equations below [15, 21], the feature with highest information gain ratio is selected to make the decision and the splitting process is continuous until reach the leaf node (class label):
ሺȁሻ ൌ
ୌሺଡ଼ሻିୌሺଡ଼ȁଢ଼ሻ ୌሺଡ଼ሻ
ܲ כ ܧൌ ܲ௫ǡ௬ ሺ݉݃ሺܺǡ ܻሻ ൏ Ͳሻ
Where, ݉݃ሺܺǡ ܻሻ is margin function to measure the distance between the average numbers of correct votes for a specific record ܺ and the average number of votes for class ܻ to the same record. The margin function is expressed as: ݉݃ሺܺǡ ܻሻ ൌ ܽݒ ܫሺ݄ ሺܺሻ ൌ ܻሻ െ ݉ܽݔஷ ܽݒ ܫሺ݄ ሺܺሻ ൌ ݆ሻ ሺͷሻ where ܫሺǤ ሻ is indicator function. and The strength ܵ of each tree is: ൌ ଡ଼ǡଢ଼ ሺǡ ሻ III.
(1)
ሺʹሻ
ܪሺܺሻ ൌ െ σ௫ ሺݔ ሻ ݈݃ଶ ሺݔ ሻ
ሺ͵ሻ
ሺሻ
COMPARATIVE RESULTS
This section explores the results got from Weka simulation for C4.5 and RF. The evaluations are achieved on network traffic dataset [24] consists of 24000 instances, 248 attributes and 11 classes. This dataset includes WWW, MAIL, FTPCONTROL, FTP-PASV, ATTACK, P2P, DATABASE, FTPDATA, MULTIMEDIA SERVICES, and INTERACTIVE. Two significant factors will be discussed namely: overall accuracy of classification and Processing time.
Where ܪሺܺȁܻሻ ൌ െ σ ൫ݕ ൯ σ ൫ݔ ȁݕ ൯ ݈݃ଶ ሺݔ ȁݕ ሻ
ሺͶሻ
A. Overall Accuracy of Classification In this subsection, the overall accuracy of classification is valued for both methods C4.5 and RF. The results reveal the ability of both methods to classify variant applications correctly. The overall accuracy can be expressed by the following formula:
where ൫ݔ ȁݕ ൯ is joint probabilistic distribution and ሺݔ ሻ, ൫ݕ ൯is the marginal probabilities. B. Random forest Method The Random forest (RF) algorithm is a classification method developed by Leo Breiman [16]. RF is an ensemble of binary decision trees. This algorithm adopts the bagging
QG,QWHUQDWLRQDO&RQIHUHQFHRQ(OHFWURQLF'HVLJQ,&(' $XJXVW3HQDQJ0DOD\VLD
ܱ
ൌ
Fig. 3.
୳୫ୠୣ୰୭୍୬ୱ୲ୟ୬ୡୣୱୡ୭୰୰ୣୡ୲୪୷ୡ୪ୟୱୱ୧ϐ୧ୣୢ ୭୲ୟ୪୬୳୫ୠୣ୰୭୧୬ୱ୲ୟ୬ୡୣୱ
performance to classify the dataset; RF is outperforming with difference 159 sec over C4.5.
ሺሻ
Overall accuracy of classification for 4.5 and RF Fig. 4. Build model time for C4.5 and RF
Figure 3 shows the results of overall accuracy for both C4.5 and RF, it reflect that C4.5 achieve highest performance in term of inspection and classification for several applications. C4.5 hits 100 % overall accuracy for instances 1000, 2000, 4000 and 8000 respectively. Afterward, the overall accuracy decreased regularly to 99.67 . While for RF the overall accuracy is lower than C4.5; it starts with high accuracy 100% for 1000 and 2000 instances and then decreased gradually to 98.64% for 24000 instances. As a result, C4.5 achieved better overall accuracy of classification than RF based on the specified dataset, the average accuracy percentage for different instances for C4.5 is 99.88% while its 99.46 for RF. B. Time of Processing The processing time is a second factor for comparison between two methods C4.5 and RF. Most of studies have been focused on the accuracy of classification and memory consumption and they ignored processing time even though the processing time is an important factor especially for online network traffic classification. We measure the time is taken to construct the build model as well as the total time for classifying the entire dataset. Figure 4 highlighted the values of build model time for both RF and C4.5. RF it was spent 0.11 sec for building model to classify 1000 instances while C4.5 was needed only 0.04 sec at the beginning. Thereafter, noted steeply increasing in C4.5 where build model had taken 0.26 sec to constructed for 2000 instances while a gently increasing had been happen in RF it take only 0.15 sec. The time of build model went up slightly in RF to 8.41 sec for 24000 instances while noted sharply increasing for C4.5 it reachs to 34.2 sec.
Fig. 5. Processing time for C4.5 and RF IV.
CONCLUSION
In this paper, we introduce a comparative study for two common decision tree methods are C4.5 and Random Forest in the area of network traffic classification. Generally, we explore the methodology of decision tree regarding classification thereafter we went through each method individually with some kind of details. The evaluation of each method is performed in term of accuracy of classification and processing time. C4.5 achieved high percentage of accuracy classification
Figure 5 reveals the results of total time processing with different amount of instances for both C4.5 and RF. C4.5 required 1.5 sec to classify 1000 instances while RF consume 1 sec. The duration increased steadily in RF it reaches to 213 sec whereas it was 372 sec for 24000 instances. RF offered fast
QG,QWHUQDWLRQDO&RQIHUHQFHRQ(OHFWURQLF'HVLJQ,&(' $XJXVW3HQDQJ0DOD\VLD
reach to 99.67 whilst RF realized faster performance over C4.5 with difference up to 159 sec. V.
[11] L. Yingqiu, L. Wei, and L. Yunchun, "Network traffic classification using k-means clustering," in Computer and Computational Sciences, 2007. IMSCCS 2007. Second International Multi-Symposiums on, 2007, pp. 360-365.
ACKNOWLEDGMENT
This research is partially funded by the generous fundamental grants – “Fundamental study on the effectiveness measurement of partially executed t-way test suite for software testing” from Ministry of Higher Education (MOHE), and the UniMAP short term grant
[12] Z. Li, R. Yuan, and X. Guan, "Accurate classification of the internet traffic based on the svm method," in Communications, 2007. ICC'07. IEEE International Conference on, 2007, pp. 1373-1378. [13] L. Bernaille, R. Teixeira, and K. Salamatian, "Early application identification," in Proceedings of the 2006 ACM CoNEXT conference, 2006, p. 6. [14] T. M. Mitchell, "Machine learning. 1997," Burr Ridge, IL: McGraw Hill, vol. 45, 1997. [15] J. R. Quinlan, C4. 5: programs for machine learning vol. 1: Morgan kaufmann, 1993. [16] L. Breiman, "Random forests," Machine learning, vol. 45, pp. 5-32, 2001. [17] H. Jiawei and M. Kamber, "Data mining: concepts and techniques," San Francisco, CA, itd: Morgan Kaufmann, vol. 5, 2001. [18] D. Michie, D. J. Spiegelhalter, and C. C. Taylor, "Machine learning, neural and statistical classification," 1994. [19] D. Che, J. Zhao, L. Cai, and Y. Xu, "Operon prediction in microbial genomes using decision tree approach," in Computational Intelligence and Bioinformatics and Computational Biology, 2007. CIBCB'07. IEEE Symposium on, 2007, pp. 135-142. [20] D. Che, Q. Liu, K. Rasheed, and X. Tao, "Decision tree and ensemble learning algorithms with their applications in bioinformatics," in Software Tools and Algorithms for Biological Systems: Springer, 2011, pp. 191-199. [21] L. FU, B. TANG, and D. YUAN, "The Study of Traffic Classification Methods Based on C4. 5 Algorithm," 2012. [22] L. Breiman, "Bagging predictors," Machine learning, vol. 24, pp. 123140, 1996. [23] O. Okun, G. Valentini, and M. Re, Ensembles in Machine Learning Applications vol. 373: Springer, 2011. [24] A. W. Moore and K. Papagiannaki, "Toward the accurate identification of network applications," in Passive and Active Network Measurement: Springer, 2005, pp. 41-54.
REFERENCES [1]
A. Callado, C. Kamienski, S. n. Fernandes, and D. Sadok, "A Survey on Internet Traffic Identification and Classification," 2009. [2] T. T. Nguyen and G. Armitage, "A survey of techniques for internet traffic classification using machine learning," Communications Surveys & Tutorials, IEEE, vol. 10, pp. 56-76, 2008. [3] J. Park, H.-R. Tyan, and C.-C. Kuo, "Internet traffic classification for scalable qos provision," in Multimedia and Expo, 2006 IEEE International Conference on, 2006, pp. 1221-1224. [4] IANA: Internet Assigned Numbers Authority. [5] T. Karagiannis, A. Broido, and M. Faloutsos, "Transport layer identification of P2P traffic," in Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, 2004, pp. 121-134. [6] S. Sen, O. Spatscheck, and D. Wang, "Accurate, scalable in-network identification of p2p traffic using application signatures," in Proceedings of the 13th international conference on World Wide Web, 2004, pp. 512521. [7] P. Haffner, S. Sen, O. Spatscheck, and D. Wang, "ACAS: automated construction of application signatures," in Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data, 2005, pp. 197-202. [8] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield, "Class-of-service mapping for QoS: a statistical signature-based approach to IP traffic classification," in Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, 2004, pp. 135-148. [9] A. W. Moore and D. Zuev, "Internet traffic classification using bayesian analysis techniques," in ACM SIGMETRICS Performance Evaluation Review, 2005, pp. 50-60. [10] T. Auld, A. W. Moore, and S. F. Gull, "Bayesian neural networks for internet traffic classification," Neural Networks, IEEE Transactions on, vol. 18, pp. 223-239, 2007.