Traffic Classification with On-Line Ensemble Method - IEEE Xplore

4 downloads 6058 Views 515KB Size Report
Traffic Classification with On-Line Ensemble. Method. Erico N. ... Faculty of Computer Science, ... Abstract—Traffic classification helps network managers to con-.
Traffic Classification with On-Line Ensemble Method Erico N. de Souza

Stan Matwin

Stenio Fernandes

Faculty of Computer Science, Dalhousie University, Canada Email: [email protected]

Faculty of Computer Science, Dalhousie University, Canada Institute of Computer Science, Polish Academy of Sciences, Poland Email: [email protected]

Federal University of Pernambuco (UFPE), Center for Informatics (CIn) Recife - Pernambuco - Brasil Email: [email protected]

Abstract—Traffic classification helps network managers to control services and activities done by users. Traditionally, Machine Learning (ML) is a tool to help managers to detect applications most used, and offer different types of services to their clients. Most of ML algorithms are designed to deal with limited amount of data, and in network context this is a problem, because of large data volume, speed and diversity. More recent work try to solve this issue by using ML algorithms developed to work with data streams, but they tend to implement only Very Fast Decision Trees (VFDT). This work goes in a different direction by proposing to use Ensemble Learners (EL), which, theoretically, offer more capability to deal with non-linear problems. The paper proposes to use a new EL called OzaBoost Dynamic (OzaDyn), and compares its performance with other ensemble methods designed to deal with data streams. Results indicate that the accuracy performance of OzaDyn is equal to other ensemble methods, while it helps reduce the memory consumption and time to evaluate the models.

I. I NTRODUCTION Characterization of Internet traffic has become over the past few years one of the major challenging issues in telecommunication networks. The increasing volume, velocity and veracity of the data makes this problem to be profiled as Big Data issue (these are the three V’s of Big data). In this sense, traditional Machine Learning (ML) algorithms cannot be used, because they will not be able to update their models according to variations of the data set. Looking at the network traffic classification from the Big Data perspective, velocity is a problem, because network administrators will have to deal with network speeds reaching Terabytes data. This creates a second problem related with volume, which does not allow managers do not have condition to store high amount of data in order to make data mining. Most ML algorithms work at this flow-level and the current challenge is to make them to achieve higher levels of accuracy. Veracity is also a problem, because developers started to hide applications’ ports to evade firewall blocking (e.g., Peer to Peer (P2P) applications) [1]. Traditional ML algorithms cannot be used under the listed constraints, and this is the main motivation of this work: to present an different general ensemble ML algorithm that offers promising results to deal with data streams. This paper is based on the results of a previous work, [2], and tries to

answer some of the issues raised related to the speed of the learning process proposed in that work. We propose to use the OzaBoost Dynamic (OzaDyn), proposed in [3], [4], as the ensemble on top of the pre-processing phase presented in [2], which helped reduce the ML algorithms’ model size. This paper is organized as follows. Section II presents the related work, Section III introduces the OzaBoost Dynamic algorithm and presents the main differences as compared to the original AdaBoost. Section IV shows the main results of the comparisons using two different data sets. Finally, Section V presents the main conclusions and directions for future work. II. R ELATED W ORK Majority of works present solutions of ML algorithms applied on the context of a learning phase and training stage. For instance, Callado et al [5] present an evaluation of various ML algorithms to classify applications based on flow records. They argue that the majority of algorithms do perform well in one or two data distributions, but no algorithm has proven better than the others in the majority of the scenarios. One interesting conclusion about this paper is that the combination of algorithms seem to offer better results than using one single algorithm. This serves as a motivation to use ensemble learners for traffic classification. The work that uses an ensemble learner to detect labels of types of application used in one network in another network is presented in [6]. Their results show that AdaBoost.M1 had similar performance to Decision Trees. Stolfo et al [7] also used a variation of AdaBoost for a specific task: intrusion detection. Their chosen algorithm was AdaCost, a type of AdaBoost.M1 algorithm. The main problem to work with AdaCost is that the user must have in-depth knowledge of the data set to choose the correct weight values. In [2], authors gave two contributions: 1) a new preprocessing step that helps to reduce the size of the classifiers, by applying a discretization based on the class of the destination IP feature, and 2) the use of a variant of AdaBoost, called AdaBoost Dynamic with Logistic Loss (AB-DL)[8], [9]. Results indicate that AB-DL has comparable accuracy with Decision Tree, but the training time is longer than Decision

978-1-4799-5490-2/14/$31.00 ©2014 IEEE

Tree. The main advantage of AB-DL is its resistance to overfitting, as presented in [4]. More recently, [10] presents a solution that goes in direction of using on-line learning. Their work proposes to use VFDTs, assuming that the traffic flow is a data stream. The main advantage is that VFDTs allow the system to be updated very fast, which allows their approach to be trained while still in conditions to test new traffic. Two more recent works, presented in [11], [12], go in direction of implementing ML algorithms to exploit parallel computing (Graphic Processing Units - GPUs). This work uses the same pre-processing strategy presented in [2], but it also uses a different learning algorithm designed to work with the ensemble learning proposed in [3], [4], called OzaBoost Dynamic (OzaDyn). Generally, majority of works use VFDTs and avoid using ensemble learners in general, and the reason for this relies on the issue that ensemble learners demand more memory consumption to build the models than decision trees. Although this higher cost is real, there are interesting properties of ensemble learners that may give more assurance related to the quality of results. III. O ZA B OOST DYNAMIC : A DA B OOST FOR DATA S TREAMS A natural extension of Very Fast Decision Trees (VFDTs) is to use ensemble methods to combine different algorithms in order to predict results from data streams. The issue with ensemble methods is that they are slower than decision trees, though they do offer better accuracy if the data stream cannot be linearly separated. The two main problems to adapt AdaBoost for data streams are: 1) the algorithm depends on multiple passes on the data set, and 2) it requires previous knowledge of the size of the data set to calculate the weight distribution. One way to solve these issues is creating an parallel boosting approach, which will feed the new examples to the algorithm as they arrive and the algorithm will update multiple models in parallel. Oza and Russell [13] introduced OzaBoost, a parallel boosting strategy, which follows the same approach as AdaBoost with the exception of weight calculation. OzaBoost’s first problem is its weight calculation for the misclassified instances. AdaBoost.M1 knows the data set size, but with data streams this is not possible. So, instead of normalizing the weights based on the “weak” learner error rate, OzaBoost uses the Poisson distribution to generate the weights. This has some theoretical issues [14], because the weights are based on a distribution that is not related to the original data set’s distribution. OzaDyn’s [3] main advantage relies on the proposed weight calculation that has theoretical guarantees presented in [4]. The second issue with OzaBoost is related to the number of possible learners to be used; and OzaDyn allows the use of multiple learners during the training phase. The proposed algorithm presented in Table I performs as follows: The first step initializes the array ht , indexing all the learners and the variables that will store the number of correct

sw classifications ( sc t ) and incorrect classifications ( t ). The algorithm executes the following simplified weight calculation 1 (1 ) , where " is the performance of “weak” 1+e (1 ("(1 "))) learner, and is a learning rate defined by the user with value between [0,1].The algorithm then checks if the new ht correctly classifies the instance (line 7). If it does, sc t is updated with the value of d (line 8), and d is updated so its value is equal to d 2 Nsc (line 9). When it is detected t that ht misclassified the example, the algorithm updates sw t in the same way as in line 8, and d will be equal to N d 2 sw (lines 11 and 12). The algorithm’s output is given by t sw P Hf inal = arg maxy2Y t:ht (x)=y t , where ✏t = sw t+ sc , t t and t = 1 (✏t (1 ✏t )). The implementation provided by Massive Online Analysis (MOA) improves to this algorithm by adapting data windows into OzaBoost; in MOA this is known as OzaBoostAdwin. ADWIN is a statistical drift detector that keeps a window of length W using O(logW ) memory and O(logW ) processing time per item. All the experiments in this work, and the modifications on the OzaBoost algorithm, were done in OzaBoostAdwin, due to of these advantages for detecting variations in the data.

Input: Base models ht , where ht 6= ht+1 ; T number of base models; User defined 2 [0, 1]; Sequence of examples m h(x1 , y1 ), ..., (xm , ym )i with labels yi 2 Y = {1, ..., k} 01. Initialize base models ht for all t 2 {1, 2, ..., T }, sc = 0, sw = 0 t t 02. for all training examples do 03. Set “weight” of example d = 1 04. Set current error " = 0.99 05. for all t do 06. Update ht with current example with weight 1 (1 ) 1+e (1 ("(1 "))) 07. if ht correctly classifies sc sc + 08. d t t N 09. d d 2 sc t

10. 11. 12.

else

sw t d

sw + t N d 2 sw

the example then

d

sw t t sw + sc t t

13. " 14. end if 15. end for 16. end for Anytime output sw Calculate the ✏t = swt+ sc , and t = 1 (✏t (1 ✏t )) for all t t t P Return:Hf inal = arg maxy2Y t:ht (x)=y t TABLE I O ZA B OOST DYNAMIC ALGORITHM THAT ADAPTS THE WEIGHT CALCULATION USED IN AD-AC ALGORITHM TO O ZA B OOST ALGORITHM . N IS THE NUMBER OF EXAMPLES SEEN .

IV. O ZA B OOST DYNAMIC C OMPARISON WITH A R EAL DATA S ET This section shows the comparison results of OzaDyn with two network data sets using the Prequential approach[14]. The problem in this case is predicting which application is

Fig. 1. This figure presents the performance of OzaDyn with one classifier (OzaBoostDynamic - 1 Classifier) and with two classifiers (OzaBoostDynamic 2 Classifier) compared to original OzaBoost for the network data sets. Figures (a) and (b) compare the accuracy of the algorithms, figures (c) and (d) compare the evaluation time, and (e) and (f) present the model size.

generating a data flow. This information is captured with the Ground Truth [15] tool, which performs two main tasks: 1) it executes a scan in network traffic, and 2) it associates this information with the applications executed in the end-hosts under observation. GT developers made their data publicly available in their website1 ; which can be requested to execute tests. In this work two data sets were used: the GT data set, and data collected from an university network using GT tool. The university data set is not public [16], because it has real traffic data. The data sets show the following common features: time stamp, IP source, IP destination, transport port source, transport port destination, Deep Packet Inspection (DPI) label(s), application name and transport protocol. The results from a signature-based analysis (known as DPI) were generated by the GT tool and the application name. The transport protocol only informs which protocol was used, Transmission Control Protocol (TCP) or User Datagram Protocol (UDP). Data from GT had both protocols, but the university data had only TCP 1 http://www.ing.unibs.it/ntw/tools/gt/

connection. Data from GT website had 78,999 flow records and data from university had 7,669 records. These data sets cannot be considered true data streams data sets, because they are very small compared to other data stream sets. Figure 1 shows the performance of OzaBoostDynamic with one classifier (VFDT) and two classifiers (VFDT and Naive Bayes Multinomial - NBM), with traditional OzaBoost with VFDT. Results indicate that the proposed approach has close to the same accuracy performance to traditional OzaBoost, with the advantage of reducing the memory footprint, and time to evaluate. V. C ONCLUSION This work presents the use of a new Machine Learning algorithm, OzaDyn, designed to work with data stream problems. OzaDyn was developed on top of OzaBoost algorithm, which is also an adaptation of AdaBoost. Unfortunately, one problem with Ozaboost is related with its theoretical aspects, since its weight calculation makes assumptions in relation to the data distribution. These problems are solved with OzaDyn, as presented in [4]. Furthermore, this work also extends a

previous work [2], where it was tested a variation of AdaBoost Dynamic (AB-DL) to work with network traffic. In [2], it was noticed that AB-DL was very slow to build the final model, although it offered an alternative to Decision Trees. Tests were executed on the same data sets used in [2], also using the same proposed pre-processing step described in that paper. Results indicate that OzaDyn is as accurate as OzaBoost, but it is less memory demanding. It is also important to mention that authors also conducted experiments with VFDT, not shown in this work, but the general accuracy compared with the proposed method is the same, while VFDT also had lower memory consumption than OzaDyn. This is an expected result, since VFDT is a optimized Decision Tree, and OzaDyn is a combination of multiple models, which will require more memory to process. As future work, authors wish to further extend the implementation of the OzaDyn by taking advantage of parallel processing based on GPU. This will allow further comparisons between the OzaDyn and VFDT. As already mentioned in the text, the data sets used cannot be considered as stream data, because of their volume. Our next priority is to collect more network data to further evaluate the algorithms. R EFERENCES [1] A. Callado, C. Kamienski, G. Szabo, B. Gero, J. Kelner, S. Fernandes, and D. Sadok, “A survey on internet traffic identification,” Communications Surveys Tutorials, IEEE, vol. 11, no. 3, pp. 37 –52, quarter 2009. [2] E. N. de Souza, S. Matwin, and S. Fernandes, “Network traffic classification using adaboost dynamic,” in Communications Workshops (ICC), 2013 IEEE International Conference on, 2013, pp. 1319–1324. [3] E. N. de Souza and S. Matwin, “Improvements to boosting with data streams,” in Canadian Conference on AI, ser. Lecture Notes in Computer Science, O. R. Za¨ıane and S. Zilles, Eds., vol. 7884. Springer, 2013, pp. 248–255. [4] E. N. de Souza, “Extending adaboost:varying the base learners and modifying the weight calculation,” Ph.D. dissertation, University of Ottawa, May 2014. [5] A. Callado, J. Kelner, D. Sadok, C. Alberto Kamienski, and S. Fernandes, “Better network traffic identification through the independent combination of techniques,” J. Netw. Comput. Appl., vol. 33, pp. 433– 446, July 2010. [6] G. Zou, G. Kesidis, and D. J. Miller, “A flow classifier with tamperresistant features and an evaluation of its portability to new domains,” Selected Areas in Communications, IEEE Journal on, vol. 29, no. 7, pp. 1449 –1460, august 2011. [7] S. Stolfo, W. Fan, W. Lee, A. Prodromidis, and P. Chan, “Cost-based modeling and evaluation for data mining with application to fraud and intrusion detection,” DARPA Information Survivability Conference and Exposition, 2000. DISCEX ’00. Proceedings, vol. 2, pp. 130 – 144, 2000. [8] E. N. de Souza and S. Matwin, “Improvements to adaboost dynamic,” in Canadian Conference on AI, ser. Lecture Notes in Computer Science, L. Kosseim and D. Inkpen, Eds., vol. 7310. Springer Berlin / Heidelberg, 2012, pp. 293–298. [9] ——, “Extending adaboost to iteratively vary its base classifiers,” in Advances in Artificial Intelligence, ser. Lecture Notes in Computer Science, C. Butz and P. Lingras, Eds. Springer Berlin / Heidelberg, 2011, vol. 6657, pp. 384–389. [10] X. Tian, Q. Sun, X. Huang, and Y. Ma, “Dynamic online traffic classification using data stream mining,” in MultiMedia and Information Technology, 2008. MMIT ’08. International Conference on, Dec 2008, pp. 104–107. [11] A. Feitoza Santos, S. F. de Lacerda Fernandes, P. Gomes Lopes J´unior, D. Fawzi Hadj Sadok, and G. Szabo, “Multi-gigabit traffic identification on gpu,” in Proceedings of the First Edition Workshop on High Performance and Programmable Networking, ser. HPPN ’13. New York, NY, USA: ACM, 2013, pp. 39–44. [Online]. Available: http://doi.acm.org/10.1145/2465839.2465845

[12] P. Lopes, S. Fernandes, W. Melo, and D. H. Sadok, “Gpu-oriented stream data mining traffic classification,” ISCC, IEEE, June 2014, to appear. [13] N. C. Oza and S. Russell, “Online bagging and boosting,” in Eighth International Workshop on Artificial Intelligence and Statistics, T. Jaakkola and T. Richardson, Eds. Key West, Florida. USA: Morgan Kaufmann, 2001, pp. 105–112. [14] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, “Data stream mining: a practical approach,” University of Waikato, Tech. Rep., May 2011. [15] F. Gringoli, L. Salgarelli, M. Dusi, N. Cascarano, F. Risso, and k. c. claffy, “Gt: picking up the truth from the ground for internet traffic,” SIGCOMM Comput. Commun. Rev., vol. 39, pp. 12–18, October 2009. [16] S. Fernandes, Private Communication.