Network Data Classification Using Graph Partition - IEEE Xplore

0 downloads 0 Views 180KB Size Report
Network Data Classification Using Graph Partition. Sahan L. Maldeniya. ∗. , Ajantha S. Atukorale. †. , Wathsala W. Vithanage. ‡. University of Colombo School of ...
Network Data Classification Using Graph Partition Sahan L. Maldeniya∗ , Ajantha S. Atukorale† , Wathsala W. Vithanage‡ University of Colombo School of Computing No 35, Reid Avenue, Colombo 00700, Sri Lanka ∗ [email protected][email protected][email protected] Abstract—Application of network classification can be seen in many domains. These varies from preserving the quality of network to analyzing personal characteristics of network users. However current methods applied for network data classification does not meet the expectations. This is because networks are dynamic which are prone to rapid changes, while methods used for the classification has been either trained using examples or defined using heuristics. World Wide Web itself is a big graph which is made out of number of URLS connecting each other via hyper-links. Hence in this work we have used this graph nature of WWW and applied graph theories to partition the network to classify network data. We have used results obtained by classifying the network traffic using k-means algorithm to evaluate the performance and usability of proposed method.

I. I NTRODUCTION Internet has become one of the things that change rapidly with changes of technology. Not only it changes, but it changes the people, their behaviors and attitudes as well. From being a simple static web page delivery and email delivery medium in its early ages, Internet has been evolved to become a weapon, to revolt against tyrants. Studying about Internet has became a science [1], where the studies carried out not only technological aspects but patterns, social effects and trends that has been occurred due to Internet. While managing rapidly growing user base, ISPs1 have to facilitate users by providing a safe and secure service, maintaining a good QoS2 and allocating a satisfactory bandwidth to users of their services. Internet is becoming a scarce resource, which took attention of even UN3 , where UN imposed new regulations to control the unlimited usage of Internet. To facilitate their customers by fairly allocating this limited resource in secure manner while containing a good QoS; ISPs and network administrators are focusing on traffic analysis and network classification methods to ease this procedure. Origin of network classification goes back to early 2000s where the user base was much smaller than present and applications of the Internet were much limited. One of the earliest methods used to classify network traffic was port base traffic classification[2]. In port based classification, network traffic had been classified using the TCP/UDP port

numbers of the data packets. Here, it is assumed that applications keep using the same port numbers which have been dedicated to those applications. Another method used in the early days of network data classification was payload based classification[2]. In payload based classification, it will rely on specific application data. This method can further divide into two parts which are protocol decoding where the application protocol data has been used, and signature-based identification where a search will be carried out to identify application specific byte sequence in packet payload. However, with growth, development and evolution of Internet these methods introduce their own set of problems such as not a single port dedicated for an application, port allocation on demand, complexities faced when load processing in networking devices, problems arisen when trying to process encrypted data, and breaching of privacy policies. To overcome these difficulties of traditional methodologies, researchers focused their attention to use other method like machine learning, statistical and heuristic base methods. In this paper we introduce a novel approach to classify network data using graph theories. Internet (or world wide web) is a large graph from its nature which contains URLs or URIs that act like nodes and hyperlinks act as edges to connect these nodes with each other in multiple ways. Because of this, world wide web has inherited qualities of a graph. This make it much easier to use graph partition methods to classify or cluster a graph created from data collected as network traffic. We have used Louvain algorithm[3] as graph partition method. Network traffic collected from University of Colombo School of Computing network operational center have been used to create graphs, partition them and evaluate this method. Rest of this paper is organized as follows. Section II discuss related work in network traffic classification domain. Section III explains Louvain algorithm which used as the graph partition algorithm for this work. Section IV describes our graph creation and traffic classification approach. Section V-A discuss about results and section VI concludes our work. II. R ELATED WORKS .

1 Internet

Service Providers 2 Quality of Service 3 United Nations

978-1-4799-2084-6/13/$31.00 ©2013 IEEE

Network traffic classification has been an active and continuing research area for more than a decade. In this section we

ICON 2013

describe several studies done in network traffic classification which are related to our work. When traditional methods for network traffic classification become not usable any more, most of researchers had focused their attention on using machine learning to classify network traffic, which has been an active and successful method used in other domains that time. Both supervised and unsupervised machine learning approaches have been applied classify network traffic. In 2004 Roughan et al [4] published their analysis about usability of nearest neighbors (NN), linear discriminate analysis (LDA) and quadratic discriminant analysis (QDA) algorithms to classify labeled traffic data. From their work they have found out that average packet length and flow duration are the most significant features among the features they considered to classify network traffic data. Based on the results they have proven that lowest error rate can be seen at classification method which uses three of protocol class types which includes FTP , telnet and real media flows. Moore and Zuev in 2005 proposed to use Naive Bayes technique for labeled traffic data classification [5] [6]. They have improved their classifier by using Nave Bayes algorithm with kernel estimation and fast correlation based filter for classification which had improved both performance and flow accuracy than using just the simple Nave Bayes classifier. In 2006, Park et al [7] tested and compared Naive Bayesian classifier with Kernel Estimation (NBKE), Decision Tree and the Reduced Error Pruning Tree to test ability of those algorithms to use in traffic classification domain. Results from this work have shown that use of Decision Tree and the Reduced Error Pruning Tree can achieve more accuracy than Naive Bayesian classifier with Kernel Estimation. In 2006 Nguyen and Armitage [8] proposed a method to address the issue of timely and continuous classification of network by using the most recent N number of packets from a flow which they called a classification sliding window. Specialty of this classifier is that, it does not need the classifier to capture the start of the traffic flow and it allows classification to be initiated at any point even the traffic flows are already in progress. Kiviat graphs [9] have been used to visualize resulted clusters because it is easy to interpret cluster meaning than using inter arrival time/ packet size plots. In 2004, McGregor et al [10] suggest a method to classify network traffic by classifying the packet headers using expectation maximization algorithm. Erman et al in 2006 [11] did a comparison in between three clustering algorithms and their usage in network traffic classification. In this research they have been evaluated results of k- means algorithm and DBSCAN algorithm with previous results published for autoclass algorithm by Zander et al [12]. According to their results, authors stated that K-Means and Autoclass algorithms produce more evenly distributed clusters than clusters produced by DBSCAN algorithm. They have been found out that the reason for DBSCAN algorithm not able to produce evenly distributed clusters is that it tries to include noisy data into the existing clusters. Also according to results it can be seen that K-Means algorithm cluster data faster than other two algorithms. We used k-means algorithm

base on the results of above research to compare the results of our classifier. In 2011 Atukorale and Vithanage presented a classifier [13] which works based on the intensity of request made to each website domain in short run and long run. This classifier is capable of clustering four website domains based on above technique. Kohenens self-organizing map [14] has been used to identify most effective features after training with the traffic flows collected from access logs of HTTP proxy servers. Even though machine learning has become an active participant in network traffic classification domain there have been other areas such as statistical methods and heuristic base methods which contributed to network traffic classification domain. In 2007, Crotti et al [15] proposed a traffic classification method using statistical data which they refer as statistical fingerprints. Here the fingerprint for a given application layer protocol has been generated by evaluating set of probability density functions estimated from set of flows. Sengupta and Sil [16] in 2012 published their work based on classifying network traffic using rough set theory. Graph theories based classification methods are slightly new to network traffic classification domain because of that efficiency of these methods were a consideration in the past. Though in 2003 Nobel et al [16] proposed two methods to detect anomalies in graphs by detecting the anomalous substructures and by detecting the anomalous sub-graphs. Siska et al [17] proposed a flow trace generator to test and evaluate intrusion detection systems using graph based traffic classification techniques in 2010. Also Mehr et al [18] suggested a hybrid method in 2011 to identify similarity of web pages using distributed automata and graph partition theories. Our work is based on an efficient and accurate graph partition method proposed by Lambiotte et al [3] in 2008. This method will be explained with more detail in section III. We use the modularity classes result by running Louvain algorithm on our data set to partition a graph created from the data set. III. L OUVAIN ALGORITHM . This algorithm have been proposed by Lambiotte et al [3] in 2008 which finds high modularity partitions of a large network with complete unfolding of hierarchical community structure in a short time. This algorithm consists of two iteratively repeating phases. At the beginning a separate modularity class will be assigned into every node. As algorithm runs it will accumulate nodes of different classes into common classes using modularity gain achieved by the accumulation of a node into a such class and form communities of nodes. Which community does a node belong will be decided based on highest positive modularity gain achieved by the selected nodes against to its neighbor communities. Second phase of the algorithm build a new network using communities found in the first phase. Here it will consider the communities formed in first phase as nodes in new network. New edges will assign between nodes those represent communities connected in previous step. Each of these new edge will have a weight calculated by summing up the weights of links

between nodes in two connected communities resulted in first phase. A. Modularity of a partition. Modularity measures the quality of a partition. Lets consider a weighted graph G where i and j are nodes in the graph, community attribute of node i defined by ci . Modularity value Q [−1, 1] can be calculated by,   1  ki kj Q= (1) Aij − δ(ci , cj ) 2m i,j 2m Where Aij is the adjacency matrix represent the graph G, ki is the degree of node i and m is the total weight measured by following equations,

ki = m=

 j 1 2

Aij

 i,j

Aij

B. Modularity gain. Modularity gain ΔQ is calculate by formula 2 where  sum of weights of links inside community C denote by in , sum of weights of links incident to nodes denote by tot , ki is the sum of weights of links incident to node i. ki,in is the sum of weights of links from i to nodes in C and sum of weights of all links in the network denote by m.  2 +2ki,in tot +ki − ΔQ = 2m 2m    2  2 ki in tot − − − 2m 2m 2m 

in

(2)

Using above formulas, modularity classes of the graph will be calculate as algorithm 1. IV. N ETWORK DATA CLASSIFICATION METHOD . A. Data collection and processing. We have used BRO IDS[19] to collect data from university network. We modified the code of BRO IDS to remove all the privacy related details such as email addresses, passwords and credit card numbers from the data set. Collected data are in text format. BRO print these data to log files hourly which place into a folder named by the date that data have been collected and different log file have been created for each common protocol type. Protocols such as HTTP, HTTPS, FTP and SMTP are considered as common protocol types by BRO . It also saves connection related data in the same way as above procedure. These log files have been used as raw data for our work. Size of these log files exceeded 100 mega bytes. Hence the analysis of these log files are done using terminal tools ships

Algorithm 1 Louvain algorithm 1: Assign each node in graph to its own community. 2: For each node v in graph G, calculate the modularity gain ΔQ between v and its neighbour nodes. 3: If there is a positive increment in modularity gain in between v and a neighbour node add v to community of the neighbour node and move to the next node in the list. 4: If modularity could not increase any more within node communities stop the current process and move to step 5 or else carry on until no more optimization can be achieved. 5: Merge nodes of each newly created communities in to a single node to represent that community and obtain the edge of such two nodes in a way that edge has the weight equals to summation of total weights of edges those two communities linking by. 6: Go to step 2 and repeat the process iteratively until no more communities left to merge. with Linux distributions. Majority of the network traffic we collected consisted of data belong to HTTP and HTTPS protocols. Since our approach is based on creating a graph with URLs and hyperlinks, we used network traffic of these two protocol types. In HTTP and HTTPS log files each request send outside has been logged into a separate line. To create a graph from these network dumps we needed URLs and hyperlinks. BRO stores URLs and hyperlinks into log files as URLs and reference for that URL. Also there are so many other fields in log files such as time stamp of the HTTP/HTTPs request, source IP address, destination IP address, source port, destination port etc. Each request can separately identify from another using source IP address, URL and time stamp fields of the request. Hence we have written several Python scripts to extract source IP address, URL, time stamp and the URL initiate request. The extracted data stored into a MySQL database because of the easiness of process and analyze data using SQL language. We created sever data chunks using the stored data where each chunk include network traffic of two consecutive days. These data chunks used to create several graphs and compare the results which will be discussed in section V-A B. Graph of network data creation approach. We used an adjacency matrix based approach to create the graph from collected network traffic. Prior to creating graph each data chunk has been analyzed to obtain the number of distinct URLs in that data chunk. This number has been used to create a square matrix for analyzed data chunk. Here the adjacency matrix is of the integer type where integer type used to hold a count. Since the URLs are consisting of character sequences they have to be mapped in to indexes. We have used a simple hash function to achieve this URL to index mapping. The calculated hash value for a particular URL will be unique only to that particular data set. Hence we have to calculate

the hash values every time we use a new data set. We have used MD5 algorithm as our hashing algorithm.

Algorithm 2 Index calculating function 1: S ← size of unique URLs in dataset 2: function INDEX O F (URL u, S) 3: h ← M D5 (u) 4: return h mod S 5: end function Using Algorithm 2 we have been able to index URLs. Then algorithm 3 has been used to create the network matrix which is an adjacency matrix of network traffic. Algorithm 3 Matrix of network 1: Dataset D 2: U ← allU niqueU RLsInDataSet (D) 3: M [size of U][size of U]  Integer square matrix initialized to 0 4: for uU do 5: V ← getRef f eringU RLsBy (u, D)  All URLs referred by u 6: uh ← indexOf (u) 7: for vV do 8: vh ← indexOf (v) 9: M [uh ] [vh ] ← M [uh ] [vh ] + 1 10: end for 11: end for 12: return M

C. Graph partitioning using Louvain algorithm. We used a XML file format called “gexf” to store the graph which obtained by converting resulted network matrix. Gephi[20] has been used to visualize and analyze the resulted graph. We then applied Louvain algorithm to the created graph. Gephi has tools to color partitions resulted by Louvain method in different color schemes. This method gives a clear image of partitions and where they resides in the graph.

Fig. 1.

Graph after partitioning.

We have used several data sets and taken the partitioned graphs for compare the results.

V. R ESULTS . Our test-bed consisted with a Intel core-i7 processor which has 8MB cache and physical memory of 6GB. We use Python programming language and R statistical tool to analyse the results. A. Representation of resulted partitions. Our data set consisted of the natural traffic flows collected from the users of University network. Due to this reason the collected traffic data spread in huge range of URL categories. To properly categorize these URLs we needed to have a standard method or labeling. In other words there is a need of directory of URLs to categorize the collected network traffic. Due to unavailability of open standard to categorize URLs or a web directory service, resulted clusters have not been labeled. Hence we focused on identifying resulted clusters by comparing content of several such partitioned graphs. We gained a rough idea about the partitions by manually analyzing the content of each partition. Also a comparison with partitions resulted from several data sets shown that, most of the time a URL from a partition of one data set does not fall in to the partition that have same neighborhood of URLs resulted from another data set. But it could have been observed that a URL can be found in a subset of the neighborhood of one such partition. This can be explained with Louvain algorithm. In the algorithm partitions are formed with accumulating several sub partitions. But in different data sets the properties to form such accumulated partitions which are identical, might not be presented. Hence sub partitions will accumulate to some other sub partitions by forming a totally new partition. B. Comparison with k-means clustering. According to the results published by Erman et al in 2006 [11], they have shown that k-means algorithm is the most efficient unsupervised machine learning approach among the three algorithms they tested. Based on this results we have used k-means algorithm to compare the results of our methods. K-means clustering algorithm have been used with the extracted data from the graph to cluster those data. Here we considered only about the edges of our graph. Within cluster sum of squares against number of clusters have been plotted for above data as shown in figure 2(a). This plot used identify the accurate number of clusters resides in our graph dataset for k-means algorithm. Number of clusters which minimize the within cluster sum of squares has been chosen as the ideal number of clusters for the dataset and k-means algorithm has been applied to dataset. We plotted resulted clusters with applying a color scheme to distinguish each cluster. Table I shows details about clusters formed by two algorithms for our datasets. We can observe that Louvain algorithm always formed larger number of partitions than k-means

(a) Within cluster sum of squares against number of clusters. Fig. 2.

Eccentricity distributions of two distinct data sets.

(a) Graph G1 . Fig. 3.

No. of nodes

No. of edges

No. of partitions

11433 13686 5419

34830 48083 12898

2488 2792 1140

(b) Plotted clusters from k-means.

(b) Graph G2 . Eccentricity distributions of two distinct data sets.

No of k-means clusters 30 24 19

TABLE I D ETAILS OF DATASETS .

algorithm. The reason for this might be that nodes which could not be able to assigned to partitions will left alone by Louvain algorithm. Also at the beginning of Louvain algorithm it assigns each node into a separate cluster. Number of partition will be high when single nodes left out without accumulating to partitions. C. Time complexities. Louvain method known to reduce run time after detecting several hierarchies of communities and has a time complexity of O(n log n). K-means algorithm known to have a time complexity of O(ndk+1 log n) where d is the number of dimensions data have scattered through and k is the number of centroids. Both algorithms address domain of NP-hard problems while Louvain method takes a greedy optimization approach. Table

II shows the time complexities of proposed graph partitioning approach and clustering approach. According to the table II we can see that, using Louvain algorithm we can cluster network traffic in a lesser time than using k-means algorithm. Since running time matters in traffic classification domain, we can achieve best results by using the graph partition approach. D. Eccentricity distributions. After comparing eccentricity plots of several networks as shown in figure 3, we have been observed that they have identical distributions. This means nodes have been positioned in an identical way where they have similar frequency of nodes those having same distance value from one node to another. Hence number of nodes meet when traveling from one node to another have same values for data set collected. An explanation to this scenario could be that only a portion of World Wide Web(WWW) is visible to a given region. Hence the URLs that link with each other, which are visible to that region, changes rarely. Reasons that define portion of WWW visible to a given region varies from legal system of that region to cultural values and social believes. This could also be vary base on technology as well. Before coming to

Data Filtering Network creation Clustering algorithm Total

k-means clustering O(n) O(n2 ) O(ndk+1 log n) O(ndk+1 log n) +O(n2 ) + O(n)

Louvain method O(n) O(n2 ) O(n log n) O(n log n) +O(n2 ) + O(n)

TABLE II T IME COMPLEXITIES .

a final conclusion more research should be carried out about eccentricity graphs of network traffic using data collected from different locations. VI. C ONCLUSION Our work has illustrated the importance of using graph theories to classify network traffic. Since World Wide Web is a large graph, graph theories are suitable to approximate resulted partitions with the natural clusters reside on a network. In this work we only used pre-collected traffic data to create graphs. Hence the graph is static for the period of time data collected. However to obtain an better understanding, theories about dynamic graphs[21] should be use. Also these dynamic graph theories can be used to classify network traffic obtained from live streaming resources. Our work can be used in intrusion detection domain to find network anomalies. Also this work can be used with QoS services where ISPs can find high traffic partitions and use those information to solve problems arise with high network usage. We faced the lack of standard web categorization method while carrying out our work. Though there exists few public web directories which maintained using crowd sourcing methods, the frequency of updating those directories are not enough to cover rapidly expanding World Wide Web. In order to use the full power of network partitioning there should be a frequently updating, standard directory service to label and categorize URLs. R EFERENCES [1] J. Hendler, N. Shadbolt, W. Hall, T. Berners-lee, and D. Weitzner, “Web Science : An interdisciplinary approach to understanding the World Wide Web,” Communications of the ACM - Web science, vol. 51, no. 7, pp. 60–69, 2008. [2] T. Nguyen and G. Armitage, “A survey of techniques for internet traffic classification using machine learning,” IEEE Communications Surveys & Tutorials, vol. 10, no. 4, pp. 56–76, 2008. [3] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast unfolding of communities in large networks,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2008, no. 10, p. P10008, 2008. [4] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield, “Class-of-Service Mapping for QoS A Statistical Signature based Approach to IP Traffic Classification,” in Proceedings of the 4th ACM/SIGCOMM conference on Internet measurement. Taormina, Sicily, Italy: ACM, 2004, pp. 135– 148. [5] D. Zuev and A. W. Moore, “Traffic Classification Using a Statistical Approach,” in Passive and Active Network Measurement, C. Dovrolis, Ed. Berlin / Heidelberg: Springer, 2005, pp. 321–324. [6] A. W. Moore and D. Zuev, “Internet Traffic Classification Using Bayesian Analysis Techniques,” SIGMETRICS Perform. Eval. Rev., vol. 33, no. 1, pp. 50–60, 2005.

[7] J. Park, H.-R. Tyan, and C.-C. J. Kuo, “GA-Based Internet Traffic Classification Technique for QoS Provisioning,” in Proceedings of the 2006 International Conference on Intelligent Information Hiding and Multimedia. IEEE Computer Society, 2006, pp. 251–254. [8] T. T. T. Nguyen and G. Armitage, “Training on multiple sub-flows to optimise the use of Machine Learning classifiers in real-world IP networks,” in in Proceedings of the IEEE 31st Conference on Local Computer Networks. Tampa, Florida, USA: IEEE Computer Society, December 2006, pp. 369–376. [9] K. W. Kolence and P. J. Kiviat, “Software unit profiles & kiviat figures,” SIGMETRICS Perform. Eval. Rev., vol. 2, no. 3, pp. 2–12, Sep. 1973. [Online]. Available: http://doi.acm.org/10.1145/1041613.1041614 [10] A. McGregor, M. Hall, and P. Lorier, “Flow clustering using machine learning techniques,” in Passive and Active Network Measurement, 5th International Workshop, PAM 2004, Antibes Juan-les-Pins, France, April 19-20, 2004, Proceedings, B. Chadi and I. Pratt, Eds. Springer, 2004, vol. 3015, pp. 205–214. [11] J. Erman, M. Arlitt, and A. Mahanti, “Traffic classification using clustering algorithms,” in Proceedings of the 2006 SIGCOMM workshop on Mining network data. New York, New York, USA: ACM Press, 2006, pp. 281–286. [12] S. Zander, T. Nguyen, and G. Armitage, “Automated traffic classification and application identification using machine learning,” in Proceedings of the The IEEE Conference on Local Computer Networks 30th Anniversary, ser. LCN ’05. Washington, DC, USA: IEEE Computer Society, 2005, pp. 250–257. [13] W. W. Vithanage and A. S. Atukorale, “A Novel Classifier for Engineering Web Traffic,” in 2011 IEEE Symposium on Computers and Communications ISCC. IEEE, 2011, pp. 1009–1016. [14] T. Kohonen, “The self-organizing map,” Proceedings of the IEEE, vol. 78, no. 9, pp. 1464–1480, 1990. [15] M. Crotti, M. Dusi, F. Gringoli, and L. Salgarelli, “Traffic classification through simple statistical fingerprinting,” ACM SIGCOMM Computer Communication Review, vol. 37, no. 1, p. 5, 2007. [16] N. Sengupta and J. Sil, “Evaluation of Rough Set Theory Based Network Traffic Data Classifier Using Different Discretization Method,” International Journal of Information and Electronics Engineering, vol. 2, no. 3, pp. 338–341, 2012. [17] P. Siska, M. P. Stoecklin, A. Kind, and T. Braun, “A flow trace generator using graph-based traffic classification techniques,” in Proceedings of the 6th International Wireless Communications and Mobile Computing Conference, ser. IWCMC ’10. New York, NY, USA: ACM, 2010, pp. 457–462. [Online]. Available: http://doi.acm.org/10.1145/1815396.1815503 [18] S. M. Mehr, M. Taran, A. B. Hashemi, and M. R. Meybodi, “Determining web pages similarity using distributed learning automata and graph partitioning,” 2011 International Symposium on Artificial Intelligence and Signal Processing (AISP), pp. 129–134, Jun. 2011. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5960971 [19] V. Paxson, “Bro: a System for Detecting Network Intruders in Real-Time,” Computer Networks, vol. 31, no. 23-24, pp. 2435–2463, 1999. [Online]. Available: http://www.icir.org/vern/papers/bro-CN99.pdf [20] M. Bastian, S. Heymann, and M. Jacomy, “Gephi: An open source software for exploring and manipulating networks,” in International AAAI Conference on Weblogs and Social Media, 2009. [Online]. Available: https://www.aaai.org/ocs/index.php/ICWSM/09/paper/viewFile/154/1009 [21] C. C. Bilgin and B. Yener, “Dynamic Network Evolution : Models , Clustering , Anomaly Detection.”