A Peer-To-Peer Traffic Identification Method Using ... - IEEE Xplore

0 downloads 0 Views 300KB Size Report
Abstract—The use of peer-to-peer (P2P) applications is grow- ing dramatically, which results in several serious problems such as the network congestion and ...
A Peer-To-Peer Traffic Identification Method Using Machine Learning Hui Liu, Wenfeng Feng, Yongfeng Huang, Xing Li Department of Electronic Engineering, Tsinghua University Beijing, 100084, China Emails: [email protected], [email protected], [email protected], [email protected]

Abstract— The use of peer-to-peer (P2P) applications is growing dramatically, which results in several serious problems such as the network congestion and traffic hindrance. In this paper, a method is proposed to identify the P2P traffic based on the machine learning. The novelty of the proposed method is that it utilizes only the size of packets exchanged between IPs within seconds. By investigating the ratio between the upload and download traffic volume of several P2P applications, a characteristic library is constructed. Then the unknown network traffic can be recognized online using this library. The distinguished features of the proposed method lie in that fast computation, high identification accuracy, and resource-saving capability. Finally, experiment results show the satisfactory performance of the proposed method.

I. I NTRODUCTION Nowadays, peer-to-peer (P2P) traffic has occupied about 70% of the total Internet traffic, and the ratio is still growing. The rapid growth on the usage of the P2P applications has led to serious problems, such as network congestion and traffic hindrance, due to the excessive occupation of the network bandwidth. This harm is especially obvious in corporation, institution and government networks where the bandwidth required by critical services has to be guaranteed. And another potential problem is security. That is, legal issues must be considered due to the piracy associated with P2P applications. Therefore, how to identify the P2P traffic has been a hot issue on internet researches. In literatures, Port-based P2P traffic identification method is first suggested by Sen [1]. Since many P2P applications have their own default service ports, such as Kazaa (1214) [2], BitTorrent (6881-6889) [3], by detecting the specific ports, we can easily identify the corresponding application traffic. However, due to the closure of Napster, more and more P2P applications begin to use the port-hopping technique to escape the detection [4]. And some recent P2P applications such as Winny [5], do not use the default service port. Then, the portbased identification method will not work any more. In order to identify the P2P traffic better, a method based on the payload is presented in [6]. This algorithm identifies a P2P traffic by recognizing the specific characters in the payload of packets. In this method, each packet needs to be analyzed, and it requires huge computation capability. What’s worse, it c 1-4244-0328-6/06/$20.00 °2006 IEEE. International Conference on Networking, Architecture, and Storage (NAS 2007) 0-7695-2927-5/07 $25.00 © 2007

behaves badly on encrypted payload or new type of P2P traffic with unknown specific characters. Because the port-based and payload-based P2P identification methods have a narrow application field, in [7], Karagiannis brings forward a new algorithm which is based on the behavior characteristic of the transport layer. By using a little information of transport layer packets, this method can accurately identify 99% of the P2P traffic, however, this algorithm can be only used offline. BLINC is proposed as another useful traffic identification method In [8], which relies on patterns of host behavior at the transport layer. Instead of studying TCP (or UDP) flows individually, this scheme pays attention to all the flows generated by specific hosts, and can accurately associate each host with the services it provides or uses (application server, web client, etc). However, BLINC has to gather information from several flows of each host before it can decide the role of a host. This makes it time-consuming. There has been a new trend to classify traffic based on summarized flow information such as duration, number of packets and mean inter-arrival time [9], [10], [11], [12]. But these methods are not mature and all belong to off-line algorithms. In [13], the authors use some fundamental characteristics of P2P protocols to identify the P2P applications, such as the huge network diameter and the presence of many hosts acting as both servers and clients. It utilizes only the transport layer header of every packet, and can identify unknown P2P protocols. However, it is time-consuming too. Bernaille et al. proposed a technique that relied on the observation of first five packets of a TCP connection to identify the application in [14]. This method opens a challenging avenue for the online traffic classification. Motivated by the aforementioned discussions, a novel algorithm is proposed here to identify the P2P traffic based on the machine learning method. Different from above methods, the proposed algorithm utilizes only the size of packets exchanged between IPs within seconds. By investigating the ratio between the upload and download traffic volume of several P2P applications, a characteristic library is constructed. Then the unknown network traffic can be recognized online using this library. It only costs few seconds from an P2P peer began to communicate with other peers to it be identified. Therefore, the proposed identification method is quick enough to allow



(

( ( (

 XSORDGGRZQORDG

XSORDG GRZQORDG

(



EW SSOLYH

    

(

                     

(



(

                       WLPH



( (            

Fig. 1.

The upload and download traffic of BitTorrent

automatic blocking, filtering, or recording of P2P applications. The rest of the paper is organized as follows. Section II describes the Preliminaries of the proposed identifier. The proposed traffic identification method is described in section III. In section IV, an experiment is presented to verify the proposed method. The constrains and limitations of the proposed approach are discussed in section V. Finally, section VI concludes this paper. II. P RELIMINARIES This section discusses some basic assumptions and the two main insights of the proposed method: the analysis of each IP’s upload traffic and download traffic, and the use of supervised machine learning to determine which application an IP runs. A. Assumptions Our goal is to build an identifier that runs online and accurately identifies the P2P application associated with an IP as quick as possible. Our work relies on the following assumptions: All the referred P2P applications are with default settings. That is to say, the maximum upload and download bandwidth are not limited. Some people restrict the upload bandwidth to save their resource, and this is not approved in this experiment. Another situation is that people using a asymmetric upload bandwidth and download bandwidth limit is also not considered in this paper. Access to all the upload and download traffic of an IP. All the traffic from and to an IP is needed, thus we assume that the network administrator monitors all the traffic of all the edge links. Whitelisted ports can not be used by P2P application (except for port 80). The whitelisted ports are those ports with fixed applications, for example, port 53 is used by DNS. P2P applications that can personally choose ports usually choose the port in 1024-65535, so we consider all the ports in 0-1024 are safe except for port 80, and port 80 will be discussed later. B. Intuition behind the method Our goal is to identify the application with an IP as early as possible and with a little resource consumption. We design an identifier that uses the traffic information of an IP. Specially,

International Conference on Networking, Architecture, and Storage (NAS 2007) 0-7695-2927-5/07 $25.00 © 2007

Fig. 2.

The uds of BitTorrent and PPlive.

only the size of packet that is not from the well-known nonP2P ports is used. For a P2P peer, it downloads things from his father peers, which we call download traffic, and at the same time, its son peers download things from it, which we call upload traffic. Therefore, there is a ratio between the upload traffic and the download traffic. And for different P2P application this rate is different. To testify this assumption, we take the BitTorrent for example. The upload and download traffic of this BitTorrent peer per minute is recorded and put into Fig. 1. As in the figure, the download traffic is of a nearly fix times to the upload traffic. For simply the rate between the upload traffic and the download traffic is recorded as ud. Furthermore, the uds of PPlive and BitTorrent are presented in Fig. 2. We can see that the uds of PPlive are larger than those of BitTorrent, and there is hardly any overlaps between them. C. Supervised Machine Learning To extract groups of IPs that share a common traffic characteristic, we employ techniques from machine learning. Machine learning generally consists of two parts: model and classification. A model is first built using training data, then it is inputted into a classifier to classify a data set. Machine learning techniques can be divided into the categories of unsupervised and supervised [11]. In this paper, we first use the supervised machine learning to obtain each P2P application’s ud from the training set of data. Once the identifier has been built, the identification process consists of the identifier calculate the distance between the uds and the testing set of data. Here we use a simple heuristic: we compute the Euclidean distance between the new IP’s ud and the pre-defined uds, and then choose the ud for which the distance is minimum. D. Traffic Capture A sniffer is situated between LAN and WAN router as Fig. 3 shows, all the packets that from or to the stub network can be captured. The sniffer was implemented based on the original libpcap library code to enable long periods of time capture. For each IP packet, the sniffer reads only the first 40 bytes, which allows us to reduce the information required to follow the P2P flows and simultaneously, the user privacy is also guaranteed. The 40 bytes are the transport layer header and the IP layer header.

Internet peer2 WAN Router

LAN Router

Core Switch

Traffic Collecting point

Fig. 3.

peer3

Stub Network

Characteristic Library

Fig. 4.

Online Identification phase





Fig. 5.

Learning phase

Test data set

SHUFHQWDJH OLQHDUILW 

Traffic collecting point.

Training data set

SHUFHQWDJH

peer1

              XSORDGGRZQORDG







The distribution of PPlive’s uds.

We introduce a parameter interval, the upload and download traffic both means the total traffic sent or received in the interval. First the interval is set to 1 minute, later the results with different intervals will be compared . Results

Identification flowchart.

III. M ETHODOLOGY Based on the observations from the previous section, we propose a P2P identification mechanism that works in two phases: an offline learning phase and an online P2P identification phase. As Fig. 4 shows, first, the learning phase uses a set of training data to get each P2P application’s ud characteristics. Then, the traffic identification phase uses these characteristics to determine the application associated with each IP. To verify that the behaviors of applications do not vary with time, we use one data set for the learning phase and another for the evaluation of the derived identification method. These two traces were collected several days apart on the same link. Considering the popularity on the Internet and the volume of traffic exchange, in this paper we only take the following P2P applications into consideration, Maze [15], BitTorrent, PPlive [16], eDonkey [17] and thunder [18]. The stub network in Fig.3 has 50 computers, among them the above five P2P applications are arranged on five computers, and each application on one computer to ensure that every P2P peer has only one P2P traffic. The peers run other non-P2P applications at the same time. We measure the accuracy by comparing the P2P application labels given by our identifier to the real situationwhich is different to [14]. The advantage is that we get the absolute accuracy, [14] gets the relatively accuracy comparing to the payload analysis. Our disadvantage is that our experiment environment is not general. By doing the experiment many times and changing the experiment environment every time, for example, we allocate the five P2P applications among the 50 computers every time on different 5 computers, and we change the background traffic from the other computers, we can simulate a more realistic environment to the link in [14].

International Conference on Networking, Architecture, and Storage (NAS 2007) 0-7695-2927-5/07 $25.00 © 2007

A. Learning Phase The learning phase is performed offline and consists in detecting different uds in a set of P2P applications. Its aim is to find the uds for the above five P2P applications. Our study use a six-hour trace collected from the traffic connecting point with interval=1 minute. The uds are calculated in each minute of the five P2P applications, and we get 360 uds for each application. Then the distribution of the uds is calculated and drawn in a chart. Take PPlive for example, we can get the diagram in Fig. 5. Compared with the linear fit, we can see that most of the uds are between 2 to 6.7 in the curve of the distribution of uds. Thus we set characteristic of PPlive as (2 , 6.7). For an unknown application, the ud of this application with interval=1 minute is calculate, and if the ud is in(2 , 6.7), we can say that it has a large possibility to be a PPlive application. The same was done on the other four P2P applications, and we get the characteristics for them. By summarizing these characteristics a character library is obtained, which will be used in the online identification process. Another characteristic will be add into the character library is the total traffic low threshold, here only those IPs that have enough traffic are to be dealt with, the reason is as follows: 1) to save resource and promote the performance ; 2) only 10 % of the total P2P peers produce 99% of the total P2P traffic[1]. The total traffic that each peer in 1 minute produces is recorded, and the minimum one is found out and based on which to set the low threshold. Here, we set the low threshold as 300 bytes per second. B. Online Identification Fig. 6 presents the structure of the online identifier. This identifier can run either at a management host that has online access to packet headers or in a network processor at the router. This process is very light since it only involves retrieving information from the IP layer and Transport layer headers (the size of the packet and the flow id).

Supervisor



file sniffer

file

Filter

Update and Record

Flow Analyze

Flow database

( source dest proto

size)

Rule database

Rules Results

Rule Execute

new rules

   DFFXUDF\

 Query

Packet Stream Flow Stream

DFFXUDF\

NIC sniffer

Extract

Sampling

NIC

Calculate

 Capture Network

HTTP Interface

 Re que st Re sults

                    LQWHUYDO

IPC/Socket

training Rules

Fig. 6.

for each packet: if sport or dport in (whitelisted ports) discard it else { get(sIP,dIP,total length) get sIP.upload get dIP.download add sIP in sIPTable update sIP.upload add dIP in dIPTable update dIP.download } } while((sIPTable != NULL) and (dIPTable != NULL)){ if(sIP == dIP){ traffic = sIP.upload + dIP.download if (traffic > low threshold){ rate = sIP.upload / dIP.download check rate from rate-application table to get the application } //do nothing

} }

Fig. 7.

The accuracy with different interval.

Design of online identifier.

while( !interval ){

else

Fig. 8.

Algorithm for online P2P traffic identification

Our identifier takes as input either from the packets headers from a NIC or from a tcpdump file. A Capture module uses a NIC sniffer and a file sniffer to get the packets headers. If necessary, sampling is employed. The packets that with source port or destination port in the whitelisted ports will be picked out by the Filter module and discarded. The Extract module extracts the 5-tuple (protocol, source IP, destination IP, source port, destination port) and the packet size, meanwhile, the source IP’s upload traffic volume and the destination IP’s download traffic volume are got. The newly got traffic volume is sent to the Update and Record module. After finding the corresponding source IP or destination IP, it is added to its old one.

International Conference on Networking, Architecture, and Storage (NAS 2007) 0-7695-2927-5/07 $25.00 © 2007

The above four modules are synchronous with the packet stream. When the interval arrives, the data will be processed in the Flow Analyze module, meanwhile, the above four modules carry on their work concurrently. In the Flow Analyze module, we have two arrays, one for the source IPs, the other for the destination IPs. The same source IP and destination IP are found out, and the total traffic volume is calculated by adding the source IP’s upload traffic volume and the destination IP’s download traffic volume. If this result is smaller than our predefined low threshold, this IP is discarded. Then we can get the ud for each IP, by checking the ud in the character library, we can know which P2P application this IP is running. The Query module receives clients requests and do the query then send the results to clients, by retrieving clients’ advice, it send new characteristic to the characteristic library. The supervisor module manages and controls all the modules, including resource allocation, exception handling and sockets communication. A pseudocode of our implantation is presented in Fig. 7. IV. E XPERIMENT R ESULTS This section gives a proof of concept of our identification method. We build a prototype identifier by modifying Como [19]. First, a training data set is used for the learning phase. Based on the identifier descriptions and identifier compositions found in the learning phase, we realize the online identification using a trace collected several days afterwards on the same traffic collecting point. We use interval=1minute before, and now we want to find a shorter interval to make the identification process faster. Fig. 8 illustrates the accuracy changing with the interval. When the interval gradually goes up from 1 second to 6 seconds, the accuracy get to a highest 92%. After that the accuracy has a little declining, considering both the efficiency and the accuracy, we say that the 6-second interval is most suitable. Table 1 presents the accuracy of our identifier on the five P2P applications. Our identifier can identify more than 92% of Maze, PPlive, BitTorrent and thunder. For eDonkey, the accuracy is 78.5%, which is because the eDonkey is not very popular in China, and the application is always searching for father nodes to download. And both the upload and download traffic of the eDonkey peers are very small, so our identifier

TABLE I ACCURACY OF P2P

APPLICATION IDENTIFICATION .

Software Maze PPlive BitTorrent eDonkey thunder

Accuracy 99.8% 93.5% 96.1% 78.5% 98.3%

always discard these IPs. This error can be easily fixed if we consider extra information. In this experiment, if we had used the default port 4662 to label eDonkey packets, we can achieve over 98% accuracy [20]. Comparing payload-based identification method with ours, we can get the following conclusions. For these P2P applications such as Maze which do not have a known application layer signature, can be well identified by ours, however, can not be recognized by the payload-based identification method. The time we need to identify an application is 6 seconds, which is longer than that of payload-based identification method, however, it is still fast enough for administrators to take some actions to the identified P2P application. We run the two identifier methods separately on a Dell workstation with 2.8G CPU, 512M memory, and the NIC is 100Mbps. The resource consumption for our method is as follows: CPU occupation is about 12%, and the total memory usage is 172M. And for payload-based identification method, CPU occupation is about 46%, and the total memory usage is 220M. We can say that the latter method consumes much more resource. V. L IMITATIONS AND C HALLENGES The initial results observed with our method on a small trace are encouraging. The method is promising as it allows early identification of P2P applications and is quite simple. However, the method has some limitations that we discuss below. Higher bandwidth. In our experiment, the bandwidth is just 100Mbps, how to deal with the higher bandwidth like 2.5Gbps is our further work. Sampling. On high speed links, monitors cannot collect all packets. The accuracy of our method will degrade fairly quickly under packet sampling. If instead the network adopts flow sampling [21], then our method will work unaltered. Only one P2P application on one computer. Our identifier would work bad, in case there are more than one P2P application running on single computer. Infrequently exists this situation, in which the ports should be taken into consideration, and the results can be satisfying, but taking more time and more resource. How to treat port 80. The port 80 is very debating, as some P2P applications like BitTorrent can use this port to communicate with other peers. In our method, the traffic volume of port 80 is recorded, if it is running a HTTP, then

International Conference on Networking, Architecture, and Storage (NAS 2007) 0-7695-2927-5/07 $25.00 © 2007

the total traffic volume probably is under the lower threshold, so it is discarded. Use this identifier on other P2P applications. In this paper, we only consider five P2P applications, how about the others, such as QQ, SKYPE and MSN. For this kind of interaction P2P applications, the characteristic of the ud is hard to obtain, how to set up a characteristics library for these P2P applications needs further research. VI. C ONCLUSION This paper presented a supervised machine learning approach for P2P traffic identification. By machine learning a characteristics library for the five P2P applications (Maze, BitTorrent, PPlive, thunder, and eDonkey) is built, then the unknown network traffic can be recognized online using this library. The distinguished features of the proposed method lie in that fast computation, high identification accuracy, and resource-saving capability. The experiment results show the satisfactory performance of the proposed method, with interval=6 seconds, the average identification accuracy can reach over 90%. With this efficiency and accuracy, it is enough to allow automatic blocking, filtering, or recording of P2P applications. Finally we discuss the limitations and the challenges of our method. How to use this identifier to other P2P applications and how to use it in a higher bandwidth network are the future research points. ACKNOWLEDGMENT This research was supported in part by the Hi-Tech R&D Program (863) of China (Grants 2006AA01Z444). R EFERENCES [1] S. Sen and J. Wang, “Analyzing peer-to-peer traffic across large networks”, IEEE/ACM Transactions on Networking, vol. 12, no. 2, pp. 219-232, 2004. [2] Kazaa, “http://www.kazaa.com”. [3] BitTorrent Protocol, “http://bitconjurer.org/BitTorrent”. [4] T. Karagiannis, A. Broido, N. Brownlee, Kc Claffy, and M. Faloutsos, “Is p2p dying or just hiding?”, Proceedings of the IEEE Globecom, Dallas, TX, USA, pp. 1532-1538, 2004. [5] S. Ohzahata, Y. Hagiwara, M. Terada and K. Kawashima, “A traffic identification method and evaluations for a pure P2P application Passive and Active Network Measurement”, Lecture Notes in Computer Science, vol. 3431, pp. 55-68, 2005. [6] S. Subhabrata, S. Oliver and D. Wang, “Accurate, scalable in-network identification of P2P traffic using application signatures”, Proceedings of Thirteenth International World Wide Web Conference, pp. 512-521, 2004. [7] T. Karagiannis, A. Broido, M. Faloutsos and K. Claffy, “Transport layer identification of P2P traffic Karagiannis”, Proceedings of the ACM SIGCOMM Internet Measurement Conference, pp. 121-134, 2004. [8] T. Karagiannis, K. Papagiannaki and M. Faloutsos, “BLINC: multilevel traffic classification in the dark,” Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, 2005. [9] M. Roughan, S. Sen, O. Spatscheck and N. Duffield, “Class-of-service mapping for QoS: A statistical signature-based approach to IP traffic classification,” Proceedings of the ACM SIGCOMM Internet Measurement Conference, pp. 135-148, 2004. [10] A. Moore and D. Zuev, “Internet traffic classification using bayesian analysis,” Proceedings of International Conference on Measurement and Modeling of Computer Systems, pp. 50-60, 2005.

[11] S. Zander, T. Nguyen and G. Armitage, “Automated traffic classification and application identification using machine learning,” Proceedings of the IEEE Conference on Local Computer Networks, pp. 250-257, 2005. [12] D. Zuev and A. Moore, “Traffic classification using a statistical approach,” Lecture Notes in Computer Science, vol. 3431, pp. 321-324, 2005. [13] F. Constantinou and P. Mavrommatis, “Identifying Known and Unknown Peer-to-Peer Traffic,” Proceedings of Fifth IEEE International Symposium on Network Computing and Applications, pp. 93-102, 2006. [14] L. Bernaille, R. Teixeira and I. Akodkenou, “Traffic classification on the fly,” Computer Communication Review, vol. 36, no. 2, pp. 23-26, 2006. [15] H. Chen, M. Yang, J. Han, H. Deng and X. Li, “Maze: a social peerto-peer network”, Proceedings of the IEEE International Conference on E-Commerce Technology for Dynamic E-Business, pp. 290-293, 2004. [16] PPlive, “http://www.imfirewall.com/protocols/ppLive.htm”. [17] EDonkey, “http://www.edonkey2000.com”. [18] Thunder, “http://www.seobbs.net/read.php?tid=11406”. [19] Como, “http://www.cambridge.intel-research.net/como”. [20] L. Plissonneau, J-L. Costeux and P. Brown, “Analysis of peer-to-peer traffic on ADSL,” Lecture Notes in Computer Science, vol. 3431, pp. 69-82, 2005. [21] N. Hohn and D. Veitch, “Inverting sampled traffic,” IEEE/ACM Transactions on Networking, vol. 14, no. 1, pp. 68-80, 2006.

International Conference on Networking, Architecture, and Storage (NAS 2007) 0-7695-2927-5/07 $25.00 © 2007

Suggest Documents