Solving the App-Level Classification Problem of P2P ... - IEEE Xplore

11 downloads 438 Views 582KB Size Report
Solving the App-Level Classification Problem of P2P Traffic Via Optimized. Support Vector Machines. Rui Wang1), Yang Liu2), Yuexiang Yang3), Xiaoyong ...
Solving the App-Level Classification Problem of P2P Traffic Via Optimized Support Vector Machines Rui Wang1), Yang Liu2), Yuexiang Yang3), Xiaoyong Zhou4) 1)2)3)4) National University of Defense Technology [email protected], [email protected] Abstract Since the emergence of peer-to-peer (P2P) networking in the last 90s, P2P traffic has become one of the most significant portion of the network traffic. Accurate identification of P2P traffic makes great sense for efficient network management and reasonable utility of network resources. App-level classification of P2P traffic, especially without payload feature detection, is still a challenging problem. This paper proposes a new method for P2P traffic identification and app-level classification, which merely uses transport layer information. The method uses optimized Support Vector Machines to perform large learning tasks, which is common in network traffic identification. The experimental results show that the proposed method has high efficiency and promising accuracy.

1. Introduction The traffic of Peer-to-Peer (P2P) filesharing applications has become the main part of Internet traffic: according to a survey from CacheLogic[1] in June, 2004, 60% of the Internet’s traffic is P2P. Additionally, P2P multimedia applications, such as pplive, qqlive in China, gradually become popular and contribute a great amount of P2P traffic to Internet. Many P2P applications use dynamic or variable ports and even camouflage traffic using ports designated to other traffic-such as ports 25(email) and 80(web). Deployment of strategies like this makes traditional port-based identification method unavailable for a large amount of P2P traffic. Inability of accurately detecting P2P traffic would result in following difficulties: firstly, hard to manage traffic efficiently in P2P-activated networks such as the campus network; secondly, hard to implement QoS, which needs accurate traffic classification information; lastly, hard to prevent or stop network attacks camouflaged with P2P traffic.

Proceedings of the Sixth International Conference on Intelligent Systems Design and Applications (ISDA'06) 0-7695-2528-8/06 $20.00 © 2006

Many identification methods are proposed to solve the P2P traffic classification problems. This paper categorizes them as three classes: payload-based methods [2], [3], [4], [5], statistic-based methods [6], [7], [8], and cross-layer methods [9]. Payload-based methods match the packets with payload features of P2P traffic, while statistic-based methods match traffic without accessing payload, but other features such as port numbers, ip address, and so on, so called statistic features. Methods using both payload and statistic information construct the third class. The payloadbased methods are easily implemented and able to classify P2P traffic in app-level, but it is hard for them to identify payload-encrypted traffic and unknown P2P. Moreover, in large-scale networks or security networks, it is common that payload access is proscribed because of the high accessing cost and private trespass, rendering that the payload-based method can not be applied. Compared with the payload-based methods, the statistic-based methods are more easy to identify payload-encrypted traffic and unknown P2P. Additionally, it is also convenient to apply them to large-scale networks and security networks. However, app-level classification is hard to be implemented without payload access. The crosslayer methods, on the one hand, have the ability of identifying payload-encrypted traffic and unknown P2P by matching statistic features, and on the other hand, are able to classify P2P traffic in app-level with payload features. Like the payload-based methods, the cross-layer methods are not suitable for some specific networks. The support vector machine (SVM) has been introduced as an excellent technique for solving various function estimation problems, including the pattern recognition problems [10], [11], [12], [13], [14]. But when the dataset is very large and some types of errors are more important than others, the traditional binary SVM and the most popular multi-class SVM can not solve the classification problems efficiently. Since the network traffic is always in bulk size, this

paper proposes a new identification method, which is one of the statistic-based methods, using a multi-class SVM [15] [16] and a unbalanced binary SVM [17] to perform the large classification tasks. The criteria for this method is „ Able to detect major types of P2P traffic „ Immune to newborn P2P applications and the possible encryption of the payload „ Able to process a large amount of traffic „ Able to process app-level classification of P2P The rest of this paper is organized as follows. In Section 2, we briefly review some basic work on SVM for classification problems. The data and the proposed method are described in Section 3. In Section 4, experimental results are reported to illustrate the superiority of the proposed method. Finally, concluding remarks and the future works are discussed in Section 5.

Figure 1 illustrates the SVM classification with a hyper-plane that maximizes the separating margin between the two classes (indicated by data points marked by “ × ”s and “○ ”s). Support vectors are elements of the training set that lie on the boundary hyper-planes of the two classes. A decision function of the classifier is then given by f w ,b =sgn[ w Tz+b],

2. Support Vector Machines

where c > 0 is a regularization parameter for the tradeoff between model complexity and training error, and ε i measures the (absolute) difference between w Tz+b and yi. Furthermore, in many important applications, however, some kinds of errors are more important than others. So, an alternative formulation called v-SVM [20] is proposed. It replaces c in the formulation (1) with a different parameter v ∈ [0, 1] that serves as an upper bound on the fraction of margin errors and a lower bound on the fraction of support vectors. The vSVM has the formulation as follows:

Support Vector Machines (SVMs) are classification and regression methods which have been derived from the basic principles described in Vapnik’s Statistical Learning Theory [13]. In this section we shortly review some basic works on SVM for classification problems.

2.1. Two-class Support vector machines Let n-dimensional inputs xi (i = 1, …, n) belong to Class 1 or 2 and the associated labels be yi = 1 for Class 1 and −1 for Class 2. The classification problem is to find a hyper-plane in a high dimensional feature space Z, which divides the set of examples in the feature space such that all the points with the same label are on the same side of the hyper-plane [10], [11], [13]. SVM is to construct a map z= φ (x) from the input space Rn to a high-dimensional feature space Z and to find an “optimal” hyper-plane w Tz+b=0 in Z such that the separation margin between the positive and negative examples is maximized (Figure 1) [10], [14], [18], [19].

Figure 1. Two-class SVM

Proceedings of the Sixth International Conference on Intelligent Systems Design and Applications (ISDA'06) 0-7695-2528-8/06 $20.00 © 2006

where w is a weight vector and b is a threshold. Without loss of generality, we consider the case when the training set is not linearly separable. The SVM classification amounts to finding w and b satisfying: N 1 min wT w + c ∑ ε i 2 i =1

 yi [ wT φ ( xi ) + b] ≥ 1 − ε i , i = 1,..., N (1) s.t.  ε i ≥ 0, i = 1,..., N

1 1 N min wT w − v ρ + ∑ ε i n i =1 2 (2) T  yi [ w φ ( xi ) + b] ≥ ρ − ε i , i = 1,..., N s.t .  ε i ≥ 0, ρ ≥ 0, i = 1,..., N

2.2. Multi-class Support vector machines Previous subsection reviews the basic theory of SVM for two-class classification. In this subsection, we describe the all-together method of multi-class support vector machines. The all-together method [21] was proposed by Crammer and Singer. Unlike most of approaches which typically decompose a multi-class problem into multiple independent binary classification tasks, it is a method by solving a single optimization problem. The formulation is as follows:

Table 1. Data description

l 1 k 2 min ∑ || wt || +C ∑ ε j 2 t =1 j =1

Set A

 wTy j φ ( x j ) − wtT φ ( x j ) ≥ etj − ε j s.t.   j = 1,..., l

(3)

Set C Set D

1, y j = m

where etj = 1 − δ y ,t and δ y ,t ≡   j j

Set E

0, y j ≠ m

A point x is in the class that corresponds to the largest value of the decision functions: T

The class of x = arg max m=1, …, k wt

φ ( x)

Set B

(4)

Set A Set B Set C Set D Set E

start 10:00 05/14 10:00 05/15 10:00 05/16 10:00 05/17 10:00 05/18 P2P 100% 100% 100% 100% 0%

During

Interval

PCs

12 hour

5 min

1

12 hour

5 min

1

12 hour

5 min

1

12 hour

5 min

1

12 hour

5 min

10

P2P type BitTorrent eDonkey Kazaa pplive --

Flows 30.68k 2.72k 4.43k 58.15k 460.24k

Bytes 337.26M 629.82M 22.548M 168.62M 799.92M

3. The Proposed Method

3.2. The proposed method

3.1. Data description

Generally, False Positive and False Negative are proposed to thoroughly evaluate the accuracy of a P2P traffic identification method. False Positive demonstrates how many non-P2P traffic is misidentified as P2P, and False Negative shows how many P2P traffic are not recognized correctly. Large False Positive always makes the identification method unavailable in many occasions. For instance, when applying P2P identification method for P2P traffic confinement, large False Positive makes that non-P2P users in the network are badly influenced. Additionally, the deployment of QoS strategies is also sensitive to large False Positive. But large False Negative makes much less impact to the availability and applicability of the method. On the one hand, a identification method with large False Negative does not threaten non-P2P users’ benefits, and on the other hand, identifying main part of P2P traffic is generally enough for efficient management of P2P. So an eligible P2P identification should have low False Positive and proper False Negative. P2P traffic can be separated into two parts: signalling and data traffic. Signalling traffic involves all controlling traffic between peers and superpeers (if there are), and announcement traffic among peers. Data traffic, which involves all user data traffic transferring among peers, constitutes the main part of P2P traffic. So the efficient identification of data traffic is more important than identifying signalling traffic for the sake of identifying the most part of P2P. Based on the discussion above, this paper proposes two designing principles for our identification method: „ Low False Positive is more important than low False Negative; „ Accurate identifying data traffic is more important than identifying signalling traffic.

The data of all experiments are collected from our campus through a netflow collector. Figure 2 shows that the netflow is on the border router, so we can easily collect all flows from inner to the internet or from internet to inner. We collected 5 groups of data, all of which are described in table 1. The first 4 groups of flows, called set A, B, C, D respectively, involve BitTorrent, eDonkey, Kazaa, and pplive traffic, which are popular P2P filesharing or multimedia. The 5th group of flows involves a large amount of non-P2P traffic. Importantly, only tcp traffic from inner to outer and from outer to inner is collected as experimental data, because the traffic from inner to inner and that of udp are not within the concern of this paper.

Figure 2. Network topology

Proceedings of the Sixth International Conference on Intelligent Systems Design and Applications (ISDA'06) 0-7695-2528-8/06 $20.00 © 2006

According to the two principles, this paper proposes a P2P traffic identification method which can be described with figure 3. input data output data unknown data

binary predict

non-P2P

unbalanced binary train

non-P2P P2P

unbalanced model

P2P 1 P2P 2

of flows, such as source ip and destination ip, are environment-specified, which makes the training results not available to other environment. Moreover, crude traffic flows lack of distinct features which can be used for classification. So a pre-process of flows is necessary.

P2P 1 P2P 2 multi-class predict

P2P n

...

...

weighted mclass train

multi-class model

P2P n

Figure 3. The proposed method (a) P2P

In figure 3, the whole detecting process can be looked as a black box (the process in the dashed box). This black box has only one input (unknown traffic) and n+1 outputs (one identified non-P2P and n P2Ps). The whole detecting process can be described as follows: „ Standard and pure non-P2P and n types of P2P traffic are trained by an unbalanced binary SVM, leaning to decreasing misclassified non-P2P traffic. The training process exports an unbalanced binary model; „ N types of P2P traffic are trained by a weighted multi-class SVM, which considers training data in different importance according to their weight values. The output of this training process is a multi-class P2P model. „ Unknown traffic, which is the input of this method, is predicted by the unbalanced binary model at first, and then the traffic is identified as P2P or non-P2P. „ The traffic which is identified as P2P is then predicted by the multi-class P2P model for the purpose of classifying the application type. In this method, the unbalanced binary train realizes the first designing principle, namely decreasing False Positive as possible as it can, and the weighted multiclass train can be used to implement the second designing principle by giving data traffic high weight value and signalling traffic low. Considering that most signalling traffic flows have a few packets, while most data traffic flows have many more packets, this paper implements a simple weight algorithm. In this algorithm, the more packets a traffic flow has, the larger weight the traffic flow has. Crude flows can not be used as the input of the proposed method directly, because certain information

Proceedings of the Sixth International Conference on Intelligent Systems Design and Applications (ISDA'06) 0-7695-2528-8/06 $20.00 © 2006

(b) non-P2P

Figure 4. Statistic features Considering the differences between web traffic and P2P traffic, provided that a node A becomes a peer of certain P2P application, throughout the lifetime of this peer, in most of its time it should be in the state of connecting with too many other peers and superpeers. Moreover, the amount of the connections between each two peers should be low because the load balance of data transfer demands that heavy load between two peers should be avoided as possible as they can. In a large-scale or middle-scale P2P overlay, it is common that a peer could find many other peers that match its requirements, rendering that the load balance strategy can be applied in most situations, so that the amount of connections between peers is always low. On the contrary, for a node with other traffic like http or ftp often shows that it connects with less destination nodes and more connections with these destinations. We illustrate this distinction between P2P traffic and nonP2P traffic in figure 4(a) and 4(b). Another statistic feature is the dynamic assignment of port numbers which are used in the communication between peers. For TCP communication between peers, we have found that a peer tends to use dynamic assigned port to communicate with other peers, and the destination peer tends to listen to a dynamic assigned port for communication. Based on the above analysis and observation, we propose a method as follows: Suppose that a flow can be expressed by a vector: . It can be simplified as: flow = . We define a function:

f ( src, dst ) = dif ( src , DST ) / same( src, dst )

(5)

Dif(src, DST) exports the number of the different destinations which are connected with src, and same(src,dst) exports the number of the same destination equal to dst among the connections to src. When computing f(src,dst) of a node within mixed traffic, dif(src,DST) is constant, while non-P2P flows tend to have a larger same(src,dst). In the similar analysis, replace dst with spt or dpt, we could also conclude that some differences may exist between P2P and non-P2P traffic. Similar to this definition, f(src, spt) and f(src, dpt) are defined respectively. Finally, a pre-processing function can be given: pre − process ( flow) = (6) < f ( src, dst ), f ( src, spt ), f ( src, dpt ), bytes / pkts > After the pre-processing, we expect that the first three dimensions of flows can be used to differentiate P2P and non-P2P efficiently, and the combination of all four dimensions can be used for the app-level classification.

4. Experimental Results First, this paper compares three methods: Signature method[3]-a payload-based method, Transport Layer method[7]-a statistic-based method, and the proposed method. The results of the comparison are demonstrated in table 2. Table 2. The comparison of three methods method Signature Transport Layer proposed

App-level Need Not Private Classification information Yes

No

Identify Encrypted or Unknown P2P No

No

Yes

Yes

Yes

Yes

Yes

Then, we test the identification accuracy and efficiency of the proposed method. The process of the test can be described as follows: The datasets A, B, C, D, E are pre-processed firstly, and then assembled to 12 subsets respectively. Each subset has one hour flows. Then, we select one subset from set A, B, C, D, and E respectively. The subset from E is regarded as non-P2P training data, and the subsets from A, B, C, and D are regarded as P2P 1, P2P 2, P2P 3, and P2P 4. The selected subsets are trained according to figure 3, then we use the total 60 subsets as the input of the method and carefully record the output and processing time. Figure 5 demonstrates the identification accuracy and efficiency of the unbalanced binary prediction. Figure 5(a) records the misclassified rate of the 5 datasets in each hour. It is clear that the misclassified rate of the set E is always the lowest in each hour,

Proceedings of the Sixth International Conference on Intelligent Systems Design and Applications (ISDA'06) 0-7695-2528-8/06 $20.00 © 2006

showing that the unbalanced principle is successfully realized. Figure 5(b) shows the cpu time of predicting the subsets of set E. This paper uses a machine with Celeron 3.0G CPU and 512M memory. The largest cpu time is 0.07s and the lowest is 0.01s. It is more important that prediction to other set cost little cpu time than set E, because set E has much more flows (460.24k) than others.

(a) accuracy

(b) efficiency

Figure 5. Binary Prediction When evaluating multi-class classification accuracy of this method, this paper compares the proposed method with the port-based method. In the port-based method, traffic using port 6881-6889 is classified as BitTorrent, port 4661-4665 as eDonkey, and port 1214 as Kazaa. There is no common used port for pplive. The comparison results are demonstrated in Figure 6. Figure 6(a) and 6(c) shows that the portbased method for BitTorrent and Kazaa is unavailable now, and for eDonkey, the port-based method still can identify some traffic. The four figures show that the proposed method can classify at least 75% and at most 99% of the whole traffic in app-level.

(a) BitTorrent

(b) eDonkey

(c) Kazaa

(d) pplive

Figure 6. Multi-class Prediction Comparison In sum, the proposed method meets the criteria posed in the introduction of this paper. Firstly, it successfully detects three popular filesharing P2P applications (BitTorrent, eDonkey, and Kazaa) and one popular multimedia application (pplive) in China. Secondly, pplive is a newborn P2P and it is detected accurately by the proposed method without port detection and payload decoding, which demonstrates

that the proposed method is immune to newborn P2P and possible encryption of the payload. Thirdly, figure 5(b) shows that the proposed method can deal with large amounts of traffic in high speed. Last but not least, the method successfully classifies 4 types of P2P in app-level without accessing any payload information.

5. Conclusions In this paper, a new method based on the support vector machines (SVM) is proposed to solve the P2P traffic identification and app-level classification. Experimental results show that this method has high efficiency and promising accuracy. Since there is little P2P file sharing traffic using udp protocol, we do not introduce the P2P traffic identification method for udp in this paper. This would be included in the future works. Additionally, to verify the availability of the proposed method in large-scale networks and security networks is also a future work.

10. References [1] CacheLogic. http://www.cachelogic.com. [2] Thomas Karagiannis, Andre Broido, Nevil Brownlee, Kc Claffy, Michalis Faloutsos, “Is P2P dying or just hiding?”, In Globecom, Dallas, TX, USA, November 2004. [3] S Sen, O. Spatscheck, D.M Wang, “Accurate, Scalable In-Network Identification of P2P Traffic Using Application Signatures”, WWW2004, New York, USA, 2004. [4] Matthew Roughan, Subhabrata Sen, Oliver Spatscheck, Nick Duffield, "Class-of-service mapping for qos: a statistical signature-based approach to ip traffic classification", In IMC ’04: Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, pages 135–148, New York, NY, USA, 2004. ACM Press. [5] H. Bleul, E.P. Rathgeb, “A Simple, Efficient and Flexible Approach to Measure Multi-Protocol Peer-ToPeer Traffic” IEEE International Conference on Networking (ICN'05) 2005. [6] Fivos Constantinou, Panayiotis Mavrommatis, “Identifying Known and Unknown Peer-to-Peer Traffic”, 2005. [7] T Karagiannis, A. Broido, M Faloutsos, “Transport Layer Identification of P2P Traffic”, Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, ACM Press, New York, USA, 2004, pp. 121-134.

Proceedings of the Sixth International Conference on Intelligent Systems Design and Applications (ISDA'06) 0-7695-2528-8/06 $20.00 © 2006

[8] M.S Kim, H.J Kang, J.W Hong, “Towards Peer-to-Peer Traffic Analysis Using Flows”, Lecture Notes in Computer Science, Springer, Heidelberg, Germany, 2003, pp. 55-67. [9] Dedinski, I., De Meer, H., Han, L., Mathy, L., Pezaros, D., P., Sventek, J., S., Xiaoying, Z., Cross-Layer Peerto-Peer Traffic Identification and Optimization Based on Active Networking, in Proceedings of the Seventh Annual International Working Conference on Active and Programmable Networks (IWAN'05), CICA, Sophia Antipolis, French Riviera, La Cote d'Azur, France, November 21-23, 2005. [10] N. Cristianini, S. Taylor J (2000) “An introduction to support vector machines and other kernel-based learning methods”, Cambridge University Press, Cambridge, UK. [11] C. Cortes and V. Vapnik. “Support vector networks”, Machine Learning, 20:273-297, 1995. [12] Drucker, H.; Burges, C.; Kaufman, L.; Smola, A.; Vapnik, V. 1997. Support Vector Regression Machines. In: M. Mozer, M. Jordan, and T. Petsche (eds.): Neural Information Processing Systems, Vol. 9. MIT Press, Cambridge, MA, 1997 (in press). [13] V. Vapnik, “The nature of statistical learning theory”, New York: Springer-Verlag, 1995. [14] V.Vapnik, S. Golowich, and A. Smola, “Support vector method for function approximation, regression estimation and signal processing”, in Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 1997, vol. 9. [15] K. Crammer and Y. Singer. On the Algorithmic Implementation of Multi-class SVMs, JMLR, 2001. [16] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Learning for Interdependent and Structured Output Spaces, ICML, 2004. [17] T. Joachims, Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT-Press, 1999. [18] C. J. Burges, “A tutorial on support vector machines for pattern recognition”, Knowledge Discovery and Data Mining, vol. 2, pp. 121–167, June 1998. [19] M. N. Wernick, “Pattern classification by convex analysis”, J. Opt. Soc. Amer. A, vol. 8, pp.1874–1880, 1991. [20] B. Sch¨olkopf, A. J. Smola, R. Williams, and P. Bartlett, “New support vector algorithms,” Neural Computation, vol. 12, pp. 1083–1121, 2000. [21] Crammer K, Singer Y (2000) “On the learn ability and design of output codes for multi-class problems”, In: Proceedings of the 13th Annual Conference on Computational learning theory, 28 June - 1 July 2000, pp 35-46.

Suggest Documents