A Methodology for P2P File-Sharing Traffic Detection ∗ Angelo Spognardi†, Alessandro Lucarelli, Roberto Di Pietro Universit`a di Roma “La Sapienza” Dipartimento di Informatica Via Salaria, 113, 00198-Roma, Italy {spognardi, dipietro}@di.uniroma1.it, ale
[email protected] Abstract Since the widespread adoption of peer-to-peer (P2P) networking during the late ’90s, P2P applications have multiplied. Their diffusion and adoption are witnessed by the fact that P2P traffic accounts for a significant fraction of Internet traffic. Further, there are concerns regarding the use of these applications, for instance when they are employed to share copyright protected material. Hence, in many situations there would be many reasons to detect P2P traffic. In the late ’90s, P2P traffic was easily recognizable since P2P protocols used application-specific TCP or UDP port numbers. However, P2P applications were quickly empowered with the ability to use arbitrary ports in an attempt to go undetected. Nowadays, P2P applications explicitly try to camouflage the originated traffic in an attempt to go undetected. Despite the presence of rules to detect P2P traffic, no methodology exists to extract them from applications without the use of reverse engineering. In this paper we develop a methodology to detect P2P traffic. It is based on the analysis of the protocol used by a P2P application, extraction of specific patterns unique to the protocol, coding of such a pattern in rules to be fed to an Intrusion Detection System (IDS), and validation of the pattern via network traffic monitoring with SNORT (an open source IDS) fed with the devised rules. In particular, we present a characterization of P2P traffic originated by the OpenNap and WPN protocols (implemented in the WinMx application) and FastTrack protocol (used by KaZaA) obtained using our methodology, that shows the viability of our proposal. Finally, we conclude the paper exposing our undergoing efforts in the extension of the methodology to exploit differences between ∗ This work was partially funded by the PRIN 2003 Web-based Management and Representation of Spatial and Geographic Data project, supported by the Italian MIUR and by the WEB-MINDS project supported by the Italian MIUR under the FIRB program. Roberto Di Pietro is also with CNR-ISTI, WNLab-Pisa. Angelo Spognardi is the contact author. † Authors are in reverse alphabetical order
centralized and decentralized P2P protocols, as well as the characterization of encrypted traffic, and highlight a new research direction in the identification of P2P traffic.
1
Introduction
P2P networking can be seen as a network of computers that does not use client/server paradigm but is based on the notion of peers. Peers may differ in processing capabilities, connection speed, local network configuration or operating systems. P2P networks can offer the functionalities required to implement a generic application as in [3, 12]. Lack of centralized authorities in P2P networks reflects in a totally distributed configuration of directly connected peers. Some P2P networks also have a small set of special nodes, known as super nodes [9, 8] that usually perform some special tasks, such as queries handling, typically requiring major resources availability. One common application of P2P networks can be identified as file sharing among users. Download operations typically involve two phases: Signaling phase: a peer searches for the content and determines which peers are eligible to provide the desired content. In many protocols this phase does not involve any direct communication with the peer which will eventually provide the content. Download phase: The requester contacts one or multiple peers among the eligible ones to directly download the desired content. Detecting P2P file sharing traffic can be required in several contexts. For instance, in an enterprise network administrators would like to provide a degraded service (via rate-limiting, service differentiation, blocking) to P2P traffic to ensure good performance for enterprise critical applications, and/or enforce corporate rules regulating the P2P usage [17]. Broadband ISPs would like to limit the P2P traffic to limit the cost they are charged by upstream ISPs. All
Proceedings of the 2005 Second International Workshop on Hot Topics in Peer-to-Peer Systems (HOT-P2P'05) 0-7695-2417-6/05 $20.00 © 2005 IEEE 0-7695-2426-5/05 $20.00 © 2005 IEEE Authorized licensed use limited to: Universita degli Studi di Roma La Sapienza. Downloaded on July 15, 2009 at 13:29 from IEEE Xplore. Restrictions apply.
these activities require the capability to accurately identify P2P network traffic. Further, identification of the users performing file sharing inside a network can be useful to support forensics investigations. However, application identification inside IP networks, in general, can be difficult. Firstgeneration P2P applications used well-defined port numbers to send file sharing traffic, hence the identification of P2P traffic was a relatively easy task. In response to this, P2P applications acquired the capability to utilize any port number. Furthermore, recent P2P networks tend to intentionally camouflage their generated traffic [11] to circumvent both filtering firewalls as well as possible legal litigation. There are also some P2P applications that support encryption, while others adopt file fragmentation; these applications split the file to be sent into chunks, where each chunk is eventually sent by a different peer. There are some projects in the area of P2P traffic detection: the same SNORT project group proposes some rules for the detection of P2P traffic and there exist some commercial applications (like p2pwatchdog [13]) that have the only purpose to catch and to monitor P2P traffic. However, neither the SNORT community nor the p2pwatchdog developers say how to write rules for all P2P file-sharing programs; p2pwatchdog, furthermore, is neither open-source nor free. What it is lacking, then, is a methodology to write IDS rules for P2P traffic detection or, more in general, a flexible methodology to be able to identify any applicationspecific traffic.
1.1
Main Contributions and Road-map
In this paper, we provide a methodology to identify P2P traffic. The methodology is based on the following steps: analysis of the protocol of interest; identification of patterns specific to the P2P protocol that can be revealed by an IP packet level analysis; coding of these patterns in rules that can be fed to an IDS; network monitoring of the identified patterns with an effective IDS fed with the devised rule. Note that following the IDS-like approach does not introduce any delay in the network, while requiring only little overhead on the checking-point where it is installed. Further, the proposed methodology is showed to be extensible to the analysis of P2P protocols that encrypt their generated traffic as well and to efficiently leverage characteristics introduced by decentralized P2P file sharing applications. Our P2P traffic detection tool has been successfully deployed and is currently running in a corporate LAN. The remainder of this paper is organized as follows. Section 2 reports related work in the field. Section 3 depicts the working hypothesis as well as the methodology to delve with P2P traffic detection. Section 4 highlights the technical issues involved in identifying P2P traffic in real time inside the network. The methodology is applied to the
OpenNap, WPN and FastTrack protocols, run by the WinMx and KaZaA applications. In this section we also report the rules for the SNORT IDS to catch the protocols signatures patterns. Section 5 reports our conclusion and a few research directions.
2
Related work
Early research on P2P traffic characterization were based on the addressing of default network ports [18],[16]. Recent work [7], uses application signatures to characterize the workload of KaZaA downloads, while in [17] signatures for a wide range of P2P applications are provided. However, these studies do not provide evaluation of accuracy, scalability or robustness features of their signature, or lack to highlight the methodology adopted, or do not consider some interesting protocols. Signature based traffic classification has been mainly performed in the context of network security such as intrusion and anomaly detection (e.g. [2], [1]) where one typically seeks to find a signature for an attack. In [19], [1], [10] research focuses on aggregated data traffic to distinguish regular one from the one originated by P2P applications. These works provide a view of local P2P usage, while in [18] is reported a complementary backbone view, that is, the analysis of data gathered from a tier-1 Internet Service Provider. Our approach is similar to that reported in [4],[17] in the sense that as a final result we provide a set of signatures to identify P2P file sharing traffic. Our approach differentiates from [4],[17] in the sense that the methodology proposed is clearly depicted and combines both signature and intrusion detection techniques.
3
Methodology
In this section we provide the methodology employed to detect P2P file sharing traffic. Note that the proposed methodology is general enough to be easily adopted to any P2P file sharing protocol. To show its flexibility, we have applied the methodology to the following P2P protocols: OpenNap, WPN and FastTrack Protocols. Once specific pattern for the protocol of interest have been find out, it is possible to feed any IDS with the appropriate rules to identify such patterns. In our specific case, we have expressed such pattern in terms of SNORT rules. Note that SNORT is only one of the possible IDS: our choice was SNORT because it is the most popular IDS, due of its history, because it is open source and also because its rules are easy to understand. Moreover, SNORT has a large community of developers, it is extensible with plug-in and add-ons and it works on every operating system (Windows, Unix/Linux
Proceedings of the 2005 Second International Workshop on Hot Topics in Peer-to-Peer Systems (HOT-P2P'05) 0-7695-2417-6/05 $20.00 © 2005
IEEE
Authorized licensed use limited to: Universita degli Studi di Roma La Sapienza. Downloaded on July 15, 2009 at 13:29 from IEEE Xplore. Restrictions apply.
and Solaris). However, other sniffers (like Bro [14]) can be equally used for the phase of writing rules. Also, the focus of this paper is not to write SNORT rules, but to propose a methodology to study applications that can lead to write IDS rule, within different peer operating contexts such as encryption and decentralization.
3.1
Working Hypothesis
Our focus is in the protection of a company LAN from unauthorized network traffic, in particular network traffic originated by P2P file sharing applications. Our operating testbed was composed of different systems running Linux and Windows Operating systems. Network was partitioned using layer-3 switches and Internet connections were filtered by a firewall. To simulate the network traffic, we have set up two different systems: the first one was a computer running Linux Red Hat7, equipped with a network adapter Eth0 configured in promiscuous mode so as to intercept all the traffic on the portion of LAN it was connected to. Installed on this machine we had WinDump [22] to analyze incoming data packet, as well as SNORT [20, 15] to check out whether our protocol analysis had correctly identified patterns of P2P file sharing; the second system was running Windows XP. On this second system there was installed WinMX [23] and KaZaA [9] as well as Ethereal [5]. The data collected from this system would have revealed the P2P protocol architecture.
3.2
Approach
There exist different approaches to characterize the traffic of a protocol. One of these is to operate a reverse engineering of an application that uses the protocol. An attempt to reverse engineering the KaZaA application has been performed by the giFT-FastTrack community [6], to identify the proprietary schemes of the FastTrack protocol. However, this approach can disclose details that are not relevant to traffic detection, while being oblivious with report to features that can allow the straight characterization of P2P traffic. We maintain a high level approach, that is we focus on the interface provided by the client adopting the protocol of interest. In particular, we analyse the messages generated by user-triggered actions. Analysing the network traffic originated by triggering the client interface, it was possible (using the netstat command) to acquire information about network protocols used, open connections, IP addresses and ports. In this way, we limited the research space only to those protocols used by the client. With the use of Ethereal, WinDump and netstat, we observed for example that OpenNap uses TCP/IP protocols and sometimes UDP protocol. At the Application Level, we observed a
modified version of HTTP1.1 and DNS protocol, to solve server names to IP addresses. Analyzing the payloads, we could catch, for instance, the lists of shared files as well as the login and the welcome messages triggered by OpenNap protocol. Since some of these elements are recurrent and fixed, they were used to generate IDS rules to recognize file-sharing client generated traffic. However, recognition of clear text is not always possible. For instance, FastTrack and WPN use some techniques to encrypt messages. Nevertheless, the analysis of the generated traffic is still possible and effective, as will be seen in Section 4.2.2. 3.2.1
Objective of traffic generation
From the analysis of the generated traffic, we strove to understand in which way a client acquires knowledge of other peers in the network and which type of connections it establishes with them. We used “what-caused-what” relations: for every action we requested the client to perform, there was a subsequent analysis of the triggered messages. The message analysis was done step-by-step: first, we tried to understand the structure of every packet composing a message; then, we tried to discover the effective content of the whole message; as last step (Section 4.1.2), we tried to understand the rationales that caused its generation. To trigger the client, we analysed the client graphical interface. Then, with netstat, we looked for the established connections. Finally, we used the sniffer to analyse the operations of these connections: for every client action (like client start-up or query submission) we sniffed the traffic to identify packet format and message payloads. With this type of analysis, we developed a model describing how the client works and studied the different protocols used by the same client (for example, WinMx uses both OpenNap and WPN protocols). We were able to categorize the client actions in these recurrent phases: Discovering and Booting: in this phase a starting client finds a network to log in and searches active peers over it. Information about active peers is provided consulting a pool of central servers. Sharing: in this phase, clients send the list of their files. Querying and Lookup: in this phase, a peer searches for a peer storing a file of interest. The result of a search is the address of those clients that share the requested file. Downloading: in this phase, two clients exchange a file.
3.3 3.3.1
Issues Encryption
As introduced, file-sharing client can use messages encryption techniques. In this way, the clients can hide shared or requested files. To overcome this difficult, the analysis focused on the identification of recurrent sizes of TCP
Proceedings of the 2005 Second International Workshop on Hot Topics in Peer-to-Peer Systems (HOT-P2P'05) 0-7695-2417-6/05 $20.00 © 2005
IEEE
Authorized licensed use limited to: Universita degli Studi di Roma La Sapienza. Downloaded on July 15, 2009 at 13:29 from IEEE Xplore. Restrictions apply.
and UDP payloads, bound to some behavior of the applications. Another technique was to change some user information and watch the modification of the packets size. For instance, about the protocol FastTrack we noticed a recurrent string of 26 bytes when establishing a connection to the network. Further, the modification of the username and the subsequent observation of the change in size of the packet allows to bind a specific packet among those sent by the client to the notification of the username. Note that a successful way to recognize this type of traffic is to catch sequences of recurrent strings of fixed sizes.
information can be used to devise ad hoc IDS rules (Rule 2, Section 4.1.2). Server→Client: answer to login message To answer to a client login request, the server replies with a message that contains the strings VERSION , SERVER and other information (like the string Welcome and some statistics on active users and shared files ). This information is sent over several packets, because TCP protocol used on Ethernet limits the MSS (Maximum Segment Size) to 1460 bytes.
3.3.2
A fragment of traffic can be found in [21]. The first Ethernet packet of the answer has a well defined structure and can be used by an IDS as a recurrent element, to identify an OpenNap connection over the network. In the next section, in fact, we report a rule for SNORT, that searches into the payload of the TCP messages the two strings VERSION and SERVER. Client→Server: list of shared files After the reply of the server, the client sends the list of its own shared files, according to this simple format:
Firewall
P2P protocols show a different behavior whether the P2P client is firewall protected or not. Then, the analysis has to take into account both possibilities. That is what we did for OpenNap, WPN and FastTrack protocols.
4
Experimental Results
In this section we show the results of our analysis, that is we describe the OpenNap, WPN and FastTrack protocols during their execution. Moreover, we report how to identify specific network traffic patterns that detect file-sharing activities originated by this protocol and how to write IDS rules that detects such activities.
4.1 4.1.1
OpenNap Protocol analysis
The OpenNap protocol is based on a pool of central servers: all the peers that want to join to an OpenNap network establish a TCP connection with one of these servers. A Central Server maintains a list of all the files shared by users, but does not store any file. Following a client-server model, every user can ask to the server which peers store the requested file, while the download is performed between peers (the requesting peer and the storing peer), via a direct TCP connection. All the following operations have been reconstructed step-by-step, trigging a single action on the client and analysing the generated traffic. Client→Server: connection and login Before starting a download session, the user has to specify some information, such as user name, password and in particular the central server list. To establish a TCP connection with a server, the OpenNap protocol sends a login message; this message contains information about the user: user nick-name, password, listening port, client type and connection-line speed. An example of captured fragment can be found in [21]. The traffic generated by this phase contains the name and the version of the software used. This
“statistics”
An example of this kind of traffic is reported in [21]. This kind of messages can be used to write IDS rules, meant to show the filenames of the files shared by a peer (see Rule 4, Section 4.1.2). Client→Server: search request To submit a query to the server, the user must fill-in a form of the graphical interface of WinMx, with a few words concerning the requested file: those words are the criteria used by the server to perform the lookup in the file list. The server, in fact, returns every file with the the requested words included in the name of the file (e.g. Vasco Rossi Generale). Other search criteria can be specified, like information about the performances of the storing peers. The structure of the query sent by the client to the server follows. SPEED>
The structure of this messages (see [21] for an example) can be used to write IDS rules to detect OpenNap protocol traffic (see Rule 4, Section 4.1.2). Server→Client: search response The answer to a query is a list of all the known files that satisfy the search criteria. In addiction to the file-name, a list element contains also: the IP address of the storing peer, the complete remote-path, the file format, its size and other information on the file type (for instance: bit-rate, frequency and duration for an mp3 file). The structure of the server response is the following:
Proceedings of the 2005 Second International Workshop on Hot Topics in Peer-to-Peer Systems (HOT-P2P'05) 0-7695-2417-6/05 $20.00 © 2005
$HOME NET any (dsize:17; content:"|28|"; offset:0; depth:1; rawbytes; content: "|4b 61 5a 61 41 00|"; offset:11; depth:17; msg:" Supernode Response";)
This rules allows to catch the UDP flooding of answering supernodes. It will alert on catching the pong1 from an active supernode. The rule searches for the value 0x28 on the first byte and for the string “KaZaA” in the 11th byte.
Proceedings of the 2005 Second International Workshop on Hot Topics in Peer-to-Peer Systems (HOT-P2P'05) 0-7695-2417-6/05 $20.00 © 2005
Storing Peer
IEEE
Authorized licensed use limited to: Universita degli Studi di Roma La Sapienza. Downloaded on July 15, 2009 at 13:29 from IEEE Xplore. Restrictions apply.
Rule 2 # alert on sending a positive response to a request for a shared file alert tcp $EXTERNAL NET any -> $HOME NET any (flow:from client; content:"|48 54 54 50 2f 2f 31 2e 3120 32 30 30 20 4f 4b|"; offset:0; depth:15; content:"KazaaClient"; session:printable; msg:"Request of a shared file with KaZaA";)
This rule alerts when a TCP connection receives the message containing the string HTTP /1.1 200 OK, that is when the peer is starting a download session. To reduce false positives, the string “KaZaA” is searched also.
5
Conclusion and future work
In this paper we have exposed a methodology to detect P2P file sharing traffic based on: analysis of the P2P protocol; identification of patterns specific to the P2P protocol that can be revealed by an IP packet level analysis; coding of these patterns in rules that can be fed to an IDS; verification of the pattern identified via network monitoring with the IDS feed with the devised rule. Our preliminary results exposed in this paper lead to a complete characterization of the traffic generated by the OpenNap, the WPN and FastTrack protocols. The devised rules allow to identify the IP of the systems inside a network that is performing file sharing. Note how this can be helpful in the accountability process required by a judiciary disputes or, better, to disincentive not law abiding behavior. Further, the identification of the P2P traffic does not introduce any delay in the network. The proposed methodology has shown its flexibility: we have been able to analyse standard protocols (OpenNap) as well as protocols that encrypt their traffic and are full decentralized (WPN and FastTrack). Finally, note that a new research area is still to be addressed: traffic detection in multipath protocols.
Acknowledgements The authors would like to thank Prof. Luigi V. Mancini for his insightful comments and valuable discussions.
References [1] P. Barford, J. Kline, D. Plonka, and A. Ron. A signal analysis of network traffic anomalies. In IMW ’02: Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurment, pages 71–82. ACM Press, 2002. [2] P. Barford and D. Plonka. Characteristics of network traffic flow anomalies. In IMW ’01: Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement, pages 69– 73. ACM Press, 2001.
[3] F. Dabek, M. F. Kaashoek, D. Karger, R. Morris, and I. Stoica. Wide-area cooperative storage with CFS. In Proceedings of the 18th ACM Symposium on Operating Systems Principles, 2001. [4] C. Dewes, A. Wichmann, and A. Feldmann. An analysis of internet chat systems. In IMC ’03: Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, pages 51–64. ACM Press, 2003. [5] http://www.ethereal.com/. [6] http://gift-fasttrack.berlios.de/. [7] K. P. Gummadi, R. J. Dunn, S. Saroiu, S. D. Gribble, H. M. Levy, and J. Zahorjan. Measurement, modeling, and analysis of a peer-to-peer file-sharing workload. In SOSP ’03: Proceedings of the 19th ACM symposium on Operating systems principles, pages 314–329. ACM Press, 2003. [8] A. Gupta, B. Liskov, and R. Rodrigues. One hop lookups for peer-to-peer overlays. In Proceedings of the 9th Workshop on Hot Topics in Operating Systems (HotOS-IX), pages 7– 12, Lihue, Hawaii, may, 2003. [9] http://kazaa.com. [10] A. Klemm, C. Lindemann, M. K. Vernon, and O. P. Waldhorst. Characterizing the query behavior in peer-to-peer file sharing systems. In IMC ’04: Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, pages 55– 67. ACM Press, 2004. [11] N. Leibowitz, M. Ripeanu, and A. Wierzbicki. Deconstructing the kazaa network. In Proceedings of the 3rd IEEE Workshop on Internet Applications (WIAPP’03), June, 2003. [12] A. Mei, L. V. Mancini, and S. Jajodia. Secure dynamic fragment and replica allocation in large-scale distributed file systems. IEEE Trans. on Parallel and Distributed Systems, 14(9):885–896, 2003. [13] http://www.p2pwatchdog.com/. [14] V. Paxson. Bro: a system for detecting network intruders in real-time. Computer Networks (Amsterdam, Netherlands: 1999), 31(23–24):2435–2463, 1999. [15] R. Rehman. Intrusion Detection with SNORT: Advanced IDS Techniques Using SNORT, Apache, MySQL, PHP, and ACID. Prentice Hall, 2003. [16] S. Saroiu, K. P. Gummadi, R. J. Dunn, S. D. Gribble, and H. M. Levy. An analysis of internet content delivery systems. SIGOPS Oper. Syst. Rev., 36(SI):315–327, 2002. [17] S. Sen, O. Spatscheck, and D. Wang. Accurate, scalable innetwork identification of p2p traffic using application signatures. In WWW ’04: Proceedings of the 13th international conference on World Wide Web, pages 512–521. ACM Press, 2004. [18] S. Sen and J. Wang. Analyzing peer-to-peer traffic across large networks. IEEE/ACM Trans. Netw., 12(2):219–232, 2004. [19] S. Sen and J. Wang. Analyzing peer-to-peer traffic across large networks. In ACM SIGCOMM Internet Measurement Workshop. Proceedings, November, 2002. [20] http://www.snort.org/. [21] A. Spognardi, A. Lucarelli, and R. Di Pietro. TRWEBMINDS-46: A methodology for p2p file-sharing traffic detection. Technical report, Web-Minds, CINI-Unit of Rome, May 2005. [22] http://windump.polito.it/. [23] http://www.winmx.com/.
Proceedings of the 2005 Second International Workshop on Hot Topics in Peer-to-Peer Systems (HOT-P2P'05) 0-7695-2417-6/05 $20.00 © 2005
IEEE
Authorized licensed use limited to: Universita degli Studi di Roma La Sapienza. Downloaded on July 15, 2009 at 13:29 from IEEE Xplore. Restrictions apply.