vide routing service. A client who wants to retrieve a file must contact a server and get the location informa- tion before it can connect with the peer who has the ...
Exploiting Spatial Locality to Improve Peer-to-Peer System Performance Zhiyong Xu and Yiming Hu Department of Electrical & Computer Engineering and Computer Science University of Cincinnati E-mail: zxu,yhu @ececs.uc.edu
Abstract Routing performance is critical for Peer-to-Peer (P2P) systems to achieve high performance. Current routing algorithms concentrate on creating a well organized network architecture to boost routing performance. However, for each routing procedure, system returns location information of the requested file only and the characteristics of system workload are seldom considered. In this paper, we propose UCP2P routing algorithm which introduces “superobject” to take advantage of spatial locality among file accesses. When a client retrieves a file, UCP2P routing algorithm may send location information of several other correlated files in addition within one routing procedure. No further routing procedure is needed if the client requests any of these files later. By this mechanism, we can greatly reduce the routing overhead. Our preliminary simulation results show UCP2P achieves a great routing performance improvement over the previous algorithms.
1 Introduction Peer-to-Peer (P2P) file sharing systems [1] [2] [3] become the fastest growing applications on Internet since the appearance of Napster in 1998. This kind of applications draws much attention from both industry and research community. Its characteristics of decentralized control, self-autonomous and load balancing make it very attractive in some environments. However, its appealing properties also bring more difficult problems in system design than the traditional Client/Server system, especially in large-scale environments. Because routing procedures are the most frequently executed operations, an efficient routing algorithm is fundamental for a P2P system to achieve high performance. In Napster [1], a group of central servers are used to provide routing service. A client who wants to retrieve a file must contact a server and get the location information before it can connect with the peer who has the de-
P1 A1 R1 Client M
P’
P3 R3
A3
R2
Client M
R’
A1 A2 A3
P2 A2
Figure 1: Current routing problem and our solution
sired file. Clearly, the routing algorithm used in Napster is Client/Server model, it still suffers from scalability problems such as “hot spot” and “single point of failure” in Client/Server systems. Gnutella [2] uses another approach, its routing algorithm is based on a broadcasting mechanism which makes it more robust than Napster. But such a scheme greatly increases system routing traffic. Chord [4], Pastry [5], Tapestry [6] and CAN [7] are second generation P2P routing algorithms with good scalability. All these algorithms have the same characteristics: Each node only stores a small amount of system routing information and takes equal responsibility for system routing tasks. A routing request is fulfilled by the coordination of several peers and is guaranteed to finish within a small number of routing hops (normally , N is the number of system nodes) even in a system with millions of peers. These second generation routing algorithms focus on designing a well organized P2P network architecture to achieve high routing performance. The characteristics of P2P system workload are seldom considered. Several research papers [8] [9] pointed out that there are temporal and spatial localities existing in web traffic. Some webpages are accessed more frequently than others and pages on the same server are likely to be accessed continuously. P2P system workload also has temporal and spatial localities. For example, in a music exchanging system, the most
popular songs are requested more frequently than others, a client who retrieves a song of a specific musician is likely to retrieve other songs of this musician in subsequential requests. Though in current P2P routing algorithms, the identifiers of these correlated files are quite different (because of the usage of collision-free algorithms to generate these identifiers) and their location information are stored on widely spreaded peers. For each routing procedure, system only returns the location information of one requested file but nothing else. An independent routing procedure is needed to locate each of these files. This problem is demonstrated in Figure 1, A1, A2 and A3 are location information of three files stored on node P1, P2 and P3 respectively, although they are very likely to be retrieved in subsequential requests, three separate routing procedures R1, R2 and R3 must be executed using current routing algorithms. In this paper, we propose UCP2P, a superobject based routing algorithm to take advantage of spatial locality of file accesses. During system running time, file access patterns are analyzed, location information of files which are likely to be retrieved continuously are grouped together in superobjects. A superobject identifier (superid) is generated for each superobject and system stores the superobject on the peer whose nodeid is the most numerically closest to this superid. In UCP2P, an individual routing procedure could return location information of multiple correlated files. In the subsequential requests, if the same client retrieves any of these files, no routing procedure is required and the client can get the desired file directly form the peer. In the previous example, location information of all three files are now stored on node P’, only one routing procedure R’ is executed for the client to get all three files’ location information.
2 Superobject In UCP2P system, we use superobjects to maintain spatial locality information of file accesses. Like a peer or a file, each superobject is assigned a superid on the same name space using the same collision-free algorithm. A superobject contains location information of several files which share the same superid. The content of a superobject is dynamic. There is no superobject existing at the beginning, system creates and modifies superobjects according to system file retrieve logs. More detailed discussion will be given later. In UCP2P system, all superobjects form a hierarchical architecture. A superobject in a higher layer may contain several superobjects in lower layers. A simple illustration is shown in Figure 2. A layer 1 superobject “Music” contains several layer 2 superobjects, such as “Music-classic”, “Music-popular”, “Music-country” and “Music-jazz”. The layer 2 superobject “Music-popular”
Microsoft Software
Oracle IBM
Movie Category classic Music
Musician Madonna
popular
Jackson
country
Huston
Song Erotica Evita True Blue
jazz Layer 1
Layer 2
Layer 3
Files
Figure 2: Superobjects hierarchy
contains several layer 3 superobjects also. A superobject in high layers may contain thousands of files, it is impractical to exploiting spatial locality in such a huge number of files. In most cases, only a small fraction of these files have strong correlations, so high layer superobjects are useless in reality. In UCP2P, we only use the lowest layer superobjects. For example, a layer 3 superobject like “Music-Popular-Madonna” keeps location information of files “Erotica”, “Evita” and “True Blue” which are likely to be retrieved contiguously. Such a layer 3 superobject is suitable for our purpose.
3 UCP2P System Description In this section, we describe UCP2P system design. UCP2P routing algorithm can achieve better performance than the current P2P routing algorithms by taking advantage of spatial locality in system workload. It can be implemented as a module on top of any current routing algorithms, such as Chord, Pastry or CAN. It uses these routing algorithms as its basic routing schemes with minor modification to utilize the superobjects.
3.1 File Identifiers Current P2P routing algorithms use collision-free algorithms such as SHA-1 to generate a nodeid for each peer and a fileid for each file and avoid possible duplication. For these algorithms, a fileid is the only key necessary to store a file and perform a routing procedure. However, in UCP2P system, to exploit spatial locality, several superids are also generated on a new file when it is inserted into system with the same collision-free algorithm. For example, a file “Evita” has three different layer superids corresponding to superobjects “Music”, “Music-popular”
and “Music-popular-Madonna”. The file “True Blue” has the same three superids as “Evita”. In current UCP2P system, we only use the layer 3 superobject. Thus a file has two identifiers, a fileid and a superid. Like a file, a superobject is also stored on the node whose nodeid is the numerically closest to its superid.
3.2 Data Structure In UCP2P, a small storage space is allocated as a superobject cache (SO-Cache) on each peer. The structure of a SO-Cache is shown in Figure 3. It consists of two levels. The upper level is a hash table with each entry containing a superobject. The superobjects whose superids have the same hash value form a double linked list. In the lower level, each superobject contains several file descriptors. The file descriptors within a superobject represent highly correlated files and form a double linked LRU list. To effectively utilize spatial locality and reduce the overhead, the maximum number of file descriptors within a superobject must be well defined. Although UCP2P routing performance is highly related to the workload trace used in each experiment, we found the configuration of 20 file descriptors per superobject is enough to catch most locality. The data structure of a file descriptor is also shown in Figure 3. It includes five fields: filename, fileid, superid, file location information and file access number. The superobjects on a peer come from two sources: The superobjects whose superids are numerically closest to the nodeid (the peer is responsible for maintaining these superobjects and we denote it as the “superagent” of these superobjects) and the superobjects obtained by routing requests (the peer is not responsible for maintaining these superobjects, they are used for caching purpose only). The superagent field indicates this state, a value of 1 means the peer is the superagent of this superobject and 0 means not. The total number of superobjects stored in a SO-Cache is limited to reduce overhead. All superobjects in a SO-Cache form a LRU link list according to the access time line. When a new file descriptor is added to a superobject, if SO-Cache is full, the least accessed superobject in the tail of the superobject LRU list is evicted and its storage space is freed for the newly coming file descriptor. Each time when a superobject is accessed or modified, it will move to the head of the LRU list. Because superobjects are evenly distributed across all system peers, each peer is not responsible for a large number of superobjects. Also, the storage space occupied by a superobject is only hundreds of bytes which is not very large. Even with a 20MB SO-Cache, a peer can hold tens of thousands of superobjects. Considering current standard computer storage configuration, storage space is not a big issue.
Hash Table 0
Superobject
Superobject
1
N−1
Superobject
Superobject Superagent Head
File Descriptor
File Descriptor
File Descriptor
Filename
Filename
Filename
Fileid
Fileid
Fileid
Superid
Superid
Superid
Location
Location
Location
Access num
Access num
Access num
Figure 3: Superobject Cache (SO-Cache) structure
3.3 Routing Algorithm UCP2P can use any of the existing routing algorithms as the base routing algorithm and it functions as a module on top. Currently, we use Chord for its simplicity. In UCP2P, a routing request can be processed by one of three methods. First, the client checks its SO-Cache, if it finds the location information of the requested file, no routing procedure is executed. UCP2P improves system routing performance mainly by this method. The higher the hit rate in SO-Cache, the better the routing performance. Additional operations must be taken for superobject maintenance: the client will send a message to the node who is the superagent of this superobject, the superagent increases the file access number of the corresponding file descriptor and adjusts its LRU lists. This extra work can be done in system idle time and the client will not notice about it. The second method is to use the fileid as the key and execute a normal Chord routing procedure. Here, the destination node is responsible for notifying the superagent to modify its SO-Cache. If the superagent thinks that other files in this superobject are likely to be accessed by the same client in the near future, it sends the superobject to the original client directly. The third method is to use the superid as the key directly and perform a Chord routing procedure. In this case, the superagent is the destination. It modifies its SO-Cache and sends back the corresponding superobject to the originator. In the last two methods, only one routing procedure is performed and the client receives location information of both the requested file and some other correlated files. If system can ensure
7 Chord
Average number of routing hops
a strong correlations among files in a superobject, system routing overhead can be greatly reduced. For example, if a client wants to download “Evita”, UCP2P system assumes the client is likely to request other songs of Madonna. It uses the superid of “Music-popular-Madonna” as the key and performs a normal Chord routing procedure. The corresponding superagent receives the request and sends back the location information of “Evita” along with location information of other Madonna’s songs. The client keeps this superobject in its own SO-Cache, next time when the client requires another Madonna’s song, it can get the location information from its local SO-Cache directly.
6
UCP2P
5 4 3 2 1 0 1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Number of nodes
Figure 4: UCP2P routing performance
3.4 Superobject Maintenance In UCP2P system, the content in a superobject is dynamic. At system start time, there’s no superobject existing. When a new file is inserted, its location information is stored on the peer whose nodeid is numerically closest to its fileid (as other P2P routing algorithms). UCP2P system also stores its location information on the superagent of the file’s superid. Each time when a routing procedure finishes, the file access information is sent to the superagent and it modifies its SO-Cache accordingly. First, it searches the hash table, if a superobject with the same superid is found, the superagent continues to search the list in this superobject, if a file descriptor which matches this file is found, it adds 1 in file access number field and moves this file descriptor to the head of the superobject. The superobject LRU list is also updated, this superobject is moved to the head. If the superagent finds it is a new file, it will create a new file descriptor for it and add it to the head of the corresponding superobject. If there’s no superobject existing, the system generates a new superobject for this file. If the superobject has more than 5 entries which means there’s some correlated files existing, the superagent sends back the superobject to the original client. If the number of file descriptors in this superobject exceeds 20, the superagent will abandon the file descriptor in the tail of the link list.
4 Performance Evaluation To evaluate the benefits of using superobjects, we conduct some preliminary experiments to compare UCP2P routing performance with Chord.
4.1 Simulation Environment Due to the lack of suitable P2P system workload, we use a synthetic trace in our simulation. Our synthetic trace is
generated from the National Laboratory for Applied Network Research (NLANR) web traces. The trace file we use is obtained in May 6th 2002 and we truncate the number of requests to 640000. A URL link in the trace file is divided into 3 sections with each sections represents a superobject. We only use the 3rd (lowest) layer superobject. The emulated network model is Transit-Stub Model [10]. It is a two level hierarchical internetwork model. In our simulation, the latencies of intra-transit domain links, stub-transit links and intra-stub domain links are set to 20, 5 and 1ms respectively (We also use some other distributions but our conclusion does not change). The number of nodes in simulated networks varies from 1000 to 10000. Because we use Chord as the baseline routing algorithm, we compare its routing performance with UCP2P system in terms of both routing hops and routing network latency.
4.2 Routing Performance In first simulation, we test UCP2P routing performance with different network sizes. The number of routing requests in all experiments is fixed to 200000 and the originator of a request is determined according to the original web trace. The simulation result is shown in Figure 4. In all emulated networks, UCP2P has smaller average number of routing hops than Chord. System routing cost reduces 11.2% to 23.6%. Because we use the same number of requests, the amount of spatial locality UCP2P system can exploit is the nearly same for all experiments. However, as the total number of system nodes increases, the average number of routing hops increases also, thus the average number of routing hops difference ratio of UCP2P and Chord algorithms rises. We evaluate the effects of total request number in the second simulation. The number of system nodes is set to 5000 and the number of routing requests varies from 10000 to 640000. Figure 5 shows the result. As we expected, as the number of requests increasing, the amount
440
420 6 400 5.5
380
360
5 UCP2P(# hops) 4.5
340
Chord(# hops)
Average routing latency (ms)
Average number of routing hops
6.5
information of correlated files together and stores them in superobjects. A routing procedure may return location information of several files. Routing performance is greatly improved by reducing the number of routing procedures. Our preliminary simulation results show UCP2P algorithm can significantly reduce the routing traffic in P2P systems.
320
UCP2P(latency) Chord(latency) 4 10000
20000
40000
80000
160000
320000
300 640000
Number of routing requests
Acknowledgments
This work is supported in part by the National Science Foundation Career Award CCR-9984852, and an Ohio Figure 5: Effects of the total number of routing reBoard of Regents Computer Science Collaboration Grant. quests Thanks to Sudhindra Rao and Juan Li for providing many suggestions during the writing of this paper. of spatial locality rises too. UCP2P routing algorithm can get more routing performance gain. For 10000 requests, UCP2P only reduces the routing cost by 2% because there’s little locality exploited. While for 640000 requests, UCP2P improves routing performance by 25% because of the increased spatial locality. For Chord, the average number of routing hops and the average routing latency are constant in all experiments because the number of routing requests does not affect its efficiency.
5 Future Work More work need to be done to evaluate UCP2P algorithm efficiency. First, the real world P2P workload is required to evaluate the maximal amount of spatial locality that can be exploited in P2P systems. However, current P2P system trace generators do not satisfy our requirement. We plan to write our own crawler program on KaZaA and Gnutella system to collect P2P system traces. Second, we need to do more research on superobject mechanism, a more elaborated data structure is required to exploit maximal spatial locality with minimal overhead. Currently, we only use Chord algorithm as the baseline system, we plan to use other routing algorithms such as Pastry, Tapestry or CAN as the underlying routing algorithm and evaluate UCP2P efficiency. More work are needed to be done on supernode creation and maintenance mechanism.
6 Conclusion Current routing algorithms concentrate on modifying the network topology related structure to improve P2P system routing performance. These algorithms do not take into account the effects of P2P system workload. In this paper, we propose UCP2P system to exploit spatial locality among file accesses. UCP2P system groups location
References [1] Napster, “http://www.napster.com.” [2] Gnutella, “http://www.gnutella.wego.com.” [3] KaZaA, “http://www.kazaa.com/.” [4] I. Stoica, R. Morris, D. Karger, M. Kaashoek, and H. Balakrishnan, “Chord: A scalable peer-to-peer lookup service for internet applications.” Technical Report TR-819, MIT., Mar. 2001. [5] A. I. T. Rowstron and P. Druschel, “Pastry: Scalable, decentralized object location, and routing for largescale peer-to-peer systems,” in Proceedings of the 18th IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), Heidelberg, Germany, pp. 329–350, Nov. 2001. [6] B. Zhao, J. Kubiatowicz, and A. Joseph, “Tapestry: An infrastructure for fault-tolerant widearea location and routing.” Technical Report UCB/CSD-01-1141, U.C.Berkeley, CA, 2001. [7] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker, “A scalable content addressable network.” Technical Report, TR-00-010, U.C.Berkeley, CA, 2000. [8] M. F. Arlitt and C. L. Williamson, “Web server workload characterization: The search for invariants,” in Measurement and Modeling of Computer Systems, pp. 126–137, 1996. [9] A. Mahanti, “Web proxy workload characterisation and modelling,” 1999. [10] E. W. Zegura, K. L. Calvert, and S. Bhattacharjee, “How to model an internetwork,” in Proceedings of the IEEE Conference on Computer Communication, San Francisco, CA, pp. 594–602, Mar. 1996.