Janus: Build Gnutella-like file sharing system over structured overlay Haitao Dong, Weimin Zheng, Dongsheng Wang
Computer Science and Technology Department, Tsinghua University, Beijing, China
[email protected], { zwm-dcs, wds}@tsinghua.edu.cn
Abstract. . How to build an efficient and scalable p2p file sharing system is still an open question. Structured systems obtain O(log(N)) lookup upper bound by associating content with node. But they can not supporting complex queries. On the other hand, Gnutella-like unstructured systems support complex queries, but because of its random-graph topology and its flooding content discovery mechanism, it can not scale to large network systems. In this paper, we present Janus, which build unstructured file sharing system over structured overlay. Different from previous approaches, Janus keeps bidirectional links in its routing table. And with one-hop replication and biased random walk Janus make it possible to implement complex queries in the scalable manner. The experimental results indicate that, when the system running over a network of 10,000 peers, it only needs 100 hops to search half of the total system.
1.
Introduction
In recent years, P2P overlay has become one of the most important systems over the Internet, which have consume more bandwidth than http access. The rapid development of P2P system owe much to its function as substrate for large data sharing and content distribution application. In such serverless system, a node looking for files must find the peers that have the desired content at first. Napster was one of the first peer-to-peer systems, which recognizes that requests for popular content need not be sent to a central server but instead could be handled by the many hosts, or peers, that already possess the content. It adapts a hybrid design, which consists of a directory server as centralized search facility based on file lists provided by each peer. But because RIAA’s lawsuit, these centralized systems have been replaced by new decentralized systems that have no such centralized search capabilities. There are two kinds of decentralized P2P overlay networks: unstructured overlay and structured overlay. Unstructured overlay, such as Gnutella and KaZaA, distribute both the download and search because of the lack of centralized directory server. In these systems, peers establish a randomly connected overlay network and the placement of files within them has no constrains. Consequently, each query should be flooded( or randomly dispersed) throughout this overlay with a limited scope( by setting the TTL of query). If a peer receives such a query message, it will check locally if it has the file matching the query. If having such content, it sends the list of all the appropriate content back
to the original node. And if the TTL is not zero, the node will decrease the TTL by 1, and resends it to all its neighbors. So we can make out that the load on each node grows linearly with the total number of queries, which grows with system size. Obviously, this approach supports arbitrarily complex queries but not scale to large size systems. To solve the scalability problem of unstructured overlay, several approaches have been simultaneously but independently proposed, all of which support a distributed hash table (DHT) functionality; among them are Tapestry[10], Pastry[6], Chord[8], and Content-Addressable Networks (CAN)[5]. In these DHT systems, files are associated with a key, which is produced, for instance, by hashing the file name, and each node in the system is responsible for storing a certain range of keys. There is one basic operation in these DHT systems, lookup(key), which returns the identity (e.g., the IP address) of the node storing the object responsible for that key. By allowing nodes to put and get files based on their key with such operation, DHTs support the hash-table-like interface. The core of these DHT systems is the routing algorithm. The DHT nodes form an overlay network with each node having several other nodes as neighbors. When a lookup(key) operation is issued, the lookup is routed through the overlay network to the node responsible for that key. Then, the scalability of these DHT algorithms is dependent on the efficiency of their routing algorithms. Each of the proposed DHT systems listed above – Tapestry, Pastry, Chord, and CAN – employ a different routing algorithm. Though there are many details different between their routing algorithms, they share the same property that every overlay node maintains O(log n) neighbors and routes within O(log n) hops (n is the system scale). To achieve this routing performance, the graph formed by peers is structured to enable efficient discovery of data items given their keys at the cost of not supporting complex queries. [2] points out that complex searches based on keyword are more prevalent, and more important, than exact-match queries. Can we design a system scalably supports complex query?[11] proposes a newly thought that it is plausible to build Gnutella like unstructured overlay on a structured overlay. They replace the random graph in Gnutella by a structured overlay while retaining the content placement and discovery mechanisms of unstructured overlays to support complex queries. Structella also uses either a form of flooding or random walks to discover content. Though, in flooding model, it takes advantage of structure to ensure that nodes are visited only once during a query and to control the number of nodes that are visited accurately, it does not radically solve the scalable problem. It is because when a query is initiated, many nodes irrelevant to the query have to re-route the query message. And in random model, because Structulla simply walk along the ring formed by neighbouring nodes in the id space. In worst case, its query path length is O(n). In this paper, we propose Janus, a new file sharing system, which builds unstructured P2P system (Gnutella) on DHT with bidirectional fingers. The motivation of modifying conventional structured overlay’s unilateral finger into bidirectional is that we observe the fact that though conventional DHT overlay has balanced fan-out fingers( typically O(logn)), the number of their fan-in fingers is not homogeneous at all. By maintaining bidirectional routing table, peers have different size of neighbors, which make the biased random walk strategy possible. And the degree of the imbalance between peers’ routing table size largely influence the efficiency of the per-
formance biased random walk. We also introduce one-hop replication (neighboring peers replicate file information with each other), which improve query performance greatly based on our simulation results. The rest of this paper is organized as follows. Section 2 introduces related works in the field of P2P file sharing overlay networks research. Section 3 presents the design of Janus protocol. Experimental results are presented in section 4. Final conclusion is given in section 5.
2.
Background
Janus protocol is based on two ongoing research interests: unstructured file sharing and structured overlay design. In this section, first, we generally introduce representative DHT systems, then give a brief description of Gnutella, the most prevalent unstructured file sharing system. DHTs:Chord takes, as input, a key and, in response, route a message to the node responsible for that key. The keys are strings of digits of some length. Nodes have identifiers, taken from the same space as the keys (i.e., same number of digits). Each node maintains a routing table consisting of a small subset of nodes in the system. When a node receives a query for a key for which it is not responsible, the node routes the query to the neighbor node that makes the most “progress” towards resolving the query. The notion of progress differs from algorithm to algorithm, but in general is defined in terms of some distance between the identifier of the current node and the identifier of the queried key. Gnutella: Here we mainly introduce the Gnutella 0.4 protocol[1]. Gnutella is based on a random graph. Each node in the overlay maintains a neighbour table with the network addresses of its neighbours in the graph. The neighbour tables are symmetric; if node has node in its neighbour table then node has node in its neighbour table. The neighbour tables are designed to be symmetric in order to reduce maintenance load. There is an upper and lower bound on the number of entries in each node’s neighbour table, typically the lower bound is 4 and upper bound is 8. Structella: Structella replaces the random graph in Gnutella by a structured overlay but it does not use the structure to organize the content. Structella supports complex queries using variants of flooding and random walks like Gnutella but it takes advantage of structure to ensure that nodes are visited only once during a query and to control the number of nodes that are visited accurately. Structella also leverages the structured overlay to reduce the maintenance overheads. But it is only a primary approach. Janus adopts some idea of Stuctella, such as loosing the constraint in DHT that couple content to nodes according their key. But Janus introduces several mechanisms to improve system performance of Structella. _
3.
_
_
_
The Janus System
This section describes the Janus protocol. The Janus protocol specifies how to how to construct the routing table, how to maintain the structure of the system, how deal with
query operation, and how to use the k-hop replication strategy to improve system performance. While Janus can be implemented based on most of existing O(log n) structured overlays, in the following part of this paper, we only discuss Janus based on Chord (Janus on other DHTs has the similar result). Like Chord, Janus assigns each node an m bit identifier using a base hash function such as SHA-1. But, different from Chord, the placement of the content in Janus has no constraint with where it is located, which is similar to Gnutella. In the following part, we introduce how nodes in Janus will be organized. A node’s identi-
fier is chosen by hashing the node’s IP address. The term “node” will refer to both the node and its identifier under the hash function. Consistent hashing assigns keys to nodes as follows. Nodes are ordered in an identifier circle modulo 2m by its identifier. A node (with identifier) K is called node L’s successor node if and if only 1. K > L; 2. there is no such node M satisfying L