Content-oriented Topology Restructuring for Search in P2P Networks Hans Friedrich Witschel
∗
[email protected]
Abstract This article presents a new algorithm for content-oriented search in P2P networks that avoids flooding and thus ensures scalability. It is based on the concept of small worlds: peers are enabled to actively influence network structure by choosing their neighbours. Different strategies for neighbour selection are possible, the most important being the one called ”cluster strategy”, which consists in peers choosing other peers that offer content similar to their own. Thus, peers will organize into clusters of semantic similarity. This will (as has to be shown) result in a small world network structure. This structure can then be exploited for implementing an efficient search algorithm: each peer forwards incoming queries to just one of its neighbours (the one whose document profile best matches the query). Because paths are short in small worlds and because there are semantic clues for finding them, it will be possible to quickly redirect queries to the right clusters of peers. After giving a detailed overview of related ideas and introducing the exact algorithm, a model of a peer-to-peer network will be presented that makes some simplifying assumptions about the world and thus allows us to build a simulation of our algorithm. The experimental setup of this simulation will be explained in detail and simulation results will be given and thoroughly discussed.
1
Introduction
asked them to forward the letter to a certain target person. The participants could only pass the letters (by hand) to personal acquaintances who they thought might be able to bring them closest to the target person. Note that ”closeness” need not necessarily be understood geographically in this case. The fact that the letters that reached their target were only passed through six intermediaries has come to be known as the ”six degrees of separation”. The topology of the social network graph – where nodes are persons and edges are acquaintances connecting them – is now (since Milgram’s experiments) called a ”small world”. This phenomenon can be found not only in social networks, but also in a great number of other self-organizing systems (e.g. neural networks, power grids etc.). Watts and Strogatz [21] define a small world as a (usually sparse) graph with short average distance L and a so-called clustering coefficient C that is significantly larger than in random graphs. This can be translated as follows: a small world – as modeled by Watts and Strogatz – consists of a number of clusters of nodes (i.e. highly connected
In recent years, there has been much research into intelligent search mechanisms in peer-to-peer (P2P) systems. In a typical P2P network, each node (or peer) is connected to a small number of other nodes (its neighbours). Nodes may also share data with the rest of the peers. In order to retrieve this data, a peer must issue a query that will be routed through the network and forwarded to a number of other peers. These may contribute answers to the query and send them back to the querying peer. Adding intelligence to P2P systems normally aims at minimizing the number of messages needed to locate data. Systems like Gnutella [7] that lack this intelligence resort to blind search mechanisms such as flooding. This creates a lot of unnecessary traffic and therefore scales very poorly. Our idea for an intelligent search mechanism is inspired by the findings of Stanley Milgram [14]: He gave letters to 60 persons picked at random and ∗ supported
by Deutsche Forschungsgemeinschaft (DFG)
1
subgraphs) that are in turn very loosely connected. That is, people in social networks organize into clusters (e.g. groups of common interests). These clusters will then be loosely connected by people belonging to more than one group (e.g. having various interests) or other random links.
I will later propose a gossiping mechanism responsible for creating clusters of interests among peers which will result (as has to be shown) in a small world network structure. The second requirement suggests that it must be possible to find short paths using content-based clues: we believe that equipping each peer with a ”profile”, i.e. a summarization of its contents, and making peers organize into clusters of semantic similarity can give such hints and thus greatly facilitate query routing. If peers know the profiles of each of their neighbours, this enables them to predict which of these neighbours is most likely to have an answer to a given query. And, more importantly, it allows for a hill-climbing approach to search: if we are searching for a particular topic, we have to find just one peer whose profile is just a little similar to our query. Now there is a good chance of this peer being part of, or at least having connections to a cluster ”covering” the topic we are searching for. Once a query has reached the right cluster of peers, it will be quickly passed on to the best peer(s) within this group.
We believe that an intelligent search mechanism for P2P file sharing communities can be derived directly from Milgram’s experiments: when a peer P receives a request for some information, it first checks whether it can satisfy this query itself (i.e. if the target has already been reached). If it can, the answer to the query is directly returned to the querying peer, if not, P chooses one of its neighbours that is deemed to be closest to the requested data and passes the query on to this peer. This continues until the target has been reached or the length of the message chain (TTL) has exceeded a certain value. Of course, all decisions about message forwarding must be made on the basis of locally available information only. According to Kleinberg [12], we must meet two requirements in order to make this algorithm work:
Before we go into details about related ideas and the algorithm itself, there is one important issue I have to mention: our algorithm is intended to be a contribution to information retrieval : we are exclusively interested in locating files by specifying keywords that describe their content (as in an ordinary web search engine). Although files are not required to be text documents (they may be music files as well), they need to have a textual description, i.e. a number of keywords summarizing their content. We will later (for simulation purposes) abstract from real documents and keywords, but one of our project’s important design issues is to make documents semantically analyzable (i.e. beyond a simple inverted index) and use the results of these analyses for building compact document and peer profiles.
• First, there must be short paths in the P2P network graph: it must be possible for queries to reach their ”destination”quickly, i.e. a peer or a number of peers that have the most relevant files with respect to the given query. • Second, peers must have access to some ”latent navigational cues” (as Kleinberg calls it) that enable them to find short paths to the target node(s). Note that even if short paths exist in a network, there may be no local algorithm that is able to find them. The first requirement is met by small world graphs, as we have seen above. In Milgram’s experiments, the participants could (without even knowing it) rely on short path lengths that exist in social networks. There is no reason why arbitrary P2P networks should have a small world topology. Therefore, we want to study P2P systems in which topology restructuring is possible: peers may decide individually which other peers to connect to or not. As all of a peer’s decisions should only be based on local information, these decisions must follow some general strategies that can be guaranteed to work well globally (i.e. create a small world network structure) provided that all peers use the same local strategy.
2
Related Work
In this section, we are interested in exploring approaches that apply some kind of ”intelligence” to search in P2P systems. They can roughly be divided into three categories: 1. Distributed hash tables (DHTs): heavily ”structured” approaches like Chord, CAN or
2
Tapestry in which nodes are made responsible for storing certain data items.
the last of the drawbacks just mentioned). Therefore, we will not examine the details of their functionality, but rather consider P2P architectures that allow for arbitrary data placement, transient peer populations and a search for more than one keyword.
2. Unstructured approaches without topology restructuring where peers have a static neighbourhood that may not be changed or extended. ”Unstructured” refers to the fact that there are no restrictions on data placement, i.e. each peer may keep arbitrary data.
2.2
The approaches we shall examine in this section focus on peers keeping descriptions (profiles) of other peers’ contents in order to facilitate query routing. Because topology restructuring is not allowed, the structure of the network itself does not give any clues as to where data may be located (i.e. the immediate neighbours of a peer P may offer content that is completely different from P ’s). Therefore, these approaches tend to store information about both neighbouring and distant nodes. Thus, the amount of knowledge stored in each peer may reach a considerable size. Actually, all three approaches sketched below maintain knowledge bases covering the contents of the entire network, i.e. peers have global knowledge of the contents available in the whole system. Apart from the obvious problem of storage space, keeping this information up to date can require a large number of messages for gossiping, which makes scalability a serious problem. We will use gossiping and peer profiles in our algorithm in section 3, but there will be no global knowledge in our approach.
3. Unstructured approaches with topology restructuring: peers are allowed to actively influence network topology by choosing their neighbours.
2.1
Static neighbourhoods
Structured approaches (DHTs)
The common feature of all DHT approaches is their deterministic data placement strategy: in systems like Chord [20], Tapestry [23] or CAN [16], the location of files is not arbitrary, but is usually determined by applying hash functions to file keys, i.e. peers are responsible for certain ranges of hash values and have to store all keys (and associated files) that yield these values when being fed into a hash function. Of course the same function can be used to find data, which makes routing very easy. This allows to locate data quickly and efficiently. However, some serious drawbacks result from these rigid constructions: • Since each peer is responsible for providing files with certain hash values, peers joining or leaving the network call for additional efforts, involving costly partitioning or replication of data on other nodes.
A typical example of a gossiping approach is given by Cuenca-Acuna et al. [5]: in their system PlanetP, a global term-to-peer index is kept by each peer. For every term t, this index contains a list of all peers in the network that have t in their local index. Two steps are needed to locate data in PlanetP: first, the querying peer uses its global index to find suitable peers. In a second step, the query is sent to all of these nodes that will return matching documents using their local index. This means that no message forwarding mechanism is needed: queries need only travel one hop because the global knowledge available at each node is able to select the right peers straight away. Peers can be ranked with respect to a given query Q using IPF (inverse peer frequency), a measure derived from IDF (cf. [17]). This can help to minimize traffic by reducing the number of peers that Q has to be sent to. Gossiping is used in PlanetP to build and maintain the global index. Peers use push and pull
• Load balancing: a lot of traffic will be directed towards nodes that are responsible for hash values of very popular data items. Because of the rigid data placement strategy, it is very difficult to introduce any form of load balancing. • DHTs were designed to provide search functionality for single keys (e.g. file names). Finding files associated with n > 1 keys requires n lookup operations (one for each key) and subsequent intersection of the result sets. Searching for more than one key is, however, a typical use case in information retrieval and should therefore be handled more efficiently. Obviously, DHTs are not really suitable for information retrieval purposes (especially because of
3
2.3
operations to propagate and learn changes (e.g. insertion of new documents) which creates a number of messages (”rumors”) proportional to the network size. This does, of course, not scale very well. PlanetP explicitly aims at communities with thousands of peers. Although the authors consider some modifications for better scalability, networks with millions of peers obviously lie beyond the scope of PlanetP. M¨ uller et al. [15] use the same architecture as PlanetP but aim at more compact representations of a peer’s contents: instead of a global term-topeer mapping, they use distributed clustering techniques to derive categories that can be used in peer profiles. This is because they are interested in image retrieval where ”terms” are not available. These (semantic) categories for files can help to reduce the size of the global index but cannot avoid the large number of messages needed for gossiping. The approach described in [4] offers a partial solution to this problem: Crespo and GarciaMolina introduce routing indices for P2P systems. These contain information about how many documents on a given ”topic” can be found along a certain path, i.e. in the subgraph reachable via one of a peer’s neighbours. This is achieved by peers recursively propagating information about their and their neighbours’ contents. Thus, it is possible for each peer to have complete knowledge of the data items that may be accessed via any of its neighbours. An improved version of these indices, called HRI (hop count routing index) also takes into account the number of hops needed to reach the data items stored in the index. As in PlanetP, the insertion of documents or peers requires a considerable number of messages for propagating this new information. In this case, however, the recursion may be restricted to a certain depth: Crespo and Garcia-Molina propose a limited ”horizon” for routing indices, i.e. information is only available for nodes at a given distance. This allows them to trade message overhead for retrieval accuracy. The problem is that we do not know exactly how much information is lost when limiting our horizon: since peers are not organized into clusters of semantic similarity, a large ”graph distance” of two nodes does not necessarily indicate a large semantic distance. Therefore, it might be desirable for a peer to keep information about peers at large ”geographic” distances that are semantically interesting and dispose of others that are (geographically) closer. This calls for reorganization of neighbourhoods as discussed in the next section.
Topology restructuring
We shall now study systems in which peers may actively influence network topology: by specializing in certain areas of interest and choosing neighbours that have similar interests, peers can create a network structure that in itself gives hints for efficient query routing. An important step into this direction is offered by the architecture of Freenet [3]: in Freenet, files can be searched using keys (obtained by applying hash functions) that must be known in advance. Each network node caches files that have been requested in the past and keeps a routing table associating request keys (i.e. queries) with addresses of peers that have answered these requests. Because storage space is limited, cached files and routing table entries may have to be replaced occasionally. This is done using a LRU (least recently used) strategy. Queries for a key k are routed by each node first inspecting its local document cache. If a file with the requested key is found, the query is returned to its origin. If nothing is found, the local routing table is searched for a key k 0 that is most similar to k. Then the query is routed to the node associated with k 0 . This is done for a limited number of hops and realizes a gradient ascent search. If a query runs into a dead end, backtracking is used to find better paths through the network. When queries have been answered, they travel back to their origin the same way they came along. This allows each node on the way to update its routing table with information about the key k (i.e. associating it with the address of the node that provided the answer). That leads to a specialization of nodes: If nodes are associated with a certain key, they will receive requests for similar keys, gain information about these keys (when the respective query gets answered) and so on. When files and their keys are inserted into the system, they are spread onto nodes specialized in keys similar to the new one. Note that the node where the file is actually inserted also stores the file but may never be queried for it (because it is not specialized in its key). So the specialization of Freenet nodes does not necessarily reflect user interests. There are two approaches developing Freenet further: • In [22], Freenet’s replacement strategy for
4
routing tables is analyzed: the authors find that clusters of peers with similar keys tend to break up when there are many documents in the system. They argue that this is due to the LRU strategy that keeps popular entries but does not necessarily preserve semantic clusters. They aim at the creation of a small world structure by applying two new replacement strategies: replace entries that are farthest from a chosen seed with closer ones (which builds local similarity clusters) or, with a certain probability p, accept new entries that are very far from the seed (which creates random shortcuts). This is very similar to the strategies I shall introduce in section 3.
data has been retrieved from their data stores instead of applying a cosine measure. This means that when a user often downloads files from a certain peer, queries will tend to be routed to this peer in the future. There are also two approaches that build network structure proactively, i.e. using gossiping: Haase et al. [2, 8] describe Bibster, a tool for exchanging bibliographic metadata among researchers. Bibster peers use a shared ontology to create advertisements describing their expertise. These advertisements are sent to the immediate neighbours of each peer which may choose to cache them (e.g. if they are semantically similar to locally offered content) or discard them. A similarity measure (based on distance in the ontology) between queries and expertise advertisements is then used to route queries only to those peers whose expertise matches the query. Since advertisements are only allowed to travel one hop (i.e. only the immediate neighbours of a peer can learn of its expertise), this will not result in (large-scale) topology restructuring, i.e. Bibster could also be classified under static neighbourhoods in section 2.2. Sia [19] presents DISCOVIR, a system for image retrieval that is based on Gnutella. It builds a semantic peer clustering on top of Gnutella by having peers broadcast their data characteristics (i.e. profiles). Each DISCOVIR node has a set of ”attractive” connections to peers offering similar content and a set of random ones. Queries are initially routed just as in Gnutella (i.e. through flooding) using random links. This stops as soon as a query reaches a peer with a matching profile: this indicates that the query has arrived at the right cluster of peers. It will therefore subsequently only be forwarded through attractive links. The problem with the last few approaches (except for those that are based on Freenet) is that they all rely on flooding (or at least limited BFS) which I think can, and should, be avoided for the sake of scalability.
• Kronfol [13] enhances Freenet with information retrieval functionality: for each document, a vector of descriptive terms (Kronfol calls them ”metadata keys”) is created and stored as a separate file, containing a link to the actual document. The original Freenet keys are replaced by metadata keys and the cosine measure is introduced as a new similarity measure for comparing these keys. So now, instead of looking for just a file key, users can search for keywords, just as in an ordinary search engine. Queries will continue to be routed through the network even when first matches have been found. Thus, as an answer to an information request, several documents can be returned and ranked according to their similarity to the query. Kalogeraki et al. [11] also address the problem of information retrieval for P2P systems. As in Freenet, each peer in their network keeps profiles of its neighbours which consist of the most recent past queries (keywords) that have been answered by these neighbours. The cosine measure is used as a similarity function. Routing tables are updated each time a new query is routed through a node and LRU is used as a replacement strategy. But instead of forwarding queries to just one neighbour, breadth first search (BFS) is used: every peer contacts the k best neighbours. This is, in turn, very similar to the approach used in Neurogrid [10]: Neurogrid also uses breadth first search and associates nodes with keyword requests that these nodes have provided answers for. However, in Neurogrid, there is a strong emphasis on user feedback: when two or more peers are associated with the same keyword, Neurogrid ranks them according to the number of times (valuable)
2.4
Our contribution
We see our project’s contribution to this research area in combining many of the above ideas in a way that is guided by the concept of small world structures with a social search mechanism on top of them: • We use gossiping because we do not want to rely on users’ activities (i.e. queries) to build network structure. We thus hope to be
5
able to answer not only the popular queries: the fact that many of the systems mentioned above (e.g. Freenet) only cache information about resources when users ask for them, together with LRU replacement strategies, leads to serious problems when unpopular data is to be found. In Freenet, unpopular files may even vanish completely when rarely queried for. We try to overcome this problem without losing search efficiency.
consists of the files stored locally. Later, it may also contain vectors of documents that are stored on remote machines, together with a link referencing the physical location of the file. • a profile that consists of a summarization of the peer’s contents, i.e. a summarization of its library. I will later explain in detail how profiles can be calculated. Basically, a profile is a vector of keywords, just like a document.
• To achieve this, we combine the active spreading of peer profiles (gossiping) with topology restructuring. However, our gossiping approach is not designed to maintain information about the entire network in each peer, but only about locally clustered neighbourhoods.
• a set of neighbours: a peer stores the addresses of other peers together with their profile. These sets of neighbours represent the connections between peers and thus define the network topology. Note that the graph resulting from these connections is directed: if peer A knows about peer B, the reverse may not be true.
• We try to avoid flooding completely and thus ensure scalability. • We deliberately aim at creating small world structures and do not merely accept them as a ”side-effect” (as [3] or [13] do).
3
Additionally, we assume that there is a similarity function sim that allows us to calculate similarities between pairs of profiles, a query and a profile and a query and a document. Since we are using a vector space model, the same funtion sim can be used in all three cases (queries will also be represented by vectors of keywords). Details of the similarity function are given in section 3.3.
The Algorithm
Our search algorithm1 consists of two parts: 1. A gossiping mechanism responsible for creating and maintaining a small world structure within the P2P network and
3.1
When a peer searches for a set of keywords, these keywords will be represented by a vector (with all keyword weights being 1) and attached to a message of type Query. Besides, a time to live (TTL) is specified, i.e. the maximum number of hops the query is allowed to travel before it has to be returned. This query is now passed to the query handling component of the peer, which performs the following steps (note that this procedure takes place in each peer that receives the query):
2. the search mechanism itself that uses this structure to locate requested data quickly and efficiently. These ideas will be covered in the following sections. Although structure building is a prerequisite for the search mechanism to work efficiently, I will explain the actual search first, because it is also part of the structure-building process. From now on, we will assume that each peer consists of three components that entirely define its state:
1. Compare the Query vector to my own library. Return any documents that are sufficiently similar to the query. There may be an explicit similarity threshold given (as in our simulation setup below) or a maximum number k of documents to be returned; the latter requires our search mechnism to return the k documents that best match the query.
• a set of documents which I will call its library. In this library, each document is represented by a vector of keywords, like in a vector space model for information retrieval (cf. [17]). The library can be thought of as the files a peer shares with the rest of the network. It 1 the
Search
idea of this algorithm was first sketched in [1]
6
2. Attach the keyword vectors of the returned documents to the Querylog of the original query, together with my address. At the end of the search, the Querylog contains the search results, i.e. a set of document keyword vectors together with the addresses of all the peers that have contributed an answer.
analyzing the Log of answers and passing queries in order to find ”better” neighbours. More precisely, each peer P periodically sends out a query containing its own profile. This query is then processed as described in section 3.1. When P receives an answer to its query, each entry in the Log of this answer is analyzed. The peers corresponding to these entries may be added to the set of P ’s neighbours. As this set is limited in size (we assume limited memory resources), other neighbours may have to be replaced. Peers may also inspect any passing queries that they have not issued themselves. This is needed to increase connectivity within the network: if the initial network is not strongly connected (which is quite probable in large networks), peers may not be able to discover new nodes in strongly connected components different from their own. If there is a path from peer A to peer B, but not vice versa, peer B may never learn of A’s existence if it only inspects answers to its own queries (these will never reach A because there is no path). However, some of peer A’s messages might reach peer B and by inspecting them, peer B can learn about peer A and possibly include it in its list of neighbours. Two strategies are used for the selection of new neighbours:
3. Attach my address and profile to the Log of the query. At the end of the search, the Log consists of the addresses and profiles of all the peers the query has reached. 4. Decrease the TTL of the Query by 1. Check if T T L = 0. If this is so, return the query directly to the querying peer. If not, go to step 5. 5. Now compare the profiles of each of my neighbours to the query vector. For the best match (according to our similarity function), check if the address of the corresponding peer is already contained in the Log of the message. • if not, forward the query to this peer. • if it is, select the next best neighbour, i.e. the peer whose profile is the second closest to the query and repeat this step. This avoids circles in the path of the query. If no peer is left, i.e. if the query has reached all of my neighbours, send the query back to the querying peer (no backtracking is performed).
1. Cluster strategy: select new neighbours whose profile is similar to my own (according to the similarity function sim). If the set of neighbours is full, replace the neighbour that least matches my profile with a new one if and only if the profile of the new neighbour is more similar to mine than that of the old neighbour.
Note that there is only one message traversing the network. It may grow to a considerable size as there is a lot of information attached to it. We assume, of course, that document and profile vectors are fairly small: we aim at very compact representations. Besides, we can reduce message size by decreasing the TTL of messages if necessary.
2. Intergroup strategy: select new neighbours whose profile is least similar to my own. Replacement is analogous to that for the cluster strategy, only with a reverse replacement condition.
To summarize the operating mode of our search mechanism: each peer that receives a query tries to contribute an answer from its library and then forwards the query to that neighbour whose profile best matches the query, i.e. that is most likely to have an answer to that query. This continues for a fixed number of hops, specified by the TTL of the query.
3.2
To apply both strategies, the set of neighbours of each peer is divided into two parts: a ”cluster” and an ”intergroup” one. By varying the size of the partitions, the influence of the two strategies can be tuned as needed. Why do we assume that these strategies will eventually create a small-world network structure? Recall that a small world consists of some strongly connected subgraphs (clusters) that are in turn loosely connected. The cluster strategy is responsible for creating clusters of peers that offer similar
Structure building
In order to create a small-world network structure, a gossiping mechanism is used: basically, it consists of peers issuing queries for their own profiles and
7
content. The intergroup strategy inserts edges between these clusters.
for his peer, a user can exploit the gossiping mechanism to collect more and more documents about the same topic.
This roughly corresponds to the small world model designed by Kleinberg [12]: He modeled a small world by introducing a two-dimensional grid where each node is connected to its immediate neighbours within that grid. This corresponds to clusters created by the cluster strategy. He then inserts long-range contacts for each node connecting it to another peer at a larger distance. A long-range contact at distance d is chosen with probability p = d−r . This means that – e.g. for r = 2 (which Kleinberg proves to be optimal with respect to a local search algorithm) – long-range contacts that are very far away become quite improbable. In our case, instead of preferring long-range contacts at smaller (similarity) distances, we chose to look for nodes at a maximal distance because a peer P may only inspect queries that are routed through it. These contain addresses of other peers (in the Log) that were also deemed likely to contribute to the same query. It is therefore probable that these peers have profiles similar to P ’s which makes it difficult to find any long-range contacts at all. So we had to maximize the semantic distance for long-range contacts in order to find any links between different clusters. Because of this problem, we cannot be absolutely sure that the structure building mechanism does indeed create a small-world network topology. This will later be investigated in our experiments and it will be shown that in our simulation smallworld structures evolve all the same. 3.2.1
Note that the gossiping mechanism is not allowed to physically download any documents: there is no need to waste disk space (unless for redundancy reasons when dealing with transient peer populations, see section 6). Instead, only document vectors, i.e. metadata, is collected, together with a link referencing the physical location of the file. These links can later be used to give users direct access to the corresponding file when the metadata is found by one of their queries. In our simulation experiments, we will see that document caching can have a tremendous influence on retrieval effectiveness.
3.3
Peer profiles and similarity measure
Let us now briefly consider how peer profiles and similarities are computed. For the creation of peer profiles, recall that each document is represented by a vector of terms. More precisely, this vector consists of weights, the entry at position i indicating how important term i is for the respective document (the weight probably but not necessarily being 0 if term i does not occur in the document). So if there are n terms occurring in all of the documents, a document d is (formally) represented by its vector d~ = (w1 , ..., wn ). We assume these vectors Pn to be normalized according to a sum norm, i.e. i=1 wi = 1. The profile of peer P is now simply calculated by adding up all the document vectors in P ’s library L: X (1) P rof ile(P ) = d~
Caching of documents
If disk space is available and peers are willing to be cooperative, the gossiping mechanism can be extended to spread information about documents more widely. This has two advantages:
d∈L
Because of space optimizations, this formula may have to be slightly modified: we may sometimes need to prune profiles, leaving over only the most prominent keywords (i.e. those with the highest weights) in order to reduce message sizes. We also plan to develop other techniques and algorithms that improve the expressiveness and compactness of peer profiles but this is beyond the scope of this paper. Note that the profile vector will not be normalized. This can be motivated by the following consideration: imagine two peers, A and B, possessing 10 and 100 documents respectively. If all of
1. By caching vectors of documents that match their profile, peers can serve more requests about the topics they are specialized in. That means that retrieval effectiveness (recall) is increased. 2. Furthermore, users might be interested in expanding their own knowledge of certain topics without actively issuing queries. By initializing his shared directory with some documents of his interest and enabling document caching
8
peer A’s documents (i.e. 10) are about ”databases” and if peer B has 50 documents about ”philosophy” and 50 about ”databases”, then we would intuitively prefer to send queries about ”databases” to peer B. Normalizing profile vectors, however, will yield an entry of 1 for ”databases” in peer A’s profile vector (because all its documents are about databases), whereas the same entry in peer B’s profile will be 0.5. By choosing not to normalize profile vectors, we ignore the ”purity” of a peer’s library (with respect to the topics covered in documents). Instead, we are interested merely in the number of documents a peer can contribute when asked for a given topic or keyword. We will now see that profiles without normalization have a curious effect on the codomain of the similarity measure: The similarity measure is defined as a simple scalar product between two vectors ~a and ~b: sim(~a, ~b) =
n X
ai bi
assumed document vectors to consist of semantic categories rather than keywords2 : we presume that each document can be classified according to the topics it covers and – for our simulation – we assume this classification to be available for all documents. This means that each document is ~ = (c1 , ..., cn ), represented by a category vector D the weight ci indicating how important topic i is for this document. As a second simplification, we neglected the influence of certain conditions that can be encountered in most real-life P2P systems: • Transient peer populations: we assumed all peers to be on-line during the entire simulation. • Bandwidth, computing power: these factors were not taken into account, i.e. we assumed that all peers were equipped with the same connection speed and computing power.
(2)
• There were no free riders in our simulation, i.e. all peers shared at least one document with the rest of the network
i=1
If ~a and ~b are normalized according to the sum norm, all values of sim(~a, ~b) fall into the interval [0, 1]. This is the case when (normalized) queries are compared to documents or when documents are compared with each other. Profile vectors, however, that are not normalized, yield arbitrary values (greater than zero) when being compared to queries or documents. But since we are – in almost all cases – interested in creating similarity rankings, the actual similarity values are of little interest.
4
We plan to continue our experiments by giving up these simplifications in order to finally arrive at a truly realistic scenario (see section 6 for details).
4.2
The simulation was run in two modes: first, gossiping was performed until network structure and library contents stabilized (structure mode), then this structure was used by the recall mode: each of 100 randomly chosen peers searched the network for documents covering the different semantic categories (there was one query for each category per peer). In this section, I will concentrate on describing how the network was initialized. Table 1 shows some basic parameters of the networks we experimented with. We performed three different simulation runs; the differences between these runs can also be seen from table 1.
Experimental setup
4.1
Simplifying assumptions
The goal of our simulation was to gain an approximate understanding of whether the algorithm can work and how it will perform in comparison to searching in a random graph. This means that some simplifying assumptions had to be made in order to reduce the complexity of the problem while, at the same time, modelling the real world as realistically as possible.
Some of these parameters need an extra explanation: we equipped every peer with limited storage space for neighbours and documents. In all three runs, a peer could store addresses and profiles of just 20 neighbours. This is a much smaller value than in [3] or [13] (i.e. it is somewhat pessimistic), but we believe that it should be sufficient
The first simplification concerns documents: we chose not to work with real documents, but 2 Categories
Simulation setup
were just integers in our simulation
9
Parameter # peers Max. # neighbours per peer # (distinct) documents # semantic categories Time to live (TTL) for gossiping messages Percentage of neighbours chosen according to cluster strategy Percentage of neighbours chosen by intergroup strategy Storage factor
Run 1 8000 20 10,000 50 25 1
Run 2 8000 20 10,000 50 25 0.7
Run 3 8000 20 10,000 50 25 0.7
Random 8000 20 10,000 50 no gossiping no choice
0
0.3
0.3
no choice
1
1
5
1
Table 1: Simulation parameters to keep the network connected. In run 1, all of these 20 neighbours are chosen according to the cluster strategy from section 3, in runs 2 and 3 peers can keep 14 ”cluster” neighbours and 6 ”intergroup” ones. This serves to study the effect of long-distance contacts on network connectivity and medium distance between nodes. A second important parameter that we varied was the ”storage factor”: In addition to their initial documents, peers are allowed to cache a limited number of document vectors that match their profile (i.e. extend their libraries). The cache size is proportional to the number k of initial documents, i.e. it is obtained by multiplying k with a constant, the storage factor SF . We introduced the storage factor because peers with a large library (i.e. large k) are likely to have more (spare) disk space than those that share few documents. We allowed document caching only in run 3 (SF = 5) to study its effects on recall.
(initially) be more than one copy of the same document in the system. The replication process is controlled by the assignment of documents to peers (see next step). Following the proposal made by M¨ uller et al. [15], we assume that peers are run by users that have certain interests. This is normally reflected by the documents in a peer’s library, i.e. the peer’s profile (when being calculated from the category vectors of the documents in the peer’s library) should consist only of a limited number of semantic categories. On the other hand, the findings of Saroiu et al. [18] suggest that the number of files per peer is significantly skewed in typical P2P networks: there are a few peers that hold a large number of documents whereas the majority of the peers share few files. We chose to model this by another Zipf distribution. These considerations lead to our implementation of the second task: in a first step, each peer is assigned 1-3 ”interests”, i.e. semantic categories the user of the peer is deemed to be interested in. Documents are then classified according to the categories they contain: there is one set Di for each semantic category ci ; if a document vector contains at least one non-zero entry for category ci , it belongs to Di . Next, the library size |Lp | is computed for each peer p such that we obtain our Zipf distribution. Finally, each peer is assigned |Lp | documents drawn from those sets Di that correspond to its ”interests” calculated before. The same document vector may be chosen more than once in this process leading to document replication as mentioned above.
Initializing the network essentially consists of three tasks: 1. Build document vectors: assign semantic categories to documents 2. Initialize peer libraries: assign documents to peers 3. Initialize peer neighbourhoods: assign addresses of neighbours to peers To solve the first task, we assumed that in most cases documents cover only very few topics. We implemented a Zipf distribution for the number of topics per document: most document vectors consist of just one category whereas a few documents cover many (up to 10) categories. Document vectors may be replicated, i.e. there might
The third task is solved by simply assigning
10
Run 1 Run 2 Run 3 Random
Clustering coefficient 0.56 0.34 0.31 0.0024
number of components 6829 135 135 1
size of biggest component 1168 7865 7865 8000
Table 2: Clustering coefficients and strongly connected components three randomly chosen addresses of peers to each node: these addresses serve as a first neighbourhood that has to be extended and improved during the structure mode of the simulation. Note that our initial graph was not strongly connected: there were 445 strongly connected components the largest of which comprised 7556 nodes (i.e. there were 444 isolated nodes). We also ”built” a random graph in which each peer was equipped with the same documents as in our three simulation runs and with 20 randomly chosen addresses of other peers as a neighbourhood. This random graph will serve as a benchmark for our gossiping mechanism in the next section: we will examine the properties of the network obtained by gossiping as compared to the random network and will evaluate which of the two graphs allows a better search. We used the software package OMNeT++ to perform the actual simulation. OMNeT++ is an open source network simulator that is based on the paradigm of discrete event simulation. Thus, time synchronization between nodes, exchange of messages and administration of the underlying network could be handled very easily.
5
simulation runs exhibits small world structures. Let us recall the definition of a small world as given by Watts and Strogatz [21]: we are looking for small average distances between nodes and a large average clustering coefficient C. [21] defines the latter – for single vertices – as the ratio of the number of edges that are present between a node’s neighbours and the number of edges that could exist. More precisely, if we define the neighbourhood Nv of a node v as Nv = {u ∈ V |(v, u) ∈ E} then we get Cv =
(3)
for a single vertex and C=
1 X Cv |V |
(4)
v∈V
as an average value for the whole graph. Note that this has already been translated to a directed graph (Watts and Strogatz only consider undirected graphs). We are also interested in the number and size of strongly connected components in the network. Table 2 shows the clustering coefficients, number of strongly connected components and the size of the largest component for the graphs obtained by gossiping in all three simulation runs and for the random graph. The figures for the random graph are precisely what we expected: the graph is strongly connected and has a very low clustering coefficient. The latter is expected to be equal to the density of the graph, |E| i.e. |V |(|V |−1) (this yields 0.0025 in our case), which is confirmed by table 2. We can see that connectivity increases in runs 2 and 3: as mentioned above, there were 444 isolated nodes at the beginning of the simulation, which can be decreased to 134 in both cases. In run 1, however, the system breaks up: the majority of the nodes is isolated at the end of the structure-building process. This means that choosing neighbours solely using a cluster strategy tends to destroy the network’s connectivity. Obviously,
Results
In this section, I shall present the results obtained with our simulation. They will be subdivided into two parts: First, the graph structure of the networks obtained by applying our gossiping mechanism will be analyzed and compared to the random graph. Second, searches for all available categories will be performed on some randomly chosen peers and retrieval effectiveness will be examined.
5.1
{(a, b) ∈ E|a, b ∈ Nv } |Nv |(|Nv | − 1)
Graph analysis
Each P2P network can be seen as a directed graph G = (V, E) where V is the set of peers (vertices) and E is the set of (overlay) connections between peers (edges). In our case, we are interested in discovering whether the graph created during the
11
choosing some ”intergroup” neighbours can circumvent this problem and even increase connectivity. Figure 1 shows a histogram of the distances between 10,000 randomly chosen pairs of nodes in the random graph.
reached from within the large strongly connected component. The picture is quite different for run 2 (distances in run 3 were very similar, so they will not be shown here): most node pairs are connected and the average distance between connected vertices is 4.3.
6000 4000 5000
3500 3000 2500
3000
# pairs
# pairs
4000
2000
1500
1000 0
2000
1000 500 -2
0
2
4
6
8
10
12
14
0
Distance
-2
0
2
4
6
8
10
12
14
Distance
Figure 1: Distances between nodes in our random graph
Figure 3: Distances between nodes in the graph produced by run 2
According to [9], the average path length of a random graph is given by lr =
log(|V |) log(|E|/|V |)
Another interesting aspect of our graphs is shown in figure 4: the distribution of indegrees among peers. The figure plots the rank of a peer (with respect to indegree) against its indegree.
(5)
which yields 3 in our case (|V | = 8000, |E| = 160, 000). Again, this is quite exactly what we see in figure 1. Figures 2 and 3 show the same for run 1 and 2. A distance value of −1 means that there was no path connecting the two nodes.
10000
run3 run2 run1 random
Indegree
1000
9000 8000
100
10
7000
# pairs
6000
1
5000 4000 3000
1
10
100 Rank
1000
10000
Figure 4: Distribution of indegrees for all runs: peers are ranked according to their indegree.
2000 1000 0
-2
0
2
4
6 8 Distance
10
12
This is close to a Zipf distribution for runs 1-3, although it does not quite match it. We suspected a correlation between the number of documents a peer possesses (which was initialized to be Zipf-distributed, see section 4.2) and its indegree. Figure 5 shows the results of our analysis: obviously, there is no strong correlation – which would have resulted in a straight line – on the other hand, the picture certainly reveals a strong tendency for peers with large libraries to have high indegrees.
14
Figure 2: Distances between nodes in the graph produced by run 1 It can be seen that there is no connecting path between the majority of node pairs in the graph produced by run 1. This reflects our previous findings: most nodes are isolated, i.e. they cannot be
12
that fulfilled sim(D, Q) > 0.5. Because we replaced the user’s relevance judgement with a similarity threshold, the system would never return any ”non-relevant” documents (i.e. precision was always 100%). However, we were interested in recall which – for a given query Q – can be defined as follows:
10000
Indegree
1000
100
10
1
R= 1
10 100 Number of documents
So, actually, we compare the retrieval effectiveness of our P2P search algorithm to that of a centralized index. The same evaluation method is used in [13].
1000
Figure 5: Plots the number of documents in a peer’s library against its indegree
Figure 6 shows recall as a function of the timeto-live of the queries, averaged over all 100 peers that generated queries and over all 50 semantic categories that they asked for.
Summarizing the results obtained by our graph analysis, we can say that apparently random shortcuts – as created by an intergroup strategy – are necessary in order to keep the network connected. When using both strategies (i.e. cluster and intergroup strategy as in runs 2 and 3), small world structures do emerge: paths are short (although they are slightly shorter in a random graph) and we have a clustering coefficient that is significantly higher than its counterpart in a random graph. Induced by our power-law distribution of documents on peers, the distribution of indegrees in the network is significantly skewed in a Zipf-like fashion with a tendency towards peers with large libraries having high indegrees.
5.2
# documents found in P2P network # documents found in centralized index
80
run 1 run 2 run 3 random
70 60
Recall
50 40 30 20 10 0
0
5
10 15 20 25 30 35 40 45 50 TTL
Figure 6: Recall as a function of TTL, averaged over all queries
Recall
We can see that recall converges quite quickly to about 75% for run 3 and to about 45-50% for runs 1 and 2. There are some really interesting observations to be made:
In a second series of experiments, we selected 100 peers at random from each of our networks. Then, each of these peers generated 50 search messages, one for each semantic category. These queries were processed by the search algorithm outlined in section 3, using the three different networks that were created before by the gossiping mechanism and the random graph (note that the same search mechanism was used in all cases!). Peers provided answers to a given query by returning documents similar to that query. We used a similarity threshold of 0.5, i.e. a document D was returned as an answer to a query Q iff sim(D, Q) > 0.5 (this can be seen as an objective criterion replacing a user’s judgement about relevance). Using a centralized index, we were able to compare the results found in the P2P network to the set of documents that should have been found, i.e. to all the documents in the entire collection
• First, in all three scenarios obtained by gossiping, recall does not further improve significantly when messages make more than 10 hops. Or, put it another way: after visiting 10 peers, queries find hardly any more relevant documents. This is different in a random graph: here recall increases linearly with the number of hops a query is allowed to travel. However, as we expected, searching in a random graph is far less effective than using the network structure generated by our gossiping mechanism. • The effect of caching documents during gossiping is tremendous: the recall values
13
achieved by run 3 are far better than in runs 1 or 2. Aggressive caching of documents is, among other things, what made systems like Freenet so successful. Here, we can see why.
Finally, I present a recall histogram for T T L = 25 in run 3 (figure 7). This shows that recall is between 70 and 80% for most queries, with small deviations to either side. We measured the standard deviation σ of recall values: The overall deviation was σ = 10.6% for T T L = 25 in run 3. When averaging over peers (i.e. looking at 50 different values, one for each semantic category), we also got σ = 10.6%. Averaging over semantic categories (i.e. looking at 100 values, one for each peer) resulted in σ = 0.2%. This is interesting because it means that finding data is equally easy (or difficult) for all the peers in the network, whereas recall may vary considerably depending on the semantic category one asks for: for some categories, it is much easier to find corresponding data than for others.
• Strikingly, the recall curves for run 1 and 2 are almost identical. This means that although the graph in run 1 mainly consists of isolated nodes, data can still be located just as effectively as in a well-connected graph as in run 2. This can be explained as follows: what remains as a strongly connected component (consisting of 1168 nodes in run 1) is a set of peers that act as authorities: they have large libraries and high indegrees and can thus serve most of the incoming queries. The isolated nodes, on the other hand, have little to contribute and can be ignored by the rest of the network with little loss of information.
All in all, recall is not perfect. But the figures achieved in run 3 seem acceptable to us; the fact that most results can be found by making only 10 hops suggest that we need not worry about big latencies. Besides, recall is much better in networks formed by gossiping than in a random network, which indicates that the network structure we created does indeed give hints about where data can be found. Finally, we have seen that caching of documents can help to greatly improve recall. Since storing metadata is not very costly (as far as disk space is concerned), we can assume that most reallife peers will be willing to cooperate in document caching, perhaps even to a much greater extent than we assumed in our simulation.
The third of these observations shows that there are ”authorities”, i.e. nodes that are neighbours to almost every peer in the network. They receive many queries because they have large libraries containing interesting information about many topics. Again, this is a consequence of our modeling the distribution of documents by a power-law. But we believe that this is quite realistic, i.e. it occurs in many real-life P2P systems (cf. [18]) or in the web (see [6]) and, although it somehow violates the basic P2P paradigm of all peers being equally important, it must be accepted as a matter of fact. 0.4
Percentage of Queries
0.35 0.3
6
Future Work
0.25
Within our project, efforts are underway to make some of our simulation’s features more realistic. These include:
0.2 0.15 0.1
• Working with real data, i.e. real documents: this implies that technologies are needed for automatically indexing documents with few but meaningful keywords.
0.05 0
0
0.2
0.4
0.6
0.8
1
Recall
• Transient peer populations: peers joining and leaving the network is a natural phenomenon in real P2P networks. We plan to take this into account and see whether the algorithms can still work. We believe that there is a good chance that they might. However, a third strategy will probably be needed for
Figure 7: Recall histogram for T T L = 25, run 3 Unfortunately, peers with large libraries may not always be equipped with high bandwidth and computing power which might call for load balancing strategies (see section 6)
14
neighbour selection: one will have to consider on-line availability of peers when looking for ”good” neighbours.
(and Jean-Alexander M¨ uller for recommending it) and Herwig Unger, Thomas B¨ohme and Gerhard Heyer for many fruitful ideas and discussions.
• Load balancing: we have seen that peers with large libraries have high indegree and are therefore likely to receive more data requests than others. We will have to consider bandwith and computing power limitations and – as a consequence – strategies for load balancing or message flow control (MFC).
References [1] S. Bordag, G. Heyer, and U. Quasthoff. Small worlds of concepts and other principles of semantic search. In Proc. of Innovative Internet Computing Systems Second International Workshop (IICS), 2003.
• Scalability: due to limited resoures, experiments could only be conducted with a network of 8,000 peers. We would like to extend this size significantly in future simulations in order to examine the scalability of our approach. The findings of [13, 3] suggest that the TTL needed to find requested data scales logarithmically with the size of the network. We would expect a similar behaviour for our scenario.
7
[2] Jeen Broekstra, Marc Ehrig, Peter Haase, Frank van Harmelen, Maarten Menken, Peter Mika, Bj¨orn Schnizler, and Ronny Siebes. Bibster - A Semantics-Based Bibliographic Peerto-Peer System. In Proceedings of SemPGRID ’04, 2nd Workshop on Semantics in Peer-toPeer and Grid Computing, pages 3–22, New York, USA, May 2004. [3] I. Clarke, O. Sandberg, B. Wiley, and T. W. Hong. Freenet: A Distributed Anonymous Information Storage and Retrieval System. Lecture Notes in Computer Science, 2009:46+, 2001.
Conclusion
In this paper, I outlined the idea of a content-based search algorithm for distributed systems that was inspired by the concept of small worlds. We designed a gossiping mechanism responsible for creating and maintaining a small world structure in P2P systems and exploited this structure with our search algorithm. This algorithm makes use of navigational clues offered by the semantic clustering of peers (as induced by the gossiping) and thus needs a minimal number of messages to locate requested data. We verified our assumptions about the feasibility of these strategies by implementing a simulation tool. The results obtained from our analyses show that it is indeed possible to create small world structures in P2P networks by letting peers organize into clusters of semantic similarities. Furthermore, we found that this does indeed greatly facilitate query routing (as opposed to a search in random networks) and, while not perfect, yields good recall values when compared to a centralized index even when using a TTL as small as 10.
[4] A. Crespo and H. Garcia-Molina. Routing indices for peer-to-peer systems. In Proc. of the 28 th Conference on Distributed Computing Systems, 2002. [5] F. M. Cuenca-Acuna, C. Peery, R. P. Martin, and T. D. Nguyen. PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities. In 12th International Symposium on High Performance Distributed Computing (HPDC), 2003. [6] N. Deo and P. Gupta. World Wide Web: A graph-theoretic perspective. Technical Report CS-TR-01-001, University of Central Florida, 2001. [7] Gnutella. www.gnutella.com. [8] P. Haase, R. Siebes, and F. van Harmelen. Peer Selection in Peer-to-Peer Networks with Semantic Topologies. In Mokrane Bouzeghoub, editor, Proceedings of the International Conference on Semantics in a Networked World (ICNSW’04), volume 3226 of LNCS, pages 108–125, Paris, June 2004. Springer Verlag.
Acknowledgements Many thanks to Florian Holz and Sven Teresniak for helping with the setup of the simulation environment. I would also like to thank Andras Varga for his help and advice on how to use OMNeT++
15
[9] A. Iamnitchi, M. Ripeanu, and I. Foster. Small-World File-Sharing Communities. In Proceedings of Infocom, 2004.
[17] G. Salton, A. Wong, and C.S. Yang. A Vector Space Model for Automatic Indexing. Communications of the ACM, 18(11):613–620, 1975.
[10] S. Joseph. NeuroGrid: Semantically Routing Queries in Peer-to-Peer Networks. In Proceedings of the International Workshop on Peerto-Peer Computing, 2002.
[18] S. Saroiu, P. Gummadi, and S. Gribble. A Measurement Study of Peer-to-Peer File Sharing Systems. In Proceedings of Multimedia Computing and Networking, 2002.
[11] V. Kalogeraki, D. Gunopulos, and D. Zeinalipour-Yazti. A Local Search Mechanism for Peer-to-Peer Networks. In Proceedings of the Eleventh International Conference on Information and Knowledge Management, pages 300–307, 2002.
[19] Ka Cheung Sia. P2P information retrieval: A self-organizing paradigm. Technical report, 2002. [20] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan. Chord: A Scalable PeerTo-Peer Lookup Service for Internet Applications. In Proceedings of the 2001 ACM SIGCOMM Conference, pages 149–160, 2001.
[12] J. Kleinberg. The Small-World Phenomenon: An Algorithmic Perspective. In Proceedings of the 32nd ACM Symposium on Theory of Computing, 2000. [13] A. Z. Kronfol. FASD: A Fault-tolerant, Adaptive, Scalable, Distributed Search Engine, 2002.
[21] D. Watts and S. Strogatz. Collective Dynamics of ’Small-World’ Networks. Nature, 393(6):440–442, 1998.
[14] S. Milgram. The small world problem. Psychology Today, 1(1):60–67, 1967.
[22] H. Zhang, A. Goel, and R. Govindan. Using the Small-World Model to Improve Freenet Performance. In Proc. of IEEE Infocom, 2002. 14, 2002.
[15] W. M¨ uller, M. Eisenhardt, and A. Henrich. Efficient content-based P2P image retrieval using peer content descriptions. In SPIE Electronic Imaging, 2004.
[23] B. Y. Zhao, J. D. Kubiatowicz, and A. D. Joseph. Tapestry: An infrastructure for fault-resilient wide-area location and routing. Technical Report UCB//CSD-01- 1141, U. C. Berkeley, 2001.
[16] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A Scalable Content Addressable Network. In Proceedings of the ACM SIGCOMM, 2001.
16