J. Parallel Distrib. Comput. An adaptive overlay network ... - CiteSeerX

9 downloads 1613 Views 3MB Size Report
a small world. The existence of small worlds in social networks was empirically demonstrated by Milgram [22]. Through a famous experiment, he showed that ...
J. Parallel Distrib. Comput. 70 (2010) 282–295

Contents lists available at ScienceDirect

J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc

An adaptive overlay network inspired by social behaviour Vincenza Carchiolo a , Michele Malgeri a , Giuseppe Mangioni a,∗ , Vincenzo Nicosia b a

Dipartimento di Ingegneria Informatica e delle Telecomunicazioni, Facoltà di Ingegneria, Università di Catania, V.le A. Doria 6, 95125 Catania, Italy

b

Laboratorio sui Sistemi Complessi, Scuola Superiore di Catania, Via S. Nullo 5/i, 95123 Catania, Italy

article

info

Article history: Received 21 October 2008 Received in revised form 11 February 2009 Accepted 27 May 2009 Available online 10 June 2009 Keywords: Peer-to-peer systems Overlay networks Socially inspired systems

abstract Nature is a great source of inspiration for scientists, because natural systems seem to be able to find the best way to solve a given problem by using simple and robust mechanisms. Studying complex natural systems, scientists usually find that simple local dynamics lead to sophisticated macroscopic structures and behaviour. It seems that some kind of local interaction rules naturally allow the system to autoorganize itself as an efficient and robust structure, which can easily solve different tasks. Examples of such complex systems are social networks, where a small set of basic interaction rules leads to a relatively robust and efficient communication structure. In this paper, we present PROSA, a semantic peer-to-peer (P2P) overlay network inspired by social dynamics. The way queries are forwarded and links among peers are established in PROSA resemble the way people ask other people for collaboration, help or information. Behaving as a social network of peers, PROSA naturally evolves to a small world, where all peers can be reached in a fast and efficient way. The underlying algorithm used for query forwarding, based only on local choices, is both reliable and effective: peers sharing similar resources are eventually connected with each other, allowing queries to be successfully answered in a really small amount of time. The resulting emergent structure can guarantee fast responses and good query recall. © 2009 Elsevier Inc. All rights reserved.

1. Introduction In the last decades, sociologists have been focused on studying social networks in order to understand why the collaboration of several people with different behaviour could magically result in an organized community of people. This field is becoming even more interesting for physicists, because many structural similarities between social networks and discrete matter organization have been discovered. Studies performed by Newman, Albert, Barabasi [23–25,45,2], and others underline the fact that almost all networks of cooperating elements, even if cooperation is based on really simple rules, naturally evolves to a small world. The existence of small worlds in social networks was empirically demonstrated by Milgram [22]. Through a famous experiment, he showed that two randomly chosen American people are connected by a very short chain of relationships. This result is at the same time surprising and astonishing: how it is possible that all of two hundred millions people are connected by just ‘‘six degrees of separation’’? A first model of the small-world effect was proposed by Watts and Strogatz [44]: they supposed to



Corresponding author. E-mail addresses: [email protected] (V. Carchiolo), [email protected] (M. Malgeri), [email protected] (G. Mangioni), [email protected] (V. Nicosia). 0743-7315/$ – see front matter © 2009 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2009.05.004

add random links to an ordered mesh of nodes, and discovered that the average path length among nodes was dramatically lowered by the addition of just a few long-distance links. On the other hand, not only networks of people and collaborations, but also some artificial networks, such as the Internet or the World Wide Web (WWW), are small worlds. The most appreciable characteristic of a small world is that messages from one node of the network to any other one can be delivered in a few steps, thanks to long-distance links. Since one of the main issues of many peer-to-peer (P2P) overlay networks is that searching and retrieving documents is slow and inefficient, we propose a novel P2P overlay structure called PROSA (P2P Resource Organization by Social Acquaintances) [5,6], that tries to mimic the way social links among peers are established and evolve, in order to build an efficient and self-organizing P2P network for resource sharing. The paper is organized as follows. Section 2 is a brief overview of recent studies in the field of semantic-driven P2P networks; Section 3 explains the basic ideas PROSA is inspired by; in Section 4 we give a formal description of the algorithms involved; Section 5 describes the simulation framework used to test PROSA features; in Section 6 topological aspects of the network are discussed, aside with simulation results; Section 7 reports PROSA performance in resource retrieval; Section 8 reports some results about PROSA robustness and Section 9 summarizes obtained results.

V. Carchiolo et al. / J. Parallel Distrib. Comput. 70 (2010) 282–295

2. Related works P2P networks have gained popularity and interest in the last ten years mainly because cheaper and faster access to the Internet is nowadays available for the public at large. In this new scenario people are relatively free from price and bandwidth constraints, and sharing documents and multimedia files has become more attractive and effective. The problem is that traditional protocols for file transfer and resource publishing (such as File Transfer Protocol and Hyper Text Transfer Protocol) are intrinsically asymmetric: on the one side you have servers, where resources are put and can be accessed; on the other side there are clients, which can just download resources from servers. If a server is out, all resources it provides are unreachable, even if many clients want to download them. But this is a quite unnatural way of sharing resources. Users usually want to send to friends or colleagues resources downloaded from servers, and want their friends to do the same, avoiding being bound to a given ‘‘provider’’. P2P overlay networks were born to address this problem. In a P2P network, each ‘‘node’’ acts as a client if it is asking for a resource, while it becomes a server whenever it is asked a certain resource. Nodes in a P2P network are functionally equivalent, and can collaborate in order to reach the common target: making their resources available in a fast and reliable way, to a large number of nodes in the network, while being able to retrieve wanted resources themselves, even if some other nodes are down or unreachable. For this reason, researchers have proposed many models for P2P overlay networks, the most common being ‘‘Distributed Hash Tables’’ (DHTs) [39] and ‘‘Unstructured Overlays’’. DHT-based P2P networks are able to give a certain internal structure to the network. Nodes are usually assigned unique IDs and are connected to a certain number of nodes that result in being ‘‘near to them’’ according to a given hashing function and to a similarity index. Queries are usually routed using the best available link and the underlying structure allows forwarding queries in a really efficient way, usually in O(log n) hops, where n is the number of nodes in the network. The main problem with DHTs is that resources have to be queried using their ID, which is not a human readable string. Unstructured Overlay networks are the simplest P2P networks. Each node is directly connected to a relatively small number of other nodes. One of the early examples of such a network is Gnutella [10]. In this overlay network, queries for resources are based on limited flooding, i.e. the query is forwarded to all connected nodes, except the one that sent it, provided that the same query is not forwarded twice from the same node, and that queries live just for a limited number of ‘‘hops’’. The worst aspect of the Gnutella overlay network is that flooding is not efficient, and limited flooding does not guarantee finding matching results. To address the problems of the Gnutella searching algorithm, many alternative scheme have been proposed. These include iterative deepening [46], the k-walker random walk [21], modified random Breadth First Search (BFS) [16], the two-level k-walker random walk [12], directed BFS [46], intelligent search [16], local indices based search [46], routing indices based search [8], attenuated bloom filter based search [33], adaptive probabilistic search [42], and dominating set based search [47]. Many of them are variations of the BFS, while others are Depth First Search (DFS) based. The interested reader can refer to [17] for a survey of major searching techniques in P2P networks. Some recent works [3,50,11,18,49,48,34] focused their interest on introducing a certain amount of semantics in P2P overlays, allowing a query to become more readable and understandable by users, while trying to maintain good performance in terms of resource availability, recall and robustness.

283

In particular, SETS [3] uses a hybrid P2P network where all nodes are spread into a predefined number of clusters (topic segments), depending on the kind of resources they are sharing, and a network manager (a super-peer) is responsible for periodically reassigning nodes to each cluster in order to maintain consistency. The main issue of SETS is that the reliability and performance of the network is devoted to a single super-peer, which represents a possible point of fault. On the other hand, the number of clusters is an arbitrary value, and it is not clear how it impacts on performance. A completely different approach is used in GES [50]. No superpeer is in charge of creating clusters of similar nodes and network management is completely distributed. Each peer decides to link to some other peer depending on a similarity index, after a handshake phase. Queries are forwarded using a hybrid algorithm: if a peer does not have links to relevant nodes, it forwards the query along a random path. Otherwise, the most relevant neighbour is selected as the next hop. The problem with GES is that network management requires additional messages to be exchanged among neighbours. As in SETS, in GES resources are also represented by a Vector Space Model, in terms of vectors of TF–IDF[36] coefficients. A similar approach is used in INGA [11,18]. The selection of the next peer to forward a query is based upon the probability that a node could answer it. INGA uses an ontological approach to model resources and relevance among query and resources. Semantic similarity among nodes is mapped onto topological relationships: peers are usually linked to neighbours that share similar resources. The main issue of INGA is that resources need an ontological description, and this description is usually different for different classes of resources. Moreover, the semantic cluster in INGA does not change as a consequence of query routing, resulting in an almost static structure. Looking at issues of these approaches, we tried to find a valuable model for a P2P network that could solve the problems introduced by organization needs and by semantics. We found that such a network does exist, in nature, and it is the network of social relationships. In fact, as explained in the next section, social relationships allow us to find good solutions to many problems, from collecting resources to gaining collaborations or finding help. On the other hand, it has been widely observed [27,20,29,37,43] that social networks are usually small worlds, and they also allow queries to be routed in a really fast and effective way in very large networks. From here came the idea of copying natural human behaviour, implementing the mechanisms involved in acquiring, modifying and cutting social links, so that the resulting P2P overlay could evolve into an effective and reliable small-world network. PROSA tries to face some important issues of P2P systems. First of all, query routing is based on a local evaluation of relevance between nodes and queries themselves: messages are not flooded to a great number of nodes, as in Gnutella, but just to nodes that can probably answer them. Second, the network organization is entirely distributed and unsupervized: nodes naturally link to other nodes that share similar resources, once they ‘‘meet’’ them as a consequence of searching resources. It is not necessary to have super-peers in charge of deciding where to put each node, as in SETS, and all nodes participate in building the structure of the network. Third, no overhead messages are needed in order to build or remove links among nodes: query messages are used to establish new links and to renew them, with no extra messages for network management, as in GES. Finally, since routing and searching algorithms are based on social behaviour, simulations show that PROSA naturally evolves to a small-world network of peers, where each node can find the required resources just a few hops away. Peers naturally get divided into ‘‘semantic groups’’, i.e. emergent groups of nodes that result in being interested in a topic, that share resources in that topic and, thanks to the underlying link management algorithm, usually try to link to each other. The structure of

284

V. Carchiolo et al. / J. Parallel Distrib. Comput. 70 (2010) 282–295

PROSA can dynamically change, following changes in peer preferences and attitudes: if a peer gets involved in different topics, it will link to other peers which can provide resources in those topics, just making queries and waiting for responses. 3. Learning from social networks Social networks are an abstract model for the structure and dynamics of groups of connected or cooperating people. The most widely used way to represent networks is by graphs: each actor of a social network is associated to a ‘‘node’’ of the graph, while relations among people are represented by directed or undirected ‘‘links’’ between nodes. But representing a society by means of nodes and arcs is not enough to guarantee that the network could ever self-organize and evolve to a complex system. It is clear that the emergent structures and dynamics of complex systems are mainly due to the overall effects of local interactions among connected nodes. Our target is to understand how social networks work, in a simplified way, in order to use this model as a reference to build a self-organizing P2P system. 3.1. The social metaphor In social systems, local interactions among people consist of communication acts: two nodes get in touch when they need help, collaboration, resources, information, knowledge, and so on. But communication is also necessary to establish new relationships with other people and to make them evolve. The first time two people meet they usually do not share any knowledge: each of them simply knows that the other is alive. The same situation holds when a baby comes to life: he does not have relationships with anybody except his parents, and those relationships are really weak, indeed. He does not have any information about his father/mother’s interests, job, culture and knowledge. He simply knows that the parents are alive, and he will ‘‘use’’ those relationships whenever he needs any kind of resource, including food and water. Relationships of a baby evolve when he gets in touch with other people, for example when he goes to school. While interacting with classmates, he could find that some of them are fond of football, while some others are good in maths, and some others like to play chess. The most interesting thing in social networks is that queries for resources are routed in a fast and efficient way, naturally flowing to ‘‘peers’’ that can successfully answer them. In the case of our child, for example, he knows that it is better to ask his teacher for questions about maths, whilst it is better to ask one of his classmates to collect information about his preferred football player. This is what we call ‘‘semantic routing’’: in real life, humans are able to forward queries to other humans directly connected to them, choosing those that could probably give an answer back. 3.2. Reproducing social dynamics In order to build a peer-to-peer network having some desirable properties of social networks, it is important to use mechanisms of node linking and query forwarding similar to those observed in natural social networks. For this reason we decided to model relationships among peers in PROSA by means of directed links, since acquaintances generally are asymmetric relations. Just to report an example, the prime minister of a country is at least ‘‘known’’ by almost all citizens, even if he directly knows just a couple of thousand of them. Another aspect of relationships is that not all have the same strength. This fact obliges us to consider different kinds of links among peers, reflecting the possible different values or strength of

the relations. For example, the relation of a baby to his parents is, at least initially, really weak. The baby simply knows that his parents exist, and no more. We call this kind of weak link an Acquaintance Link (AL) in PROSA. Acquaintance links usually evolve and become stronger, mainly because people get in touch, communicate, and ask for help or collaboration. For example, if a friend asked us something about ‘‘golf’’ and we were not able to answer him, we would usually remember that he is involved with golf. If we were to need information about a golf club at some later time, and if we did not know anybody in that field, we could ask our old friend, assuming that he eventually found information about ‘‘golf’’ and could help us find similar resources. The link we had with that friend is not a simple AL: it is a sort of ‘‘hint’’ about his interests and knowledge, based on a ‘‘query’’ he forwarded to us in the past, and on the assumption that he eventually got in touch with people in the field of golf. This kind of link is called a Temporary Semantic Link (TSL) in PROSA, since its strength lies in assumptions of peer knowledge, based on past queries forwarded by that peer. Finally, a third kind of link is introduced in PROSA, the so-called Fully Semantic Link (FSL). FSLs are the strongest links in PROSA: an FSL from a source peer a to a target peer b arises whenever b is able to answer a query originated in a. Providing a given resource is considered as a ‘‘meeting’’ among the peers, and we assume that two peers that get in touch by sharing resources will know each other better than before. FSL models friendship and strong relationships in general. 4. Conceptual model for PROSA Generally, an overlay network can be modelled as a directed graph, G = (P , L). P denotes the set of peers (i.e. vertices). L is the set of links l = (s, t ) (i.e. arcs), where t is a neighbour of s, s is the source and t is the target peer. However, we are not able to model PROSA using only a directed graph, but we need to introduce some other structures mainly to cope with knowledge management and query processing. Therefore, we define a model for knowledge to represent resources hosted by peers, and a model for query messages used to search information. 4.1. Modelling knowledge In order to define PROSA it is necessary to describe how knowledge (e.g. resources or documents) is represented in each peer. The behaviour of PROSA and algorithms governing its dynamics are independent from the knowledge model adopted, i.e. changing the model of knowledge does not affect the mechanism upon which PROSA works. For this reason, in this section we introduce a very general framework aimed at describing the behaviour that a knowledge model should have. In this framework, each resource hosted by a peer is represented as an element of the Resource Space R. The mapping Pr : P → 2R associates each peer with some resources. Given a set of resources, we define a function Rc : 2R → C that provides a compact description of resources; the result of this function is an element of C , a generic space where a sum operation (+) is defined. The knowledge of a peer is represented by the resources it contains: for the sake of simplicity we define a function Pk : P → C , that is a compact description of the peer knowledge for a given peer s ∈ P . It can also be obtained as Pk (s) = Rc (Pr (s)). 4.2. Query representation Similarly to what we stated in the previous section, we do not define a specific model for query representation, but a general framework suitable to implement any query model. In this

V. Carchiolo et al. / J. Parallel Distrib. Comput. 70 (2010) 282–295

framework, each query is represented by an element of the Query Space Q. We define the relevance of a resource r ∈ R with respect to a given query q ∈ Q as a function Rrq : R × Q → K (where K is a space where an order relation ≥ is defined). Similarly, the relevance of the knowledge of a given peer s ∈ P (computed as Pk (s)) with respect to a query q ∈ Q is worked out by a function Rpq : C × Q → K. The relevance will be used by the PROSA query routing algorithm. It is worth noting that a high relevance between q and Pk (s) means that peer s has documents that can probably match q. When a peer s ∈ P receives a query q ∈ Q , we need to know how many resources hosted by the peer have a relevance, with respect to the query, greater than a given threshold Th ∈ K. For this purpose we define the function Rpq>Th : P × Q × K → 2R , such that Rpq>Th (s, q, Th) = {∀r ∈ Pr (s) : Rrq (r , q) ≥ Th}.

(1)

In PROSA, a peer which receives a query forwarded by an unknown peer (as will be shown later) can obtain from the query itself some information about source peer knowledge, that can be used to establish a new link with the source peer. This is modelled by a function which represents a query q ∈ Q as an element of C , Q2 C : Q → C . 4.3. Mapping social relationships As explained in Section 3, relationships among people evolve from simple ‘‘acquaintance-links’’ to what we called ‘‘semanticlinks’’. To implement this behaviour three types of link have been introduced:

• Acquaintance-Link (AL) • Temporary Semantic-Link (TSL) • Full Semantic-Link (FSL) TSLs represent relationships based on a partial knowledge of a peer. They are usually stronger than ALs and weaker than FSLs. In PROSA, if a given link is a simple AL, this means that the source peer does not know anything about the target peer. If the link is an FSL, the source peer knows all of the target peer (i.e. it knows the Pk (t )). Finally, if the link is a TSL, the peer does not know the full Pk (t ) of the linked peer, but a Temporary Peer Knowledge (TPk ) is built based on the knowledge carried by the queries received in the past. Let L = {AL, TSL, FSL} be the set of all link labels. The behaviour of links is modelled through a labelling function Label : L → [L × C ]. For a given link l = (a, b) ∈ L, Label(l) is a vector of two elements [e, w]: the former is the link label and the latter is an element of C , computed as follows:

w=

 ∅

TPki

P (l) k

if e = AL, if e = TSL if e = FSL.

(2)

where:

 Q2 C (q) if i = 0 (q is the first query a received from b) TPki = TP i−1 + Q C (q) in the other cases 2 k

(3)

Given a vector [e, w] = Label(l), we use the common indexing method to get or set the vector elements; for example, Label(l)[0] returns the first element e and Label(l)[1] = w sets the second element of l to w .

285

4.4. Modelling PROSA We define PROSA as the quadruple: PROSA = (P , L, Pr , Label)

(4) P

We define a neighbourhood relationship N : P → 2 , such that, for a given peer a, N (a) is the set of peers a is connected to, i.e. there is a direct link (a, b) in L for b ∈ N (a), or equivalently:

∀ b ∈ N (a) ∃ (a, b) ∈ L. The out-degree Do (a) of a peer a is defined as the number of neighbours it has: Do (a) = |N (a)|. The in-degree Di (a) is defined as the number of neighbour sets of which a is an element: Di (a) = |{b ∈ P : (b, a) ∈ L}|. The degree D(a) of a node a is defined as the sum of Do (a) and Di (a). Finally, we define the function boot : P × N → P N used by a peer to select N peers of PROSA during the joining phase. How peers are selected is not relevant in PROSA, since any well-known bootstrap technique for P2P architectures can be implemented (e.g. beacon servers, peer caching, etc.) [9]. 4.5. PROSA management algorithm The life of a peer in PROSA is divided into two phases: the Joining phase, when the peer first connect to the network, and the Resource Searching phase, when a connected peer issues queries in order to retrieve resources. These phases are explained below. 4.5.1. Joining The joining of a new peer in PROSA happens according to the steps shown in Algorithm 1, which is obtained by observing how social relationships are established by a child. At the beginning of his/her life only a few acquaintance relationships are available, which are basically due to chance meetings. Similarly, when a peer joins PROSA, it does not know so much about other peers of the network. Then it will only establish AL relationships with some bootstrap peers. To be more specific, in Algorithm 1 instruction #1 is the choice of N peers of PROSA and their storage into RP set. Note that N is a parameter of the joining algorithm. Instruction #2 adds the new peer s to P , and instruction #3 links it with the N peers selected in instruction #1. Finally instruction #4 labels all links from s to the peers in RP as [AL, ∅] (see also Section 4.3). Algorithm 1 JOIN: Peer s joining to PROSA(P , L, Pr , Label) Require: PROSA(P , L, Pr , Label), Peer s 1: RP ← boot (P , N ) 2: P ← P ∪ s 3: L ← L ∪ {(s, t ), ∀t ∈ RP } 4: ∀t ∈ RP ⇒ Label(p, t ) ← [AL, ∅] The complexity of the joining algorithm can be easily neglected, since it is O(1) in the number of nodes and it is executed just once in a while. 4.5.2. Resource searching In order to show PROSA’s resource searching strategy, we need to define a model of query messages. Each query message is a quadruple: QM = (qid, q, s, nr )

(5)

where qid is a unique query identifier to ensure that a peer does not answer a query more than once; q ∈ Q is the query; s ∈ P is the source peer, and nr is the number of required results, needed to stop query forwarding when enough documents have been found.

286

V. Carchiolo et al. / J. Parallel Distrib. Comput. 70 (2010) 282–295

A response message is the message peer t sends to the source of the query; it is represented by the tuple RM = (qid, r , t )

(6) R

where qid is the query identifier and r ∈ 2 is the set of resources found. Algorithm 2 implements ExecQuery, which models PROSA’s dynamic behaviour. Function ExecQuery calls two functions: UpdateLink (described in Algorithm 3), which updates the link between the current peer and the peer where the query originated, and SelectNextPeer (described in Algorithm 4), used to select another peer to which to forward query q. When a user of PROSA asks for a resource on a peer sq , it builds up a query message qm = (qid, q, sq , nr ) specifying the query q, the source peer sq and the number nr of results he wants to obtain: that is modelled by ExecQuery (PROSA, sq , qm). As shown in Algorithm 2, initially cur is equal to sq , and this avoids execution of instruction #3. When a peer receives a query, UpdateLink (described below) updates the link between cur and sq . The relevance of query q with respect to resources hosted by the user’s peer is evaluated (instruction #5) using the function Rpq>Th (see Eq. (1)). Two cases can hold:

• (Instructions #7–#11). None of the hosted resources has sufficient relevance (i.e. numRes = 0). Then the query is forwarded to another peer f , selected by SelectNextPeer (see Algorithm 4). Subsequent forwards are modelled by ExecQuery, where f becomes the current peer. • (Instructions #13–#26). cur hosts resources which are relevant with respect to q (i.e. numRes > 0). A response message (rm) is sent to sq , specifying the query identifier qid, the resources matching the query (Result) and the current node (instructions #13–#14), and an FSL with label [FSL, Pk (cur )] is established between sq and cur (instructions #15–#16). Then, two subcases are possible: – cur has some relevant resources, but they are not enough to fulfil the request (i.e. numRes < nr ). In this case (instructions #15–#20) the query message is forwarded to all peers in the neighbourhood whose relevance with respect to the query is higher than a given threshold (semantic flooding). The number of matched resources is subtracted from the number of total requested documents before each forward step. – cur is able to fulfil the request. In this case the query is not forwarded. When sq receives a response message it presents the results to the user. If the user chooses to download some resources, sq directly contacts the peer owning that resource and, if the download request is accepted, the resource is eventually made available to sq , and to the user. In the following we show how UpdateLink works (Algorithm 3). If sq is an unknown peer for cur or Label(cur , sq ) == AL (instruction #1), a new TSL link from cur to sq is added (instructions #2–#5) with a probability PTSL according to a given distribution function (instruction #2) in order to reduce the size of the TSL list. As said above (see Eq. (2)), the weight of the TSL link is related to the Temporary Peer Knowledge (TPk ) which is based on the received query message. Note that a TPk can be considered as a ‘‘good hint’’ for the current peer, in order to gain links to other remote peers. When a query is finally answered, sq could download all resources that matched it. For this reason it is useful to record a link to the peer containing requested resources, just in case those resources were to be requested in the future by other peers. If cur already has a TSL link to sq , the corresponding TPk is updated (instructions #7–#9) using Eq. (3). If the link from cur to sq is an FSL, no updates are performed. Finally, the function SelectNextPeer (see Algorithm 4) selects, among cur neighbours, the peer (named f w Peer) to which to forward the query q using the following procedure:

Algorithm 2 ExecQuery: query q originating from sq executed on cur Require: PROSA(P , L, Pr , Label), cur ∈ P Require: (qid, q, sq , nr ) ∈ QM ; 1: Result ← ∅ 2: if cur 6= sq then 3: UpdateLink(PROSA, cur , sq , q) 4: end if 5: Result ← Rpq>Th (cur , q, Threshold1 ) 6: numRes ← |Result | 7: if numRes == 0 then 8: f → SelectNextPeer (PROSA, cur , q) 9: if f 6= null then 10: ExecQuery(PROSA, f , (qid, q, sq , nr )) 11: end if 12: else 13: RM rm ← (qid, Result , cur ) 14: SendMessage(sq , rm) 15: L ← L ∪ (sq , cur ) 16: Label(sq , cur ) ← [FSL, Pk (cur )] 17: if numRes < nr then 18: {– Semantic Flooding –} 19: for all t ∈ N (cur ) do 20: rel → Rpq (Pk (t ), q) 21: if rel > Threshold2 then 22: qm ← (qid, q, sq , nr − numRes) 23: ExecQuery(PROSA, t , qm) 24: end if 25: end for 26: end if 27: end if Algorithm 3 UpdateLink: updates links between peer cur and query source peer sq Require: PROSA(P , L, Pr , Label) Require: cur , sq ∈ P , q ∈ Q 1: if (Label(cur , sq )[0] == AL) ∨ ((cur , sq ) ∈ / L) then 2: if RandomNumberBet w een(0, 1) < PTSL then 3: L ← L ∪ (cur , sq ) 4: Label(cur , sq ) ← [TSL, Q2 C (q)] 5: end if 6: else 7: if Label(cur , sq )[0] == TSL then 8: Label(cur , sq )[1]+ = Q2 C (q)) 9: end if 10: end if

• The relevance between query q and neighbours of cur with TSL or FSL links is computed (instructions #1–#12).

• The peer connected with the link having the highest relevance is selected as the next peer (f w Peer). • If the peer cur has only ALs, the next peer is selected at random (instructions #13–#21). The complexity of the ExecQuery algorithm is O(Dmax · C (Rpq )), o where Dmax is the maximum out-degree in the network, while o C (Rpq ) is the computational complexity of the relevance evaluation function Rpq . In general, C (Rpq ) depends on the knowledge model used and on the size of the data structure which represents the knowledge itself. As for the amount of memory required on each node to store routing tables (i.e. lists of node neighbours with their link labels), it is O(Dmax o ). In order to avoid an excessive number of links, the number of neighbours allowed to each node is limited by a parameter (thus implicitly limiting the size of the routing table). We tried different policies for neighbour pruning, whenever a node

V. Carchiolo et al. / J. Parallel Distrib. Comput. 70 (2010) 282–295

Algorithm 4 SelectNextPeer (PROSA, cur , q) Require: PROSA(P , L, Pr , Label), q ∈ Q, cur ∈ P 1: maxRel → 0 2: f w Peer → null 3: for all t ∈ N (cur ) do 4: lbl = Label(cur , t )[0] 5: if (lbl == TSL) ∨ (lbl == FSL) then 6: rel → Rpq (Label(cur , t )[1], q) 7: if (rel > Threshold) ∧ (rel > maxRel) then 8: maxRel → rel 9: f w Peer → t 10: end if 11: end if 12: end for 13: if maxRel == 0 then 14: ALPeers → ∅ 15: for all t ∈ N (cur ) do 16: if Label(cur , t )[0] == AL then 17: ALPeers → ALPeers ∪ t 18: end if 19: end for 20: f w Peer → random(ALPeers) 21: end if 22: return f w Peer

287

weight is computed using a modified version of the TF–IDF [35] schema, as follows:

wt ,d = 1 + log(ft ) where ft is the term frequency in the document. It has been proved [36] that this way of computing relevance is a good approximation of the TF–IDF ranking schema. The VSM representation of a document is necessary to compute the relevance of a document with respect to a certain query. We model a query by means of a so-called Query Vector (QV), that is the VSM representation of the query itself. In this case, the Query Space Q (as defined in Section 4.2) is Rn , where n is the size of the QV. Since both documents and queries are represented by state-vectors, we define the relevance of a document d ∈ Rn with respect to a given query q ∈ Rn as follows:

X

Rrq (d, q) =

wt ,d · wt ,q

(7)

t ∈d∩q

where K ≡ R. Using the VSM we obtain also a compact description of peer knowledge called the Peer Knowledge Vector (Pk (s), s ∈ P ). It is computed as follows:

• For each document hosted by the peer, the frequency of terms it contains are computed (Ft ,d ).

• The term frequencies for different documents are summed needed to add a new neighbour and no space was left on the routing tables. Simulations performed confirmed our conjecture that a simple node replacement strategy, such as Least Recently Used, can be used successfully since nodes in PROSA tend to use links pointing to similar nodes more often than other links.

together, obtaining the overall frequency for each term: Ft =

X

F t ,d .

t

• Then a weight is computed for each term, using wt ,s = 1 + log(Ft ).

5. Evaluation setting All results reported in the following sections have been obtained using a functional event-driven PROSA simulator, written in Python [40]. The knowledge base used for simulations is composed by scientific articles in the field of maths and philosophy. Articles about maths come from ‘‘Journal of American Mathematical Society’’, ‘‘Transactions of the American Mathematical Society’’ and ‘‘Proceedings of the American Mathematical Society’’ [15,41, 32], making a total of 740 articles. On the other hand, articles in the field of philosophy come from ‘‘Journal of Social Philosophy’’, ‘‘Journal of Political Philosophy’’, ‘‘Philosophical Issues’’ and ‘‘Philosophical Perspectives’’ [14,13,30,31], making a total of 750 articles. Each node contains, on average, 20 ± 5 articles, 80% of them belonging to the same topic. Nodes perform 80% of queries in the same topic of the hosted resources and the remaining 20% in the other topic, according to the literature on query distribution in a Gnutella P2P system [19] and with real social communities in mind, where the majority of requests for resources are focused on a really small number of topics. 5.1. Vector space model In PROSA, knowledge is modelled using the framework described in Section 4.1. As said above, this is an abstract model and, in order to simulate PROSA’s behaviour, it is necessary to provide a concrete implementation of the knowledge model. For this reason, in this section we present the Vector Space Model (VSM) [36] adopted, without loss of generality, for modelling documents hosted by each peer. The VSM represents a document using a state-vector of (stemmed) terms called the Document Vector (DV) (in this case the Resource Space R ≡ Rn , where n is the size of the DV). Each term t in the vector is assigned a weight based on the relevance of the term itself inside the document d. This

• Finally all weights are put into a state-vector and the vector is normalized. The obtained Pk (s), s ∈ P is a sort of ‘‘snapshot’’ of the peer knowledge, since it contains information about the relevant terms of the documents the peer shares. The relevance of the peer knowledge Pk (s) with respect to a given query q is computed as follows: Rpq (Pk (s), q) =

X

wt ,Pk (s) · wt ,q .

t ∈Pk (s)∩q

This relevance is used by the PROSA query routing algorithm: a high relevance between a QV and a Pk (s) means that the given peer probably has documents that can match the query. 6. Topological properties The main target of P2P resource organization algorithms is obtaining fast, reliable and efficient resource retrieval. It is easy to understand that achieving good results, in terms of speed and document relevance, requires the overlay network to have a convenient topological structure. Studies performed in the last three decades in the field of social networks show that many natural networks are ‘‘small worlds’’ [44,4,24,2]. The small-world property is very desirable in a P2P network, since resource retrieval in small worlds is really efficient. This is mainly due to the fact that small-world networks have a short Average Path Length (APL) and a high Clustering Coefficient (CC). The APL is defined as the average number of hops required to reach any other node in the network: if the APL is small, all nodes of the network can be easily reached in a few steps, starting from whichever other node. The CC can be defined in several ways, depending on the kind of ‘‘clustering’’ you are referring to. We used the definition given

288

V. Carchiolo et al. / J. Parallel Distrib. Comput. 70 (2010) 282–295

4

in [44], where the clustering coefficient of a node is defined as follows: En,real

(8)

En,tot

where n’s neighbours are all the peers to which n is linked, En,real is the number of links among n’s neighbours and En,tot is the maximum number of possible edges between n’s neighbours. The clustering coefficient of a network G(V , E ) having |V | nodes and |E | links is defined as CC =

1 X

|V |

CCn

Average Path Length

CCn =

PROSA RANDOM

3.5 3 2.5 2 1.5

(9)

1 200

n∈V

400

600

800

1000

# of nodes

i.e. the average clustering coefficient over all nodes. The CC is an estimate of how strongly nodes are connected to each other and to their neighbourhood.

Fig. 1. APL for PROSA and a random network.

4

6.1. Average path length

200 nodes 400 nodes 600 nodes

Formally, estimating the real APL for a directed graph requires measuring the length of all possible paths into the graph and then dividing it by the total number of paths found. This is indeed computationally hard to do, since the number of possible paths grows exponentially with the number of links among nodes. Since we are focusing on the topological properties of a PROSA network to show that it is a small world (i.e. that queries in PROSA are served in a small number of steps), we estimate the APL as the average length of the path traversed by a query. Note that this is an ‘‘effective’’ APL, since it is computed taking into account only paths used. It is interesting to compare the APL of PROSA with the APL of a corresponding random graph. A random graph [28] is a graph G(Vrnd , Ernd ) where links among nodes are distributed according to a given linking probability p. Given a graph (network) G(V , E ), the corresponding random graph is a graph Grnd (Vrnd , Ernd ) which has the same number of nodes and the same number of links of G(V , E ), and where the out-degree of nodes is a stochastic variable with an |E | expected value of |V | . Note that the APL of a random graph can be calculated using Eq. (10), as reported in [26]. APLrnd =

log |V | log(|V |/|E |)

.

(10)

Fig. 1 shows the APL for PROSA and the corresponding random graph for different numbers of nodes in the case of 15 performed queries per node. The APL for PROSA is about 3.0, for all network sizes, while the APL for the corresponding random graph is between 1.75 and 2.0: the average distance among peers in PROSA seems to logarithmically depend on the size of network, as observed in real small-world networks. It is worth noting that the APL of a small-world network is usually comparable with that of a random graph of the same size, and the same is valid also for PROSA. This characteristic has been extensively explained by Watts and Strogatz [44]: it is mainly due to the fact that nodes in a small-world network usually have a large number of links to ‘‘nearer’’ peers and a certain number of long-distance links to far away nodes. The former guarantee high connectivity among similar peers; the latter are shortcuts to remote regions of the network, actually lowering the APL among nodes. It is also interesting to analyse how the APL changes when the total number of queries performed increases. Results are reported in Fig. 2, where the APL is calculated for windows of 300 queries, with an overlap of 50 queries. Note that the APL for PROSA decreases with the number of queries performed. This behaviour

Average Path Length

3.5 3 2.5 2 1.5 1

0

5000

10000 15000 Number of performed queries

20000

Fig. 2. Running averages of the APL for PROSA with different network size.

depends heavily on the facts that new links among nodes arise whenever a new query is performed (TSLs) and fulfilled (FSLs). The higher the number of queries performed, the higher the probability that a distant link between two nodes exists. On the other hand, the dependence of the APL on the number of queries has a bad effect on the clustering coefficient, as extensively discussed in Section 6.2. It is also important to highlight that the APL does not depend on the network size but on the average number of queries performed by each node. In other words, PROSA networks have a ‘‘structural’’ APL that presumably depends only on the algorithm used to search resources and to establish links among peers. However, in order to show that PROSA evolves to a small-world network, it is necessary also to analyse the network clustering coefficient. 6.2. Clustering coefficient In this section we report some results about the CC in PROSA, compared with those observed in a corresponding random graph, as defined in Section 6.1. While the APL of a small-world network is quite similar (a little higher, indeed) to that of the corresponding random graph, the CC of small worlds is usually many times higher than that of a random graph. This means that in a small-world network groups of strongly connected peers can exist (this is true and very important in real social networks), while the APL remains really short. This behaviour is mainly due to ‘‘long-distance’’ links, which naturally arise in small-world networks. The TSLs are the algorithmic representation of such weak links, and in this section we will show their importance in order to guarantee a high CC in PROSA. In Fig. 3 the CC of PROSA for different numbers of queries performed is reported for a network of 200 nodes. Note that the

V. Carchiolo et al. / J. Parallel Distrib. Comput. 70 (2010) 282–295

0.5

289

50 Outdegree Indegree

40

0.4 Nodes

Clustering Coefficient

0.45

0.35

20

0.3 10

0.25

0

0.2 0

5000

10000

15000 20000 # of queries

25000

30000

0.3 PROSA RND

0.25 0.2 0.15 0.1 0.05 0 200

400

600 Network size

800

1000

Fig. 4. Clustering coefficient for PROSA and the corresponding random graph.

CC of the network increases when more queries are performed. This means that nodes in PROSA usually forward queries to a small number of other peers so that their aggregation level naturally get stronger when more queries are performed. It could be interesting to compare PROSA’s clustering coefficient with that of a corresponding random graph. The clustering coefficient of a random graph with V nodes and E links can be computed using Eq. (11) [2]. CCrnd =

|E | . |V | · (|V | − 1)

0

50

100

150 Degree

200

250

300

Fig. 5. Degree distribution for PROSA.

Fig. 3. Clustering coefficient for PROSA.

Clustering Coefficient

30

(11)

Fig. 4 shows the CC for PROSA and a corresponding random graph for different network sizes, in the case of 15 queries performed per node. The CC for PROSA is from 2.5 to 6 times that of a corresponding random graph, in accordance with the CC observed in real small-world networks. This result is simple to explain, since nodes in PROSA are mainly linked to similar peers, i.e. to peers that share the same kind of resources, while being linked to other peers at random. Due to the linking strategy used in PROSA, it is really probable that neighbours of a peer are also linked together, and this increases the clustering coefficient. 6.3. Degree distribution Sometimes small-world networks are also scale-free networks, with respect to the incoming degree Di (p) of each node. A scalefree network has the peculiar property that the distribution of Di (p) (i.e. the number of incoming links of a node) follows a power law. It has not yet been proved if there is any relation between scale-free networks and small worlds: many social networks are small-world ones and some of them are also scale-free, but as far as we know there is no scientific evidence of a strong relation among these two properties.

In the last few years all studies of small worlds [25] (such as those found in Co-authorship networks, the Internet, the WWW, and so on [2]) have confirmed that small-world networks are divided into two different ‘‘flavours’’. The first flavour was modelled in the work of Watts and Strogatz [44]. These small worlds are artificially obtained from a regular ordered mesh of nodes by adding a certain (small) number of random links among distant nodes. These are called ‘‘democratic’’ small worlds, and are observed in real biological networks [4]. The incredible feature of such networks is that the degree distribution is almost the same of a pure mesh, while the APL and CC are those of a small world. The other ‘‘flavour’’ of small-world networks was empirically discovered by Albert and Barabasi [2] while studying the structure of the Internet and WWW networks. These networks are characterized by a power-law degree distribution. A power-law degree distribution indicates that a few nodes are connected to a large number of nodes, while the majority of nodes are connected to a small number of neighbours. The few highly connected nodes are called ‘‘hubs’’, and the majority of queries issued usually pass through one of these nodes during the routing phase. These networks are called ‘‘aristocratic’’ networks. As shown in Fig. 5, both Di (p) and Do (p) have a kind of χ 2 distribution. This means that there are no ‘‘hubs’’ in PROSA and that almost all nodes have the same degree, so the removal of a single group of nodes could not heavily affect the network performance. From this point of view PROSA seems to be a democratic smallworld network. The last considerations about degree distribution suggests that PROSA is similar to the model of small-world network proposed by Watts and Strogatz, i.e. groups of strongly connected components, with approximately the same degree and a small number of longdistance links to other communities. But we think that simply looking at the number of links among peers does not really give an idea of how important a link is. In Fig. 6, the distribution of link usage (i.e. the number of links that are used a certain number of times) is reported. It is clear that this distribution follows a power law: many links are used less than once, while just a few links are used many times. We call the number of times a given link is used to forward a query the ‘‘link weight’’. The exponent of link usage distribution seems to be around 1.8, as shown in Fig. 7, where we fit the link usage distribution on a log–log scale by means of logarithmic binning [1]. The obtained exponent for the power-law distribution is quite similar to exponents usually observed in real scale-free networks, which are in the range [1.7:3] [4]. Since the link usage distribution is a power-law one while the degree distribution is a χ 2 one, it is interesting to look at the distribution of outgoing flow for each node, expressed as the sum of

290

V. Carchiolo et al. / J. Parallel Distrib. Comput. 70 (2010) 282–295

Number of Links

1000

Fig. 9. APL and CC of PROSA compared to those of several real networks.

100

10

1

10 Usage Frequency

100

Fig. 6. Link usage distribution for PROSA.

10000

Number of links

Link usage distribution Interpolation

1000

100

10

1

1

10

100 Usage Frequency

1000

10000

Fig. 7. Logarithmic binning and interpolation of link usage distribution.

60

Number of nodes

50 40 30 20

7. Information retrieval in PROSA

10 0

semantic P2P overlay networks, GES [50] and SETS [3]. Moreover, in Fig. 9 we have included the average path length APLrnd and the clustering coefficient CCrnd of the corresponding random graph. The GES network of 400 nodes reveals an APL and a CC that are very similar to those of the corresponding random graph. In other words, from the structural properties point of view GES seems to behave like a random graph. SETS displays a behaviour very similar to a small world, having a small APL and a CC about five times greater than that of the corresponding random graph. Note that this kind of behaviour is induced by the network management algorithm that is based on the presence of a super-peer which is responsible for clustering nodes in topic segments; this can partially justify the value of CC that cannot be considered as an emerging property of the network. Looking at results on topological properties, it becomes clear that PROSA evolves to a small world, as happens for many real networks. Note that the clustering coefficient for a PROSA network is ten times higher than that of a corresponding random graph, while the average path length is just a bit longer than in a random graph, in accordance with results observed in other real graphs that are small worlds. The most interesting feature of PROSA is that those values for topological parameters have been obtained using the same dynamics involved in social structures, avoiding the use of a complex analytical method to generate a network with those desirable properties. PROSA seems to catch at least a bit of the natural way of building complex and efficient networks. Later considerations about flow distribution suggest that PROSA is not a pure democratic small world: even if the distribution of links is a χ 2 one, not all links are useful or convenient and the flow distribution resembles that of an aristocratic network. This aspect reveals how complex a small world could be and how difficult it is to find a common criterion to assert which kind of small world a network is like.

0

200

400

600 Node Flow

800

1000

Fig. 8. Output flow distribution.

the weights of its outgoing links. In fact we suppose that in PROSA the number of links to other nodes is not the most important parameter to take into account. Also the choice of ‘‘the most useful link’’ has a great impact on how quickly and efficiently each query is answered. And given the distribution of link usage, it is neither said nor proved that having a large number of links implies a higher probability to fulfil a request. As shown in Fig. 8, the distribution of output flow is a χ 2 one, and this means that nodes in PROSA usually forward queries to the most convenient links, and the probability of finding such a link is not related to the number of links to other nodes, since most of the links are never or rarely used (see Fig. 6).

In the above sections we showed the topological properties emerging in PROSA, finding that the network naturally evolves to a small-world network, with a really high clustering coefficient, a relatively small average path length between peers and a scale-free distribution of link usage. In this section we show that resource searching in PROSA is massively improved by its topological structure. The fact that all peers in PROSA are connected by a small number of hops does not guarantee anything about searching efficiency. In this section we show that searching resources in PROSA is really fast and successful, mainly because peers that share resources in the same topic usually result in being strongly connected with similar peers. Unless specified, the simulation results reported in the following subsections were obtained with networks of different sizes, where each node performs 15 queries on average, requiring a maximum number of documents nR = 10. 7.1. Number of retrieved documents

6.4. Topological properties: Concluding remarks Fig. 9 shows a comparison of the structural properties (APL and CC) of PROSA with those of two already discussed (see Section 2)

One of the most relevant quality measures of a resource searching algorithm is the number of documents retrieved by each query. In this paragraph we examine results obtained with PROSA,

Average # of retrieved documents

10 PROSA Flooding Random

8

6

4

Average # of retrieved documents

V. Carchiolo et al. / J. Parallel Distrib. Comput. 70 (2010) 282–295

291

10

PROSA Flooding Random

5

2 300

400

600 500 Network size (nodes)

700

800

Fig. 10. Average number of documents retrieved per query.

using the query mechanism described in previous sections. We also compare PROSA to other searching strategies, such as random walk and flooding. In the case of a random walk, each query is forwarded to one neighbour, chosen at random. On the other hand, flooding consists of forwarding a query to all neighbours. It is clear that flooding is able to retrieve almost all matching documents, since almost the whole network is reached by an issued query, the only exception being isolated components. Fig. 10 shows a comparison of average number of documents retrieved in a PROSA network for different numbers of nodes. The best performance is obviously obtained by flooding, since the average number of documents retrieved per query is about 10, which is the maximum number of documents required by each query (nr ). If more than nr document are available, they are counted and reported in the graph as nr , since a query is no longer forwarded by PROSA if a sufficient number of documents has been retrieved, as explained in 4, while it is still possible to obtain a larger number of results using bare flooding. PROSA is able to retrieve about 4 documents per query, on average, and this result is still better than that obtained by a random walk, which usually retrieves only 2.8 documents per query. This suggests that the query routing algorithm, based on local link ranking, is really efficient and usually lets queries ‘‘flow’’ in the direction of nodes that can probably answer them. We note that PROSA is also able to retrieve a relatively high number of documents if compared with a simple flooding. This is a good result, since flooding is known as being the optimal searching strategy with respect to resource reachability: queries are actually forwarded to all nodes, so all existing and matching documents are retrieved, even if the number of required documents has already been obtained. In Fig. 11, the average number of documents retrieved per successful query is reported. The best performance is once again obtained by flooding, while PROSA retrieves an average of 4.2 documents for each successful query over 10 documents required. Random walk has, once again, the worst performance. Looking only at the number of documents retrieved could be misleading: it is not important to have a small number of queries answered with a high number of documents. It is desirable to have almost all feasible queries1 answered by a sufficient number of documents. Fig. 12 shows the percentage of documents retrieved for PROSA, flooding and random walk, on the same PROSA network with different network sizes. Note that in every case the average amount of unfeasible queries is around 6%.

1 A query is feasible if there exist matching documents to answer it. Otherwise it is considered unfeasible.

300

200

400 600 500 Network Size (nodes)

700

800

Fig. 11. Average number of documents retrieved per successful query.

100 Percentage of answered queries

200

90

80

70 PROSA Flooding Random

60 200

300

400 500 600 Network Size (nodes)

700

800

Fig. 12. Percentage of answered queries.

The highest percentage of answered queries is obtained by flooding the network, since about 94% of queries have an answer. This means that practically all the queries are answered, if we exclude those that have no matching documents. A valuable result is also obtained by PROSA: 84% to 92% of all queries are answered, while random walk usually returns results for less than 80% of issued queries 2 . The percentage of answered queries increases with network size, for all searching strategies, because all nodes have an average number of 20 documents: more nodes means more documents, i.e. a higher probability of finding matching documents. 7.2. Query recall Even if it is an important parameter for a resource searching and retrieving strategy, the number of documents retrieved is not the best measure of how many documents a searching algorithm is able to retrieve. Since not all queries match the same number of documents, it is better to measure the percentage of documents retrieved over all matching documents. A valuable measure is the so-called ‘‘recall’’, i.e. the percentage of distinct documents retrieved over the total number of distinct documents existing that match a query. In Fig. 13, we show the recall distribution for PROSA, flooding and random walk when each node performs 15 queries on average. The best performance is obtained, once again, by flooding the network: about 60% of queries have a recall of 100%, and about 80% of queries have a recall of at least 50%. Searching by flooding could

2 If a query eventually enters an unconnected component, it cannot be further forwarded.

292

V. Carchiolo et al. / J. Parallel Distrib. Comput. 70 (2010) 282–295

1 Cumulative normalised distribution

Cumulative normalised distribution

1

0.8

0.6

0.4

0.2

0

0

0.2

0.4 0.6 Query recall

0.8

0.8

0.6

0.4

0.2

0

1

0

0.2

7.3. Rare and common documents Recall measured as the simple percentage of documents retrieved over the total number of matching documents does not take into account the fact that in PROSA queries are requested to retrieve nr documents and no more. This could practically influence the recall measure for PROSA networks, since queries are no longer forwarded if a sufficient number of documents have been retrieved. On the other hand, it is important to analyse the recall in the case of ‘‘rare’’ queries: we consider a query as being ‘‘rare’’ when the total number of matching documents is lower than the number of documents requested; similarly, a query is considered ‘‘common’’ if it matches more than nr . Fig. 14 shows the cumulative normalized distribution of recall for rare queries, while Fig. 15 reports the cumulative distribution for common queries. The results reported in Fig. 14 are really interesting: PROSA answers 35% of rare queries by retrieving all matching documents,

3 Processing cost is defined as the average percentage of nodes traversed by a query during routing.

0.8

1

Fig. 14. Recall distribution for rare queries.

Fig. 13. Query recall distribution.

1 Cumulative normalised distribution

not return all documents because PROSA is a directed graph, and unconnected components could still exist. Also PROSA has high recall: about 20% of queries obtain all matching documents, while 45% of queries are answered with at least one half of the total number of matching documents. Random walk is the worst case: about 80% of queries have a recall of less than 50% and only 8% of queries obtain all matching documents. A comparison with recall values obtained by other algorithms based on the VSM, such as SETS[3] or GES [50], is possible, extrapolating recall values from graphs reported in those papers. The problem is that measures of recall, both in GES and in SETS, are reported as a function of the processing cost3 and not as a cumulative distribution over the total number of queries answered. Nevertheless, a recall higher than 50% is obtained only with a processing cost higher than 25% in both SETS and GES: this means that a quarter of the network should be probed to achieve a significant recall. On the other hand, in this section we showed that PROSA has a recall of 50% for more than 75% of queries; considering that the average number of visited nodes for a PROSA query is about 3.5 in a network of 400 nodes, we obtain an equivalent 50% of recall with a processing cost less than 1%, and this is a really interesting improvement. The better performances in recall achieved by PROSA are mainly due to the emerging topological structure and to the similaritybased routing strategy used.

0.4 0.6 Query recall

0.8

0.6

0.4

0.2 PROSA Flooding Random

0 0

0.2

0.4 0.6 Query Recall

0.8

1

Fig. 15. Recall distribution for common queries.

while 75% of queries retrieve at least 50% of the total number of matching documents; less than 10% of queries obtain less than 30% of matching documents. The performance of a random walk is worse than that obtained by PROSA: only 20% of queries obtain all matching documents, while more than 30% of them obtain less than 30% of matching results. Flooding obtains the best recall, even if 10% of queries still have a recall below 95%: this is due to the fact that the network considered has directed links, and some nodes can be unreachable, depending on which node flooding starts from. The situation is slightly different for common queries. As reported in Fig. 15, PROSA is able to retrieve at least 10 documents for 20% of queries issued and, in every case, at least one document is found for 99% of queries, and at least 3 documents for 85% of queries. We think that this behaviour is also affected by the chosen value of nr . 7.4. Query deepness In order to better understand the benefits of using PROSA, it is interesting to look also at other measures that could clarify some of PROSA’s characteristics. For instance, recall results are of poor relevance without a measure of how quickly answers are obtained. A feasible measure of speed could be the average query deepness, defined as the average number of ‘‘levels’’ a query is forwarded far away from the source node. In Fig. 16 we show the average deepness of successful queries for PROSA, flooding and random walk on the same PROSA network for different numbers of peers. The query deepness for PROSA is around 3, and it is not heavily affected by the network size, while that of flooding and random walk is much higher (from 30 to 60 and from 120 to

V. Carchiolo et al. / J. Parallel Distrib. Comput. 70 (2010) 282–295

293

6

100

10

1

NLR 5% NLR 10% NLR 20% NLR 30% NLR = 0

5 Average Query Deepness

Query deepness

PROSA Flooding Random

4 3 2 1

200

300

400 500 600 Network Size (nodes)

700

800

Fig. 16. Average query deepness.

0 8000

10000

12000 14000 Number of Queries

16000

Fig. 17. Query deepness (NLR ≥ 0).

600, respectively). The better results in resource retrieval obtained by PROSA if compared with flooding and random walk cannot simply be explained by the network clustering coefficient, since all simulation are performed on the same network. We think that it is mainly due to the searching algorithm implemented by PROSA itself: it is able to exploit the emerging small-world structure and to find a convenient and efficient route to forward queries along, avoiding a large number of forwards to non-relevant nodes. 8. Robustness The robustness of a network can be studied looking both to the modifications in the topological structure of the network induced by failures and to their impact on network performance. However, when we consider communication networks, such as P2P overlays, it is important to analyse not only the structural changes due to node (or link) removal, but principally how the routing algorithm reacts to these changes and how performances (i.e. network recall, query deepness, etc.) are affected by them. PROSA exhibits a good structural resilience against node removals, as shown in [7]. Here we report some interesting results about its functional robustness. An important characteristic of P2P networks is the fact that the network topology changes in a continuous manner: nodes are not connected to the network all the time, but rather they join the network, then use it for a certain amount of time in order to acquire needed resources, and then they disconnect from the network. This means that there is no such thing as a fixed ‘‘network topology’’ in the case of P2P overlays, since a given network configuration lasts for a really brief amount of time, because of joins and leaves. For this reason it is important to estimate the impact of joining and leaving on network performance, i.e. how the ability of PROSA to retrieve resources in a fast and reliable way is affected by a continuously changing topology. In classical papers about network robustness, leaves of nodes or temporary failures on links are considered ‘‘network errors’’, while for P2P overlays they are commonly referred as ‘‘peer churning problem’’. Even if it is easy to imagine an IP router failure as a network error, it results a bit more complicated to define errors for P2P network, because nodes usually join and leave the network with a probability much higher than a router crash. Just to report an example, the measured leaving rate in a Gnutella network [38] is about 10% of nodes per hour, i.e. one tenth of peers leave the network after less than an hour. This is absolutely not the case for Internet routers: for those ‘‘nodes’’ the average time interval between faults can be measured in months, or even years. So what is an ‘‘error’’ for classical studies on networks becomes a normal routine in a P2P system. For this reason, the analysis of robustness to errors can be reduced to the analysis of the network

during normal operation, where a certain number of nodes leave the network with a given probability while other nodes join the network at a certain other rate. Since rates should be defined as a function of time, and our simulations of PROSA are discrete-time event-driven ones, we defined the Normalized Leave Ratio (NLR) as the percentage of nodes leaving the network over the average number of issued queries per node. Similarly, the Normalized Joining Ratio (NJR) is defined as the percentage of nodes which join the network when one query per node has been performed. Doing so, all measures are normalized to the same time, precisely the ‘‘query-time’’, i.e. the average time needed for the network to process a query. Note that an NLR (NJR) of 1.0 means that, on average, 100% of nodes leave (join) the network after each node has performed one query, on average. Similarly, an NLR (NJR) of 0.1 indicates that one tenth of nodes leave (join) the network when each node has performed one query on average. Just to have an idea of typical NLR values observed in real P2P systems, consider that measures on Gnutella [38] reveal that a number of queries equal to the number of nodes is performed in less than 17 min; this, combined with the typical leave frequency of Gnutella nodes (about 10% of peers/h) leads to an average NLR of 2.8%. In order to ‘‘stress’’ PROSA a bit, we simulated four different scenarios, where a network of 500 nodes experiences different NLRs and NJRs. In the initial phase (the first 10,000 queries) NJR and NLR are set to zero, and the number of connections among nodes grows; then NLR and NJR are set to a given value in the range [0.0; 0.3]. In all simulations, we imposed NJR = NLR. As shown in Fig. 17, the query deepness (QD) is not heavily affected by an NLR below 20%. The average deepness is in the range [2; 3] when NLR is less than or equal to 10%, and this also guarantees that queries do not flood the network if a significant number of nodes leave the network continuously. On the other hand, when NLR becomes greater than 20%, the average query deepness decreases below 2.0. This effect could be misleading, since a massive node removal should cause a higher QD, but the problem is that when a large number of nodes leave the network, many nodes are isolated from the network and queries cannot be forwarded so far away. Measures of QD become even more interesting looking at Fig. 18, which reports the average recall for different values of NLR. Note that when nodes start leaving the network (after 10,000 queries), the recall value degrades a bit for NLR ≤ 10%, but when the NLR increases more and more, the recall value decreases dramatically. This can be explained bearing in mind the way links in PROSA are created and updated: the most important links

294

V. Carchiolo et al. / J. Parallel Distrib. Comput. 70 (2010) 282–295

References NLR 5% NLR 10% NLR 20% NLR 30% NLR = 0

Average Recall

0.6 0.5 0.4 0.3 0.2 0.1 0

0

5000

10000 Number of Queries

15000

20000

Fig. 18. Recall (NLR ≥ 0).

among nodes (FSLs) are established only when matching resources are found, and if nodes leave the network at high rates, a high number of links disappear before new links can be created or new paths can be discovered and used. This leads to the arising of unconnected components, where nodes are unable to answer a relevant number of queries. Those considerations become really interesting if we look at the actual leave rate measured in Gnutella: with an NLR of just 2.8% PROSA is able to effectively react to leaves, answering a large number of queries with a really high average recall, and visiting just a few nodes, at most a couple of hops away from query source. Peers have enough ‘‘time’’ to rewire their connections, adapting their neighbourhood when any other peer suddenly leaves the network. Queries are forwarded through sub-optimal paths as long as new semantic links are not available, leading to a really robust self-adaptive structure. We can conclude that socially-inspired P2P systems such as PROSA are more reliable and efficient than other unstructured decentralized P2P technologies proposed so far. More details about PROSA’s robustness, both from a topological and from a functional point of view, can be found in [7]. 9. Conclusions Nature has inspired many improvements to human artefacts, actions, and behaviour. Actual peer-to-peer networks suffer many drawbacks, such as network overhead, inefficient or suboptimal routing protocols, and lack of semantics. In this paper we have described PROSA, a peer-to-peer architecture inspired by social networks, which naturally evolves to a small world by linking ‘‘similar’’ peers together. The proposed model is quite independent from the knowledge model used to represent resources, and the given formalization allows us to use different models by just redefining functions involved in query routing and relevance computing. Network management and organization is fully distributed and decentralized, leading to a robust and effective structure. Links among peers are established as a consequence of query forwarding and answering, using no more messages than those needed to route queries throughout the network and to collect answers. The resulting network has a really small average path length, a relevant clustering coefficient and an ‘‘almostdemocratic’’ distribution of links, while links usage has a typical power-law shape. Resources can be efficiently searched and retrieved, using natural language and involving a small number of nodes obtaining high recall, especially for rare and uncommon documents. The overall quality of algorithms and routing techniques resembles a real social network, where messages flow in a fast and efficient way through convenient paths, chosen step-by-step by each peer involved using only local information.

[1] Lada A. Adamic, Zipf, power-laws, and Pareto — a ranking tutorial. [2] Reka Albert, Albert-Laszlo Barabasi, Statistical mechanics of complex networks, Reviews of Modern Physics 74 (2002) 47. [3] Mayank Bawa, Gurmeet Singh Manku, Prabhakar Raghavan, Sets: Search enhanced by topic segmentation, in: SIGIR ’03: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, New York, NY, USA, 2003, pp. 306–313. [4] Mark Buchanan, Nexus: Small Worlds and the Groundbreaking Theory of Networks, 2003. [5] V. Carchiolo, M. Malgeri, G. Mangioni, V. Nicosia, Social behaviours applied to p2p systems: An efficient algorithm for resources organisation, in: 2nd International Workshop on Collaborative P2P Information Systems, COPS 2006, Manchester, 2006. [6] V. Carchiolo, M. Malgeri, G. Mangioni, V. Nicosia, Efficient searching and retrieval of documents in prosa, in: Fourth International Workshop on Databases, Information Systems and P2P Computing, DBISP2P06 – AAMAS 2006, 2006. [7] V. Carchiolo, M. Malgeri, G. Mangioni, V. Nicosia, On robustness and selfadaptiveness of a socially inspired p2p network, in: Computer and Information Sciences, ISCIS 2007, 22nd International Symposium on, 2007, pp. 1–6. [8] Arturo Crespo, Hector Garcia-Molina, Routing indices for peer-to-peer systems, in: ICDCS ’02: Proceedings of the 22 nd International Conference on Distributed Computing Systems, IEEE Computer Society, Washington, DC, USA, 2002, p. 23. [9] Chris GauthierDickey, Christian Grothoff, Bootstrapping of peer-to-peer networks, Turku, Finland, 1/8/2008 2008, IEEE. [10] Gnutella website, World Wide Web. http://www.gnutella.com. [11] Peter Haase, Jeen Broekstra, Marc Ehrig, Maarten Menken, Peter Mika, Michal Plechawski, Pawel Pyszlak, Björn Schnizler, Ronny Siebes, Steffen Staab, Christoph Tempich, Bibster — A semantics-based bibliographic peer-topeer system, in: Proceedings of the International Semantic Web Conference, ISWC2004, November 2004. [12] I. Jawhar, J. Wu, A two-level random walk search protocol for peer-topeer networks, in: Proc. of the 8th World Multi-Conference on Systemics, Cybernetics and Informatics, 2004. [13] Journal of Political Philosophy (1998–2006). [14] Journal of Social Philosophy (1998–2006). [15] Journal of the American Mathematical Society (1998–2006). [16] Vana Kalogeraki, Dimitrios Gunopulos, D. Zeinalipour-Yazti, A local search mechanism for peer-to-peer networks, in: CIKM ’02: Proceedings of the Eleventh International Conference on Information and Knowledge Management, ACM, New York, NY, USA, 2002, pp. 300–307. [17] Xiuqi Li, Jie Wu, Searching techniques in peer-to-peer networks, in: Handbook of Theoretical and Algorithmic Aspects of Ad Hoc, Sensor, and Peer-to-Peer Networks, CRC Press, 2006, pp. 613–642. [18] Alexander Löeser, Steffen Staab, Christoph Tempich, Semantic social overlay networks, IEEE JSAC — Journal on Selected Areas in Communication 25 (1) (2007). [19] B.T. Loo, R. Huebsch, I. Stoica, J.M. Hellerstein, The case for a hybrid p2p search infrastructure, in: Proceedings of the 3rd International Workshop on Peer–to–Peer Systems, IPTPS, February 2004. [20] David Lusseau, M.E.J. Newman, Identifying the role that individual animals play in their social network, Proceedings of the Royal Society of London, Series B 271 (2004) S477. [21] Qin Lv, Pei Cao, Edith Cohen, Kai Li, Scott Shenker, Search and replication in unstructured peer-to-peer networks, in: ICS ’02: Proceedings of the 16th International Conference on Supercomputing, ACM, New York, NY, USA, 2002, pp. 84–95. [22] S. Milgram, The small world problem, Psychol Today 2 (1967) 60–67. [23] M.E.J. Newman, Models of the small world: A review, Journal of Statistical Physics 101 (2000) 819. [24] M.E.J. Newman, The structure of scientific collaboration networks, Proceedings of the National Academy of Sciences of the United States of America 98 (2001) 404. [25] M.E.J. Newman, The structure and function of complex networks, SIAM Review 45 (2003) 167. [26] M.E.J. Newman, Scientific collaboration networks. ii. shortest paths, weighted networks, and centrality, Physical Review E 64 (2001) 016132. [27] M.E.J. Newman, Juyong Park, Why social networks are different from other types of networks, Physical Review E 68 (2003) 036122. [28] M.E.J. Newman, S.H. Strogatz, D.J. Watts, Random graphs with arbitrary degree distributions and their applications, Physical Review E 64 (2001) 026118. [29] Juyong Park, Oscar Celma, Markus Koppenberger, Pedro Cano, Javier M. Buldu, The social network of contemporary popular musicians, 2006. [30] Philosophical Issues (1998–2006). [31] Philosophical Perspectives (1998–2006).

V. Carchiolo et al. / J. Parallel Distrib. Comput. 70 (2010) 282–295 [32] Proceedings of the American Mathemetical Society (1998–2006). [33] S.C. Rhea, J. Kubiatowicz, Probabilistic location and routing, in: INFOCOM 2002, Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies, Proceedings, vol. 3, IEEE, 2002, pp. 1248–1257. [34] Habib Rostami, Jafar Habibi, Emad Livani, Semantic routing of search queries in p2p networks, Journal of Parallel and Distributed Computing 68 (12) (2008) 1590–1602. [35] Gerard Salton, Chris Buckley, Term weighting approaches in automatic text retrieval, Technical Report, Ithaca, NY, USA, 1987. [36] H. Schutze, C. Silverstein, A comparison of projections for efficient document clustering, in: Proceedings of ACM SIGIR, Philadelphia, PA, July 1997, pp. 74–81. [37] Parongama Sen, Complexities of social networks: A physicist’s perspective, 2006. [38] Kunwadee Sripanidkulchai, The popularity of Gnutella queries and its implications on scalability, World Wide Web. http://www.cs.cmu.edu/ kunwadee/research/p2p/gnutella.html. [39] R. Steinmetz, K. Wehrle (Eds.), Peer-to-peer systems and applications, in: Lecture Notes on Computer Science, vol. 3485, Springer Verlag, 2005. [40] The python programming language http://www.python.org. [41] Transactions of the American Mathemetical Society (1998–2006). [42] D. Tsoumakos, N. Roussopoulos, Adaptive probabilistic search in peer-topeer networks, in: Proc. of the 2nd International Workshop on Peer-to-Peer Systems, IPTPS 03, 2003. [43] Fang Wang, Yamir Moreno, Yaoru Sun, Structure of peer-to-peer social networks, Physical Review E 73 (2006) 036123. [44] Duncan J. Watts, Steven H. Strogatz, Collective dynamics of small-world networks, Nature 393 (1998) 440–442. [45] R.J. Williams, E.L. Berlow, J.A. Dunne, A.-L. Barabasi, N.D. Martinez, Two degrees of separation in complex food webs, Proceedings of the National Academy of Sciences 99 (2002) 12913–12916. [46] Beverly Yang, Hector Garcia-Molina, Improving search in peer-to-peer networks, in: ICDCS ’02: Proceedings of the 22 nd International Conference on Distributed Computing Systems, IEEE Computer Society, Washington, DC, USA, 2002, p. 5. [47] C. Yang, J. Wu, A dominating-set-based routing in peer-to-peer networks, in: Proc. of the 2nd International Workshop on Grid and Cooperative Computing Workshop, GCC’03, 2003. [48] Hai Zhuge, Resource space model, its design method and applications, Journal of System Software 72 (1) (2004) 71–81. [49] H. Zhuge, X. Li, Rsm-based gossip on p2p network, in: Algorithms and Architectures for Parallel Processing, in: Lecture Notes in Computer Science, vol. 4494, Springer, Berlin, 2007, pp. 1–12. [50] Yingwu Zhu, Xiaoyu Yang, Yiming Hu, Making search efficient on Gnutellalike p2p systems, in: Parallel and Distributed Processing Symposium, 2005, Proceedings, 19th IEEE International, IEEE Computer Society, 2005, 56a–56a.

295

Vincenza Carchiolo is a Full Professor of Computer Science in the Department of Informatics and Telecommunications at the University of Catania, Italy. She received her degree in Electrical Engineering from the University of Catania in 1983. Her research interests include information retrieval, query languages, distributed systems, and formal language.

Michele Malgeri is an Associate Professor in the Department of Informatics and Telecommunications at the University of Catania, Italy. He received his degree in Electrical Engineering from the University of Catania in 1983. His research interests include distributed systems, information retrieval, query languages, and formal language.

Giuseppe Mangioni is an Assistant Professor in the Department of Informatics and Telecommunications at the University of Catania, Italy. He received his degree in Computer Engineering (1995) and Ph.D. (2000) from the University of Catania. Currently he holds the courses of Computer Networks and Fundamental of Computer Science at the Faculty of Engineering of Catania. His research interests include peer-to-peer systems, trust and reputation systems, self-organizing and self-adaptive systems, and complex networks.

Vincenzo Nicosia received his degree in Computing Engineering (2004) and his Ph.D. from the University of Catania. His research interests include distributed systems, security, peer-to-peer networks, and complex systems.

Suggest Documents