Evaluating the Dynamic Behaviour of PROSA P2P ... - Semantic Scholar

2 downloads 0 Views 104KB Size Report
In practise Milgram asked to sixty people located in Kansas to send a letter .... from a list of peer that are supposed to be up, as in Freenet or Gnutella) and adds ...
Evaluating the Dynamic Behaviour of PROSA P2P Network Vincenza Carchiolo1 , Michele Malgeri1 , Giuseppe Mangioni1, and Vincenzo Nicosia1 Dipartimento di Ingegneria Informatica e delle Telecomunicazioni Facolt`a di Ingegneria – Universit`a di Catania Viale A. Doria 6 – 95100 Catania (ITALY) {car,malgeri,gmangioni,vnicosia}@diit.unict.it

Abstract. In this paper we present and simulate a new self–organising algorithm for P2P unstructured networks inspired by human relationships. This algorithm, called PROSA , tries to simulate the evolution of human relationships from simple acquaintance to friendship and partnership. Our target is to obtain a self–reconfiguring P2P system which possesses some of the desirable features of social communities. The most useful property of many natural social communities is that of beeing “small–worlds”, since in a small–world network queries usually require a small amount of “hops” to walk from a source to a destination peer. We show that PROSA naturally evolves into a small–world network, with an high clustering coefficient and a relly short average path length.

1

Introduction

The way social contacts and relationships are arranged, how they evolve and how they end, is matter for psychologists and social scientists research. Nevertheless some studies about social groups and their connections reveal that a “social network”, i.e. the network of relationships among people from simple acquaintance to friendship, has many interesting properties that can be exploied in a real–world P2P structure. The Milgram experiment of 1966 [6] showed that a message from a “source” to a “destination” person can be delivered by forwarding it step–by–step to just one of the related people, in the direction of the destination. In practise Milgram asked to sixty people located in Kansas to send a letter to a specified person located in Cambridge. The participants could just pass the letter to personal acquaintances, hand–by–hand. About one quarter of the total number of letters were delivered to the destination person, and Milgram found that the mean number of “hops”, i.e. the number of persons involved in each delivery, was about six. This experiment opened the research in the field of “small– world” networks [9]. The small–world property seems to be a characteristic of many human communities, such as mathematicians, actors, scientists. A small–world arises almost naturally whenever social contacts among people are involved: many researchers are trying to understand the reasons of this behaviour. In this work we’re not interested in answering this question. Our target is just to develop a P2P system using rules and concepts inspired by human behaviours and relationships dynamics. In a social network there are several kind of social links among people. We can identify “acquaintance– links” and ”semantic–links”: the former expresses a simple “acquantaince” among people; the latter requires at least an acquaintance–link plus some additional information about interests, culture, abilities, knowledge etc. In our life semantic–links arise almost naturally. You need no great effort to establish a semantic–link with somebody: you have just to share a knowledge field or a passion or simply an interest with a person and meet him in some circumstances, have a talk with him and no

more. Once you know somebody shares a certain knowledge or passion with you, a semantic–link in that field with that person is established and you’re ready to use that link the next time you need information, help, assistance or collaboration in that field. In real life we massively use semantic–links to speed up information retrieval. For example if a car vendor wants information about Linux, he asks his nephew, who is studying Computer Science at University, but doesn’t ask his wife since she is a biologist and she does not like computers. Note that both of them (this nephew and his wife) are “semantic–links”, but they belong to two different semantic fields. His nephew, who doesn’t know anything about Linux, will ask to one of his colleagues of the Operating Systems course, who is famous as being a Linux guru and can give him the required information: using semantic–links a car vendor reaches a Linux guru in just two hops. If the car vendor in our example doesn’t have a nephew studying Computer Science, he asks to a friend (acquaintance– link) at random, hoping somebody knows what Linux is. Our daily experience says that, at the end, he will find somebody who can help him gathering information about Linux. The same mechanism that allowed Milgram’s letter to be correctly delivered from Kansas to Cambridge in just six hops is exploited in the given example: small–world characteristic of a network allows efficient information retrieval. This is how small–world networks work. In this paper we introduce a P2P structure, named PROSA [2], in which semantic proximity of resources is mapped onto topological proximity of peers. PROSA is inspired by social relationships and their dynamics, since social networks characteristics can be exploited to optimise query forwarding and answering. PROSA uses a self–organising algorithm that dinamically links peers sharing similar knowledge and resources, putting them into high clustered and self–structured “semantic groups”. To validate the proposed algorithm we developed a functional simulator and we used it to show that PROSA really evolves into a small–world network. The paper is organised as follows: Section 2 is a short survey about current work in the field of P2P resource retrieval; in Section 3 we discuss our proposal; in Section4 we show simulation results and finally Section 5 presents a plan for future work.

2 Related work In the last years the interest on overlay networks has increased, mainly because bandwidth, computing power and cheapness of personal computers allow to implement such kind of “logic” networks. Examples of overlay networks include Gnutella, Freenet [3], CAN [5], Tapestry [10]. Each of them focuses on a particular aspect of P2P computing: Gnutella is totally unstructured, Freenet is practically anonymous, CAN is search–efficient and so on. Few P2P structures proposed till now face the problem of efficient resources retrieval. In particular one of the more desirable feature in a P2P network is the possibility to perform query based on semantic resource description. Semantic queries are interesting because they are similar to the natural way a user describe concepts. In unstructured networks, such as Gnutella, semantic query for resource can be performed, but for each request most part of the network is flooded, and there are no response guarantees either if the requested resource is present ([4]). In networks organised as Distributed Hash Tables (DHT) [3][5][10] semantic queries are not allowed, since resources are described by a certain hash of their content or description, so no “semantic proximity” can be neither defined nor used to discover them. Some recent works [1][11] proposed to organise a P2P network in semantic groups of “similar” peers, to facilitate resource search and retrieval based on semantic queries. In particular in SETS [1] the net-

work is split in semantic areas by a super–peer which also maintains a table of groups centroids; a centroid represents the “topic” of a given area. The main drawback of SETS is the introduction of a network manager, which represents a single point of fault. In GES [11] peers maintains two sets of links to other peers: semantic–links and random–links. Queries for resources are first forwarded to a so–called “semantic–target”, which is the first peer that can answer the query, and then flooded to this peer neighbours (the semantic group).

3

PROSA

Our target is to create a P2P network based on acquaintance– and semantic–links, where peers join the network in a way similar to a “birth”, then achieve more links to other peers according to the social model, i.e. by linking (semantically) with peers which have similar interests, culture, hobbies, works and so on, and maintaining a certain number of “random” acquaintances. In P2P networks the culture or knowledge of a peer is represented by the resources (documents) it shares with other peers. On the other hand, different types of “links” among peers simulate acquaintances and semantic–links. To implement such a model it is necessary to have: – A system to model knowledge, culture, interests etc... – A self–organising network management algorithm 3.1 Modelling Knowledge In PROSA , knowledge (each resource) is represented using the Vector Space Model (VSM) [8] . In this approach each document is represented by a state–vector of (stemmed) terms called Document Vector (DV); each term in the vector is assigned a weight based on the relevance of the term itself inside the document. This weight is calculated using a modified version of TF–IDF [7] schema, as follows: wt = 1 + log(ft ) where ft is the term frequency into the document. It has been proved [8] that this way of calculating relevance is a good approximation of TF–IDF ranking schema. The VSM representation of a document is necessary to calculate the relevance of a document with respect to a certain query. We model a query by means of a so–called Query Vector (QV), that is the VSM representation of the query itself. Since both documents and queries are represented by state–vectors, we define the relevance of a document (D) with respect to a given query (Q) as follows: X r(D, Q) = wt,D · wt,Q (1) t∈D∩Q

Using VSM we obtain also a compact description of a peer knowledge. This description is called “Peer-Vector” (PV), and is computed as follows: - For each document hosted by the peer, the frequencies of terms it contains are computed (Ft,D ). - Terms frequencies for different documents are summed together, obtaining overall frequency for each term: X Ft = Ft,D t

- Then a weight is computed for each term, using: wt = 1 + log(Ft ) - Finally all weights are put into a state–vector and the vector is normalised. The obtained PV is a sort of “snapshot” of the peer knowledge, since it contains information about the relevant terms of the documents it shares. The relevance of a peer (P) with respect to a given query (Q) is defined as follows: X r(P, Q) = wt,P · wt,Q t∈P ∩Q

This relevance is used by the PROSA query routing algorithm. It is worth noting that a high relevance between a QV and a PV means that probably the given peer has documents that can match the query. 3.2 Network Management alghoritm As stated above, relationships among people are usually based on similarities in interests, culture, hobbies, knowledge and so on. And usually these kind of links evolve from simple “acquaintance– links” to what we called “semantic–links”. To implement this behaviour three types of links have been introduced: - Acquaintance–Link (AL) - Temporary Semantic–Link (TSL) - Full Semantic–Link (FSL) TSLs represent relationships based on a partial knowledge of a peer. They are usually stronger than ALs and weaker than FSLs. Since usually relationships are not symmetric, it is necessary to specify what are the source peer (SP) and destination peer (DP) of a link. Figure 1 shows the representations for the three different types of links.

SP

DP

SP

DP

SP

DP

Aquaintance Link

Temporary Semantic Link

Full Semantic Link

Fig. 1. Link types

To efficiently use the appropriate link in any given situation, each peer maintains a list of known peers, that we call Peer List (PL). It is a finite list of links divided into three parts: the first one

contains FSLs, the second one contains TSLs and the third contains ALs. The size of each portion is an algorithm parameter. Note that FSLs represent “stable” connections (parents, relatives), TSLs are similar to FSL but are not so strong (friends, colleagues), and finally AL are really weak links. Joining The case of a node that wants to join an existing PROSA network is similar to the birth of a child. At the beginning of his life a child “knows” just a couple of people (his parents). A new peer which wants to join, just searches other peers (for example using broadcasting, or by selecting them from a list of peer that are supposed to be up, as in Freenet or Gnutella) and adds some of them in his PL as ALs. These are ALs because a new peer doesn’t know anything about its “relatives” until he doesn’t make query to them for resources. This behaviour is quite easy to understand: when a baby comes to life he doesn’t know anything about his parents. He doesn’t know his father’s job, neither that is mother is a biologist. The joining phase is represented in figure 2, where “N” is the new peer; N chose some other peers (P) at random as initial ALs.

P

PROSA

AL P

AL N AL

P

Fig. 2. A new node joining PROSA

Updating In PROSA FSLs dynamics are strictly related to queries. When a user of PROSA requires a resource, he performs a query and specifies a certain number of results he wants to obtain. The relevance of the query with respect to the resources hosted by the user’s peer is first evaluated, using equation 1. If none of the hosted resources has a sufficient relevance with respect to the query, the query has to be forwarded to other peers. The mechanism is quite simple: - A query message containing the QV, a (possible) unique QueryID, the source address and the required number of results is built. - If the peer has neither FSLs nor TSL, i.e. it has just AL, the query message is forwarded to one link at random. - Otherwise, the peer computes the relevance between the query and each entry of his Peers–List. - It selects the link with a higher relevance, if it exists, and forwards the query message to it. When a peer receives a query forwarded by another peer, it first updates its PL. If the requesting peer is an unknown peer, a new TSL to that peer is added in the PL, and the QV becomes the corresponding Temporary Peer Vector (TPV). If the requesting peer is a TSL for the peer that receives

the query, the corresponding TPV in the list is updated, adding the received QV and normalising the result. If the requesting peer is a FSL, its PV is in the PL yet, and no updates are necessary. After PL update, the relevance of the query and the peer resources is computed. There are three possible cases: - None of the hosted documents has a sufficient relevance. In this case the query is forwarded to another peer, using the same mechanism used by the forwarder peer. The query message is not modified. - The peer has a certain number of relevant documents, but they are not enough to full-fill the request. In this case a response message is sent to the requester peer, specifying the number of matching documents and the corresponding relevance. The message query is forwarded to all the links in the PL whose relevance with the query is higher than a given threshold (semantic flooding). The number of matched resources is subtracted from the number of total requested documents before forwarding. - The peer has sufficient relevant documents to full-fill the request. In this case a result message is sent to the requesting peer and the query is no more forwarded.

P

P1 Pr

PROSA P N

P

Fig. 3. Query forwarding: new TSL arise This situation is showed in figure 3, where peer “N” forwards a query to one of his ALs randomly chosen, since it has niether TSLs nor FSLs. In our example the chosen peer is “P1”. As soon as P1 receives the QV, it automatically establish a TSL with N (see figure 3) and then it forwards the query if needed. When the requesting peer receives a response message it presents the results to the user. If the user decides to download a certain resource from another peer, the requesting peer contacts the peer owning that resource and asks it for download. If download is accepted, the resource is sent to the requesting peer, together with the Peer Vector of the serving peer. This case is illustrated in figure 4, where peer “N” received a response from peer “Pr” and decided to download the corresponding resource. Note that Pr established a TSL with N, because it received a QV from it, and N established a FSL with Pr, because it successfully received a resource from it.

4 PROSA simulations and results In order to validate PROSA dynamics and effectiveness, we developed a simple functional simulator in Python. The three “phases” of peers life (joining, querying and leaving PROSA ) are represented by

P

P Pr

PROSA P N

P

Fig. 4. Query forwarding: new FSL arises

different “behaviours” and triggered by events off-line generated by an event-generator. The choice of data sets is of the most importance in order to obtain consistent and relevant simulation results. To simulate PROSA functionalities, we chose to use a data set composed by scientific articles from two different fields: Maths and Philosophy. Maths articles comes from “Journal of American Mathematical Society”, “Transactions of the American Mathematical Society” and “Proceedings of the American Mathematical Society”, for a total amount of 740 articles. Philosophy articles comes from “Journal of Social Philosophy”, “Journal of Political Philosophy”, “Philosophical Issues” and “Philosophical Perspectives”, for a total amount of 750 articles. We chose documents from these to fields in order to test if a PROSA network is able to dinamically adapt his structure, allowing “similar” peers to link together by means of Full Semantic Links or Temporary Semantic Links, and adaptively form “semantic–groups”. We define a “semantic–group” as a group of semantically–linked peers that host documents belonging to a certain shared field. Article terms have been stemmed and stored into a database. For each article, it’s DV was computed, using only the most 100 frequent terms of the document. We choose to limit the number of terms of the DV because in [11] it has been proved that a larger DV does not give better results. 4.1 Simulation results The main target of this work is to show that a relationships–inspired network naturally evolves to a small–world. For this reason, preliminary simulations of PROSA have been focused on topological characteristics, such as clustering coefficient and average path length, because small-worlds graphs have high clustering coefficient and small average path length. Since links netween peers in PROSA are not symmetric, it is possible to represent a PROSA network as a directed graph G(V,E). The clustering coefficient for a node in a directed graph can be defined as follows [9]: CCn =

En,real En,tot

(2)

where En,real is the number of edges between n’s neighbors and En,tot is the maximum number of possible edges between n’s neighbors. Note that if k is in the neighborhood of n, the viceversa is not guaranted, due to the fact that links are directed. The clustering coefficient of a graph is defined as the mean graph coefficient for all the vertices (nodes) in the graph: 1 X CC = CCn (3) |V | n∈V

In figure 5 the clustering coefficient (CC) and average path length (APL) of PROSA is compared to that of an “equivalent” random graph defined as a graph with the same number of nodes and edges of the PROSA network and randomly chosen edges. The clustering coefficient and the average path length of a random graph with |V | vertices and |E| edges has been computed using equations (4) and (5) [9].

# nodes # edges 400 15200 600 14422 800 14653 1000 14429 3000 15957 5000 19901

CCrnd =

|E| |V | · (|V | − 1)

(4)

aplrnd =

log |V | log (|E|/|V |)

(5)

CC prosa APL prosa CC rnd ALP rnd CC prosa/CC rnd 0.26 2.91 0.095 1.65 2.7 0.19 2.97 0.04 2.01 4.75 0.17 2.92 0.02 2.29 8.5 0.15 2.90 0.014 2.58 10.7 0.11 2.41 0.002 4.8 55 0.06 2.23 0.0008 6.17 75

Fig. 5. Clustering coefficients and APL for different network size These measures regard the case of PROSA networks where each peer starts with 20 documents on average. The CC and AP L are computed after 10.000 queries. Each query contains 4 terms, on average. Looking at the results, it is clear that PROSA networks always present a higher clustering coefficient than the corresponding random graphs. This means that each peer tends to link with a strongly connected neighborhood, which represents (a part of) the “semantic group” joined by the peer. This behaviour is mainly due to the fact that links are mainly “semantic links” (both FSLs and TSLs) with nodes that provided (or requested) resources belonging to a given field. Note also that the APL for a PROSA network decreases when the number of nodes increases, while it seems to linearly depend on the network size for the correspondent random graph. As showed in figure 6, the clustering coefficient for both PROSA and random network decreases, but the ratio between CC prosa and CC rnd increases almost exponentially with the number of nodes. We think this is due to the fact that in PROSA the clustering degree of the network is strictly related to the number of queries performed by nodes, expecially in this case, where the PL has a non–limited length. Figure 7 reports the ratio between number of edges and number of nodes for PROSA networks with 400 to 5000 edges (divided by 100) and the corresponding clustering coefficient.

0.3 CC_prosa CC_rnd 0.25

0.2

0.15

0.1

0.05

0 0

1000

2000

3000

4000

5000

Fig. 6. Clustering coefficients for PROSA and random graph – 10.000 queries

0.4 Links/nodes CC 0.3

0.2

0.1

0

0

1000

2000

3000

4000

5000

Fig. 7. Ratio between # of edges and # of nodes for different network sizes

It is clear that the clustering coefficient seems to depend on the average number of links per node. To verify this conjecture, we simulated PROSA networks behaviour for different numbers of queries, from 5000 to 20000. Results are showed in the four subfigures of figure 8. These graphics show that a PROSA network has higher clustering coefficient than the corresponding random graph for networks that have more than 200 nodes.

0.3

0.6

PROSA RND

0.25

0.2

0.4

0.15

0.3

0.1

0.2

0.05

0.1

0 200

PROSA RND

0.5

400

600

800

1000

0 200

(a) 5000 queries

400

600

800

1000

(b) 10000 queries

0.6

0.7 PROSA RND

0.5

PROSA RND

0.6

0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1

0 200

0.1

400

600

800

1000

(c) 15000 queries

0 200

400

600

800

1000

(d) 20000 queries

Fig. 8. Clustering coefficient for PROSA and random graph for different network size and # of queries

The ratio between the clustering coefficient of PROSA networks and that of correspondent random graphs (showed in figure 9) says that PROSA clustering coefficient is always 2 to 15 times higher than that of a random graph. Due to this results, we can finally deduce that a PROSA network evolves to a “small–world” after a number of queries which depends on the number of peers, because of the really short average path length and the relative high clustering coefficient we obtain.

5 Conclusions and future works In this paper a novel P2P self–organising algorithm for resource searching and retrieving has been presented. The algorithm emulates the way social relationships among people naturally arise and evolve, and finally produces a really small–world network topology, as confirmed by simulation results. The next step is to prove that a PROSA network is internally organized into semantic–groups, i.e. highly clustered groups of peers formed by nodes that share knowledge into a certain field.

20 5000 queries 10000 queries 15000 queries 20000 queries

15

10

5

0 200

400

600

800

1000

Fig. 9. Ratio between PROSA and random graph clustering coefficients, for different sizes and number of queries

References 1. Mayank Bawa, Gurmeet Singh Manku, and Prabhakar Raghavan. Sets: search enhanced by topic segmentation. In SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 306–313, New York, NY, USA, 2003. ACM Press. 2. V. Carchiolo, M. Malgeri, G. Mangioni, and V. Nicosia. Prosa: P2p resource organisation by social acquaintances. 2006. To be presented at AP2PC – Agents and P2P Computing Workshop at AAMAS 2 006. 3. Ian Clarke, Oskar Sandberg, Brandon Wiley, and Theodore W. Hong. Freenet: A distributed anonymous information storage and retrieval system. Lecture Notes in Computer Science, 2009:46, 2001. 4. B.T. Loo, R. Huebsch, I. Stoica, and J.M. Hellerstein. The case for a hybrid p2p search infrastructure. In Proceedings of the 3rd Internationa Workshop on Peer–to–Peer Systems (IPTPS), February 2004. 5. Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, and Scott Shenker. A scalable content addressable network. Technical Report TR-00-010, Berkeley, CA, 2000. 6. Milgram S. The small world problem. Psychol Today, 2:60–67, 1967. 7. Gerard Salton and Chris Buckley. Term weighting approaches in automatic text retrieval. Technical report, Ithaca, NY, USA, 1987. 8. H. Schutze and C. Silverstein. A comparison of projections for efficient document clustering. In Prooceedings of ACM SIGIR, pages 74–81, Philadelphia, PA, July 1997. 9. Duncan J. Watts and Steven H. Strogatz. Collective dynamics of ’small-world’ networks. Nature, 393:440– 442, 1998. 10. B. Y. Zhao, J. D. Kubiatowicz, and A. D. Joseph. Tapestry: An infrastructure for fault-tolerant wide-area location and routing. Technical Report UCB/CSD-01-1141, UC Berkeley, April 2001. 11. Yingwu Zhu, Xiaoyu Yang, and Yiming Hu. Making search efficient on gnutella-like p2p systems. In Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International, pages 56a– 56a. IEEE Computer Society, April 2005.