Distributed Optimization Strategies for Mining on Peer-to ... - CiteSeerX

2 downloads 0 Views 86KB Size Report
the source code of Limewire [11] and collects network in- formation. They also provide information on queries in the network, statistics regarding peers and ...
Distributed Optimization Strategies for Mining on Peer-to-Peer Networks Haimonti Dutta Center for Computational Learning Systems Columbia University New York, NY 10115 [email protected]

Abstract Peer-to-peer (P2P) networks are distributed systems in which nodes of equal roles and capabilities exchange information and services directly with each other. In recent years, they have become a popular way to share large amounts of data. However, such an architecture adds a new dimension to the process of knowledge discovery and data mining – the challenge of mining distributed (and often) dynamic sources of data and computing. Furthermore, effective utilization of the distributed resources needs to be carefully analyzed. In this paper, we study the problem of optimization of resources to enable efficient and scalable mining on a peer-to-peer (P2P) network. We develop a crawler based on the Gnutella protocol and use it to simulate a P2P network on which we run a classification task. Our results from the case-study indicate that not only do we have an effective utilization of resources but also the accuracy of the distributed mining algorithm is likely to be close to the hypothetical scenario where all data in the network is stored in a central location.

1 Introduction Peer-to-peer (P2P) systems such as Gnutella [6] and Kazaa [9] are likely to play a very important role in the next generation of data driven collaborative systems [5]. Their popularity stems from the fact that they can function, scale and self-organize in the presence of highly transient population of nodes without an overhead of a central server for co-ordination. In addition, they are collectively capable of storing a huge amount of data of different modalities. Extracting patterns from this data is a challenging task. Most of the data is inherently distributed and merging of remote data at a central site to perform data mining will result in unnecessary communication overhead. An alternate approach is to mine data locally on individual peers and combine models or results which can provide a cheaper

Ananda Matthur Department of Computer Science Columbia University New York, NY 10115 [email protected]

solution. In this paper, we study the fundamental trade-off between the cost of communication and computation on one hand and accuracy of the data mining results on the other in a P2P network. The goal is to find an optimal combination of communication and allocation of data among sites currently in the network. The strategy of transferring data aims at minimizing the error on the data mining model while satisfying restrictions on data transfer in the network. One way to do this is to pose an optimization problem [20] and the solution is the strategy that gives the best balance between communication cost and accuracy. However, such an approach assumes that there is a “global” co-ordinator site that can solve the optimization problem. This is typically not the case in a peer-to-peer network. We present a distributed algorithm for optimization of resources and compare the performance of the distributed algorithms to a centralized approach on a simulated peer-to-peer system. The rest of the paper is organized as follows: Section 2 discusses related work on peer-to-peer crawlers and data mining on peer-to-peer networks, Section 3 provides an example of the static network model, Section 4 describes the Distributed Simplex Algorithm and Section 5 presents a case study. Finally Section 6 concludes the paper.

2 Related Work 2.1 Peer-to-Peer Crawlers The problem of studying the trade-off between communication cost and accuracy of data mining models on a peerto-peer network requires collection of information on network topology, bandwidth and files shared at each node. Earlier efforts involved analyzing messages passed in the P2P network, such as ping and pong messages in the Gnutella network [7]. This approach is also adopted by gnuDC [22], which provides a distributed architecture for crawling the Gnutella network. Another successful effort is by Cruiser [19], which adopts a master-slave architecture,

in which several slave processes crawl the network in parallel and the master co-ordinates them. IRWire [13] modifies the source code of Limewire [11] and collects network information. They also provide information on queries in the network, statistics regarding peers and suggest methods for analyzing the data collected. To the best of our knowledge, none of the existing crawlers provide all of the information we need to frame the optimization problem as described in section 4.

300 GB

600 GB 3.8

Node 2

Node 1

6.1

6.5

Node 3

2.5 10.4 Node 5

8.3

300 GB

7.8

Node 4

300 GB

300 GB

Figure 1. An example of a static P2P Network

2.2 Data Mining on Peer-to-Peer Networks P2P systems have been studied in many domains outside of data transfer and mining; for example computation [2] and infrastructure systems [18]. However, search mechanisms and data mining algorithms developed for P2P systems are most relevant to this paper. Typically, nodes in the P2P system form an overlay structure over a fully connected network to retrieve information requested by a user [3]. The search technique proposed in [1] allows nodes to index the content of other nodes in the system, thereby allowing queries to be processed over a large fraction of the content in the network while actually visiting just a small number of nodes. Other strategies allow nodes to maintain meta data that provide “hints” to answer the current query. Such hints may be formed by building summaries or learning user preferences [8]. Systems such as Chord [18] and CAN [15] are quite similar in concept, but differ slightly in algorithmic implementations. Extracting patterns from data stored in P2P networks is a relatively new area of research [5]. Only a limited set of problems such as association rule mining [16], eigen monitoring [21] and facility location [10] have been studied in some detail; however, they do not consider the problem of optimal resource allocation for effectively doing data mining.

3 The P2P Network Model

Node 1 2 3 4 5

Table 1. ν values for the P2P network shown on Figure 1

from node i to its neighbor node j in the network is µij per record. Let (1) xij be the amount of data Di transferred from node i to node j for processing i.e. 0 ≤ xij ≤ Di . (2) X = [xij ]ni,j=1 be the matrix containing all the data transfers in the P2P network and is referred to as a strategy [20]. (3) δi be the amount of data that can be processed by the ith compute node at the current time t. Note that as more and more jobs are processed on the network, the value of δi changes and is always less than the total amount of data that can be processed by a compute element. The overall cost function for building the data mining model for a strategy X can be obtained as follows: C(X) =

X ij

A peer-to-peer network can be conceptualized as a weighted graph G = (V, E) where V represents the node set and E the edge set that connect pairs of nodes. For our purposes, the graph is assumed to be undirected and has a fixed topology. We also assume that communication between nodes is completely reliable and accessing local memory of a node is less expensive than accessing data across the network. Formally, we assume that there are n different nodes in the network. The node i has a dataset Di residing on it1 . The cost of processing data at the ith node into a data mining model is νi per record. The cost of moving the data 1 Note that data may be either homogeneously or heterogeneously partitioned and this does not affect our analysis.

ν 1.23 2.23 2.94 1.78 4.02

µij xij + νj xij =

X

cij xij

(1)

ij

where cij = µij + νj . The constraints for the optimization problem described by the above equation are derived from the fact that the compute element of each node can process at most δi amount of data. Next, we illustrate our optimization problem with an example: Example 1 Consider the P2P network in the Figure 1. The weights on the edges represent the cost of moving the data from node i to its neighbor node j i.e. µij dollars per record. The values of δi for each node are also indicated in the figure, for e.g. δ1 = 300, δ2 = 600 etc. Assume that the ν values for each node are as indicated in the Table 1. Thus for the Figure 1 the following objective function can be written: z = 6.03x12 + 9.04x23 + 6.52x15 + 8.28x14 + 14.42x25 +

9.58x34 + 12.32x45 , where z is a user defined cost. The corresponding constraints for this objective function are as follows: x12 + x14 + x15 ≤ 300, x12 + x25 + x23 ≤ 600, x15 +x25 +x45 ≤ 300, x14 +x34 +x45 ≤ 300, x23 +x34 ≤ 300, 0 ≤ x12 ≤ D1 , 0 ≤ x23 ≤ D2 , 0 ≤ x15 ≤ D1 , 0 ≤ x14 ≤ D1 , 0 ≤ x25 ≤ D2 , 0 ≤ x34 ≤ D3 , 0 ≤ x45 ≤ D4 Note that in addition to the above mentioned constraints, nodes may have other local constraints developed amongst themselves. For e.g. Nodes 1, 2 and 5 may agree that amount of data transferred from Node 5 to Node 2 is twice sum of data transferred from Node 1 to 2 and from Node 1 to 5. This constraint may be written as x25 = 2x12 + 2x15 . If the linear programming problem described above is fully constrained it can be solved by direct or iterative techniques. However, if the system is underconstrained as is often the case on the P2P network, the solution approaches are different. Historically simplex optimization [4] has been used to solve under constrained linear systems and we use this to design a distributed algorithm described in the next section.

4 Distributed Simplex Algorithm One of the objectives of Distributed Data Mining (DDM) algorithms is minimization of communication cost. This can be illustrated by posing the cost of data transfer as a constrained optimization problem and solving it using standard optimization techniques such as the simplex algorithm. However, this requires that the constraint matrix and the objective function to be observed at a central location. This is not true for large P2P networks where maintenance of the entire constraint matrix at a central site is infeasible. We now describe the steps involved in the distributed simplex algorithm.

4.1 Distributed Canonical Representation of the Linear System A pre-processing step to solving the simplex algorithm2 in a distributed manner is to develop an algorithm for obtaining the canonical representation of the linear system. We propose the following converge cast based approach: Let s be an initiator node in the network. It builds a minimum spanning tree on all nodes in the network. Following this, a message is sent by s to all its neighbors asking how many local constraints each node has. A neighbor on receiving this message either forwards it to its neighbors (if there are any) or sends back a reply. At the end of this procedure, Node s has the correct value of the total number 2 We refrain from giving a detailed description of the simplex algorithm due to space restrictions. An interested reader is referred to [4] for a detailed review.

of constraints in the system. This is the number of basic3 variables the node has to add. Next each node must figure out which of the basic variables to set to 1 so that there is a consistent simplex tableau among all nodes in the network. In order to do this, node s traverses the minimum spanning tree and informs each node visited, of the number of constraints seen so far. Let ni be the node currently visited and Tc the number of constraints known so far by node s. Then node ni must set the Tc + 1th basic variable to 1 while all the others remain 0. At this point, we are ready to describe the distributed simplex optimization algorithm.

4.2 Notation and Preliminaries Let P1 , P2 , · · · Pη be a set of nodes connected to one another via an underlying communication tree such that each node Pi knows its neighbors Ni . Each node Pi has its own local constraints which may change from time to time depending on the resources available at that node. The constraints at node i have the form Ai X i = bi where Ai represents an m × n matrix, X i is a n × 1 vector and bi is a m × 1 vector. Thus at each node, we are interested in solving the following linear programming problem: Find X i ≥ 0 and Min z i satisfying c1 x1 + c2 x2 + · · · cn xn = z i subject to the constraints Ai X i = bi . The global linear program (if all the constraint matrices could be centralized) can be written as follows: Find X ≥ 0 and Min z satisfying c1 x1 + c2 x2 + · ·S · cn xn = z subjectSto constraints η η AX = B where A = i=1 Ai and B = i=1 bi . Next, we present an exact distributed algorithm for solving linear optimization using the simplex method. Our assumption is that each node contains different sets of constraints, but has knowledge of the global objective function.

4.3 The Algorithm At the beginning of iteration l, a node Pi has its own constraint matrix and the objective function. The column pivot, henceforth referred to as col − pivoti , is that column of the tableau corresponding to the most negative indicator of c1 , c2 , · · · , cn . Each node forms the row ratios (rji , 1 ≤ j ≤ m) for each row i.e. it divides bij , 1 ≤ j ≤ m by the corresponding number in the pivot column of that row. Let minimum of rji ’s be presented as row − pivoti . This is stored in the history table of node Pi corresponding to iteration l. Now the node must participate in a distributed algorithm for determination of the minimum row ratio i.e. M inimum (row − pivoti ), i ∈ Ni . We describe a simple protocol called Push-Min for computing this. At all times t, each node maintains a minimum mt,i . At time t=0, 3 The

reader is referred to [4] for a definition of basic variables

mt,i = row − pivoti . Thereafter, each node follows the protocol given in Algorithm 4.3.1. When the protocol PushMin terminates, each node will have the exact value of the minimum row−pivoti in the network. This has been shown by Bawa et. al. [12]. Algorithm 4.3.1 Protocol Push-Min 1. Let {m ˆ r } be all the values sent to i at round t − 1. 2. Let mt,i = min({m ˆ r }, row − pivoti ) 3. Send mt,i to all the neighbors. 4. mt,i is the estimate of the minimum in step t

to leaf nodes. Nodes communicate with each other via messages. The Gnutella handshake allows exchange of information about data shared between nodes, the names and sizes of file(s) being shared and a list of current neighbors. We studied the structure of the Gnutella network and the memory sharing among its nodes. Figure 2(b) shows the degree distribution of ultrapeers in Gnutella network as collected by our crawler. Even though there is a lot of free riding in the network, we observed that the memory sharing behavior among ultrapeers (shown in Figure 2(a)) is very stable.

5.1 Crawler Description Once the Push-Min protocol converges, the node containing the minimum row − pivoti (say Pmin ) will send its row in the simplex tableau to all other nodes in the peer-topeer network. Next node Pi updates its local tableau with respect to the extra row it received from node Pmin . The algorithm – Constraint Sharing Protocol is described in Table 4.3.2. Completion of one round of the CS-Protocol, ensures that one iteration of the distributed simplex algorithm is over. Algorithm 4.3.2 Constraint Sharing Protocol (CS-Protocol) 1. Node Pi performs protocol Push-Min until there are no more messages passed. 2. On convergence to the exact minimum, the minimum row − pivoti is known to all nodes in the network. 3. All the nodes use the row obtained in Step 2 to perform Gauss Jordan elimination on the local tableau. 4. At the end of Step 3, each node locally has the updated tableau and completes the current iteration of the simplex algorithm. Termination: In a termination state, two things should happen: (1) No more messages traverse in the peer-to-peer network (2) Each local node has all its ci > 0. Thus the state of the network can be described by information possessed by each node. In particular, each node will have a solution to the linear programming problem and this will be stored in X i . Note that this solution converges exactly to the solution if all the constraints were centralized.

5 Case Study Before we describe the crawler we developed for use in our experiments, we first present a brief review of the structure of the Gnutella network and its protocol. Gnutella uses a two-tier architecture - a subset of nodes in the network are identified as ultrapeers and the rest are leaf-nodes. Ultrapeers support large data transfers and remain in the network for a longer period of time compared

Our crawler does a breadth-first exploration of the Gnutella nodes. At each node, it carries out a complete Gnutella handshake and reads the header information to retrieve data of interest to us. The network is crawled on two days – March 28th and April 2nd, 2008. We store topology information for 424 nodes in the network. Gnutella is a file sharing network storing mainly music and video files. Mining files of different modalities on a peer-to-peer network presents its own unique challenges. This is beyond the scope of this work. We are primarily concerned with the process of resource optimization on a P2P network and its effect on data mining algorithms. So the network structure information we gathered using our crawler was fed into the Distributed Data Mining Toolkit (DDMT4 ) which further analyzed the optimization problem we want to study. In the following section we present a description of the data files stored at each node.

5.2 The Dataset The dataset at each node was downloaded from The Sloan Digital Sky Survey [17] and contains a large number of spectroscopic observations of objects in the sky5 . For every observation, we have four features: petroMag u, petroMag g, petroMag r, petroMag i, which are the Petrosian flux parameters in the u, g, r and i bands. Each object is labeled as either primary or secondary. Our objective is to classify previously unseen test data at each node into these two categories. The training set consisting of 86542 astronomical objects is randomly partitioned among the 424 nodes in the network in proportion to the memory distribution collected from crawling the Gnutella network. The test set consists of 849 stars which need to be classified into 4 The software is downloadable from http://www.umbc.edu/ddm/Software/DDMT/ and was developed at the University of Maryland, Baltimore County 5 P2P networks are being considered for use in the astronomy domain for distributing the load on centralized servers and giving astronomers access to their own personalized databases called MyDB [14]. This motivated us to test our technique on an astronomy dataset.

300

100 90

250 Number of nodes with x neighbors

80

Number of nodes

200

150

100

70 60 50 40 30 20

50 10 0

0

0.5

1

1.5

2 2.5 3 3.5 Memory Shared, in kilobytes

4

4.5

5

0

0

20

40

4

x 10

(a) Memory sharing among ultrapeers

60 Number of Neighbors

80

100

120

(b) Degree Distribution

Figure 2. Memory Sharing and Degree Distribution of Ultrapeers in a Gnutella network

one of the two classes mentioned above. Each node in the network builds a local model (C4.5 decision trees6 ) with its training data and tests the performance of the model on the test set. Algorithm Type Centralized Distributed Distributed (with Simplex)

Accuracy (%) 54.8881 51.63 51.87

Table 2. Accuracy of classification

5.3 Results We present results produced in three different scenarios: (1) Centralized: All the data in the network is assumed to be at a central site. (2) Distributed: Each node builds a local model and tests accuracy on the test set describe above. The mean accuracy of classification over all nodes in the network is presented. (3) Distributed (with simplex) - The distributed resource optimization algorithm is run on the nodes and then each node is allowed to build a local model. The Table 2 presents the results of classification on test data set. Figure 3(a) and Figure 3(b) presents the distributions of the accuracy obtained by the nodes in the network. We notice that after executing simplex optimization, a larger number of nodes in the network has an accuracy close to the centralized case than if no optimization was run in the network. 6 We used the C4.5 implementation available in the WEKA Toolkit 3.4.12.

6 Conclusions Peer-to-peer networks are expected to play an important role in the next generation of distributed systems. They are capable of storing a large amounts of data of different modalities. However, resource management and extracting patterns from this data presents formidable challenges. In this paper, we present a distributed algorithm for resource management on a static P2P network and test its effect on a classification task. This is significantly different from query routing protocols, which are mainly geared towards reduction of traffic caused by popular searches. Our results indicate that accuracy of distributed algorithms after resource optimization can be comparable to mining on centralized data. Future directions of research involve development of resource optimization algorithms for dynamic networks and study of other data mining tasks such as clustering, association rule mining on P2P networks.

7 Acknowledgments The authors would like to thank Dr. Hillol Kargupta, Dr. David Waltz and Dr. Ansaf Salleb-Aouissi for their valuable contributions during different parts of this project.

References [1] L. A. Adamic, R. M. Lukose, A. R. Puniyani, and B. A. Huberman. Search in power-law networks. Physical Review E, 64:046135, 2001. [2] D. Anderson, J. Cobb, E. Korpela, M. Lebofsky, and D. Werthimer. “SETI@home: An Experiment in Public-

250

200

200

Number of Nodes

Number of Nodes

250

150

100

50

0 44

150

100

50

46

48

50 Accuracy

52

54

56

(a) Histogram of accuracy of classification in the P2P network before simplex optimization

0 44

46

48

50 Accuarcy

52

54

56

(b) Histogram of accuracy of classification in the P2P network before simplex optimization

Figure 3. Accuracy of classification in the P2P network

[3]

[4] [5]

[6] [7] [8]

[9] [10]

[11] [12]

[13]

[14]

Resource Computing”. In Communications of the ACM, volume 45, pages 56–61, November 2002. B. F. Cooper and H. Garcia-Molina. Ad hoc, selfsupervising peer-to-peer search networks. ACM Trans. Inf. Syst., 23(2):169–200, April 2005. G. B. Dantzig. “Linear Programming and Extensions”. Princeton University Press, Princeton NJ, 1963. S. Datta, K. Bhaduri, C. Giannella, R. Wolff, and H. Kargupta. ‘Distributed Data Mining in Peer-to-Peer Networks.”. In IEEE Internet Computing special issue on Distributed Data Mining, volume 10, pages 18–26, 2006. http://www.gnutella.com/, Website. www9.limewire.com/developer/gnutella protocol 0.4.pdf, Website. S. Joseph and T. Hoshiai. Decentralized meta-data strategies: Effective peer-to-peer search. IEICE Transactions on Communications, E86-B(6):1740–1753, 2003. http://www.kazaa.com/us/index.htm, Website. D. Krivitski, A. Schuster, and R. Wolff. A local facility location algorithm for sensor networks. In Proceedings of International Conference on Distributed Computing in Sensor Systems (DCOSS’05), Marina del Rey, CA, June-July 2005. http://www.limewire.com/, Website. B. Mayank, G. M. Hector, G. Aristides, and M. R. “Estimating Aggregates on a Peer-to-Peer Network”. Technical report, Stanford University, 2004. L. T. Nguyen, W. G. Yee, D. Jia, and F. O. A tool for information retrieval research in peer-to-peer file sharing systems. In ICDE 2007: IEEE 23rd International Conference on Data Engineering, 2007, pages 1525–1526, Istanbul, Turkey, 2007. W. O’Mullane, N. Li, M. A. Nieto-Santisteban, A. S. Szalay, A. R. Thakar, and J. Gray. “Batch is back: CasJobs, serving multi-TB data on the Web”. Technical report, MSR-TR2005-19, , 2005.

[15] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Schenker. A scalable content-addressable network. In SIGCOMM ’01: Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications, pages 161–172, New York, NY, USA, 2001. ACM. [16] A. Schuster, R. Wolff, and D. Trock. A High-Performance Distributed Algorithm for Mining Association Rules . In Third IEEE International Conference on Data Mining, Florida , USA, November 2003. [17] Sloan Digital Sky Survey. http://www.sdss.org, Website. [18] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications,. ACM SIGCOMM, pages 149 – 160, August 2001. [19] D. Stutzbach, R. Rejaie, and S. Sen. Characterizing unstructured overlay topologies in modern p2p file-sharing systems. In IMC’05: Proceedings of the Internet Measurement Conference 2005 on Internet Measurement Conference, pages 5–5, Berkeley, CA, USA, 2005. USENIX Association. [20] A. L. Turinsky. “Balancing Cost and Accuracy in Distributed Data Mining”. PhD thesis, University of Illinois at Chicago., 2002. [21] R. Wolff, K. Bhaduri, and H. Kargupta. Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems. In Proceedings of 2006 SIAM Conference on Data Mining, Bethesda, MD, April 2006. [22] D. Zeinalipour-Yatzi and T. Folias. A quantitative analysis of the gnutella network traffic, 2002.

Suggest Documents