Fuzzy Cost Modeling for Peer-to-Peer Systems - Semantic Scholar

4 downloads 0 Views 116KB Size Report
larity of Napster [2] that caught the attention of million of Internet users. Such systems are inexpensive, easy using, highly scaleable and do not require central ...
Fuzzy Cost Modeling for Peer-to-Peer Systems Wee Siong Ng1, YanFeng Shu1, Bo Ling2 1

Department of Computer Science, National University of Singapore, Singapore {ngws, shuyanfe}@comp.nus.edu.sg 2

Department of Computer Science and Engineering, Fudan University, China [email protected]

Abstract. The objective of query optimizers is to select a good execution plan for a given query. In a distributed system, it is crucial for a query optimizer to have effective remote cost estimation in order to make a satisfactory decision. There are two categories of cost estimation models: static cost model and dynamic cost model. Unfortunately, these models suffer from several limitations. First, the static cost model is not capable of reflecting real-time situations. Second, dynamic cost model is not scalable due to its extensive probe queries. Third, these models are not designed for ad-hoc systems such as P2P, since the dynamism of peers is not taken into consideration. In this paper, we firstly propose a progressive “push-based” remote cost monitoring approach. We derive a generic static cost model from conventional static approach. Agents will be sent to remote hosts with a generic cost model and epsilons (ε) indicating magnitude of cost change, i.e., percentage of coefficient change. An update will be sent (push) to original host once the magnitude of the cost change exceeds ε. Second, we introduce a fuzzy cost evaluation metric in additional to traditional evaluation criteria for handling the dynamism of P2P systems. This metric gives a confident measurement of a peer’s reliability. Our contributions have successfully addressed some of the issues concerning the cost modeling in environment of dynamic peers.

1 Introduction Peer-to-Peer (P2P) has opened up a new area of research in networking and distributed computing. It has been studied extensively in recent years partly due to the popularity of Napster [2] that caught the attention of million of Internet users. Such systems are inexpensive, easy using, highly scaleable and do not require central administration. Despite the advantages offered by P2P technology, it poses many novel challenges for the research community. Estimating remote cost for producing an effective query plan is one of the crucial challenges. Query optimization is vital in any distributed system because executing a query at the remote sites may be very expensive due to high connection overhead or heavy workload at remote hosts. Different decomposition strate-

gies may significantly influence the performance of query processing in such environments. Intuitively, a P2P system is a program that integrates different data sources from multiple remote nodes and forms a virtual resources-rich community. In BreadthFirst-Traversal (BFT) systems such as Gnutella [1] and BestPeer [2], each peer that receives a query propagates it to all of its neighbors up to a maximal number of hops. These types of systems do not require query optimization. The only control criterion is the TTL (Time-to-Live) that determines the termination of queries. PeerOLAP [11] has extended BestPeer to support complex queries, i.e., OLAP [10] processing. PeerOLAP acts as a large distributed cache for OLAP results by exploiting underutilized peers. The query processing is composed of two phases; resource location and query routing. In the resource location phase, the initiating peer decomposes a query into chunks, and broadcasts the request for the chunks in a similar fashion as Gnutella. However, PeerOLAP employs a query optimization in phase two as compared to Gnutella and BestPeer. It has defined a cost model that optimizes the cost of query routed to the next peer to minimize response time and bandwidth consuming. Indeed, its cost model is static, which cannot accurately reflect the ad-hoc characteristic of P2P environment. There are number of literatures that focus on cost-based query optimization in distributed information sources. Basically, it can be categorized into static cost model and dynamic cost model. Static cost model derives a generic model based on calibration [6][8], sampling [5] or statistical approach [7]. These models are seldom changed (i.e., the coefficient) once it has been derived and all parameters used for this cost estimation are known a priori before the execution of the plans. The assumption here is works in a static environment where the workload of remote hosts may not change dynamically. In the context of heterogeneous and dynamic environment such as Internet and P2P, it is unrealistic to assume that every site has the same query execution capabilities. Moreover, workload of a host might be increased dynamically when more concurrency tasks are being supported. As the result, the executing cost will be significantly increased. Hence, cost estimation deployed in a static environment cannot be migrated directly into dynamic environments without any major modification. A dynamic cost model using a probe-based optimization strategy is proposed in [3], which is capable of reflecting more accurate real-time cost statistics of remote hosts. However, this approach is not scalable due to the extensive increase in the number of probe queries. Furthermore, existing cost models employ discrete function as their cost measurement metric, which we claim not to be appropriate for ad-hoc P2P systems. Giving an example, assuming there are two peers in the system namely Peer1 and Peer2 with both sharing similar resources. Let PeerQ be the peer who initiates a query. Assume the cost of executing a query q at Peer1 and Peer2 be 5 and 6 units respectively. With these discrete cost values, PeerQ expects the query to be more cheaply executed in Peer1 than in Peer2. However, in reality, Peer1 might not be reliable, e.g., the frequency of its disconnection from the network might be higher than that of Peer2, which is not reflected in the discrete cost model. As a result, PeerQ may pay higher cost due to the disconnection of Peer1, i.e., resubmits the query to Peer2 when it notices the failure of Peer1.

Based on the above observation, we propose solutions to the problems of querying remote cost statistics with multiple data sources in the dynamic and heterogeneous P2P environment. First, we follow a different approach focusing on “push-based” mechanism. The approach relies on agents that monitor cost on the remote data sources with progressive cost updating. We first derive a generic static cost model from the conventional approaches [6][8]. Each agent will be sent to remote hosts with a generic cost model and epsilons (ε) indicating magnitude of cost change, i.e., percentage of coefficient change. An update will be sent (push) to the original host once the magnitude of the cost change is bigger than ε. Consequently, the query optimizer can have near-real-time cost information (i.e., estimation based on the coefficients and ε) of remote peers without heavily probing for the answer. Second, we propose a fuzzy cost evaluation metric in additional to traditional evaluation criteria such as CPU, I/O and communication costs. Unlike the conventional discrete cost model, e.g., absolute cost values such as 5 units of processing cost, fuzzy cost is described in term of possibilities of occurrences of events. Since it is possible to estimate a maximal processing time for a query based on history statistic records or sampling technique, the query optimizer needs to ensure choosing only peers with certain degree of reliability, i.e., those that are accessible during the period of query processing. In this regard, we propose a fuzzy Reliability metric to describe the reliable cost of remote peers. Therefore, the objective for a query optimizer is to find the best plan that minimizes the processing cost while maximizes reliability. As a result, it can minimize the performance degradation due to the remote peer failure which would cause query resubmission. The rest of the paper is organized as follows: in Section 2, gives a review of related work in literature. In Section 3, we propose our model for processing cost estimation with ε. In addition, we introduce TM-H structure and formulation of Reliability fuzzy set to estimate the costs associated with dynamism of peers. Several issues concerning the implementation of the model and optimization solutions are discussed in Section 4. Finally, we conclude in Section 5.

2 Related Work Early works on cost-based query optimization in distributed information sources [6][8] are based on calibration approach where a coefficient of a generic cost model is deduced from each local site with calibrating procedure. Therefore, a generic cost model can be customized into specific class of systems with different coefficients. Zhu et. al. [5] have derived the local cost models based on the sampling query running against user database systems. Naacke et al. [9] have suggested using a combination of generic cost model with specific cost information exported by the local database system wrapper. Adali et. al. [7] have proposed a cost estimation strategy based on the statistics cache. The statistics are cached for each actual call to the sources. Consequently, the cost of the possible plans can be estimated based on the available statistics cached previously. Shahabi et. al. [3] have proposed a probe-based optimization

strategy. In order to obtain cost from k participating sites, k x (k-1) probe queries are generated among the k sites. These systems except [3] assume a static environment where they operate. The cost or workload of a participant host does not change dramatically over time. The approach proposed in [3] is able to reflect more real-time cost statistics of remote hosts. However, it is not scalable due to the extensive increase in the number of probe queries. In addition, all these models do not take the dynamism of peers into consideration. C. Olston and J. Widom have proposed a precision performance tradeoff architecture (TRAPP) for aggregation over numeric data [14]. The algorithm works in such a way that a cache is used to store a bounded value [LA, HA] in which the precise answer is guaranteed to fall. Suppose a bounded answer to a query with aggregation is computed from cached values, but the answer does not satisfy the user's precision constraint, i.e., the answer bound is too wide. In this case, some data must be refreshed from sources to improve precision. Although TRAPP is most similar to our approach on progressive remote cost monitoring, the fundamental architectures are different. Ours adopts a “push-based” mechanism whereas TRAPP employs “pull-based” architecture. Complex P2P query systems have been proposed in literature recently. These systems employ different query location schemes. PeerOLAP [11] and PeerDB [16] are Breadth-First-Traversal based systems that support of OLAP and other database applications. Chord [15] and CAN [17] propose Implicit-Binary-Tree and DDimensional-Space schemes respectively for query location. Even though these systems are different in terms of query location strategies, our model is schemeindependent and can be applied directly for implementing query optimization upon these systems.

3 Cost Model In this section, we start with defining the problem of query optimization in the P2P environment. Subsequently, we describe a mechanism for progressive remote cost monitoring. Finally, we propose our Reliability fuzzy metric for handling dynamism of peers. 3.1 Problem Statement Suppose there is a peer P who has n identifiers in its buddy list1. These n peers are dynamic, i.e., they are allowed to leave and then join the network at any time. Consider a query, q, requires a resource R to process. Assume there are k peers who have the sharable resource R, where k ≤ n. The objective is to select one peer from the k, 1

We denote buddy list as a list contains remote peers’ identifiers that can be communicated directly for resources acquisition, e.g., similar application such as ICQ [3]. The identifiers can be obtained from auxiliary sources such as web or through emails.

which provides a minimal query response time, and with highest confidence that it will not be offline during the period of query processing. The solution to the problem is two-fold, which is the focus of this paper. First, we would like to perform a quick filtering on k peers; select the top-i (i

Suggest Documents