SmurfPDMS: A Platform for Query Processing in

0 downloads 0 Views 3MB Size Report
uated. Since there is no common platform approaches are evaluated in separate. This ... quires efficient distributed query processing strategies that do not need any kind of global knowledge. .... indexes, and event processing strategies. Peer.
SmurfPDMS: A Platform for Query Processing in Large-Scale PDMS Katja Hose Christian Lemke Jana Quasebarth Kai-Uwe Sattler Department of Computer Science and Automation, TU Ilmenau P.O. Box 100565, D-98684 Ilmenau, Germany Abstract: As Peer Data Management Systems (PDMS) are a focus of current research, there are lots of approaches like query processing or routing issues that have to be evaluated. Since there is no common platform approaches are evaluated in separate. This is disadvantageous for research groups in two ways. First, it means a huge effort to build a simulation environment from scratch. Second, this makes a direct comparison of approaches more difficult. In this paper, we present SmurfPDMS an extensible system that means to provide a common platform for all researchers in that they can easily integrate their approaches and that allows for running large simulation experiments in distributed environments such as workstation clusters or even PlanetLab.

1

Introduction

Peer Data Management Systems (PDMS) – also known as schema-based P2P systems – are an important area of recent and current research. Emerging from federated database systems and applying the P2P paradigm, PDMS have to counteract the challenges coming along with peer autonomy. This means that all peers are equal in terms of issuing and processing queries, each peer possesses and owns its private local data, and each peer might have a local schema that is unique in the whole network. Furthermore, each peer can only communicate with those neighbor peers to which mappings exist. This and the fact that we consider unstructured P2P systems as the basis for PDMS requires efficient distributed query processing strategies that do not need any kind of global knowledge. In contrast to structured P2P systems like Chord [SMK+ 01], PDMS do not have global indexes or hash functions that could help us find the data we are looking for. Since we are neither allowed to rearrange the peers’ data we have to find other possibilities to route queries efficiently through the network. A common means to do this are routing indexes [CGM02] that can be used to identify which neighbors hold data that matches the query. Though aspects like routing indexes, query processing strategies, or dynamic behavior have a great influence on each other, they are usually considered independently from each other by different research groups. The use of different platforms, implementations, and assumptions hampers a direct comparison of similar concepts. Additionally, many existing simulators like ns-2 (http://www.isi.edu/nsnam/ns/ ) are too low-level for simulating PDMS appropriately. Most environments have another severe drawback: their lack of documentation and extensibility. Furthermore, they often do not have an intuitive user interface let alone a graphical one that might allow outsiders to configure and use the system. In this paper, we do not present yet another PDMS system in addition to systems like Piazza [TIM+ 03] but SmurfPDMS (SiMUlation enviRonment For PDMS) a common plat-

form for simulating PDMS. Due to its architecture it can easily be extended so that it enables a direct comparison of approaches in the same environment using the same environmental setup. The remainder of this paper is structured as follows. After having discussed architectural issues in Section 2, Section 3 points out SmurfPDMS’ features and usage. Finally, Section 4 outlines what we are planning to show at the conference.

2

Architecture

Running SmurfPDMS as distributed simulator enables the user to combine the computational powers of several computers. For this purpose, each participating computer (e.g., a number of PlanetLab nodes (http://planet-lab.org)) has to run its own instance of SmurfPDMS. Hence, we have to distinguish between the (i) simulated PDMS network and (ii) the network that is formed by the participating computers. Each SmurfPDMS instance – running either on the local or on a remote computer – has the same capabilities and can act either as coordinator or participant – the coordinator is always that instance at which the user starts the simulation. The simulated PDMS network is determined at startup by the coordinator in consideration of the user-defined configuration. The peers of the simulated network are then divided into partitions. Each partition is assigned to one of the participants to distribute load among the computers. In summary, the most important tasks of a coordinator are: • Determine the setup, calculate a net• Have peers update their local data work topology, assign the peers to the • Determine peers to leave or join the participants, calculate data partitions, network etc. • Simulate communication and pro• Synchronize all participants cessing delays • Log events, messages, results, etc. • Choose queries and determine peers to initiate them • Collect statistics and create diagrams SmurfPDMS implements several central concepts: managers that control the simulation and the communication, messages for information exchange, class hierarchies for a straightforward extensibility, configuration files and logging to achieve repeatability. SmurfPDMS has three logical layers that we sketch in the following. Union

Select

QueryRewriting Algorithm

BucketAlgorithm

TopN

Skyline

Mapping Definition

Neighbor

Query PathIdx PeerObject

RoutingIndex

QueryProcessing Strategy

Message Shipping

Incremental Message Shipping

QTree

LoggingOn EventStrategy DataUpdate

Network Layer. All communication and all messages are sent via this layer using TCP/IP. The corresponding configuration of an instance like listening port, instance name, or the list of known computers is held by the Network Manager. Based on the list of available computers the user selects those computers that later on participate in the simulation.

Simulation Layer. After having started the simulation the coordinator’s Startup Manager determines the initial setup by calculating a network topology, peer objects (representing peers of the simulated network), etc. The simulator currently already provides 8 different algorithms for topology generation (including one to read the topology from Figure 1: Simulation Core Architecture

existing config files for the sake of repeatability) – others can be added easily. Local and Remote Peer Managers provide an appropriate interface for the communication between peers so that the simulation engine does not have to take care on which computer the receiving peer object is actually located. The statistics component gathers statistics during the simulation. After the simulation is finished all participants send their local statistics to the coordinator which finally calculates the total statistics, serializes them, and calculates diagrams (Figure 3). Especially the Simulation Layer is affected when new approaches are to be integrated into the system. Figure 1 illustrates the Simulation Core with the entry points for query operators, query processing strategies, query rewriting algorithms, routing indexes, and event processing strategies. GUI and Configuration Layer. This layer operates on top of the other two and makes working with the simulator comfortable (Figure 2). It visualizes the selection of participating computers and can be used to set all necessary configuration parameters. It also visualizes the simulation itself by displaying the network with peers, links and messages – revealing detailed information by simply clicking on the corresponding symbols. The same window enables the user to control the simulation (issue a query, start, halt, or stop the simulation). Finally, the GUI Layer illustrates the simulation statistics (Figure 3).

3

Control Panel

Peer

Link

Message

Figure 2: Simulation Window

Distinguishing Characteristics

(a) Statistics Window

(b) Example Result Diagrams

Figure 3: Statistics Window and Diagrams Resulting from Test Series

As already mentioned above SmurfPDMS aims to provide a common platform that can be used by multiple research groups in order to evaluate their approaches and to accomplish comparisons and interactions to other approaches with low effort. The following aspects can currently be evaluated using SmurfPDMS: (i) Query Routing, (ii) Query Processing Strategies, (iii) Routing Indexes, (iv) Mapping Definitions, and (v) Query Rewriting. In

[HJKS06] we have already demonstrated the former three aspects. Recently, we have completed the latter two. Additionally, we enhanced simulating data updates, enabled SmurfPDMS to run test series automatically, and integrated the generation of gnuplot files. Of course, the parameters of these features can be adapted to the users needs. Eventually, the simulator could also be augmented to support the evaluation of strategies with the aim of changing the network topology, e.g., in order to build semantic clusters. Furthermore, schema matching techniques could also be integrated. In summary, the specialties that distinguish SmurfPDMS from other systems are: • extensibility in a variety of aspects, • evaluation of approaches not only in separate but also in interaction with others, • comfortable configuration and controlling thanks to the graphical component, • providing a whole variety of algorithms for repeating tests and for generating initial setups, • offering a powerful evaluation component that allows for running whole test series and creating the corresponding diagrams in gnuplot format that can be converted into various formats (fig, eps, pdf, etc.), • platform independence by using Java as programming language, • scalability using PlanetLab as platform. Finally, let us emphasize that SmurfPDMS was especially designed for the needs and specialties of PDMS. With XML as native data format SmurfPDMS integrates another interesting aspect of current research.

4

Demonstration

At the demonstration site we want to show how to work with SmurfPDMS and how it could help research groups to conduct experiments. This includes all steps starting with configuration and ending with the evaluation of statistics. In contrast to [HJKS06] we want to emphasize the distinction of (i) running predefined test series and (ii) using SmurfPDMS interactively in the single-step mode where the user can monitor the execution of each message, each rewriting step and each event. We will also show how to set up a configuration, how to compare different strategies (query processing, routing indexes, etc.), and how to evaluate them properly. Finally, we will show how to create evaluation diagrams for test series that can serve as evaluation results for publications without further adaptations.

References [CGM02]

A. Crespo and H. Garcia-Molina. Routing Indices For Peer-to-Peer Systems. In ICDCS ’02, page 23, 2002. [HJKS06] Katja Hose, Andreas Job, Marcel Karnstedt, and Kai-Uwe Sattler. An Extensible, Distributed Simulation Environment for Peer Data Management Systems. In EDBT 2006, pages 1198–1202, 2006. [SMK+ 01] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In SIGCOMM 2001, pages 149–160, 2001. [TIM+ 03] I. Tatarinov, Z. Ives, J. Madhavan, A. Halevy, D. Suciu, N. Dalvi, X. Dong, Y. Kadiyska, G. Miklau, and P. Mork. The Piazza Peer Data Management Project. SIGMOD Record, 32(3):47–52, 2003.