Document not found! Please try again

Data Storage Requirements for the Service Oriented Computing

1 downloads 0 Views 87KB Size Report
high bandwidth from the next processing step. 3.1.2 Content sharing. Content sharing is the most popular P2P application, known from Napster[6] or Gnutella[5].
Data Storage Requirements for the Service Oriented Computing Thomas Risse, Predrag Kneˇzevi´c Fraunhofer IPSI Integrated Publication and Information Systems Institute Dolivostrasse 15, 64293 Darmstadt, Germany {risse|knezevic}@ipsi.fhg.de Abstract Service oriented computing and peer-to-peer (P2P) computing are emerging technologies which provide, from the architectural point of view, scalability and flexibility. From their properties both architectures are quite similar, e.g. in both architectures nobody has a complete system overview and no central administration exists. Besides other challenges, this leads to the question how data can be stored in a reliable way in the P2P and service oriented environments. In this paper we analyze and discuss the requirements as well as the challenges and give an outlook on our research activities which address the identified problems. Keywords: Services, Peer-to-Peer Computing, Data Management, Storage, XML

1. Introduction Service oriented computing is seen as the enabling technology for the future applications in different areas, e.g., EBusiness and GRID computing. The benefit of services is that they are self-contained and self-described [35]. Hence they provide a high degree of flexibility in their usage. Applications build on services do not necessarily need to rely on specific components. They can choose among several service providers with the required functionality and elect the one with e.g., the best price, highest reliability or other parameters. The possibility to select the service on demand makes them interesting e.g., for building E-Business applications with dynamic supply chains or for distributed problem solving, which requires computational resources on demand. Also the GRID and peer-to-peer (P2P) computing are based on services. In the GRID computing domain, services are provided for resource sharing, e.g., to share CPU cycles or storage capacities [26]. Current developments in the GRID domain will also provide databases as a resource.

The idea of P2P computing is based on the service aspect. Peers provide some services to the community such that they can be consumed by other peers. Popular examples are P2P file sharing systems [6, 4, 5]. Peers in file sharing systems provide the content of a local directory as a service to the public and any other peer is able to access it. Providing a service does not implicitly mean that the service results are stored on the side of the service. If an application calls a service function, it will give a result object back. This object has to be passed to the next service or has to be stored somewhere. A possible approach is to store result objects locally. In this case, the application needs to have a local database. The application must control the object processing and maintain the database system. Why not using a database service instead? In the paper, we discuss this question, resulting challenges and possible solutions in more detail. The paper is organized as follows. In Section 2 we first describe services and their relationship to P2P computing. Afterward P2P data stores are introduced. Section 4 gives an overview of related research activities. The current research at Fraunhofer IPSI is described in Section 5. Finally, conclusions and future work are presented.

2. Overview of Service-oriented and P2P Computing 2.1. Service oriented computing Several service definitions exist in the literature, e.g., [35][24]. But in general it can be seen that services provide one or more functionalities (e.g., calculator, route planning) to consumers. Consumers can be anybody that needs a specific functionality. They could be client applications requesting some external support, or other services as well. The client can be of any system type: from small PDA or digital camera to large mainframes. Important service properties are self-containment and self-description. In other words, services are highly inde-

pendent and flexible. They describe their functionalities, interfaces and other properties by themselves and these metainformation are published to a service directory. When some consumer wants to find appropriate service, it will contact the service directory. The service directory discovers a service that will match the consumers’ request. With the exact service description, the consumer is able to adapt its access to the service according to the interface description. From the self-containment of services some implicit properties can be derived. Services are autonomous in their decisions. They are free to refuse requests. Furthermore, they are able to suspend their functionality without notice. Such behavior makes them unreliable to consumers. The administration is done by the service owner. There is no central administration. In addition, nobody has a complete overview of all provided services. Each consumer has only a partial view on the complete system. Also, services do not provide any fault-tolerance mechanism e.g., when some communication channel is down. So the usage of transaction monitors is necesary. Prominent examples for services are Web services. Web services use the available infrastructure of the Internet as the communication medium. They are build on a wide range of standards or proposed standards for service description and high-level communication protocols. All data formats use the XML [3] encoding. Services are described by WSDL [8] and published with the help of UDDI [7]. The individual service calls use SOAP [2]. For the implementation of distributed workflows BPEL [10] has been proposed as a new standard. With the introduction of a transaction protocol [25] also some work on the reliability of Web service architectures has been done. The strict focus on open Internet standards is an important difference to other approaches of distributed computing like Corba [11], or DCE [19]. Since only the strict interface details are published, Web services are neutral to the programming language, programming model and the underlying operating system. Another emerging service framework is the Open Grid Service Infrastructure Architecture (OGSA) [26], which itself is based on the previously described Web service. The main goal of the OGSA is to provide system resource like CPU cycles or storage capacity to other GRID users. But it is also planned to use the OGSA to provide high-level resources like databases [36].

2.2. P2P Computing In P2P environments, systems are no longer distinguished by thin clients and thick servers. In a P2P system, every node (peer in P2P terminology) has, a priori, an equal status. This means that a peer offers services or

resources to the community, but at the same time, it can consume services/resources from others in the system. One central property of P2P systems is that they do not have a central administration. Furthermore, P2P architectures are highly dynamic. Peers can join and leave the system at any time. Hence, none of the peers has a global system view or can rely on any particular peer [28]. The benefits of P2P architectures are scalability, reliability and low administration costs. P2P architecture can be scalable because applications have to be developed for a highly distributed environment. The reliability aspects will be achieved by a reasonable distribution of data or functions. The flexibility results from the service aspect of P2P computing. Finally the low administration costs is the results of the necessary self-organization of P2P systems. Based on the previously described properties, P2P systems are similar to the services oriented systems. Differences can be found regarding the service discovery. From the nature of P2P systems centralized discovery services are not practical. Furthermore, P2P systems compared to Web services lack today in standards for interface description or communication protocols. An approach for standardization has been done by SUN Microsystems with the development of the JXTA framework [16], which is used by several applications. From the general properties, we will treat P2P and service-oriented systems as equal in the following sections.

3. P2P Data Stores In the current development the focus of service oriented computing is laying on the standard descriptions of interfaces, communications protocols and processes. But applications, even if they are based on services, deal with data. With the development of mobile information systems, service consumers will be more often less powerful systems like mobile devices that have restricted storage capacities. So it might be possible that large service results can not be stored locally. Furthermore, mobile devices often use unreliable and low bandwidth communication channels. It would be very costly to transfer larger temporary processing results back to the mobile device. A better solution would be to use a reliable data storage service. As already mentioned in the previous section, service oriented architectures are similar to P2P systems. Hence data storage technologies from P2P system could enrich the functionality of service oriented systems, e.g., by the development of distributed data stores. In P2P computing much work has been done to handle the specific properties of P2P in data storage system [30, 33]. An overview of P2P data stores is given in Section 4. In the subsequent sections, we would like to motivate the necessity for data stores based on service oriented or P2P computing rather then on traditional

technologies. Afterwards in Section 3.2, the requirements and challenges for such system are described.

3.1. Applications In P2P and service oriented systems, the following general categories of applications can be identified with their requirements to the data management:

3.1.1 Workflows and processes Workflows and processes are well known application domains which gain from the introduction of Web services. Workflows are now able to be built up and instantiated dynamically. Required functionalities can be used on demand. No static binding are necessary. This gives the possibility to do optimizations on costs or performance. Several services are working together within workflows. In addition services are not restricted to one workflow. Hence it is possible that function calls have to be queued, i.e. results of the previous processing step have to be stored at a reliable location. The temporal storage should be accessible with low latency and high bandwidth from the next processing step.

3.1.2 Content sharing Content sharing is the most popular P2P application, known from Napster[6] or Gnutella[5]. The basic idea is to share atomic objects like files (e.g., music, video, picture) within the community. Users can search for some objects and get the location as a result. There is no guarantee that the location is live or the object is available. Currently, content sharing systems lack in the functionality to search for content within the atomic objects [28]. Furthermore, the systems provide only few meta information, e.g., checksums, file type, title, author.

3.1.3 Resource sharing With SETI@home[1] the resource sharing became popular. In SETI@home peers provide their computational power to the SETI project to search for extraterrestrial intelligence. But the usage of the resource was restricted by the installed client software. The reliability of the system was increased by performing the same search in parallel in different peers. With the development of GRID environments, e.g., Globus[14][27] and Condor[12], the resource usage is much more flexible. With this flexibility, the requirement on databases functions grew, because up to now proprietary local database solutions have been used. First proposals for database services have been developed [36][31].

3.1.4 Collaboration In collaborative application users work together on the same data, e.g., users edit the same document or exchange instant messages. The applications are working in a P2P manner, because changes on one peer are directly distributed to all other peers. Other use cases are the collaborative work on spreadsheets, calenders or CAD drawings. An application which provides a framework for such applications is Groove[15]. The difference between collaboration and content sharing is that the collaborative applications actively distribute data while content sharing applications passively providing data to the public. The active distribution and the collaboration require concurrency control for the data access and update.

3.2. Requirements As seen before, the different application areas have different requirements to the data management. Workflows need a reliable place for the storage of temporal or permanent processing results. The user of content sharing applications needs sophisticated querying functions and short access times. Applications based on resource sharing require standardized reliable database access. Finally, collaborative applications require concurrency control. In the service oriented and P2P computing field, the same classical requirements on databases are still valid: • Durability Data need to be stored for longer time. In P2P systems data have to be available even if storage peers disappear. • Consistency Data have to be always consistent. This is challenging in P2P systems as data can be changed in every storage peer and changes have to be propagated to other peers. • Reliability The reliability of P2P data stores is accomplished by the distributed storage of data. Hence the reliability of a P2P storage is related to the durability property. • Concurrency From the architecture of P2P system a high level of concurrent operations on data is given. Changes can be done on several peers in parallel. Hence updates have to be done in a more controlled way. • Scalability The scalability is base property of P2P systems, which has to be respected by the data management.

3.3. Challenges The previously identified requirements lead to the following challenges for the development of a P2P storage system.

3.3.1 Durability The durability is the starting point for the development of a P2P data store. All other tasks like data access and data update depend on the way data are stored, organized and distributed among the peers. This leads directly to the question how many data item replicas are necessary to guarantee a certain availability. Here the dynamics of the peers and other quality of service parameters have to be taken into account. Also the granularity of data have to be devised, e.g., full objects are distributed or fragments only. Furthermore the whole distribution process must be self-organized due to the dynamicity of the architecture [28].

3.3.2 Data access The data access in a P2P storage system means to locate a known data object, e.g., by the unique object ID, within the storage system and access it afterwards. For efficient access, distributed indices are used which must always be up-to-date. With the help of indices the access path to an individual object can be efficiently created. A distributed index is a challenge due to the dynamic nature of peers. As peers appear and disappear as they like, it is necessary that data will move among the peers. This data is also called nomadic data. Nomadic data is a challenge for the data access as the distributed indices have always to be updated if data moves.

3.3.3 Querying Querying in P2P data stores is a step further from accessing of a single object. The result of a query is a number of access paths or object Id’s to individual objects. Querying can also be supported by distributed indices that have to deal with nomadic objects.

3.3.4 Data update Adding or updating of objects in P2P data stores results in the question in which way the object replicas are updated. The replicas have to be located and changed. In

the traditional distributed databases, the transaction management guarantees the data consistency after update operations. The P2P data stores do not have central administration like some DBMS. Having that in mind, other solution have to be developed. Furthermore the indices need to be updated.

4. Related Work P2P systems became popular with the sharing of multimedia files. So most of the current systems cover the content sharing aspect. The one that brought P2P area to broader community of regular users was Napster[6]. But Napster has a weak point: a centralized index of all shared files. This was a trade-off between having a pure P2P system and having better performances. But if the index is down, the whole network is useless as no one can perform a search. The systems that came after Napster try to be completely distributed, to avoid central places in their design. The good examples are Gnutella[5] or KaZaA[17]. An analysis of these systems regarding their search performance can be found in [37]. All existing systems have the common drawback: content querying (e.g., free-text search) is not possible. The most representative examples for P2P data stores are OceanStore[30], Past[33] and FreeNet[4]. The basic idea behind these systems is to separate data from their location. Objects stored into the system will be replicated in the network in order to provide easier access from different location or they are archived in order to survive corruption. Also, data are encrypted, so data access can be controlled. The sharing unit is restricted to atomic objects. Sharing of smaller units, e.g., the elements of a document, is not possible. In [29], P2P technology has been used to build up a distributed Web service directory. The system is able to store and query WSDL[8] documents. But the system has no active data placement algorithms like OceanStore[30] or Past[33]. Active data placement is necessary to build up a reliable data store. Otherwise, less popular objects will disappear from the system as they will not be replicated very often.

5. Current research activities We are working on a generic architecture for sharing XML documents among peers. Since every XML document has a tree representation, the problem is equivalent to the tree sharing. Peers are responsible only for parts of a XML-tree, but they have access to the whole structure. During the lifetime of the shared XML document, the document parts can be modified, removed, or added. Our working data granularity is very fine and equal to a tree element.

build on top of any P2P application. An example for a distributed storage of a XML document is shown in Figure 1. Searching for some content is equal to querying of shared meta-data. Changing of meta-data structure is quite easy and it affects only the application built on top. For all other applications we will share the document we work on. Since we have access to all document parts, the realization of freetext searching is a matter of applying the right query (by using XPath, XQuery, or any other XML query language). In order to build the proposed architecture, many research issues need to be solved: • Data consistency Figure 1. Distributed XML Storage But users should not see the difference compared to have the whole structure on the single machine. The large area of distributed and cluster computing tries to hide distribution and make it transparent to the programmer, so the developing process is not different from the development for single machine. The goal is that the programmer writes the code without knowing what is the target (single machine, cluster, or distributed system) and during compile or run-time will be decided how to deal with specific problems. Many frameworks have been developed to support development and execution of distributed application [9, 11, 22, 21, 18, 20, 23]. These distributed environments are stable and each node has the global system overview. Those are the major differences from the P2P domain, where connections are established in an ad-hoc manner, peers leave or join the community at any time. In order to establish transparent usage of P2P networks, similar frameworks need to be developed. The major benefits of our approach are the following: • Sharing XML documents inherits all benefits of the XML data model (hierarchical and semi-structured data representation, data types with XML schemas, query languages). • Sharing complete documents, which is different from sharing keys or keywords about documents in the existing systems. • Many P2P issues are hidden from the user. So the document access should not be different from accessing a local document. A final aim would be that a programmer has access to an interface similar to Document Object Model (DOM) [13]. When a document is shared, an arbitrary search (like free-text search) can be performed because the document structure is known. The proposed architecture allows us to

• Transactions • Permissions • Concurrency • Queries We do not start building the proposed architecture from scratch. Our intentions are to re-use the existing P2P storage architectures that deal with smaller primitives like (key, value) pairs. The common name for them is Distributed Hash Tables (DHT) and existing examples are FreeNet[4], Tapesty[38], Pastry[33], CAN[32], or Chord[34]. We explore the possibility to build our primitives for sharing a XML tree using existing DHT systems. We will compare possible mapping, trade-offs, scalability of each mapping, used bandwidth, etc. Based on these parameters, our prototype will use one of these underlying architectures.

6. Conclusions In this paper we compared the general properties of service oriented computing and P2P systems and came to the results that many similarities exist. Services based systems need more activities regarding data management. Especially reliable data stores with content query facilities are necessary. After analyzing the application requirements, it turns out that the requirements are the same as for the classical database systems. The requirements and the system architecture analysis lead to several challenges, which have to be solved if real application should trust open and highly distributed systems like P2P and service oriented computing. Finally we gave an outlook on our current research activities, which should lead to a reliable distributed XML storage.

References [1] SETI@home, 200. berkeley.edu/.

http://setiathome.ssl.

[2] Simple Object Access Protocol (SOAP) 1.1, 200. http: //www.w3.org/TR/SOAP/. [3] Extensible Markup Language (XML) 1.0, October 2000. http://www.w3.org/TR/REC-xml. [4] FreeNet Homepage, 2001. http://www. freenetproject.org/. [5] Gnutella Homepage, 2001. http://www.gnutella. com/. [6] Napster Homepage, 2001. http://www.napster. com/. [7] Universal Description, Discovery and Integration (UDDI), 2001. http://www.uddi.org/. [8] Web Services Description Language (WSDL) 1.1, March 2001. http://www.w3.org/TR/wsdl. [9] The Beowulf Project, 2002. http://www.beowulf. org. [10] Business Process Execution Language for Web Services (BPEL), 2002. http://www-106.ibm.com/ developerworks/webservices/library/ ws-bpel/. [11] Common Object Request Broker Architecture, 2002. http: //www.omg.org/. [12] Condor Project Homepage, 2002. http://www.cs. wisc.edu/condor/. [13] Document Object Model, 2002. http://www.w3.org/ DOM/. [14] Globus Project Homepage, 2002. http://www. globus.org/. [15] Groove, 2002. http://www.groove.net/. [16] JXTA Project, 2002. http://www.jxta.org/. [17] Kazaa Homepage, 2002. http://www.kazaa.com/. [18] Message Passing Interface Forum, 2002. http://www. mpi-forum.org. [19] OSF Distributed Computing Environment, 2002. http:// www.opengroup.org/dce/. [20] Parallel Virtual Machine, 2002. http://www.epm. ornl.gov/pvm/. [21] PARASLAX, 2002. http://www.paraslax.com. [22] The Distributed Component Object Model (DCOM), 2002. http://www.microsoft.com/com/tech/ dcom.asp. [23] The Oxford BSP Toolset, 2002. http://www. bsp-worldwide.org/implmnts/oxtool/. [24] Web Services Architecture Requirements, October 2002. http://www.w3.org/TR/wsa-reqs. [25] Web Services Transaction, August 2002. http: //www-106.ibm.com/developerworks/ library/ws-transpec/. [26] I. Foster, C. Kesselman, J. Nick, and S. Tuecke. The physiology of the grid: An open grid services architecture for distributed systems integration. Global Grid Forum, Open Grid Service Infrastructure WG, 2002. [27] I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the Grid: Enabling scalable virtual organization. The International Journal of High Performance Computing Applications, 15(3):200–222, Fall 2001. [28] S. Gribble, A. Halevy, Z. Ives, M. Rodrig, and D. Suciu. What can peer-to-peer do for databases, and vice versa?,. In Fourth International Workshop on the Web and Databases (WebDB ’2001), 2001.

[29] W. Hoschek. Peer-to-peer grid databases for web service discovery. To appear in Grid Computing: Making the Global Infrastructure a Reality, 2002. [30] J. Kubiatowicz, D. Bindel, Y. Chen, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and B. Zhao. Oceanstore: An architecture for global-scale persistent storage. In Proceedings of ACM ASPLOS. ACM, November 2000. [31] N. Paton, M. Atkinson, V. Dialani, D. Pearson, T. Storey, and P. Watson. Database access and integration services on the grid. Technical Report UKeS-2002-03, UK e-Science Programme Technical Report Series, 2002. [32] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A scalable content-addressable network. In Computer Communication Review, volume 31, pages 161– 172. Dept. of Elec. Eng. and Comp. Sci., University of California, Berkeley, 2001. [33] A. Rowstron and P. Druschel. Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. Lecture Notes in Computer Science, 2218, 2001. [34] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan. Chord: A scalable Peer-To-Peer lookup service for internet applications. In Proceedings of the 2001 ACM SIGCOMM Conference, pages 149–160, 2001. [35] D. Tidwell. Web Services - The Web’s next revelotion. https://www6.software.ibm.com/ developerworks/education/wsbasics/ wsbasic%s-ltr.pdf. [36] P. Watson. Databases and the grid. Technical Report UKeS2002-01, UK e-Science Programme Technical Report Series, 2002. [37] B. Yang and H. Garcia-Molina. Comparing hybrid peer-topeer systems. In P. M. G. Apers, P. Atzeni, S. Ceri, S. Paraboschi, K. Ramamohanarao, and R. T. Snodgrass, editors, Proceedings of the Twenty-seventh International Conference on Very Large Data Bases: Roma, Italy, 11–14th September, 2001, pages 561–570, Los Altos, CA 94022, USA, 2001. Morgan Kaufmann Publishers. [38] B. Y. Zhao, J. D. Kubiatowicz, and A. D. Joseph. Tapestry: An infrastructure for fault-tolerant wide-area location and routing. Technical Report UCB/CSD-01-1141, UC Berkeley, Apr. 2001.

Suggest Documents