Document not found! Please try again

Policy Driven Data Management in PL-Grid Virtual ... - Semantic Scholar

14 downloads 192675 Views 291KB Size Report
monitoring, data management as well as security services with respects to the ..... between different storage resources, e.g. single hard drives or disk arrays that ...
Policy Driven Data Management in PL-Grid Virtual Organizations Dariusz Kr´ol, Renata Słota, Bartosz Kryza, Darin Nikolow, Włodzimierz Funika, and Jacek Kitowski

Abstract In this chapter, we intend to introduce a novel approach to data management within the Grid environment based on user-defined storage policies. This approach aims at enabling Grid application developers to specify requirements concerning storage elements which will be exploited by the application during runtime. Most of the existing Grid middleware focus on unifying access to available storage elements, e.g. by applying various virtualization techniques. While this is suitable for many Grid applications, there is a category of applications, namely the data-intensive one which often has more specific needs. The chapter outlines research and development work carried out in the PL-GRID and OntoStor projects to solve this issue within the PL-GRID infrastructure.

1 Introduction Modern scientific as well as business oriented applications are becoming more and more complex, involving a multitude of parties and their heterogeneous resources including hardware, services, and data. This poses several problems when trying to deploy such applications in Grid environments, such as the problem of defining and managing a Virtual Organization which spans several different IT infrastructures. Currently, one of the main bottlenecks in fostering the adoption of Grid environments in scientific and business applications is the complex process of configuring all necessary middleware for existing applications, including configuration of

D. Kr´ol () • B. Kryza • D. Nikolow • J. Kitowski ACC CYFRONET AGH, ul. Nawojki 11, 30-950 Krak´ow, Poland e-mail: [email protected]; [email protected]; [email protected]; [email protected] R. Słota • W. Funika • J. Kitowski Department of Computer Science, AGH University of Science and Technology, Mickiewicza 30, 30-059, Krak´ow, Poland e-mail: [email protected]; [email protected]; [email protected] F. Davoli et al. (eds.), Remote Instrumentation for eScience and Related Aspects, DOI 10.1007/978-1-4614-0508-5 17, © Springer Science+Business Media, LLC 2012

257

258

D. Kr´ol et al.

Fig. 1 Virtual Organizations in PL-Grid

monitoring, data management as well as security services with respects to the requirements of particular applications. In the Grid, such applications are developed within custom Virtual Organizations (Fig. 1) deployed in order to limit the number of resources to particular applications and users as well as to control the access to these resources by users of the application. Within the PL-Grid project, these issues are addressed in order to allow the end-users to create their applications on the Grid by defining requirements which will be mapped to proper VO configuration. In particular, the VO specific requirements should be taken into account by the data management layer of the Grid middleware. This includes such issues as preferred storage type, required access latency, or replication for the purpose of increased data security. In order to accomplish this, the VO management framework must provide means for defining the high level requirements related to data management and the data management layer must be able to infer from these requirements particular actions which should be performed during data storage and retrieval. In this chapter, we present our results on developing a framework for data management in Grid environments in the context of Virtual Organizations. The novelty of our approach is that users can specify Quality of Service (QoS) requirements for the storage resources in the given VO. Our framework uses those requirements to automatically create and manage the VO. The framework consists of several components. First of all, a component called FiVO is responsible for definition of the VO and the users requirements and automatic deployment and management

Policy Driven Data Management in PL-Grid Virtual Organizations

259

of the VO according to the defined policy. Data management components use the predefined policies in order to optimize data access and best meet the users expectations, using proper storage monitoring and data access estimation methods. The rest of the chapter is organized as follows: The next section presents the related work in the field of grid data management. The third section gives some details about VO management within the PL-Grid. The fourth section shows the data management challenges regarding the QoS of data access in PL-Grid, the available relevant technologies and example use cases. Implementation notes are included in the fifth section and the last section concludes the chapter.

2 Related Work The presented research intends to develop an approach to data management within the Grid environment driven by a user-defined policy. However, most of the existing solutions try to leverage the data management process by choosing storage elements on behalf of the user rather than supporting users with selecting the storage elements which are suitable for the user requirements. This section depicts various projects which are related to the described issue in various environments. DCache system [1] consolidates various types of data sources, e.g. hierarchical storage management system or distributed file systems in order to provide users with a single virtual namespace. Moreover, dCache replicates existing data to different data sources to increase throughput and data availability. Also, mechanisms for moving data to a location where it is actually used are intensively developed within the project. All of these features are transparent from the user point of view. Though, some aspects of the system behavior, e.g. data source division into pools or replica manager can be configured. The autonomic, agent-based approach presented in [2] exploits features, e.g. autonomicity and self awareness known from other agent-based systems. Therefore, it is suitable for application to dynamic environments such as the grid where storage elements can be added or removed at runtime. The agent-based approach ensures that such a change will be quickly discovered and adapting actions, e.g. adjusting a load balancer agent, will be performed. Each agent has been assigned different roles, e.g. data source accessor or storage element chooser; thus, the responsibility can be well divided between loosely coupled objects. It is worth mentioning that in this approach metadata can be assigned to each storage element which can be then exploited during finding the most suitable storage for application data. However, such attributes are static and contain only information which does not change over time. Many challenges in the data management area were reinvented in the Cloud computing era. The most important ones includes: storing and retrieving enormous amounts of data, accessing data by different users from geographically distributed

260

D. Kr´ol et al.

locations, synchronization between different types of devices or consolidating various storage types and others [3]. While there are no effective solutions available for each of the mentioned problems yet, there are some very interesting ongoing works in this area. One of the interesting observations about applying Cloud computing is a necessity of shifting the approach to storing data from relationalbased to less-structure storage which is better scalable in a highly distributed environment [4]. The Reference [5] describes a production-ready cloud architecture which is deployed by Yahoo!, along with several explanations of decisions related to data management that were taken, e.g. the necessity of applying the map-reduce paradigm to perform data intensive jobs or replacing relational databases with more schema-flexible counterparts.

3 Virtual Organization Management in PL-Grid One of the main objectives of the PL-GRID project [7] is to support Polish scientists by providing a hardware and software infrastructure which will meet the requirements of scientific applications. The infrastructure will be accessible in form of a Grid environment that encompasses various computational and storage resources placed in several computer centers in Poland. The proposed system for VO management, called FiVO, will support the PLGrid users in defining the requirements for their applications in a unified semantic way, which will be then translated by our system into the configuration of particular middleware components (e.g. VOMS, replica manager) in order to make the process of VO creation as automatic as possible (Fig. 2). Such approach will enable abstraction over both heterogeneous resources belonging to a particular domain of a Virtual Organization (e.g. services or data) as well as abstraction over heterogeneous types of middleware which is used by different sites of the Grid infrastructure, including gLite and UNICORE. This process assumes that description of the resources available to a VO is described in a semantic way. Since this is usually a very strong requirement, we are developing a system – X2R, which will allow semi-automatic translation of legacy metadata sources such as RDBMS, LDAP, or XML databases into an ontological knowledge base founded on the provided mapping [6]. This will overcome the problem of interoperability of description of resources available for a Virtual Organization. In particular, an important aspect of the work involves optimization of data access on the VO level based on the semantic agreement reached by the partners during the VO inception. This will involve custom monitoring of data access statistics within a VO as well development of extensions to existing monitoring systems with capability of monitoring of data access metrics and verifying them against the particular Service Level Agreement (SLA) present within a VO.

Policy Driven Data Management in PL-Grid Virtual Organizations

261

User Interface

FiVO

X2R

Resource description layer

Security Configurator

Monitoring Configurator

PLGrid Infrastructure Security layer

Monitoring layer

Data management layer Access time estimators

Replica manager

Fig. 2 Virtual Organizations management architecture

4 Data Management Challenges in PL-Grid For users and applications with specific data access quality of service requirements an appropriate VO should be created respecting the data storage performance demands specified in the SLA. There are a couple of challenges which need to be addressed when creating and running such VOs. The first one is automatic configuration of the VO and assignment of storage resources to it. We assume that heterogeneous storage resources are available within the grid environment. The appropriate resources and storage policies need to be selected which best match the user requirements. More than one storage resource can be selected depending on the storage policy (e.g. replication). The second one is controlling the storage performance and storage resource allocation for the goal of guaranteeing the fulfillment of user requirements with respect to storage performance within the already created VO. One technique which can be used in this case is replication of data in order to avoid storage node overload for popular data sets. Another technique for increasing the performance of a single data access is distributing (stripping) the data and accessing the stripes in parallel. A relevant storage resource selection and scheduling based on a given user

262

D. Kr´ol et al.

performance profile is needed to keep track of the resource usage and to have the performance within the specified limits. Therefore, our research aims at providing a mechanism to define non-functional requirements for the storage from the grid application developers, e.g. a required throughput or desired availability level. The requirements can be divided into two categories: soft ones and hard ones. The former category represents the constraints which should be maintained at runtime, e.g. maximum access time. However, if some of them are violated, the application should not be stopped, instead it will most probably run longer due to unoptimized storage management. The hard requirement means that each violation of such a requirement will result in an application failure, e.g. related to storage capacity. An application which tries to dump data on a storage element which does not provide enough free space or the data capacity exceeding the available storage capacity, will be interrupted and an error code will be returned in most situations. It is important to mention that some properties of storage elements can change over time, e.g. the hard drive capacity or mean access time; thus, a monitoring system has to be exploited in order to get information about the current state of available storage resources. In order to build/deploy a system providing the functionalities described above some low level storage monitoring services and relevant performance prediction methods are needed. Such technology is being developed as part of another ongoing research project codenamed “OntoStor” [10] which is briefly presented in the following section.

4.1 Storage Monitoring The OntoStor project aims at developing an ontology-based data access methodology [11] for heterogeneous Mass Storage Systems (MSS) in grid environment. Within the project a library of storage monitoring methods for various types of storage systems has been developed. MSS differs between each other, thus special monitoring methods are necessary for each supported system. In order to make the monitoring of heterogeneous storage systems easier, a common set of performance related parameters has been specified and an appropriate CIM based model (C2SM) has been developed. Since the model is based on standard parameters it allows for easier integration and development of monitoring software. Currently, three types of storage systems are supported by the model: HSM systems, disk arrays and local disks. Specific monitoring software for each storage system need to be developed since the mentioned software is storage system dependent. This software has been implemented in Java as a library package and can be included as monitoring sensors in more general monitoring systems like Nagios, Ganglia or Gemini. The monitoring systems provide, according to the C2SM model, static (e.g. total storage capacity) and dynamic (e.g. MSS load) performance related parameters allowing performance prediction for the given moment.

Policy Driven Data Management in PL-Grid Virtual Organizations

263

4.2 Methods for Storage Performance Prediction Storage system performance is manifested by two main parameters: data transfer rate and data access latency. The performance prediction concerns a given single request – not the average for the given MSS. Within the OntoStor project three types of methods for performance prediction are taken into consideration: statistical prediction, rule-based predictions and prediction based on MSS simulation. The simulation-based model is the most advanced and accurate method but it needs more detailed description of the current state of MSS. The statistical prediction is the simplest but fastest method. These methods are implemented as grid services. Depending on the user profile and the goal of prediction (replica selection or container selection) appropriate service is chosen for the given request. By using the mentioned prediction services it is possible to manage data more intelligently and to assign the best storage elements according to the user requirements. During the prediction process the appropriate prediction service is selected based on the ontological description of the service.

4.3 Sample Use Case As the PL-GRID project is focused on the user, it is crucial to gather requirements and needs directly from potential users. The result from a requirements analysis stage is a description of two typical use cases which are the most important ones from the user point of view. We present them in this section to better demonstrate how the presented system works (in particular the programming library element) in typical scenarios. The first use case describes a scenario when functions provided by the programming library are explicitly called from an application. It encompasses all situations when a new data-intensive application is created and the application developer wants to apply a special storage strategy based on his/her experience and the nature of the generated data. This type of applications often occurs in science where solutions of complex problems have to perform complicated algorithms on many very large data sets. Another field of science where a large amount of data is generated during runtime is simulation. Depending on the problem scale and desired level of details, simulation can generate hundreds of Terabytes of data per day. The second use case is intended to handle scenarios where an application already exists, e.g. in a binary form and therefore cannot be modified. In order to apply our storage management mechanism to this use case a proxy object has to be exploited. In this case, a modified version of the CCC standard library serves as the proxy which delegates all file creation requests to our programming library. Therefore, the application may not be aware of any additional element. The necessary information about a storage policy which should be applied is retrieved from a VO knowledge base. The knowledge base contains properties of the default storage strategy which

264

D. Kr´ol et al.

Fig. 3 Architecture of data management system enhanced by user-defined requirements

can be applied to applications of any user that belongs to the VO. At runtime, the programming library determines to which VO the user belongs and then retrieves the necessary information. The rest of the use case is the same as the previous one. Figure 3 depicts an overview of the system schema. The central point of the system is a module (called Data storage manager) which handles data creation requests from an application according to the defined use cases. The VO knowledge base element is exploited only when there was no storage policy given. The Monitoring system module provides information about a current environment state to better select the storage elements from the available ones where the data will be actually stored.

5 Implementation Notes An important choice from the data management point of view is deployment of the Lustre file system [8] as a primary distributed file system on Storage Elements. The Lustre provides a file striping capability which can be used to distribute data between different storage resources, e.g. single hard drives or disk arrays that are gathered to provide a single, logical namespace to the user. Another interesting feature is the pool mechanism, which is a way of grouping different storage elements, e.g. RAID arrays or partitions from several hard drives, into a single category. Once a pool is defined, the user can explicitly request to store some data on this pool only.

Policy Driven Data Management in PL-Grid Virtual Organizations

265

Apart from the functionality which is provided by a programming library, an exposed API is very important from the application developer point of view. The API should be as clean, easy to use and intuitive as possible. In the presented library, a context of use is relevant while designing the API. The developed library is intended to be used to create files in the same way as the standard library does. The difference is in the physical location of the created files. While the standard library creates files on the local hard drive or within the distributed file system with default properties in the best case, our programming library will dynamically adjust, e.g., the striping strategy while creating files according to the given storage policy and the actual state of Storage Elements. In addition, the library should provide functions to retrieve information about the currently exploited strategy along with the possibility to change the policy (and thus the striping strategy) of the specified file. The most important functions of the API are as follows: – int createFile(char *filename, StoragePolicy *policy)– is a main function of the API which creates a new file in the Lustre file system according to the given StoragePolicy object. The policy object contains storagerelated requirements defined by a user. At runtime, these requirements are mapped to the striping strategy applied by the Lustre system. It is worth mentioning that besides static information about user-defined requirements, the actual state of the runtime environment retrieved from a monitoring system (see Sect. 4.1) is included. A file descriptor related to the given name is returned. – int openFile(char *filename)/ void closeFile(char *filename) – are counterparts to the standard open()/close() functions from the standard library. The only reason to include these functions is to maintain a consistency of the API. – void changeStoragePolicy(char *filename, StoragePolicy * newPolicy) – can be executed when one wants to change the storage policy of an existing file. As a result, a file can be transferred to a different storage element if it will be more suitable for the new policy or the current state of the environment. – int getStripeCounter(char *filename)/ void setStripe Counter(int newCounter, char *filename) – the first function simply returns a number of stripes on which a file is divided. The second function changes the number of stripes. Exposing these two functions is dictated by the importance of the striping mechanism. It is the most important feature provided by the Lustre file system which is exploited to reduce the file access time as well as to increase the availability level of a file. By replacing standard functions for creating files with the presented ones, the application developers can locate the application data more precisely. Also the learning curve of the API should be very gentle due to the analogy to the standard library. Implementation work is currently in progress. We have chosen the CCC language to provide high performance and to be able to efficiently exploit the native Lustre library which is written in pure C. To communicate with the FiVO [9] and

266

D. Kr´ol et al.

the storage monitoring system the Web Services technology was selected. This set of development tools ensures that the final implementation of the presented library can be deployed on the production PL-GRID infrastructure.

6 Conclusions In this chapter, we have presented our approach to data management driven by specific user requirements in grid based Virtual Organizations. The system is comprised of several components including VO management, data management and data access estimation layers. Due to their integration it is possible to enhance the data management within the Grid not only on the global scale, but especially with respect to particular applications running within the context of Virtual Organizations. The future work will include evaluation of the system within the framework of the PL-Grid project. Acknowledgements The research presented in this paper has been partially supported by the European Union within the European Regional Development Fund program no. POIG.02.03.0000-007/08-00 as part of the PL-Grid Project (www.plgrid.pl) and ACC Cyfronet AGH grant 500-08. MSWiN grant nr N N516 405535 is also acknowledged.

References 1. G. Behrmann, P. Fuhrmann, M. Gronager and J. Kleist, A distributed storage system with dCache, in G .Behrmann et al., Journal of Physics: Conference Series, 2008 2. Zhaobin Liu, Autonomic Agent-based Storage Management for Grid Data Resource, Semantics, Knowledge, and Grid, 2006. Second International Conference on Semantics 3. Daniel J. Abadi, Data Management in the Cloud: Limitations and Opportunities, IEEE Data Engineering Bulletin, 2009, pp. 3-12 4. Robert L. Grossman and Yunhong Gu, On the Varieties of Clouds for Data Intensive Computing, IEEE Data Engineering Bulletin, 2009, pp. 44-50 5. B. Cooper, E. Baldeschwieler, R. Fonseca, J. Kistler, P. Narayan, C. Neerdaels, T. Negrin, R. Ramakrishnan, A. Silberstein, U. Srivastava and R. Stata, Building a Cloud for Yahoo!, IEEE Data Engineering Bulletin, 2009, pp. 36-43 6. A. Mylka, A. Swiderska, B. Kryza and J. Kitowski, Supporting Grid metadata management and resource matchmaking with OWL, Computing and Informatics, In preparation. 7. PL–GRID project page, http://www.plgrid.pl 8. Lustre file system project wiki, http://wiki.lustre.org 9. B. Kryza, L. Dutka, R. Slota, and J. Kitowski, Dynamic VO Establishment in Distributed Heterogeneous Business Environment, in: G. Allen, J. Nabrzyski, E. Seidel, G. D. van Albada, J. Dongarra, and P. M.A. Sloot (Eds.), Computational Science – ICCS 2009, 9th International Conference Baton Rouge, LA, USA, May 25-27, 2009 Proceedings, Part II, LNCS 5545, Springer 2009, pp. 709-718 10. The OntoStor project, http://www.icsr.agh.edu.pl/ontostor/ 11. D. Nikolow, R. Slota, J. Kitowski, Knowledge Supported Data Access in Distributed Environment, in: M. Bubak, M. Turala, K. Wiatr (Eds.), Proceedings of Cracow Grid Workshop – CGW’08, October 13-15 2008, ACC-Cyfronet AGH, 2009, Krakow, pp. 320-325.