Providing Autonomic Features to a Data Grid - Semantic Scholar

Providing Autonomic Features to a Data Grid Mar´ıa S. Pérez, Alberto Sánchez, Ramiro Aparicio, Pilar Herrero and Manuel Salvadores Facultad de Informática Universidad Politécnica de Madrid Madrid, Spain

Abstract. Autonomic and Grid computing are complementary, in the sense that complex grid environments can take advantage of the features provided by autonomic computing and on the other hand, autonomic processes can be properly deployed by using grid technology. Besides, one of the most active fields in grid computing is the area of data grids, focusing on the data access. Our paper proposes a grid framework which provides autonomic characteristics in order to enhance the performance of the data access, predicting the future behaviour of the corresponding I/O system. The use of the autonomic system is transparent to the user. This paper also presents a study case of such system. Keywords: Autonomic computing, Grid computing, Data grids, future behaviour prediction

1 Introduction Most of the progress made in Computer Science have arisen as result of the existence of some specific crisis, which implies a revolution in the corresponding research area. For instance, the known software crisis [2], whose notion emerged at the end of the 1960s, and which is characterized by an inability of software developers to deliver good quality software products according to the scheduled time and budget, caused the beginning of the software engineering [15]. In the same way, the I/O crisis and, more recently, the software complexity crisis, have been used for naming situations in which the current technology did not solve the problems originated by such crisis. The I/O crisis is given by the difference between the computation and the I/O capacity, that leads to become the I/O system in a “bottleneck” in the nowadays systems [8]. On the other hand, a new software complexity crisis has been detected in current software systems [5]. These systems are so complex than their administration is becoming increasingly unmanageable. Both problems or crisis have not been properly solved, although there exist some initiatives for their resolution. In the first area, many different proposals have been provided. Parallel I/O systems field is one of the most active in this sense. In the latter one, the growing proliferation of the autonomic computing can allow software developers and administrator to make easier the management and administration of complex software systems.

Autonomic computing [12] [4] is used to describe the set of technologies that enable applications to become more self-managing. Self-management involves selfconfiguring, self-healing, self-optimising, and self-protecting capabilities. The word autonomic has been borrowed from physiology; as a human body knows when it needs to breathe, software is being developed to enable a computer system to know when it needs to reconfigure itself. Autonomic computing deployment is one of the most promising areas in computer science. If the environment in which this discipline is used is a grid [3], the advantages would be even higher, due to the complexity of these environments. Our work intends to combine solutions from parallel I/O systems and autonomic computing in order to optimize the performance of the I/O phase (which is critical) in data grids [1]. Autonomic computing provides self-management, which in this case corresponds to the self-configuration and self-optimising capabilities. With this aim, we propose MAPFS-Grid [10], whose autonomic system takes decisions about the data location, based on monitored data. As our goal is to increase the system performance not in a concrete time point, but in future actions and during a time period, decisions are made according to a statistic prediction algorithm [14]. The outline of this paper is as follows. Section 2 defines the foundations and characteristics of autonomic computing. Section 3 describes our proposal, an autonomic architecture for a data grid, which provides autonomic features to a I/O system. Section 4 shows a study case of the autonomic part of our proposal. Finally, Section 5 explains the main conclusions and outlines the ongoing and future work.

2 Autonomic Computing The increasing complexity of current software infrastructures can slow down the progress of the technology development. As transitions from the information age to the knowledge era, it seems clear that the need for data processing capabilities will continue growing in a exponential way. The huge amount of data we have to deal with everyday is impossible to be managed with current management systems. Nowadays applications require both a huge amount of computing capability and tools which make easier their configuration and deployment. These two aspects are both sides of the same coin, which involves two innovative fields, namely grid computing [3] and autonomic computing [7]. Indeed, the intended goals of both areas can be seen as instances of a more general goal, that is, the usability of computing elements in a virtual environment. The main metaphor of this feature is the use of the telephony or electricity. These scenarios provide automated and standardized ways of using services, whose complexity is hidden for end users. In [5] is emphasized the urgency of “. . . design and build computing systems capable of running themselves, adjusting to varying circumstances, and preparing their resources to handle most efficiently the workloads we put upon them. These autonomic system must anticipate needs and allow users to concentrate on what they want to accomplish rather than figuring how to rig the computing systems to get them there . . . ” Autonomic computing tries to emulate the autonomic nervous system of a human body. The autonomic nervous system is the responsible for performing body tasks such

as control the heart beating, check the blood’s sugar and oxygen levels, monitor body’s temperature, manage the food digestion and so on. And all these task are made in a unconscious fashion. In fact, this is the key feature that autonomic computing aims at achieving: the self-configuration is made without any conscious recognition by the user or developer. In order to focus on this paradigm, it is important to understand the nature of autonomic computing. In [5], IBM, one of the most active supporters of autonomic computing (they were the first in coining this term), defines the following eight key elements of this discipline: 1. “To be autonomic, a computing system needs to ‘know itself’ - and comprise components that also possess a system identity”. An autonomic computing system is aware of all the components and their status. 2. “An autonomic computing system must configure and reconfigure itself under varying and unpredictable conditions”. The environment in which an autonomic computing system works is dynamic, and according to these dynamic conditions, the autonomic computing system must be able to reconfigure itself. Although the conditions are unpredictable, it is possible and desirable to use a system which can predict, in some sense the future behaviour. In this way, the configuration makes feasible the performance enhancement. 3. “An autonomic computing system never settles for the status quo - it always looks for ways to optimize its workings”. An autonomic computing system monitors the overall status of the system and decides, according to a optimization plan, the parameters to be changed. 4. “An autonomic computing system must perform something akin to healing - it must be able to recover from routine and extraordinary events that might cause some of its parts to malfunction”. An important feature of an autonomic computing is its ability for self-healing: a system must be able to identify the problems causes and solve them. 5. “A virtual world is no less dangerous that the physical one, so an autonomic computing system must be an expert in self-protection” An autonomic computing system must prevent itself from attacks, detecting them and alerting system administrator in case of danger. 6. “An autonomic computing system knows its environment and the context surroundings its activity, and acts accordingly”. An autonomic computing system must be able to discover resources and obtain information about them. Furthermore, according to the information of its neighbours, the system takes decision. 7. “An autonomic computing system cannot exist in a hermetic environment”. An autonomic computing system interact in an open and heterogeneous environment with other elements by means of open standards. This feature is especially compatible with the grid phylosophy. 8. “Perhaps most critical for the user, an autonomic computing system will anticipate the optimized resources needed while keeping its complexity hidden”. An autonomic computing system must be able to act in advance in a optimized fashion in order to increase the performance of the system. This ability must be performed in a transparent way.

To achieve all these features, four generic principles are embedded into autonomic computing strategy, namely [7]: – self-configuration, that is, the ability for configuring itself according to high level policies; – self-optimisation, that is, the capacity of seeking ways of enhancing the performance; – self-healing, that is, the feature which allows the system to detect, diagnose and repair hardware and software problems; – self-protection, that is, the ability for preventing the system against possible attacks.

3 MAPFS-Grid: An Autonomic Data Grid Architecture The difficulty and size of current problems involve a challenge for researchers, which need to use complex solutions and architectures to their domain-based problems. Grid computing has become a key piece for the development and deployment of these infrastructures. Often, the higher complexity is due to the huge amount of data involved in such processes. In these scenarios, the I/O access stage limits the overall performance of the system. Furthermore, the optimum configuration of these environments is not usually straightforward. MAPFS-Grid is a grid-based framework, whose main goal is to enhance the performance of data grid applications. Moreover, MAPFS-Grid is composed of an autonomic system, which is in charge of providing autonomic capabilities to the system. Our optimization depends on the kind of I/O operation. In the case of write operations, MAPFS-Grid uses a prediction algorithm, based on logs and historic data, together with a decision policy to find out the “best” target cluster1 . This is due to the fact that we use replicated data, in order to provide both fault tolerance and performance enhancement, and thus, although the written data are present in the system, we can choose an alternative location. A coherence protocol is used for updating all the data copies. On the other hand, in the case of read operations, MAPFS-Grid optimizes the data access, depending on the current performance of all the locations where data are stored. Before analysing every component of MAPFS-Grid, it is important to emphasize several aspects of such framework: – MAPFS-Grid resources are clusters of workstations/servers or individual nodes. In general, any computation element with disk capacity can be considered a resource in our environment. This feature makes flexible the definition of resources in MAPFS-Grid. – MAPFS-Grid is based on a multiagent parallel file system, named MAPFS [9], whose main contribution is the conceptual use of agents to provide applications with new properties, with the aim of increasing their adaptation to dynamic and complex environments. MAPFS offers features such as data acquisition, caching, prefetching and use of hints. MAPFS is intended to use in a cluster of workstations. 1

This is the best target cluster according to the used heuristics.

User X MAPFS Interface 1

MAPFS-Grid Autonomic System retrieve

store

Monitored Data Retrieval System trigger 4 System Monitoring Module query

Logs & Monitored data Monitoring PortType Access PortType

3

5

Request Translation Module 2 System Prediction Module

6

Taking Decision Module

access 7

MAPFSGrid

MAPFSGrid

MAPFSGrid

…. Cluster 1

Data Cluster 1

Cluster 2

Data Cluster 2

Node n

Data Node n

Fig. 1. MAPFS-Grid Architecture

– In MAPFS, data is striped between the nodes of the clusters, in order to take advantage of the inherent paralellism of such layout. – Associated to every resource (cluster or computing element) there is a grid service, with two main portTypes: access and monitoring. The first one is the portType used for the main operations of the file system, and is explained in detail in [11]. The second one will be explained later. Figure 1 shows the MAPFS-Grid architecture, focusing on the internal design of the MAPFS-Grid Autonomic System. Thus, the main components of MAPFS-Grid are: – MAPFS Interface: MAPFS-Grid shares the interface with MAPFS. Unlike MAPFS, which is used in a cluster, MAPFS-Grid is used in a grid composed of cluster and/or individual nodes. – MAPFS-Grid Autonomic System: This is the autonomic part of MAPFS-Grid. It is composed of five modules, whose main responsability is providing autonomic features to the I/O management. Mainly, this component is in charge of taking decisions about the target cluster for an I/O operation, and aspects related to the data layout and management. The five modules of the autonomic system are:

1. Request Translation Module: This module translates a MAPFS I/O operation in a request to the system prediction module (step 1 and 2). Basically, this module analyses the kind of operation and accordingly, it requests System Prediction Module the optimum storage element. In some operations, such as read routines, the system prediction module only provides the “best” storage element in which data are stored, without performing any prediction task. In create and write operations, the prediction task must be invoked. 2. System Prediction Module: In order to calculate the optimum storage element, the System Prediction Module queries monitored data (step 3) and uses a prediction method based on Markov chains, which is described in [14]. Basically, we use a probabilistic model based on past behaviour, which takes into account two types of parameters, measured at storage element level (cluster or individual node), that is: (i) Basic parameters, such as capacity of the hard disk of each server, network load (busy rate of the network), workload or disk bandwidth; (ii) Advance parameters, which are configuration parameters which affect to the performance of the autonomic system. One of the most significant advance parameters is the time window (T), that is, the time period in which the system monitors its performance. 3. Monitored Data Retrieval System: This module retrieves the data queried by the System Prediction Module from the logs and monitored data storage. In case this information is not available (this happens every T seconds), this module triggers the execution of the System Monitoring Module (step 4). 4. System Monitoring Module: Basic parameters are monitored by this module with the aim of improving the decisions taken by the autonomic system. This is made querying the Monitoring portType of every grid service (step 5). 5. Taking Decision Module: According to the predicted system state (step 6), this module decides the target cluster and the action to be done (step 7). – MAPFS-Grid Service, with two portTypes, the access and the monitoring portype. As mentioned previously, the access portType is used for performing the file systems operations. On the other hand, the monitoring portType is used in order to obtain measures related to the storage element. A basic monitoring portType, whose functionality is querying performance parameters, is shown in Figure 2. The access portType is described in [11]. The Monitoring portType implementation uses MonALISA [6] and Ganglia [13]. MonALISA is a distributed monitoring system based on JINI/Java and WSDL/SOAP, whose main goal is providing information about large-size distributed systems. MonALISA provides a web service interface and allows existing monitorization tools to be integrated. We have used Ganglia as scalable monitorization tool for high performance distributed systems, such as clusters or grids.

4 Study case This section shows some results obtained by our autonomic system, which allow us to extract some interesting conclusions that assert our previous proposals. Through this

Fig. 2. A very basic Monitoring PortType

CPU percentage (%)

analysis, our aim is predicting the future behaviour of our system in order to take the best decision that enhances the system performance. Although MAPFS-Grid autonomic system is able to measure several parameters from several clusters which belong to a grid environment, for the sake of simplicity, we have decided to use only one parameter, the CPU load and two nodes Intel Xeon 2.40GHz with 1GB of RAM memory interconnected by means of a 2 Gigabit network. The CPU load is measured supporting a normal workload. We have considered that in our data grid, the CPU load is a crucial parameter, since the server runs the process request. These results can be extrapolated to a more complex environment, with different parameters (disk bandwidht, network load, etc.), depending on the specific requirements of the system.

100 90 80 70 60 50 40 30 20 10 0

CPU load

Period of time

Fig. 3. CPU load in node 1 during a time interval of 4 hours

Figures 3 and 4 show the nodes workload, measured by means of the system monitoring module. Apparently, we cannot decide at first sight which is the best node where

CPU percentage (%)

100 90 80 70 60 50 40 30

CPU load

20 10 0 Period of time

Fig. 4. CPU load in node 2 during a time interval of 4 hours

it would be advisable to write in order to improve the performance of the next I/O requests.

0 − 25% 25 − 50% 50 − 75% 75 − 100% 0 − 25% 42.0% 1.0% 0.0% 0.0% 25 − 50% 1.0% 7.0% 1.0% 0.0% 50 − 75% 0.0% 0.0% 0.0% 2.0% 75 − 100% 0.0% 1.0% 1.0% 55.0% Table 1. Initial matrix of probabilities of transition between states for node 1

Our system prediction module uses a Markovian approach, as explained in [14]. With the aim of simplifying the resolution of the the Markovian problem, we have defined four different states according to the CPU load percentage (0 − 25%, 25 − 50%, 50 − 75% and 75 − 100%). The system will be in this corresponding state when the computer workload is between the two suitable measures. The initial matrix 1 and 2 of probabilities of transition between states are obtained by means of collected data, shown in Figures 3 and 4. The time window has been setup as 2 minutes. Thus, we can

0 − 25% 25 − 50% 50 − 75% 75 − 100% 0 − 25% 4.0% 1.0% 0.0% 0.0% 25 − 50% 0.0% 71.0% 2.0% 0.0% 50 − 75% 0.0% 0.0% 24.0% 1.0% 75 − 100% 1.0% 1.0% 0.0% 5.0% Table 2. Initial matrix of probabilities of transition between states for node 2

0 − 25% 25 − 50% 50 − 75% 75 − 100% 38.76% 8.1% 1.8% 51.3% Table 3. Solution vector for node 1

0 − 25% 25 − 50% 50 − 75% 75 − 100% 4.31% 62.90% 22.40% 10.35% Table 4. Solution vector for node 2

obtain the probability of changing among states or maintaining the same state every 2 minutes. We have solved this problem by means of the proposed Markovian approach, obtaining the corresponding solution vectors 3 and 4. By comparing both vectors, we can select the best resource that maximizes the expected remuneration in a further future. This remuneration depends of the used policy. In this sense, if we use a defensive attitude, we should select the second node to be written in order to improve the I/O requests in a further future because the node 1 has larger probability to stay in a worse state. However, if we use an aggressive policy, we could select the node 1, because the probability to stay in the best state (0 − 25%) is higher. Nevertheless, since we are using heterogeneous environments, it is necessary to take into account that different computers can have different characteristics. In short, it would be advisable to define some rules to compare different nodes. For instance, in the case of measuring workloads, it could be interesting to multiply the CPU speed and the probability to stay in a concrete state. This value would represent the expected CPU speed in such state.

5 Conclusions and future work This paper has shown MAPFS-Grid autonomic system, whose main goal is to provide autonomic features to data applications on grid environments. This system is composed of several modules: (i) Request Translation Module, which translates a MAPFS I/O operation in a request to the system prediction module; (ii) System Prediction Module, which calculates the optimum storage element; The basis of this module is explained in detail in [14], where a Markov-based prediction algorithm is described. (iii) Monitored Data Retrieval System, which retrieves the data queried by the System Prediction Module from the logs and monitored data storage; (iv) System Monitoring Module, which is in charge of monitoring information and stores on the corresponding database; and (v) Taking Decision Module, which decides the target cluster and the action to be done. An analysis of the results obtained by our autonomic system is also presented at the end of the paper. For the sake of simplicity, these results are based on only one of the parameters that has influence about the data applications. As future work, we are planning to introduce both new parameters and rules for taking decisions. Furthermore,

it would be desirable to define several policies, which allow us to take decisions, more or less agressive.

6 Acknowledgements This research has been partially supported by Universidad Politécnica de Madrid under Project titled “MAPFS-Grid, a new Multiagent and Autonomic I/O Infrastructure for Grid Environments”.

References 1. A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, and S. Tuecke. The data grid: Towards an architecture for the distributed management and analysis of large scientific datasets. Journal Of Network And Computer Applications, 23(3):187–200, 2000. 2. Edsger W. Dijkstra. The humble programmer. Commun. ACM, 15(10):859–866, 1972. 3. I. Foster and C. Kesselman, editors. The Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, 2004. 4. IBM and autonomic computing. An architectural blueprint for autonomic computing, April 2003. 5. IBM’s Perspective on the state of information technology. http://www-1.ibm.com/industries/government/doc/content/resource/thought/278606109.html. 6. H.B. Newman, I.C. Legrand, P. Galvez, R. Voicu, and C. Cirstioiu. MonALISA: A distributed monitoring service architecture. In Proceedings of CHEP, La Jolla, California, March 2003. 7. Jeffrey O. Kephart, Davis M. Chess, and Thomas J. Watson. The Vision of Autonomic Computing. IEEE Computer Society, 2003. 8. D. A. Patterson, G. Gibson, and R. H. Katz. A case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of ACM SIGMOD, pages 109–116, June 1988. 9. Mar´ıa S. Pérez, Jesús Carretero, Félix Garc´ıa, José M. Peña Sánchez, and Victor Robles. A flexible multiagent parallel file system for clusters. In Peter M. A. Sloot, David Abramson, Alexander V. Bogdanov, Jack Dongarra, Albert Y. Zomaya, and Yuri E. Gorbachev, editors, International Conference on Computational Science, volume 2660 of Lecture Notes in Computer Science, pages 248–256. Springer, 2003. 10. Mar´ıa S. Pérez, Jesús Carretero, Félix Garc´ıa, José M. Peña Sánchez, and Victor Robles. MAPFS-Grid: A flexible architecture for data-intensive grid applications. In F. Fernández Rivera, Marian Bubak, A. Gómez Tato, and Ramon Doallo, editors, European Across Grids Conference, volume 2970 of Lecture Notes in Computer Science, pages 111–118. Springer, 2003. 11. Mar´ıa S. Pérez, Alberto Sánchez, Pilar Herrero, and V´ıctor Robles. Engineering the Grid, chapter A new Approach for overcoming the I/O crisis in Grid environments. American Scientific Publisher, 2005. 12. IBM Research Autonomic Computing. http://www.research.ibm.com/autonomic/. 13. F. D. Sacerdoti, M. J. Katz, M. L. Massie, and D. E. Culler. MonALISA: A distributed monitoring service architecture. In Proceedings of the IEEE Cluster 2003 Conference, 2003. 14. Alberto Sánchez and Mar´ıa S. Pérez. A mathematical predictive model for an autonomic system to grid environments. In Osvaldo Gervasi et al., editor, ICCSA (3), volume 3482 of Lecture Notes in Computer Science, pages 109–117. Springer, 2005. 15. Anthony Ira Wasserman. A top-down view of software engineering. SIGSOFT Softw. Eng. Notes, 1(1):8–14, 1976.