An Efficient and Low Cost Monitoring System to Improve Availability and Reliability of Grid Services Mohammad Javad Hosseini, Bahman Arasteh Department of Computer, Sofian Branch, Islamic Azad University, Sofian, Iran
[email protected]
Abstract In the Grid and cloud computing systems, resources are shared among a large number of applications and users. These systems, as efficient environments, can be used for execution of long-running distributed applications. The failure occurrence during critical long-running applications can lead to spend considerable time and cost. This paper proposes an efficient job and resource monitoring services to attain the needed degree of availability and reliability of long-life application. In addition to the quality of services, the other focus of this work is to minimize resource consumption and the cost of requested services in the economic grid. The dynamic nature of proposed monitoring service leads to improve the availability and reliability of grid resources/services with low resource consumption. Analytical approach (Markov approach) is used to analyze the effect of our service on the availability and reliability of grid services/resources in the presence of permanent and transient faults. Keywords: Grid computing; long-life
application; availability;
reliability;resource Consumption;
1. Introduction The grid system is a distributed infrastructure for sharing large number of heterogeneous resources for cooperative problem solving [2, 3]. A node can be a PC-desktop, server computer, cluster or even supercomputer. Every node has its particular resources. Recently, grid computing is considered as an efficient environment for processing large-scale applications and is increasingly used for applications requiring higher levels of performance and dependability [1]. Also grid computing is an efficient distributed system with parallel processing capability is a suitable framework for executing critical long-running applications which need high availability and reliability. These application are the ones in which failure are not acceptable and may lead to spend considerable time and cost. In Addition to timing
constraints, the critical long-mission applications need the high level of availability and reliability. Timely results and high level of reliability are the main constraints in the mission oriented applications. Responsiveness and dependability are the main measures in the high available applications; availability and safety are the main factors in the long-running applications. On the other hand, some features of grid computing such as heterogeneity of remote resources, network details and geographical distribution can cause many transient and permanent faults. Therefore, failure occurrence in each component of grid environment is a rule not an exception. The unavailability or unreliability of grid resource/services can leads to failure of associated applications. The failure of longrunning applications can lead to spend considerable time and cost to repair and recovery. Therefore, the execution of these long running- applications needs to be managed and monitored by the grid resource management system (GRMS). We propose a monitoring service with respect the longrunning application’s requirements. This service along the other basic services of Globus can detect and recover high percentage of timing and content failure like Byzantine failure (coverage). Using this service in the Globus toolkit reduces the resource/service failure probability and consequently the mean-time-between failure (MTTF) is increased. The cost of the requested services is the other criteria from the user’s and resource provider’s points of view in the economic grid which are considered in this work. The second part of this paper is concerned with the previously proposed works. The next part represents the proposed method and the last section represents the evaluation results.
2. Related Works There are several significant grid middleware frameworks, like GrADS [5], Cactus [6], and GridWay
[7] that support reliability and availability aspects of long-running jobs. In Globus [12] customers describe required resources in a resource specification language (RSL) based on a pre-defined schema of the resources database. Krauter and Buyya have proposed an abstract model of GRMS for grid resource management systems [11]. In this model the RMS have several internal unit and four connection port to establish connection between RMSs and resources [11]. There are prior works which addresses some aspects of the problem. Some recent works present frameworks for execution of traditional long-running time applications [8, 9]. In order to maximize the reliability factor, a heuristic techniques has been proposed for grid resource allocation in [10]. This work focuses on optimal task partition and distribution in grid service system. Thema suggests a server wrapper for creating replicated web services and an external service to access an external Web Service safely [13]. Fry and Reiter [14] present a replication mechanism to improve reliability services that allows replicated objects to invoke operations on other potentially replicated objects. A dynamic and flexible scheduling model is proposed in [15] to improve reliability of grid services and resources. The proposed technique in this work improves the MTTF factor and consequently the availability and reliability of delivered services by grid will be increased. With respect the dynamic infrastructure of grid, the proposed failure detection method is aware of the grid reconfiguration and can adapt to it. The cost of the requested services is the other criteria from the user’s and resource provider’s points of view in the economic and market-based grid which are considered in this work and the last point is that, this technique reduces the probability of false alarm (false positive and false negative).
3. Proposed Method 3.1. System Design The grid infrastructure consists of layered software components deployed in different nodes [15]. A Grid can be defined as a layer of networked services that allow users access to a distributed collection of computing, data, communication and application resources from any location. The Open Grid Services Architecture (OGSA) represents an evolution towards a Grid system architecture based on Web services concepts and technologies. The OGSA as a new architecture in the Grid middleware provides a more unified and simplified approach to the Grid applications. The Globus toolkit includes software services and libraries for resource monitoring, discovery, and management, plus security and file
management [2, 5]. Grid resource allocation management (GRAM), Monitoring and Discovery System (MDS) and Grid Security Infrastructure (GSI) are the main components of Globus toolkit. The HBM (heartbeat monitor) technique is used to handle fault and failures [3, 6 and 8]. GRAM is responsible for managing local resources and comprises a set of Web services to locate, submit, monitor, and cancel jobs on Grid computing resources. The MDS is the information services component of the Globus toolkit and provides information about the available resources on the Grid and their status. GSI is a set of tools, libraries and protocols used in Globus to allow users and applications to securely access resources. MDS consists of two services: Grid Resource Information Service (GRIS) and the Grid Index Information Service (GIIS). The GIIS/GRIS together constitutes the discovery service currently provided in OGSA and Globus.
Fig.1 Abstract model of Grid system with the corresponding service in each layer
3.2. Efficient Monitoring System The grid services are web service that are designed to operate in a Grid environment, and meets the requirements of the Grid computing and consequently the grid computing platform can be defined as a collection of grid services. After a long-running application is submitted through a client, the broker invokes the middleware and core services to execute it. Service/resource discovery involves search through local and remote registries, to discover services/resourc es which match a particular criteria of long running applications. The long-running applications have some quality metrics which must be considered in the resource/service discovery and during execution. The needed degree of availability, reliability and remaining deadline of the long-mission application is important to
discover the candidate resources and services. Hence, availability and reliability of resources/services have directly effect on the quality of corresponding longrunning applications. The GRMS tries to find and allocate, by means of MDS and GRAM, sites that deliver dependable services which satisfy the QoS requirements. But with respect the dynamic structure of grid systems, which every resource can enter and leave the grid at each time, this is so complex task. On the other hand waiting for optimal resources allocation leads to starvation and increases the total finish time. In order to satisfy dependability and quality of grid services over an erroneous platform, we propose a redundancy technique that satisfies required QoS. As a result, availability, reliability and workload must be considered in the resource/service discovery for the long-running applications. Availability of the resources/service refers to the ability of the application to access it and the reliability is defined as continuity of correct service [1]. The resources with high degree of locality impose low communication and performance overhead. ܵ݁= ݕݐܾ݈݅݅ܽ݅ܽݒܣ݁ܿ݅ݒݎ ܵ݁= ݕݐ݈ܾ݈݅݅ܽ݅ܽݒܣ݁ܿ݅ݒݎ
ܶை௧ (1 ܶை௧ + ܶோ ܨܶܶ ܯ ܨܶܶ ܯ+ ܴܶܶ ܯ
ߣ
ୀଵ
,
ߤ=
1 1 = = ܨܶܶ ܯ ߣ ∑ୀଵ ߣ 1 1 = ܴܶܶ ܯ = ∑ୀଵ ߤ ߤ
݊
=
1
1 + ߣൗߤ 1+
ஶ
1 ∑ୀଵ ߣ ൘∑ ߤ ୀଵ ஶ
(5
= ݁ܿ݅ݒݎ݁ܵ ݂ ܨܶܶ ܯන ܴ( = ݐ݀)ݐන ݁ିఒ௧݀ݐ ஶ
ି ∑ సభ ఒ௧
= න ݁
݀= ݐ
1
∑ୀଵ ߣ
(6
With respect the dynamic structure of grid computing, each of the corresponding resources of the running application can leave the grid environment at each. On the other hand, probability of fault, error and failure in each remote resource and the network framework is not negligible. The unavailability and failure of resources can lead to halt or failure of running application. Hence, the main focus of this work is on the monitoring of the corresponding resources, services during the running of the application. Many transient and permanent hardware faults can lead to failure of the application in the grid system.
(2
Mean-time between failures is the average time that a service can perform its agreed function without interruption and the mean-time to repair is the mean elapsed time from the occurrence of an incident to the restoration of the service. In order to improve grid service/resource availability the mentioned equation must be maximized. We assume that the failure of corresponding resource of a service is constant and independent (λ). The λi is the failure rate and μi refers the repair rate of the resourcei which the corresponding service depends on. ߣ=
ܵ݁= ݕݐ݈ܾ݈݅݅ܽ݅ܽݒܣ݁ܿ݅ݒݎ
݅=1
ߤ݅
(3
(4
When a grid resource/service fails, every dependent service and application which depends on will fail too. Therefore the failure probability of a running application is equivalent to the failure probability of corresponding services and resource. In this case, the proposed monitoring service must detect the failure of operational resources and services during the execution of long running-application.
Fig.2 An overview of monitoring service to monitors the allocated resources and services during long-running applications and detects unavailability and failure of them.
The availability and reliability of an application or service is a function of availability and reliability of allocated resources and services. Hardening the resources and services leads to hardening the running applications. To improve the availability and reliability of resources we use dynamic redundancy technique. On the other hand, in the full redundancy, GRMS provides alternate resources as backup for each allocated resources. After a failure of a resource the corresponding backup starts to run from the recently saved checkpoint. This technique can detect the most resource unavailabity and failure. But it leads to
increase the resource consumption and consequently the cost of requested services by user in the marketbase grid. On the other hand, GRMS may spend more time to allocate optimal and duplicated resources to the corresponding application which leads to starvation and increases the total finish time. In order to trade off availability, reliability and the cost of requested service, we use a dynamic and minimal redundancy technique. With respect the nature of the application, we define a criticality threshold for the allocated resources.
Wi is equivalent to the criticality of the functions which assigned to the resource in the corresponding application. This technique determines the essential and critical functions of the application and quantifies them. It needs an application level analysis. The failure of critical functions, more likely, can change the program result. Hence, in this technique just the resources which have higher criticality factor will be redundant. Similar to the resource hardening, in order to minimize resource consumption this technique uses partial service hardening, just critical services will be harden. With respect the role of services in the application, this technique determines service criticality factor. The unavailability and failure of critical services, more likely, can change the program result. This technique provides a mechanism to monitor the availability and failure of grid services which have critical role in the running application. After a failure or unavailability of a service is detected by monitoring service during a job execution the GRMS automatically restart a failed service on the same resource or on an alternate when necessary. In order to improve availability and reliability of critical services, this technique uses redundancy or hardening method. In the redundancy technique there is backup version of the service and after unavaibility of master services it starts to serve the application from the last checkpoint. The proposed schema covers unavailability, timing and content failure of resources/services. In the timing failure the determined deadline for the computation of the result or delivery of the requested service is not met and the detection service monitors time thresholds or checks time-out and deadline constraints of running. Checking deadline constraints is an important factor in failure detection mechanisms especially in the real time applications. Acceptance tests (AT) [16] as a software module which are executed on the outcome of the
relevant independent services to confirm that the results are reasonable or not. During the execution of distributed function of the running application, monitoring system invokes the AT periodically to check state and results of execution. If the output of AT is true, then the result of corresponding function is considered valid and the monitoring system save this mid results as checkpoints by means of other grid services like MDS.
4. Evaluation Results In this section we use an analytical approach to evaluate the quality of proposed resource management system in the grid. We use Markov chain to evaluate the availability, reliability and efficiency of proposed technique [16].
Fig.3 MTTF of grid services with and without exploiting proposed monitoring technique
Fig.4 Availability of grid services with and without exploiting proposed monitoring technique
Fig.4 Reliability of grid services with and without exploiting proposed monitoring technique
This technique improves MTTF of grid resources with minimal degree of redundancy and cost. Attaining to higher MTTF leads to improve the availability and reliability of the grid resources/services. The dynamic architecture of proposed model reduces total service time and improves resources efficiency. The cost of requested services is one of main factors from the user’s point of view which is minimized in this work.
5. Conclusion This paper focuses on hardware permanent and transient faults over the grid resources and services which can lead to unavailability or failure of corresponding running applications. Furthermore, every resource/service in the grid or cloud environment can be unavailable at each time. In order to improve the grid resource/service availability and reliability, the proposed monitoring service exploits minimal degree of redundancy. This technique work improves the MTTF factor and consequently the availability and reliability of delivered services by grid will be increased. This technique is aware of the grid reconfiguration and can adapt to it. The cost of the requested services is the other criteria from the user’s and resource provider’s points of view in the marketbased grid which are considered in this. This service can be embedded into the monitoring service in the Globus toolkit.
[2] I. Foster and C. Kesselman. “The Globus project: A progress report”. In Proceeding of the Heterogeneous Computing Workshop, 1998. to appear. [3] I. Foster, C. Kesselman, S. Tuecke, The anatomy of the grid: enabling scalable virtual organizations, International J. Supercomputer Applications 15 (2001). [4] I. Foster, C. Kesselman, J.M. Nick, and S. Tuecke, “Grid Services for Distributed System Integration,” Computer, vol. 35, no. 6, pp. 37-46, June 2002. [5] S. S. Vadhiyar and J. J. Dongarra, .Self adaptivity in grid computing,. Concurrency & Computation: Practice & Experience, vol. 2005, 2005. [6] G. Allen, D. Angulo, I. Foster, G. Lanfermann, C. Liu, T. Radke, E. Seidel, and J. Shalf, .The cactus worm: Experiments with dynamic resource discovery and allocation in a grid environment,. International Journal of High Performance Computing Applications, vol. 15, p. 2001, 2001. [7] E. Huedo, R. S. Montero, and I. M. Llorente, .A framework for adaptive execution in grids,. Softw. Pract. Exper., vol. 34, no. 7, pp. 631.651, 2004. [8] L. Du, Y. Wu, and C. Wang, .Component based legacy program executing over grid,. in GCC '07: Proceedings of the Sixth International Conference on Grid and CooperativeComputing. Washington, DC, USA: IEEE Computer Society, 2007, pp. 558.565. [9] N. Markatchev, C. Kiddle, and R. Simmonds, .A framework for executing long running jobs in grid environments,. in HPCS '08: Proceedings of the 2008 22nd International Symposium on High Performance Computing Systems and Applications. Washington, DC, USA: IEEE Computer Society, 2008, pp. 69.75. [10] Y.S. Dai, X.L. Wang , Optimal resource allocation on grid systems for maximizing service reliability using a genetic algorithm, Reliability Engineering and System Safety 91 (2006) 1071–1082. [11] K. Krauter, R.Buyya, M. Maheswaran, A taxonomy and survey of grid resource management systems for distributed computing, SOFTWARE PRACTICE AND EXPERIENCE 2002; 32:135–164 (DOI: 10.1002/spe.432) [12] OGSA, http://www.globus.org/ogsa/ [13] M. G. Merideth, A. Iyengar, T. Mikalsen, S. Tai,I. Rouvellou, and P. Narasimhan. Thema: Byzantine FaultTolerant Middleware for Web-Service Applications. In Proc. 24th Symp. on Reliable Distributed Systems, pages 131–140, 2005. [14] C. Fry and M. Reiter. Nested Objects in a Byzantine Quorum-Replicated System. In Proc. 23rd Intl. Symp. on Reliable Distributed Systems, pages 79–89, 2004.
6. References
[15] B. Arasteh, M.J. Hosseini, “A Dependable and Efficient Scheduling Model for Critical Applications on Grid Systems” , parelec, pp.79-86, 2011 Sixth
[1] A. Avizienis, J.Laprie , B. Randle, C. Landwehr. Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE transaction on Dependable and secure computing, 11-33, 2044.
International Symposium on Parallel Computing in Electrical Engineering, Luton, United Kingdom, April 03-April 07 [16] L. Pullum, “Softwar Fault Tolerance Techniques and Implimentations,” 2001 Artech House, Inc. 658 Norwood, MA 02062 ISBN: 1-58053-137-7