Application Availability Measurement in Computational Grid ? Chunjiang LI, Nong XIAO, and Xuejun YANG School of Computer, National University of Defense Technology, Changsha, 410073 China, +86 731 4575984
[email protected]
Abstract. The computational grid built on wide-area distributed computing systems is a more variable and unreliable computing environment, hence it is undoubtedly important to analyze its availability. In our opinion, the availability of such an open computing environment should be analyzed from the applications’ perspective. In this paper, we propose an application specific availability analysis method for computational grid, then present a measurement model for the applications’ availability. This model abstracts the key factors that affect the availability of applications in the computational grid.
1
Introduction
Computational Grids [1, 2] enable the coupling and coordinated use of geographically distributed resources for such purposes as large-scale computation, distributed data analysis, and remote visualization. It is necessary to predict how well the computational grid serving the applications. Availability analysis is such a method. For computational grids, the availability analysis should be done from the point of view of applications. Because the computing grid is an open computing environment without boundary, and each application in the computational grid never use all resources in it, the application’s availability totally depends on the resources it got and the recovery service provided by the grid service layer. In this paper, we proposed a measurement model for the application availability in computational grid. This paper is organized as follows. The application availability is defined in section 2. In section 3, availability models in practice are reviewed. In section 4, our applications availability measurement model is presented. Conclusion and future work is presented in section 5. ?
This work is supported by the National Science Foundation of China under Grant No.60203016 and No.69933030; the National High Technology Development 863 Program of China under Grant No.2002AA131010.
2
Application Availability
Traditionally, availability is defined as: p=
uptime uptime + downtime
It works well for defining the availability of a single processor, disk or other device, where devices are either ”up” or down. The application availability, in our opinion, could be defined as: A=
trun trun + tstall
trun is the time that application keep running, tstall is the time to recover from resource failure.
3
Availability Models
Techniques to evaluate a system’s availability can be broadly categorized as measurement-based and model-based. Measurement-based evaluation requires building a real system and taking measurements and then analyzing the data statistically, it is expensive. Model-based evaluation on the other hand is inexpensive and relatively easier to perform [3]. Model-based evaluation can be through discrete-event simulation, or analytic models, or hybrid models combining simulation and analytic parts. The main benefit of discrete-event simulation is the ability to depict detailed system behavior in the models. Its drawback is the long execution time, particularly when tight confidence bounds are required in the solutions obtained. Analytic models are more of an abstraction of the real system than a discrete-event simulation model. It is easier to develop and faster to solve. The main drawback is the set of assumptions that are often necessary to make analytic models tractable. Analytic availability models can be categorized to non-state space models and state space models. The former includes reliability block diagram RBD and fault trees. The later includes Markovian model, Stochastic Petri nets and Reward Nets [4]. The two main assumptions used by non-state space models are statistically independent failures and independent repair units for components. Although these two assumptions made the non-state space model not exact enough, the non-state space model is easier to develop and solve, and can be evaluated to compute measures like system availability, reliability and system mean time to failure. On the contrary, in the state space models, when the number of the system components increases, the state space will explode. For example, the number of states in the exact Markov chain for an n-processor VAXcluster is O(n3 ) [5]. The computational complexity of the exact analysis depends on the solution method used. For instance, if a full storage iterative method is used, then the time complexity will be O(n6 ) [5]. So, some approximate
method is used to reduce the complexity of state space models. The probabilistic model is a good approximate method for availability modeling [6]. Next section, we propose an approximate measurement model for the application availability in the computational grid.
4
Application Availability model
Suppose a task set GA = {GT1 , GT2 , · · · , GTm } composes a grid application. The tasks of GA are scheduled to k grid site, the number of tasks in each grid k P site is T Ni , and T Ni = m. Then the availability of application GA can be i=1
described by a serial RBD built by Ci and GNi , GNi is grid site and Ci is the network connection between grid sites. Only when all Ci and GNi are available, the grid application GA could run smoothly. So the availability of GA is: A(GA) =
k Y
A(Ci )A(GNi )
i=1
A(Ci ) is the availability of Ci , suppose its failure rate is γi and repair rate is i τi , then A(Ci ) = γiτ+τ . A(GNi ) is the availability of T Ni tasks in the grid site i GNi , it can be measured by the probabilistic model. Suppose the total resource number in grid site GNi is RT Ni , and the number of resources allocated to the application is RNi , each task mapping to a resource. The probability of a resource in ”up” state is P wi , the probability of permanent failure for a single resource is P ei , and P wi +P ei = 1. REi is number of resources which is available. In computational grid, the service layer support migrating tasks between grid sites. When REi < T Ni , the tasks which could not be served must migrate to other resources. P ti is the availability of such task. Suppose the total runtime of a task without failure is trun , if its resource fails, the time to recovery is trec , then P ti can be defined by trectrun +trun . In computational grid, the most costly recovery method for failed task is to restart it on a newly allocated resource. Here, for simplicity, we suppose each task in a grid site has the same P ti ; in practice, this value for each failed task must be calculated by the recovery mechanism. Then the availability of the T Ni tasks in grid site GNi is: A(GNi ) ≡ Pr {REi ≥ T Ni } + Pr {REi < T Ni and the failed tasks migrated successfully} RNi
§ RNi · k RNi k ¸Pwi Pei k ¸¹ TN
¦ ¨¨©
k
i
= ( Pwi + Pei ) RNi − = 1−
TNi −1
∑ j =0
TN i −1
TNi 1
§ RNi · j RNi j TNi j ¸Pwi Pei Pti j ¸¹
¦ ¨¨© j 0
TN i −1 RN RN i Pwi j Pei RNi − j + ∑ i Pwi j Pei RNi − j PtiTNi − j j j j =0
∑ j =0
TNi −1 RN RN i RNi − j j + ∑ i Pwi j Pei RNi − j PtiTNi − j Pwi Pei j j j =0
In conclusion, the availability of application GA can be described as follows: k
A(GA)
k
A(C ) x A(GN ) i
i
i 1 k
i 1 TN i 1
i 1
i 1
j 0
k
k
k
§ RN i · ¸Pwi j Pei RN i j j ¸¹
A(C ) x (1 ¦ ¨¨© i
A(C ) x ( A
ra (GN i )
i
i 1
§ RN i · ¸Pwi j Pei RN i j Pt i TN i j ) j ¸¹
¦ ¨¨© j 0
At (GN i ))
i 1
TN i 1
Ara (GN i ) 1 TN i 1
§ RN i · ¸Pwi j Pei RNi j j ¸¹
¦ ¨¨© j 0
At (GN i )
TN i 1
§ RN i · ¸Pwi j Pei RN i j Pt i TN i j j ¸¹
¦ ¨¨© j 0
Ara (GNi ) is determined by resource allocation, we call it resource allocation availability; At (GNi ) describes the availability of the tasks which suffered resource failure, we call it task migration availability.
5
Conclusion and Future work
In computational grid, in order to provide high availability service for the applications, a measurement model for application availability is absolutely necessary. This paper presents such a model. We are concentrating on constructing the high availability service architecture in grid platform to make the computational grid more available for users. There are still much work has to be done in order to implement multiple recovery mechanisms in this architecture.
References 1. Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers (1999) 2. Foster, I.: The grid: A new infrastructure for 21st century science. Physics Today 54 (2002) 3. Archana Sathaye, S.R., Trivedi, K.: Availability models in practice. In Proceedings of Int. Workshop on Fault-Tolerant Control and Computing (FTCC-1) (2000) 4. R. Sahner, A.P., Trivedi, K.S.: Performance and Reliability Analysis of Computer Systems: An Example-Based Approach Using the SHARPE Software Package. Kluwer Academic Publishers (1995) 5. Ibe, O.: Validation of the approximate availability analysis of vaxclusters. Internal Report, Digital Equipment Corporation (1988) 6. O. Ibe, R.H., Trivedi, K.S.: Approximate availability analysis of vaxcluster systems. IEEE Transactions on Reliability 38 (1989) 146–152