Availability Modeling and Cost Optimization for the Grid ... - IEEE Xplore

8 downloads 0 Views 260KB Size Report
The availability of the grid RMS is very important because it acts as the heart of grid computing [7], [8]. However, the grid. RMS availability is different from the ...
170

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 38, NO. 1, JANUARY 2008

Availability Modeling and Cost Optimization for the Grid Resource Management System Yuan-Shun Dai, Min Xie, Fellow, IEEE, and Kim-Leng Poh

Abstract—Grid computing is a recently developed technique for complex systems with large-scale resource sharing, wide-area communication, and multi-institutional collaboration. Although the development tools and techniques for the grid have been extensively investigated, the availability of the grid Resource Management System (RMS) has not been comprehensively studied. In order to contribute to this lacking but important field, this paper first models the grid RMS availability by considering both the failures of Resource Management (RM) Servers and the length limitation of request queues. A hierarchical Markov reward model is implemented to evaluate the grid RMS availability. Based on the availability model, an optimization problem for designing the grid RMS is studied in order to minimize the cost by determining the best number of RM servers. Then, the sensitivity analysis is conducted, and a dynamic switching scheduling method is further presented based on the sensitivity analysis. Index Terms—Availability, grid computing, Markov models, optimization, reliability, Resource Management System (RMS).

I. I NTRODUCTION

G

RID computing [1] is a newly developed technology for complex systems with large-scale resource sharing, widearea communication, and multi-institutional collaboration (see, e.g., [2]–[6]). The real and specific problem that underlies the grid concept is coordinated resource sharing and problem solving in dynamic multi-institutional virtual organizations [3]. The sharing we are concerned with is not primarily file exchange but rather direct access to computers, software, data, and other resources. This is required by a range of collaborative problem-solving and resource-brokering strategies emerging in industry, science, and engineering. This sharing is highly controlled by the Resource Management System (RMS), with the resource providers and consumers defining what is shared, who is allowed to share, and the conditions under which the sharing occurs. The basic structure and functions of the RMS in the grid have been introduced in detail [7], [8]. Krauter et al. [9] classified the existing RMSs into different categories based on their control property and applications. Most kinds of RMSs have a common service process as depicted below. At first, the grid jobs/programs submit their requests for some resources to the Manuscript received March 29, 2005; revised November 27, 2005 and June 5, 2006. This paper was recommended by Associate Editor H. Pham. Y.-S. Dai is with the Department of Industrial and Information Engineering, and with the Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN 37996 USA (e-mail: [email protected]). M. Xie and K.-L. Poh are with the Department of Industrial and System Engineering, National University of Singapore, Singapore 117576 (e-mail: [email protected]; [email protected]). Digital Object Identifier 10.1109/TSMCA.2007.909546

RMS. The RMS adds these requests into the request queue [7]. Then, the RMS schedules and allocates those requests to different RM servers for matchmaking. In the matchmaking, the RM server discovers the requested resource [10] and then builds the connection between the request and the resource. Finally, the grid jobs/programs can access their requested resources through the constructed links. The availability of the grid RMS is very important because it acts as the heart of grid computing [7], [8]. However, the grid RMS availability is different from the availability and reliability of conventional distributed/parallel systems [11]–[17]. Those conventional availability models cannot be simply implemented in the analysis of grid RMS availability. It is because the availability models for those small-scale distributed systems usually do not consider the unavailability case caused by the limited request queue. The number of job requests in a global grid may be much greater, particularly when the Open Grid Services Architecture (OGSA) is applied [18]. The OGSA enables the integration of services/resources by various organizations, and therefore, the requests for all these services/resources are of a great number. The length limitation of request queue is necessary because delayed services over the due time are often useless. In fact, sometimes, the unavailability of the RMS caused by an overflowing request queue is much more than that caused by the failed servers, especially during the peak time. Thus, a new model that is suitable for the availability analysis of the grid RMS is required. To fulfill this requirement, this paper first systematically models and analyzes the grid RMS availability, which is defined as the probability for the grid RMS to be available in providing services for the grid jobs. There are usually two cases that cause the unavailable RMS, i.e., 1) all the Resource Management (RM) servers are down, and 2) the request queue is full so that no new requests can be added. Some initial studies have been done in analyzing the grid computing reliability. Dai et al. [19] first studied the grid computing reliability, which is defined as the probability for the grid to successfully execute the grid programs. Nevertheless, Dai et al. [19] only studied the network failures for the grid reliability without considering the RMS availability. Later, Dai et al. [20] studied the service reliability for a wide-area distributed system that is one of the ancestors of the grid system. The function of the control center in that model is similar to that of the RMS for grid computing. However, the reliability analysis of the control center just considered a virtual machine whose reliability was set to a constant given the running time, and then directly fed this reliability into the distributed system reliability analysis. Different from the grid RMS for detecting

1083-4427/$25.00 © 2007 IEEE

DAI et al.: AVAILABILITY MODELING AND COST OPTIMIZATION FOR THE GRID RMS

171

Fig. 2. Markov model for the RM servers.

Fig. 1.

Serving process of the RMS.

and matchmaking the requests with resources, this virtual machine itself participates into the computational work and immediately serves for any request without a queue. As a result, that model is inappropriate for the grid RMS that includes the matchmaking and the request queue. Thus, we extend the previous research in the grid and present a new availability model that is specifically suitable for the grid RMS. The organization of this paper is as follows. Section II describes the structure and functions of the general grid RMS, and then presents the RMS availability model. An example is also illustrated to numerically derive the RMS availability. Section III studies one of the applications of the RMS availability, which is to determine the optimal number of RM servers. Furthermore, sensitivity analysis and a dynamic switching system are also studied. Section IV concludes this paper and discusses some possible improvements for future research. II. A VAILABILITY OF THE G RID RMS A. Description of the Grid RMS The RMS is the “brain” of grid computing. It manages the pool of shared resources of the grid and matches the requests of different jobs to the resources. The serving process of the RMS can be generally represented in Fig. 1. Grid jobs arrive at the RMS and request their needed resources. Translating the jobs is the first step that abstracts resource requests out of the jobs and put those translated requests into the request queue. The RMS then allocates those requests to different RM servers for the matchmaking in turn. Finally, the jobs can access those resources through the network. RMS is supported by multiple RM servers running in parallel (see, e.g., [9, Fig. 2]). Krauter et al. provide the basic functions of translating the service requests, scheduling the request queue, matchmaking requests and offers, and accessing resources. In the translation, the RM servers abstract the requests

for resources out of the jobs and identify their requirements (such as the deadline, budget, and other quality-of-service), and then use certain standard language or internal expressions to redescribe them. After the translation, those requests can be understood inside the RMS and then added into the request queue. The multiple RM servers collaborate together to store and schedule those requests in the queue. They use some protocols to determine the priority of the requests according to their importance, emergency, profit, relationship, etc. [7]. The requests with higher priority queue in front of those with lower priority and can be earlier served for matchmaking. The RM servers implement certain protocols to discover the shared resources and to match the requests to them, such as disseminated and discovery protocols [9], resource trading protocols [21], etc. After matchmaking, the sites of the requested resources are known. Given the site information, the accessing control of the RM servers can construct communication channels between them. Finally, the grid jobs/programs can reach and use those remote resources through the constructed links. Without the RMS, the request queue will have no media to store, no scheduler is run to control the queue, and the jobs in the grid will lose their destinations and cannot be completed. Thus, the availability of the RMS is crucial to grid computing. The grid RMS availability is defined as the proportion of time that the system is available to the users. There are mainly two conditions that cause the RMS to be unavailable to the jobs. 1) Since the RM servers may be down due to certain failures, if all the RM servers are down, the RMS is not available to receive or serve any grid job/program, and the request queue will no longer work. 2) The request queue is full, i.e., the new job’s request cannot be added to the queue. The RM servers are always working or keeping hot standby when the grid system is in use. Hence, it is common for the RM servers to be down because they may contain software/ hardware faults. If either software or hardware fails to work, the RM server is down. Then, maintenance personnel will debug and repair those failed RM servers. The multiple RM servers running in parallel are used to improve the effectiveness and tolerate failures. These RM servers are usually homogeneous with identical software and hardware equipment. If all RM servers are down, the functions of translating, receiving, scheduling, matchmaking, and accessing are no longer available, and as a result, no new requests can be received during the period. The second type of unavailability case usually takes place during the peak time when many job requests arrive in a short time period. Somebody may suggest not limiting the length of the request queue. However, most grid jobs have deadlines to be completed because the time-out service is useless [21]. If the length of the request queue is not limited, during the peak time, most jobs cannot be completed in time after they have waited

172

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 38, NO. 1, JANUARY 2008

for a long time in the queue. Thus, it is reasonable and common to set a limitation of the request queue length, which may cause the RMS to be unavailable to grid jobs when the queue is full. Thus, the availability of the grid RMS should be analyzed based on the above two conditions, which is different from the conventional distributed system availability [15], [22]. The next subsection will present a new availability model for the grid RMS. B. RMS Availability Model Model Assumptions: The assumptions of the availability model are given as follows. 1) There are N homogeneous RM servers running in parallel, and the length of the request queue cannot exceed M . 2) The RM servers may be down because of certain failures. The failure occurrence on each RM server follows a Poisson process with a failure rate λs . The failures of the different RM servers are independent of one another. 3) If any RM server is down, the repair process begins. The repair time follows the exponential distribution with the repair rate µs for each RM server. 4) The arrival of the grid jobs to the RMS follows the Poisson process with the arrival rate λa . Different grid jobs or programs may contain different numbers of resource requests, so the number of requests of an unknown job is a discrete random variable denoted by X. Denote the probability mass function for X to take the value x by p(x) = Pr(X = x) (x = 1, 2, 3, . . .).

(1)

5) The service time for each RM server to complete a request is exponentially distributed with the parameter µr . 6) If all the N RM servers are down, no grid job can be served by the RMS. If the request queue exceeds its limitation of M after adding the X requests of a new grid job, the new job is also unable to be completely served by the RMS. Then, the RMS availability is the proportion of the time when the above two cases do not occur. For the second assumption, the failures of the RM servers are assumed to follow a Poisson process, which can be explained as either in the operational phase [23] or in a steady state after a long-time run [24]. The assumption of independent failures in the RM servers is a good approximation to the reality since the requests served by the different RM servers are uncorrelated. The third assumption of repair time in accordance with the exponential distribution has also been widely accepted [15], [25]. Assumption 4 for the arrival of grid jobs following the Poisson process can be justified as a memoryless process for unknown users to submit jobs. It is more general to allow a job to contain requests for different resources than for only a single resource. The fifth assumption of the random service time with exponential distribution is also a commonly accepted one, which is more general than assuming a fixed/constant service time. The reason is because matching different resources may take different service times by an RM server, especially when the function of real-time detection of resources is recruited.

Hierarchical Markov Reward Model (MRM): The MRM is different from the conventional Markov models, which introduces another parameter for each state called as reward. The generation of MRM consists of a continuous-time Markov chain X = {Xt , t ≥ 0} on a finite state space χ and a reward function r : χ → IR. X is completely described by its generator matrix and the initial probability vector. For every state i ∈ χ, the reward r(i) signifies the gain or the reward that will be obtained. For more details of the MRM, please refer to [25]. In order to analyze the RMS availability, a hierarchical MRM is implemented here. The first level of the Markov model is built for the RM servers, and the second level is constructed for the request queue. The interface between the two levels is the reward values of states. For N homogeneous RM servers, the birth–death process is modeled in Fig. 2 with failure rate λs and repair rate µs . The state k represents that the number of available RM servers is k (k = 0, 1, 2, . . . , N ), and thus, the number of failed RM servers is N − k. Denote pk as the steady probability for the system staying at state k (k = 0, 1, . . . , N ). It is simple to derive pk by 1

p0 = 1+

N  k=1

pk =

N !µk s k!·(N −k)!λk s

N !µks · p0 k! · (N − k)!λks

(k = 1, 2, . . . , N ).

(2)

In addition to the state probability, each state k also has a reward value denoted by vk (k = 0, 1, . . . , N ), which is the probability for the request queue to be available for a new job given the k RM servers running. The values of the reward vk (k = 1, 2, . . . , N ) can be computed by the second level of the hierarchical model. Note that there is a special state (i.e., k = 0) that means all the RM servers are down at this state, so the reward value for state 0 is set as v0 = 0, which means the request queue must be unavailable because no new requests can be received by the RM servers that support the various RM functions. The second level is to model the process of the request queue given that k RM servers are working. The Markov process is shown in Fig. 3 based on Assumptions 4 and 5. The state m (m = 0, 1, . . . , M ) represents the number of requests in the queue. According to Assumption 4, the grid job arrival rate is λa and the probability distribution is p(x) for the number of requests in the job to be x. Then, the transition rate from state m to state m + x is p(x)λa (x = 1, 2, . . . , M − m). If at state m the new job’s requests x > M − m, adding all the new job’s requests will make the request queue over its length limitation, so the RMS is unavailable to such new jobs with > M − m,  x−m and it remains at state m with the rate (1 − M x=1 p(x))λa that does not mark in Fig. 3. According to Assumption 5, the completing rate of a request is µr . If m ≤ k, the m requests can be immediately served by the k available RM servers, so the leaving rate of any one request is equal to m · µr . If m > k, only k requests are being served by the k available RM servers, so the leaving rate is k · µr .

DAI et al.: AVAILABILITY MODELING AND COST OPTIMIZATION FOR THE GRID RMS

Fig. 3.

173

Markov model for the request queue.

Denote qk (m) as the steady probability for the system staying at state m (m = 0, 1, . . . , M ) given k available RM servers. It is easy to derive qk (m) by solving the following Chapman–Kolmogorov:   m · µr +

M 

x=1 M −m 

 p(x)λa qk (0) = µr qk (1)

(3)



N 

p k vk .

(10)

p(x)λa qk (m) = (m + 1) · µr qk (m + 1) C. Numerical Example +

m−1 

p(m − y)λa qk (y),

y=0

k · µr +

A(N ) =

k=1

x=1



which represents the probability for the RMS to be available to the grid jobs given k working RM servers. As in Fig. 2, the expected RMS availability given N RM servers can therefore be obtained by

M −m 



(m = 1, . . . , k − 1)

(4)

p(x)λa qk (m) = k · µr qk (m + 1)

x=1

+

m−1 

p(m − y)λa qk (y),

y=0

(m = k, . . . , M − 1) (5) M −1  k · µr · qk (M ) = p(M − y)λa · qk (y) y=0

(6) M 

qk (m) = 1.

(7)

m=0

For each state m, there is also a reward value denoted by wm that represents the probability that the remained positions in the request queue are able to contain all the requests of an unknown new job. Given the m requests in the queue, the number of remained positions is M − m. If the number of requests of an unknown new job is not more than M − m, the RMS is available to the new job. As the above Assumption 4, the number of resource requests of an unknown grid job is a discrete random variable X with distribution p(x). Then, the reward value at state m in Fig. 3 can be obtained by wm = Pr(X ≤ M − m) =

M −m 

p(x).

(8)

x=1

Then the reward values of the first-level Markov model in Fig. 2 can be obtained by vk =

M  m=0

qk (m) · wm

(9)

A grid RMS has 15 homogeneous RM servers that are simultaneously working, i.e., N = 15. The failure rate of each RM server is λs = 0.0008 s−1 , and the repair rate is µs = 0.003 s−1 . Supposing that the maximal number of requests waiting in the request queue is allowed to be 200 (i.e., M = 200), the arrival rate of the grid jobs to the RMS is λa = 1.4 s−1 , and the complete rate of each request by an RM server is µr = 0.8 s−1 . As in Assumption 4, we suppose that the discrete random variable of X satisfies a uniform distribution p(x) = Pr(X = x) =

1 , b

x ∈ {1, 2, 3, . . . , b}

(11)

where b is an integer that represents the maximal number of requests in a grid job. Then the reward value for state m can be computed by (8) to obtain wm = Pr(X ≤ M − m)  −m  M p(x) = M −m b , (M − m ≤ b) = x=1  1, (M − m > b).

(12)

Please note that the uniform distribution used here is only for illustration, and other distributions can also be implemented in a similar way based on real conditions. Here, we suppose that b = 10. Then, substituting the above numerical values into (2)–(10), the grid RMS availability is calculated as A(15) = 0.9934, which means the expected probability is 0.9934 for the grid RMS to be available in serving the grid jobs. The availability model and analysis are useful not only to evaluate the grid RMS availability but also to further apply into some other useful problems, such as system design and optimizations. The next section presents an example of the applications based on the availability evaluation, which is to optimally design the grid RMS.

174

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 38, NO. 1, JANUARY 2008

III. O PTIMIZATION FOR D ESIGNING THE G RID RMS The RM servers are the heart of a grid system. Adding the number of RM servers can easily improve the RMS availability and efficiency, but it is very costly, and sometimes, too many redundant servers are not necessary. On the other hand, too few RM servers may make the RMS often unavailable and make the users unsatisfied, which results in a decrease in the number of customers and the amount of profit. Thus, in designing/developing the grid, there must exist an optimal number of RM servers. This section presents a way to optimally determine the number of RM servers in a grid RMS from both the availability and cost viewpoints. A. Availability and Cost Analysis The grid RMS provides services to the grid jobs for resources. The completion of serving a grid job can bring some profit to the grid RMS. The expected value of the profit for serving a job is supposed to be cp . If the RMS is unavailable for a job, the RMS loses the profit of cp , which is defined here as the unavailability cost. Given a period of time, for example, ∆t, the expected number of jobs is λa · ∆t, where λa is the arrival rate of the grid jobs, the same as depicted in Fig. 3. Then the expected unavailability cost during this time period can be computed by C1 = {1 − A(N )} · λa · ∆t · cp

(14)

Thus, the total cost is the summation of the unavailability cost (13) and the server cost (14), which is a function of N (the number of RM servers), i.e., C(N ) = {1 − A(N )} · λa · ∆t · cp + N · cs · ∆t.

Objective : Minimize C(N ) = {1 − A(N )} · λa · ∆t · cp + N · cs · ∆t Subject to : N = 1, 2, 3, . . . , Nmax .

(16) (17)

This optimization model is numerically solvable. Usually, the maximal number of RM servers (Nmax ) is not too large, so the complexity of the exhaustive searching algorithm is tolerable, which can be applied to find the optimal solution for this problem. However, it is possible for the Nmax to be too large and even infinity (such as an extreme case of no budget bound). Fortunately, to further improve the computational efficiency, we present a fast searching algorithm that does not need to search all the values of N from 1 to Nmax . The following are the steps of the fast searching algorithm: Algorithm 1: Step 1: Let N start from 1 and then increase by one at each iteration. Step 2: Use (10) to compute the availability function A(N ) and then substitute into (16) to obtain the cost C(N ). Step 3: If

(13)

where A(N ) is the RMS availability to the jobs, and thus {1 − A(N )} · λa · ∆t is the expected number of jobs that are lost due to the unavailability of the RMS. In addition to the unavailability cost, there is another cost that is also related to the number of RM servers. It is the server cost for preparing/running/maintaining the RM servers [26]. The expected cost for each RM server running per unit of time is supposed to be cs , which includes the cost of wear out, power usage, maintenance, and so on. Since the RM servers are assumed to be homogeneous here, during the same period of time ∆t, the total server cost for the N RM servers can be simply obtained by C2 = N · cs · ∆t.

and so forth. Thus, the optimization model for the designing problem can be built as

(15)

The RMS availability is an increasing function to the number of RM servers, so the first part in (15) is decreasing with N , and the second part is obviously increasing to N . Hence, N should be optimally selected in order to minimize the total cost of (15). Usually, there is a limitation of the number of RM servers (denoted by Nmax ) due to the budget bound, facility limit,

A(N ) > 1 −

cs λa · cp

(18)

then continue to the following step 4. Otherwise, N increases by one, and repeat the steps 1 and 2 until Nmax . Step 4: Output the minimal value of the cost out of all the computed C(N ), i.e., Min(C(N )) and the optimal number of RM servers (N ∗ ). This is the optimal solution and stop. To validate Algorithm 1, the convergence and the existence of the optimum are proven in Lemmas 1 and 2. We prove them under the most stringent condition that has no upper bound of the number of hosts, because otherwise Algorithm 1 is naturally convergent (stoppable) without an infinite loop (reaches Nmax at most). Lemma 1: There exists a finite integer N0 to make the inequality (18) in Algorithm 1 satisfied. Proof: See Appendix A. Lemma 2: The minimum cost (i.e., optimum) must occur be fore the termination criterion of inequality (18) in Algorithm 1. Proof: See Appendix B. Therefore, Lemma 1 guarantees that the convergence of the fast algorithm can be terminated within finite iterations, and Lemma 2 guarantees that the output of the fast algorithm must be the optimum. A numerical example using this fast searching algorithm is illustrated in the next subsection. In addition, if the function C(N ) in (16) is observed to be a monotonously increasing function to N , then the optimal solution can be directly obtained as N ∗ = 1.

DAI et al.: AVAILABILITY MODELING AND COST OPTIMIZATION FOR THE GRID RMS

175

B. Numerical Example Suppose that a grid needs to build an RMS. After investigation, some necessary parameters are provided as follows: The expected server cost per second of an RM server is cs = 0.01 dollars/s, and the expected profit it is available to serve for a job is cp = 0.55/job. Due to the resources and budget limit, the maximal number of RM servers is Nmax = 1000. The expected deadline for a grid job to be completed is Td = 10 s. The failure rate of each RM server is λs = 0.0008 s−1 , and the repair rate is µs = 0.003 s−1 . The arrival rate of the jobs to the RMS is λa = 1.4 s−1 , and the complete rate of each request by an RM server is µr = 0.8 s−1 . The parameter of the uniform distribution in (11) is assumed to be b = 10. As explained in Section II, the limitation of the length of the request queue is set for the purpose of completing most jobs in time. Supposing that the expected deadline for most grid jobs is Td , we therefore set a constraint that the expected waiting time Tw for the last request in the full queue is no longer than the expected deadline, i.e., Tw =

M ≤ Td k · µr

(19)

where k is the number of working RM servers (k = 1, 2, 3, . . . , N ). Then, the limitation of the queue length M should satisfy M ≤ Td · k · µr .

(20)

The longer is the limitation of the queue length, the more is the RMS availability. Hence, in order to maximize the RMS availability without breaking the requirement of (19) or (20), the value of M can be dynamically set as M = [Td · k · µr ] (k = 1, 2, . . . , N )

(21)

where [X] represents the largest integer that is no more than X. Please note that although we acknowledge the dynamic relationship of (21) in this example, other relationships can also be implemented according to the requirements of real conditions. For example, M is possible to be static, i.e., a fixed number preset by the RMS, as shown by the example in Section II-C. Thus, substituting the numerical values into (21), we get the length limitation of the request queue as M = [Td · k · µr ] = 8 · k

(k = 1, 2, . . . , N ).

Substituting the above values of the respective parameters into (15), we get the total expected cost per second as C(N ) = 0.77 {1 − A(N )} + 0.01N. In order to minimize the total expected cost, the optimization model [(16) and (17)] is applied, and a program of the fast searching algorithm is composed to solve this optimization problem. The fast searching algorithm stops at N = 15 using the stop criterion of the inequality (18). The optimal solution is N ∗ = 14, i.e., the most economical number of RM servers should be 14 with the minimal cost 0.1536 dollars/s.

Fig. 4. Cost functions to the different number of RM servers.

To show the trend, Fig. 4 depicts the cost function to different numbers of RM servers from 1 to 40. It can be observed that the cost function decreases very sharply at first and then increases after N ∗ . It can also be observed that when N is large enough (e.g., about 20 here), the cost function C(N ) linearly increases to N . It is because given a large number of RM servers, the availability A(N ) is very close to 1 so that the unavailability cost (13) is negligible compared to the server cost (14). If the server cost dominates the unavailability cost, the total cost function (15) is actually approximated to a simple linear function to the number of RM servers. C. Sensitivity Analysis and Dynamic Switching Scheduling There is a parameter that may fluctuate more than the other parameters in reality in the above decision problem. It is the arrival rate of the jobs λa . Although we can justify that we use the mean value of the arrival rate in the decision problem, we cannot guarantee that the static configuration (the number of RM servers) is always optimal. Hence, a sensitivity analysis of the optimal solution to the parameter λa is first conducted in this subsection, and then a dynamic switching system that can guarantee the optimum most of the time is presented. Sensitivity Analysis: The goal of the sensitivity analysis is to show the robustness of the optimal solution to the deviation of the parameter λa . Hence, after obtaining the optimal solution N ∗ , we now present a way to get the marginal values of the parameter λa that can make the optimal solution change to another one. The marginal value means a value of λa that makes the optimal solution not only N ∗ but also another value of either N ∗ − 1 or N ∗ + 1, i.e., they have the same total cost. Thus, the two marginal values of λa can be obtained by solving the following two equations:   ∗ ∗ = C λ+ C λ+ a |N a |N + 1   ∗ ∗ = C λ− C λ− a |N a |N − 1

(22) (23)

176

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 38, NO. 1, JANUARY 2008

Denote by α the confidence probability that the total number of RM servers is enough for optimization given a random λa . Then, given the confidence level of α, we can obtain the upper bound of the arrival rate λmax by solving F (λmax ) = α

Fig. 5. Sensitivity of N ∗ to the arrival rate λα . − where λ+ a is the upper bound of the marginal values, and λa is the lower bound. Both equations are numerically solvable. + Thus, if the parameter of λa changes in the range (λ− a , λa ), the optimal solution of the number of RM servers can remain at N ∗ . We pursue the above example that has an optimal solution of N ∗ = 14. Using (22) and (23), we obtain the range of λa that can keep N ∗ unchanged as λa ∈ (1.357, 1.466). Now, we let λa change from 0.8 to 2. The optimal solutions of N ∗ are depicted in Fig. 5. From Fig. 5, we can find that if λa does not deviate too much, the initial configuration can mostly be optimal or near optimal. Nevertheless, if λa changes too much and too frequently, in order to make the configuration often optimal, the next subsection presents a dynamic switching scheduling based on the sensitivity analysis. Dynamic Switching Scheduling: In reality, during different time periods, the arrival rates of jobs vary, so the deviation of λa could be very large. Hence, the number of running RM servers should not be fixed in order to optimally configure the grid RMS. Based on the above sensitivity analysis, a dynamic switching scheduling is suggested, which can make the configuration optimal in most of the time. In the system, there are two types of RM servers, i.e., 1) hot running RM servers and 2) cold standby RM servers. The hot running RM servers represent those servers that are working to serve for the requests. The cold standby RM servers represent those servers that are not working but waiting to be switched on. If the switching system switches some cold standby RM servers on, those servers become the hot running RM servers. On the other hand, if some hot running servers are switched off, they become cold standby RM servers. A switching system controls the switching on or off among the hot/cold RM servers. Initially, we need to determine the total number of RM servers including both hot running and cold standby servers. Suppose that λa is a random variable that follows a certain distribution

F (x) = Pr(λa < x).

(24)

(25)

and then substitute the solution of λmax into (16) to obtain the optimal solution of NT that means the total number of RM servers for guaranteeing the optimum with a confidence level of α. Based on the total NT RM servers, the switching system can work. It periodically monitors the arrival rate of λa , and then according to the sensitivity map (like Fig. 5), it dynamically schedules the number of hot running RM servers. Thus, such a switching system can guarantee the optimal configuration of the RMS with the confidence level α. For example, continuing the above case, suppose that the arrival rate λa follows a normal distribution with mean µ = 1.4 and deviation σ = 0.39. If the requirement of the confidence level is at least α = 90%, λmax can be obtained by solving (25), where the distribution F (x) is the normal distribution N (µ, σ), i.e., λmax = 1.8998. Then, we substitute it into (16) to obtain the total number of RM servers as NT = 18. Thus, with 90% confidence, the 18 RM servers can make the RMS configuration optimum under dynamic switching scheduling. As the example in Section III-B, when the instant arrival rate λa = 1.4, the switching system schedules 14 RM servers as hot running and the other four RMs as cold standby. IV. C OMPARISONS AND D ISCUSSIONS A. Significance and Comparisons Grid computing is a recently emerging technique for largescale resource sharing, where the RMS plays an important role. Although the development tools and techniques for the grid have been extensively investigated, the availability of the grid RMS has not been comprehensively studied. To fulfill this lacking but necessary field, this paper is the first to comprehensively and systematically model and analyze the grid RMS availability. Many prior models [11]–[17] have been presented for system availability but not specifically for grid RMS availability. They considered the system being unavailable at the moment when the servers or other software/hardware components malfunction. However, they did not consider that, although every component is in a good state, the system can yet be unavailable to the users, which is caused by too many requests, i.e., more than the system is capable of handling. Our availability model considered both RM server failures and request queue limitations. It is more suitable for the grid computing service due to the large and open nature of the grid. In addition, this model

DAI et al.: AVAILABILITY MODELING AND COST OPTIMIZATION FOR THE GRID RMS

can be similarly implemented in analyzing the unavailability caused by the denial-of-service attacks that make the request queue overflow (such as using Syn attacks to overflow the buffer size in order to block other valid users). To analyze both factors, the prior single-level Markov models [15] cannot be directly implemented for the modeling and evaluation. Therefore, we made some extension to a hierarchical MRM in analyzing the grid RMS availability. There are two levels in this hierarchical model. The first level models the failure and repair processes of the RM servers, and the second level models the arrival and completion processes of the requests. The connection between the two levels is the reward value on each state. As another contribution of this paper, an optimization problem of the grid system design, which was to determine the number of RM servers for minimizing the total cost, was solved. Then, a fast searching algorithm was further presented to effectively get the optimal solution rather than an exhaustive searching algorithm. In reality, the parameter of arrival rate of jobs may fluctuate, so the sensitivity of the optimal solution that is affected by the changes of arrival rate was analyzed. If this parameter changes too frequently or too much, in order to keep the configuration optimum most of the time, a dynamic scheduling switching system was suggested. B. Discussions In this availability model, the RMS that serves the requests of grid services used one common request queue scheduled for multiple RM servers. It is also possible that each RM server has its own request queue. This condition has been covered by the general structure of our RMS model. If the request queue of an RM server does not interact with other RM servers’ request queues, when the grid service requests reach this RM server’s queue, it can be analyzed by our model assuming N = 1 (i.e., one RM server), which is a reduced case of our general RMS model. Furthermore, it is better that the interaction among the different request queues is permitted, e.g., if some requests of a grid service are blocked by one queue, those blocked requests can be transferred to another RM server’s queue that is not full. This advantage has been possessed by our model using one common queue whose upper bound can be set as the summation of separate queues’ upper bound of all the RM servers. In addition, our model has another advantage: using one common queue can well balance the load to different RM servers, i.e., the unbalanced case that one RM server is idle while another has many requests waiting for service. In deriving the grid availability, we also assumed that the different RM servers are independent. This assumption can justify that the grid system is a wide-area system, and different servers are either located far away from one another or the requests are uncorrelated from different users, so their interaction is so slight as to be negligible. Although this assumption is a good approximation to reality, under certain circumstances, the failure correlation among different parts may yet exist, and its influence could cause error of evaluation [27], [28]. Thus, future research can further consider failure correlation among the different RM servers.

177

In addition to the applications mentioned in this paper, the RMS availability can be applied in solving many other problems, such as allocating testing time and manpower on different RM servers, optimally protecting the grid system [29], etc. A PPENDIX A Lemma 1: There exists a finite integer to make inequality (18) in Algorithm 1 satisfied. Proof: Since cs /λa · cp > 0, if we can prove limN →∞ A(N ) = 1, then Lemma 1 is proved. As in (10), we have N 

lim A(N ) = lim

N →∞

N →∞

= lim

N →∞

≥ lim

N →∞

= lim

N →∞

p k vk

k=1 N  k=M +m N  k=M +m N 

p k vk +

M +m−1

p k vk

k=1

p k vk pk

(26)

k=M +m

because vk = 1 when k ≥ M + m, which means that the number of servers is sufficient to be more than the whole request queue and can immediately serve the new job. Then, from the Markov model in Fig. 2, we have N 

M +m−1

pk = 1 −

k=M +m

pk = 1 − P (T )

(27)

k=0

where P (T ) is the probability to stay at state T , which combines all the states from 0 to M + m − 1, as the Markov model in Fig. 2. It is easy to derive P (T ) in a similar way as deriving P (0) in (2) to get P (T ) =

1 N 

−m+1 (N −M −m)···(N −k+1)·(N −k)·µk−M s −m+1 (M +m)···(k−1)·k·λk−M s k=M +m

1+

1

=

−m+1 (N −M −m)····(N −k)·µk−M s −m+1 (M +m)···(k−1)·k·λk−M s k=M +m+1

−M −m)·µs 1+ (N(M +m)·λs +



N 

1 1+

(N −M −m)·µs (M +m)·λs

.

We continue on using (26) to obtain lim A(N ) ≥ lim

N →∞

N →∞

N 

pk

k=M +m

= lim 1 − P (T ) N →∞

≥ lim 1 − N →∞

= 1 − lim

N →∞

1 1+

(N −M −m)·µs (M +m)·λs

1+

(N −M −m)·µs (M +m)·λs

1

= 1.

178

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 38, NO. 1, JANUARY 2008

Thus, 1 ≤ limN →∞ A(N ) ≤ 1, which is equivalent to  limN →∞ A(N ) = 1. Thus, Lemma 1 is proved. A PPENDIX B Lemma 2: The minimum cost (i.e., optimum) must occur be fore the termination criterion of inequality (18) in Algorithm 1. Proof: It is because adding one RM server will cause an additional server cost of cs per unit time, and the inequality (18) is equivalent to {1 − A(N )} λa · cp < cs

(28)

which means that to add another RM server, the maximal saved money by improving the availability has to be less than the additional cost of an RM server, i.e., the additional cost must be more than the saved money. Thus, there is no need to add more RM server under such condition of (18) because the total cost (15) must increase afterward. Therefore, the minimum cost never occurs afterward [i.e., greater than the N that satisfies inequality (18)]. Thus, Lemma 2 is proved.  R EFERENCES [1] I. Foster and C. Kesselman, The Grid 2: Blueprint for a New Computing Infrastructure. San Mateo, CA: Morgan Kaufmann, 2003. [2] A. Kumar, “An efficient SuperGrid protocol for high availability and load balancing,” IEEE Trans. Comput., vol. 49, no. 10, pp. 1126–1133, Oct. 2000. [3] I. Foster, C. Kesselman, and S. Tuecke, “The anatomy of the grid: Enabling scalable virtual organizations,” Int. J. High Perform. Comput. Appl., vol. 15, no. 3, pp. 200–222, 2001. [4] F. Berman, R. Wolski, H. Casanova, W. Cirne, H. Dail, M. Faerman, S. Figueira, J. Hayes, G. Obertelli, J. Schopf, G. Shao, S. Smallen, N. Spring, A. Su, and D. Zagorodnov, “Adaptive computing on the grid using AppLeS,” IEEE Trans. Parallel Distrib. Syst., vol. 14, no. 4, pp. 369– 382, Apr. 2003. [5] M. Cannataro, A. Congiusta, A. Pugliese, D. Talia, and P. Trunfio, “Distributed data mining on grids: Services, tools, and applications,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 34, no. 6, pp. 2451–2465, Dec. 2004. [6] N. D. Doulamis, A. D. Doulamis, A. Panagakis, K. Dolkas, T. A. Varvarigou, and E. Varvarigos, “A combined fuzzy-neural network model for non-linear prediction of 3-D rendering workload in grid computing,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 34, no. 2, pp. 1235–1247, Apr. 2004. [7] M. Livny and R. Raman, “High-throughput resource management,” in The Grid: Blueprint for a New Computing Infrastructure. San Francisco, CA: Morgan Kaufmann, 1998, pp. 311–338. [8] J. Nabrzyski, J. M. Schopf, and J. Weglarz, Grid Resource Management. New York: Kluwer, 2003. [9] K. Krauter, R. Buyya, and M. Maheswaran, “A taxonomy and survey of grid resource management systems for distributed computing,” Softw.—Pract. Exp., vol. 32, no. 2, pp. 135–164, Feb. 2002. [10] Q. Ding, G. L. Chen, and J. Gu, “A unified resource mapping strategy in computational grid environments,” J. Softw., vol. 13, no. 7, pp. 1303– 1308, 2002. [11] A. Goyal and S. S. Lavenberg, “Modelling and analysis of computer system availability,” IBM J. Res. Develop., vol. 31, no. 6, pp. 651–664, Dec. 1987. [12] J. C. Laprie and K. Kanoun, “X-ware reliability and availability modeling,” IEEE Trans. Softw. Eng., vol. 18, no. 2, pp. 130–147, Feb. 1992. [13] S. Hariri and H. Mutlu, “Hierarchical modeling of availability in distributed systems,” IEEE Trans. Softw. Eng., vol. 21, no. 1, pp. 50–56, Jan. 1995. [14] L. Nordmann and H. Pham, “Reliability of decision making in humanorganizations,” IEEE Trans. Syst., Man, Cybern. A, Syst., Humans, vol. 27, no. 4, pp. 543–549, Jul. 1997.

[15] C. D. Lai, M. Xie, K. L. Poh, Y. S. Dai, and P. Yang, “A model for availability analysis of distributed software/hardware systems,” Inf. Softw. Technol., vol. 44, no. 6, pp. 343–350, Apr. 2002. [16] H. N. Wu, “Reliable LQ fuzzy control for continuous-time nonlinear systems with actuator faults,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 34, no. 4, pp. 1743–1752, Aug. 2004. [17] J. H. Lo, C. Y. Huang, I. Y. Chen, S. Y. Kuo, and M. R. Lyu, “Reliability assessment and sensitivity analysis of software reliability growth modeling based on software module structure,” J. Syst. Softw., vol. 76, no. 1, pp. 3–13, Apr. 2005. [18] I. Foster, C. Kesselman, J. M. Nick, and S. Tuecke, “Grid services for distributed system integration,” Computer, vol. 35, no. 6, pp. 37–46, Jun. 2002. [19] Y. S. Dai, M. Xie, and K. L. Poh, “Reliability analysis of grid computing systems,” in Proc. 10th IEEE Pac. Rim Int. Symp. Dependable Comput., 2002, pp. 97–104. [20] Y. S. Dai, M. Xie, K. L. Poh, and G. Q. Liu, “A study of service reliability and availability for distributed systems,” Reliab. Eng. Syst. Saf., vol. 79, no. 1, pp. 103–112, Jan. 2003. [21] D. Abramson, R. Buyya, and J. Giddy, “A computational economy for grid computing and its implementation in the Nimrod-G resource broker,” Future Gener. Comput. Syst., vol. 18, no. 8, pp. 1061–1074, Oct. 2002. [22] D. R. Jeske and X. Zhang, “Some successful approaches to software reliability modeling in industry,” J. Syst. Softw., vol. 74, no. 1, pp. 85– 99, Jan. 2005. [23] B. Yang and M. Xie, “A study of operational and testing reliability in software reliability analysis,” Reliab. Eng. Syst. Saf., vol. 70, no. 3, pp. 323– 329, Dec. 2000. [24] M. Xie, Y. S. Dai, and K. L. Poh, Computing Systems Reliability: Models and Analysis. New York: Kluwer, 2004. [25] K. S. Trivedi, Probability and Statistics With Reliability, Queuing, and Computer Science Applications. New York: Wiley, 2001. [26] H. Pham and X. Zhang, “NHPP software reliability and cost models with testing coverage,” Eur. J. Oper. Res., vol. 145, no. 2, pp. 443–454, Mar. 2003. [27] Y. S. Dai, M. Xie, K. L. Poh, and S. H. Ng, “A model for correlated failures in N -version programming,” IIE Trans., vol. 36, no. 12, pp. 1183–1192, Dec. 2004. [28] Y. S. Dai, M. Xie, and K. L. Poh, “Modeling and analysis of correlated software failures of multiple types,” IEEE Trans. Rel., vol. 54, no. 1, pp. 100–106, Mar. 2005. [29] G. Levitin, Y. S. Dai, M. Xie, and K. L. Poh, “Optimizing survivability of multi-state systems with multi-level protection by multi-processor genetic algorithm,” Reliab. Eng. Syst. Saf., vol. 82, no. 1, pp. 93–104, Oct. 2003.

Yuan-Shun Dai received the B.S. degree from Tsinghua University, Beijing, China, and the Ph.D. degree from the National University of Singapore, Singapore. He is currently with the Department of Industrial and Information Engineering, and with the Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville. He has published four books and over 60 articles. His research interest is in dependability, grid computing, security, and autonomic computing. His research has been featured in Industrial Engineer Magazine (December 2004). Dr. Dai was the Program Chair for the 12th IEEE Pacific Rim Symposium on Dependable Computing and the General Chair for the 2nd and 3rd IEEE Symposia on Dependable Autonomic and Secure Computing (DASC06 and DASC07). He also chairs many conferences and is on the editorial board of some journals. He has been a Guest Editor for the IEEE TRANSACTIONS ON R ELIABILITY , Lecture Notes in Computer Science Journal of Computer Science and International Journal of Autonomic and Trusted Computing.

DAI et al.: AVAILABILITY MODELING AND COST OPTIMIZATION FOR THE GRID RMS

Min Xie (A’90–M’91–SM’94–F’06) received the Ph.D. degree from Linkoping University, Linköping, Sweden, in 1987. He is currently a Professor with the National University of Singapore, Singapore. He has been active in research in reliability and quality engineering. He is the author of six books and numerous articles in the area of reliability and quality. Prof. Xie served on the editorial board of several international journals, including as Associate Editor of the IEEE TRANSACTIONS ON RELIABILITY and as Department Editor of the Institute of Industrial Engineers Transactions.

179

Kim-Leng Poh received the Ph.D. degree from Stanford University, Stanford, CA, in 1993. He is currently an Associate Professor and the Deputy Head of the Department of Industrial and Systems Engineering, National University of Singapore, Singapore. He has published many papers in his areas of research. His research is in system modeling and decision analysis. Dr. Poh is the Past President of the Operational Research Society of Singapore.

Suggest Documents