Paper Title (use style: paper title)

Reliability-Aware Task Allocation in Distributed Computing Systems using Hybrid Simulated Annealing and Tabu Search

Hamid Reza Faragardi, Reza Shojaee and Nasser Yazdani Router Lab., School of Electrical and Computer Engineering University of Tehran Tehran, Iran {h.faragardi, r.shojaee, yazdani}@ut.ac.ir Abstract — Reliability is one of the important issues in the design of distributed computing systems (DCSs). This paper deals with the problem of task allocation in heterogeneous DCSs for maximizing system reliability with several resource constraints. Memory capacity, processing load and communication rate are major constraints in the problem. Reliability oriented task allocation problem is NP-hard, thus many algorithms were presented to find a near optimal solution. This paper presents a Hybrid of Simulated Annealing and Tabu Search (HSATS) that uses a non-monotonic cooling schedule to find a near optimal solution within reasonable time. The HSATS algorithm was implemented and evaluated through experimental studies on a large number of randomly generated instances. Results have shown that the algorithm can obtain optimal solution in most cases. When it fails to produce optimal solution, deviation is less than 0.2 percent. Therefore in terms of solution quality, HSATS is significantly better than pure Simulated Annealing. Keywords-distributed computing system; reliability; task allocation;simulated annealing; tabu search.

I.

INTRODUCTION

Nowadays, the required computation power for most applications cannot be achieved with a single processor. One of the traditional approaches for tackling this problem is to employ Distributed Computing System (DCS). DCS is a collection of independent computers (nodes) that appears to its users as a single coherent system. In heterogeneous DCS each node has different computation power and memory capacity. Also communication links which are connecting the nodes may have different bandwidths. Such systems provide many advantages over centralized ones, such as improving performance, availability, reliability, resource sharing and extensibility [1]. Reliability is an important issue in DCSs. Furthermore it is critical requirement for some systems (e.g., military system, nuclear plant, etc.). A parallel application can be divided into a number of tasks and executed concurrently on different nodes. System reliability is defined as probability that all tasks run successfully [3]. Due to complexity of such systems, failure of nodes and communication links are inevitable. Hence we need an effective approach for improving reliability. Redundancy and diversity are popular methods to attain high reliability

[4][5][6][7][8][9][10], but they impose extra hardware or software costs. Another alternative is optimal task allocation. This method does not require additional hardware or software and improves system reliability just by using proper allocation of the tasks among dissimilar nodes [8] [11][12] [13] [14]. The problem can be stated as follows: There is a heterogeneous DCS with N nodes which are connected through a communication network. The network topology is cycle free such as star, tree and bus. The parallel application is divided into a number of tasks. These tasks must be distributed among N nodes. The tasks which are assigned to a pair of nodes can communicate to each other. Several resource constraints are considered such as memory capacity, processing load and communication rate for each path. This problem can be formulated as an optimization problem consists of a cost function representing the unreliability caused by the execution of tasks on nodes and the unreliability caused by inter-processor communication time. Also penalty functions are considered to satisfy the mentioned constraints. In this paper our goal is maximizing system reliability by optimal task allocation without extra hardware or software costs. This problem is NP-Hard [8], thus exact methods cannot produce optimal solution in reasonable time for large scale inputs. We propose a Hybrid of Simulated Annealing and Tabu Search (HSATS) that uses a non-monotonic cooling schedule to find a near optimal solution within reasonable time. For evaluating the algorithm, pure simulated annealing, branch & bound and HSATS were implemented. Then we compute their reliability and execution time for various numbers of processors and tasks. The results have shown that HSATS produces more accurate solutions rather than pure simulated annealing. Branch & bound (BB) gives optimal solution but it needs extreme execution time and space. In most cases HSATS generates the results same as BB within acceptable time and space. The reminder of this paper is organized as follows. Related work is presented in section II. In section III, problem statement is defined in details. In sections IV and V, two well-known meta-heuristic algorithms for finding near optimal solution of combinatorial optimization problems are introduced. New algorithm for the problem is described in

section VI. Simulation and performance evaluation of the new algorithm are presented in section VII. Finally, concluding remarks and future work are presented in section VIII. II.

RELATED WORKS

In 1992 Shatz, et al. defined a model for the problem where failure of processor or communication links is timedependent. In this scenario a task with longer execution time will increase failure probability [3]. Many algorithms were proposed base on this model to find optimal or near optimal solution. Exact algorithms can attain optimal solutions and they are usually based on branch and bound idea. Kartik and Murthy (1995, 1997) used the branch and bound with underestimates and reorder the tasks according to task independence for reducing the computations required [8][27]. Also they proved that reliability oriented task allocation in distributed computing systems is NP-Hard [8]. Thus exact algorithms only work in small and moderate problem size. In recent years, most researchers focused on developing heuristic and meta-heuristic algorithms to solve the problem. For example, in 2001 Vidyarthi and Tripathi presented a solution using a simple genetic algorithm to find a near optimal allocation quickly [13]. In 2006 Attiya and Hamam developed a simulated annealing algorithm for the problem and evaluated its performance in comparison with branch-and-bound technique with acceptable results [11]. In 2007 Yin et al. proposed a hybrid algorithm that combines particle swarm optimization with a hill climbing heuristic [12]. In 2010 Kang, et al. solved the problem using honeybee mating optimization technique [14]. III.

PROBLEM STATEMENT

In a heterogeneous DCS, each node may have different processing speeds, memory sizes, and failure rates. In addition, the communication links may have different bandwidths and failure rates. Another important issue for this problem is network topology. The network topology for our problem is cycle free such as star, tree and bus. Also we can consider fully connected topology by assume that each pair of nodes (p, q) communicate just through a link between p and q (direct path). Each component of the distributed computing system (node or communication link) has two states: operational or failed. If a component fails during an idle period, it will be replaced by a spare. We do not consider this to be a critical failure. The failure of a component follows a Poisson process (constant failure rate). Failures of components are statistically independent, this assumption has been widely used in the community of computing system’s reliability analysis [18][19][20][21]. Under these assumptions, the reliability of a DCS depends on both the number of computing servers composing the system and their individual likelihoods of failure. Obviously

number of tasks is another important factor which affects to results. Tasks of the given application require certain computer resources such as computational load and memory capacity. They also communicate at a given rate. We are given a set of M tasks representing a parallel application to be executed on a distributed system of N processors [11]. A. Notations The notations used in problem formulation are listed as follows:  M represents number of tasks.  N represents number of processors.  Ti is an ith task.  Pi is an ith processor.  CLpq is a path between node p and q.  xij equals one, if and only if Ti is assigned to Pj in the assignment represented by X. Otherwise Xij = 0.  PHRi is hazard rate for ith processor.  Eij is execution time of Ti on Pj.  CBWij is communication bandwidth for CLpq.  CRij is communication rate between task i and j.  PLij is path load between Pi and Pj.  CHRpq shows communication hazard rate for CLpq.  Memi is memory amount for Pi.  memi represents essential memory for Ti.  Li is processing load for Pi.  li shows that essential processing load for Ti.  Rs(X) is system reliability for assignment X.  Rs’(X) is system reliability without considering failure of links.  Rs”(X) is system reliability without considering failure of nodes.  C(X) is cost of assignment X  TC(X) is total cost of assignment X B. Principle constraints The principle constraints for the problem are outlined in this section.  Memory: Memory of each processor is no less than the total amount of memory requirements for its all assigned tasks. This constraint is formulated by Eq. 1.  Processing load: Load of each processor is no less than the total amount of processing load requirements for its all assigned tasks. This constraint is formulated by Eq. 2.  Path load: Load of each path is no less than the total amount of communication rate requirements for the all tasks which communicate through this path. This constraint is formulated by Eq. 3. ∑ ∑

≤ Memk ≤ Lk

For all k, 1≤ k ≤ N.

(1)

For all k, 1≤ k ≤ N.

(2)

∑

∑

For all paths pq, 1≤ p 1;

Repeat Iteration=0; Repeat Select a random neighbor solution Xn; Compute the cost of next solution CXn; ΔC = CXn - CXc; If (ΔC 0 Generate a uniform random value r in the range (0,1); If r < exp(-ΔC /T ) Xc = Xn CXc = CXn Iteration = Iteration +1 Until Iteration = RE T = µ ∗ T; RE = β∗RE; Until (T < Tf)

V.

TABU SEARCH

Tabu search (TS) is a meta-heuristic approach which is designed to find a near-optimal solution of combinatorial optimization problems. This method was introduced by Glover et al [22][23][24]. TS has been applied to several problems such as timetabling [25], routing [26], and scheduling [15]. TS can be briefly stated as follows. Move is a fundamental notion in TS. The move is a function which transforms a solution into another solution. A subset of moves is applicable to any solution. This subset generates a subset of solutions called the neighborhood. TS starts from an initial solution. At each step, the neighborhood of a given solution is searched in order to find the best neighbor. As TS should search all of neighbors, when size of neighborhood is large, its performance dramatically decreases. To overcome this issue, TS can visit only a subset of neighbors which is called candidate list. For avoiding cycle path and exploring new region of the solution space, the search history is kept in the memory and will be employed in future. There are at least two classes of the memory: short-term memory for the very recent history and a long-term memory for old history. A class of the short-term memory is called the Tabu list and plays a basic role in TS. This list does not allow search algorithm to turn back the solutions visited in the previous steps. In practice for decreasing memory usage, Tabu list stores forbidden moves instead of forbidden solutions. Usually the Tabu list is implemented as a limited queue containing forbidden moves. Once a move from a solution to its neighbor is made, we insert the inverted move at the end of queue and remove the first item if the queue is full. However, it is possible that TS reselects a move which is in the Tabu list. In order to perform such a move, an aspiration function must be defined. This function evaluates the cost of forbidden move. If this cost is acceptable, then the move can be performed. Normally the move is accepted if it generates the best solution.

The choice rule of the TS method is to select the move which is non Tabu with the lowest cost or satisfies the aspiration criterion. Nevertheless, a situation may arise where all possible moves are in Tabu list and none of them can satisfy the aspiration criterion. In such a case, one might use the oldest Tabu move or randomly select a Tabu move or even the algorithm is terminated. Empirically, the second strategy that randomly selects a move among the possible moves works better. Some stopping conditions of TS are suggested as follows:  There is no feasible solution in the neighborhood of current solution.  Number of iterations to be greater than the maximum value.  The number of iterations without any improvement to be larger than a specified number.  An optimum solution has been obtained (if we know optimum solution). Usually, researchers used the second case for their algorithms [25][26]. The performance and convergence speed of TS depend on basic definition (neighborhood, move, etc.) and tuning parameters (Tabu size, candidate list size, number of iterations). VI.

HSATS ALGORITHM

In this section a new algorithm for solving the problem was presented. HSATS combines the advantage of SA and TS. The main feature of SA is stochastic nature. This attribute helps it to escape from local optima. SA works elaborately. In primary iteration due to large value of T, weak solution is accepted but gradually acceptance probability decreases. It is possible that SA falls in loop and visit a solution twice or more times, this is the main drawback of SA. In contrast TS hasn’t stochastic nature and using Tabu list to prevent cycling. Thus the hybrid of SA and TS can create an efficient algorithm. The performance of the SA algorithm depends on its cooling schedule. There are two types of cooling schedule: monotonic and non-monotonic. Monotonic cooling schedule is composed of two schemes: stepwise temperature reduction and continuous temperature reduction [2]. Fig. 1 describes the first scheme which temperature is reduced after each iteration. Fig. 2 describes the second one which for specified number of iterations, T is constant, after that T is slumped. In contrast non-monotonic cooling schedule reduces the temperature value after each attempted move with occasional increase. Fig. 3 describes a non-monotonic cooling schedule. The SA algorithm which is presented in previous section uses monotonic cooling schedule with stepwise temperature reduction. The HSATS works similar to SA with some differences: 1.

Neighboring solutions are randomly selected in SA as opposed to systematic selection from candidate list in the HSATS.

2. 3.

4.

HSATS uses Tabu list to prevent cycling and a tabu move will be accepted if aspiration criterion is satisfied. Due to non-monotonic cooling schedule in HSATS, if the number of no-improvementmoves is greater than MNI then the current temperature is returned to Tb, this action is called restart. The temperature which gives the best solution till now called Tb. Convergence of SA extremely depends on initial values of its parameters while HSATS dependency to initialization is comparatively low.

Algorithm 2 HSATS Randomly select an initial solution Xs Compute the cost of initial solution CXs Xc=Xs // assign initial solution to current solution CXc=CXs Xbest=Xs // assign initial solution to best solution CXbest = CXs Determine an initial temperature Ts Determine the final temperature Tf Determine Q as the size of Tabu list Tb = Ts // assign T start to T best Determine cooling factor µ< 1 Determine η as coefficient of penalty functions NI=0 // no improvement counter Determine the maximum number of no improvement MNI R = 0 // number of restart Determine the maximum number of restarts MR Repeat NI = NI + 1 Generate a candidate list for Xc Select the best move from the candidate list as X n Compute the cost of next solution CXn ΔC = CXn - CXc If Xn is in Tabu list If aspiration criterion satisfied // Xbest>Xn Xc = Xn CXc = CXn Update the Tabu list Xbest=Xc CXbest =CXc Tb = T NI = 0 Else // Xbest MNI T = Tb // return T to best temperature R = R+1 NI = 0 Until (R > MR or T < Tf)

determined in Tab. I. In a DCS many parameters should be determined in order to system reliability can be computed, such as hazard rate of each node and path, memory and processing load of each node and etc. System parameters are tabulated in Tab. II. These values are similar to the ones used in [3][12][27]. Table II System parameters and the corresponding value ranges

Figure 1. Monotonic cooling schedule with continuous temperature reduction.

System parameters E l L mem Mem CR PL PHR CHR CBW

Figure 2. Monotonic cooling schedule with stepwise temperature reduction.

Value ranges [15, 25] [1, 50] [100, 200] [1, 50] [100, 200] [0, 25] [100, 200] [0.00005, 0.00010] [0.00015, 0.00030] [1, 4]

VII. PERFORMANCE EVALUATION Simulated Annealing, Branch & Bound and HSATS were implemented in C++ language. The system specifications are: Fedora version 14 as operating system, 2.13 GHz Core i3 Intel Processor and 4 Gigabyte Dual Channel RAM. The inputs of the programs are M (Number of tasks) and N (Number of nodes). For evaluating the HSATS, four DCS configurations are considered:

Figure 3. Non-monotonic cooling schedule

HSATS is used for finding optimal reliability oriented task allocation problem which is modeled in section III under following structures: 

 

Initial solution: It can be generated randomly or other heuristic methods. The better initial solution might provide better results. In our algorithm we generate it randomly. Aspiration criterion: If the tabu move produces a solution better than the best solution aspiration criterion is satisfied. Stopping condition: There are two stopping conditions in HSATS: T value to be less than Tf or number of restarts to be greater than MR. The second condition is considered to prevent fall into infinite loop. Table I Initial Values for HSATS parameters Algorithm Parameters Ts Tf MR MNI Q μ ɳ

Initial values 2.5 0.000001 80 100 10 0.95 0.01

The pseudo code of HSATS is provided in Alg. 2. There are several parameters in the algorithm which should be initialized correctly. The initial values of HSATS are

1. 2. 3. 4.

Four nodes with fully connected topology. Six nodes with fully connected topology. Eight nodes with star topology. Twelve nodes with star topology.

Although network topology is considered fully connected, but we assume that each pair of nodes communicate just through a direct link between them. For each problem size (N, M), 20 simulation runs are conducted by SA and HSATS. The average values of reliability and execution time are computed and tabulated in Tab. III. Also when the problem size was less than (8, 20), BB was executed and produced optimal solution. We compare HSATS and SA results with optimal solution for determining the quality of solutions and deviations. Results have shown that when the size of problem is small, SA has a short execution time and HSATS is close to SA and although execution time of BB is large but it is still reasonable. Fig. 4 and Fig. 5 show SA and HSATS reliability in small size of problem is very close to optimal solution which is produced by BB. In large size of problem, due to exponential execution time of BB, it is infeasible. In contrast, SA and HSATS are desirable whereas SA has better execution time. HSATS generate high quality solution rather than SA in such situation. Fig. 6 and Fig. 7 show when the size of problem becomes large, HSATS produces better solutions. Furthermore, the HSATS can quickly find near optimal solution whit average deviation not exceeding 0.2% from the global optimal.

Figure 6. Reliability pattern for 8 nodes in terms of SA, HSATS and BB.


Figure 7. Reliability pattern for 12 nodes in terms of SA, HSATS.


Table III Experimental results for various numbers of nodes and tasks in terms of SA, HSATS and BB Problem Size N M 4 6 4 8 4 10 4 12 4 16 6 10 6 12 6 14 6 18 6 20 8 12 8 16 8 20 8 24 8 28 12 20 12 24 12 26 12 30

Simulated Annealing Reliability Execution Time (sec) 0.9938 0.047 0.9910 0.063 0.9883 0.109 0.9851 0.156 0.9756 0.281 0.9893 0.234 0.9843 0.359 0.9785 0.499 0.9743 0.811 0.9689 1.012 0.9829 0.655 0.9780 1.170 0.9690 1.840 0.9594 2.652 0.9585 3.603 0.9754 4.243 0.9491 6.053 0.9262 7.062 0.9025 9.422

Reliability 0.9938 0.9910 0.9886 0.9854 0.9778 0.9893 0.9855 0.9816 0.9765 0.9730 0.9859 0.9803 0.9744 0.9671 0.9618 0.9776 0.9720 0.9695 0.9621

VIII. CONCLUSION In this paper our goal is maximizing system reliability by optimal task allocation without extra hardware or software costs. The problem is formulated as an optimization problem that consists of a cost function representing the unreliability caused by the execution of tasks on nodes and the unreliability caused by inter-processor communication

HSATS Execution Time (sec) 0.093 0.156 0.249 0.312 0.526 0.764 1.154 1.716 2.369 1.412 2.547 2.895 4.722 5.034 7.310 12.412 17.214 22.121 27.448

Branch & Bound Reliability Execution Time (sec) 0.9938 0.015 0.9910 0.062 0.9886 0.234 0.9854 1.014 0.9779 10.109 0.9893 4.430 0.9855 42.198 0.9821 238.931 0.9765 2512.552 0.9733 9903.926 0.9859 496.427 0.9810 10940.304 0.9747 238296.867 -

time. Also penalty functions are considered to satisfy application and system constraints. A meta-heuristics algorithm is presented to find near optimal solution. HSATS combines the advantages of SA and TS elaborately. In spite of SA, HSATS has restricted repetitive exploration of neighbors. The results of extensive computational tests indicated that the HSATS is both fast and precise for a wide variety of problem sizes. HSATS can generate optimal

solution in many instances and when it fails to achieve optimal solution, deviation is not exceeding 0.2% from global optimal. The next step of this work is to present a more accurate algorithm rather than HSATS in acceptable execution time. ACKNOWLEDGMENT The authors would like to thank Prof. M. Kargahi and router lab. researchers at university of Tehran for their valuable comments. REFERENCES K. K. Aggarwal and S. Rai, “Reliability evaluation in computer communication networks”, IEEE transaction on reliability, vol. R-30, pp. 32-35, June 1981. [2] I. H. Osman and N. Christofides, “Capacitated Clustering Problems by Hybrid Simulated Annealing and Tabu Search”, International transactions in Operation Research, 1:317-336, 1994. [3] S.M. Shatz, J.P. Wang and M. Goto, “Task allocation for maximizing reliability of distributed computer systems”, IEEE Transactions on Computers, 41, pp.1156–1168, 1992. [4] A.O. Charles Elegbede, C. Chu, K.H. Adjallah and F. Yalaoui, “Reliability allocation through cost minimization”, IEEE Trans. Reliab. 52, pp.106–111, 2003. [5] C.C. Chiu, Y.-S. Yeh and J.-S. Chou, “A fast algorithm for reliabilityoriented task assignment in a distributed system”, Comput. Commun. 25, pp.1622–1630, 2002. [6] C.C. Hsieh, “Optimal task allocation and hardware redundancy policies in distributed computing systems”, European J. Oper. Res. 147, pp.430–447, 2003. [7] C.C. Hsieh and Y.C. Hsieh, “Reliability and cost optimization in distributed computing systems”, Comput. Oper. Res. 30, pp.1103– 1119, 2003. [8] S. Kartik and C.S.R. Murthy, “Improved task-allocation algorithms to maximize reliability of redundant distributed computing systems”, IEEE Trans. Reliab. 44, pp.575–586, 1995. [9] A. Kumar and D.P. Agrawal, “A generalized algorithm for evaluating distributed-program reliability”, IEEE Trans. Reliab. 42, pp.416–426, 1993. [10] P.A. Tom and C. Murthy, “Algorithms for reliability-oriented module allocation in distributed computing systems”, J. Systems Software 40 pp.125–138, 1998. [11] G. Attiya and Y. Hamam, “Task allocation for maximizing reliability of distributed systems: a simulated annealing approach” Journal of Parallel and Distributed Computing 66, pp.1259–1266, 2006. [1]

[12] P.Y. Yin, S.S. Yu, P.P. Wang and Y.T. Wang, “Task allocation for maximizing reliability of a distributed system using hybrid particle swarm optimization”. Journal of System and Software 80, pp.724– 735, 2007. [13] D.P. Vidyarthi and A.K. Tripathi, “Maximizing reliability of distributed computing systems with task allocation using simple genetic algorithm”, Journal of Systems Architecture 47, pp.549–554, 2001. [14] Q.M. Kang, H. Hong, H.M. Song and R. Deng, “Task allocation for maximizing reliability of distributed computing systems using honeybee mating optimization”. Journal of Systems and Software 83, pp.2165–2174, 2010. [15] P. Fattahi, M. Saidi-Mehrabad and F. Jolai, “Mathematical modeling and heuristic approaches to flexible job shop scheduling problems”, J. Intell. Manuf., 18, pp. 331–342, 2007. [16] S. Kirkpatrick, C.D. Gelatt Jr. and M.P. Vecchi, "Optimization by Simulated Annealing",Science, 220, 4598, pp.671-680, 1983. [17] V. Cerny, "Thermodynamical Approach to the Traveling Salesman Problem: An Efficient Simulation Algorithm", J. Opt. Theory Appl., 45, 1, pp.41-51, 1985. [18] C. H. Sauer and K. M. Chandy, “Computer Systems Performance Modeling”. Englewood Cliffs, NJ: Prentice-Hall, 1981. [19] J. F. Lawless, “Statistical Models and Methods for Lifetime Data” New York: Wiley, 1982. [20] C. Singh, “Calculating the time-specific frequency of system failures”, IEEE Trans. Reliability, vol. 28, no. 2, pp. 124-126, 1979. [21] C. S. Raghavendra and S. V. Maram, “Reliability modeling and analysis of computer networks”, IEEE Trans. Reliability, vol. 35, no. 2, pp. 156-160, 1986. [22] F. Glover and C. McMillan, “The general employee scheduling problem: an integration of MS and AI”, Comput. & Ops. Res. Vol. 13, No.5, pp. 563-573, 1986. [23] F. Glover, "Tabu search, part 1”, ORSA Journal on Computing 1, pp. 190-206, 1989. [24] F. Glover, "Tabu search, part 2”, ORSA Journal on Computing 2, pp. 190-206, 1989. [25] E. Nowicki and C. Smutnicki, “A fast Taboo search algorithm for the job-shop problem”, Management Science, Vol. 42, No. 6, pp. 797813, 1996. [26] J. Brandao, "A new tabu search algorithm for the vehicle routing problem with backhauls,", European Journal of Operational Research, vol. 173,pp.540-555, 2006. [27] S. Kartik and C.S.R. Murthy, “Task allocation algorithms for maximizing reliability of distributed computing systems”, IEEE Trans. On computers Vol. 46, No.6, pp. 719-724, 1997. [28] N. Azad and H. Davoudpour, “A hybrid Tabu-SA algorithm for location-inventory model with considering capacity levels and uncertain demands,” Science, vol. 3, no. 4, pp. 290-304, 2008.

Paper Title (use style: paper title)

Paper Title (use style: paper title)

Suggest Documents