Optimal Task Allocation for Maximizing Reliability in Distributed Real ...

2 downloads 0 Views 633KB Size Report
Hamid Reza Faragardi, Reza Shojaee, Mohammad Amin Keshtkar and Hamid Tabani. School of Electrical and Computer Engineering. University of Tehran.
Optimal Task Allocation for Maximizing Reliability in Distributed Real-time Systems Hamid Reza Faragardi, Reza Shojaee, Mohammad Amin Keshtkar and Hamid Tabani School of Electrical and Computer Engineering University of Tehran Tehran, Iran {h.faragardi, r.shojaee, m.keshtkar, h.tabani}@ut.ac.ir Abstract— Distributed system has been developed as a platform for huge computations. Reliability is one of the prominent issues in such systems. Many studies have been recently done to improve reliability by proper task allocation in distributed systems, but they have only considered some system constraints such as processing load, memory capacity, and communication rate. In this paper, we consider time constraint in form of task deadline to above-mentioned constraints in order to model and analyze reliability in distributed real-time systems. To maximize reliability besides satisfying the constraints, we proposed a new offline task allocation algorithm. The algorithm is Systematic Memorybased Simulated Annealing (SMSA) which uses a monotonic cooling schedule and limited memory to store recently visited solutions to prevent cycling. In addition, an effective greedy heuristic algorithm intensifies SMSA. For evaluating the algorithm, SMSA is compared with Genetic Algorithm (GA) and Simulated Annealing (SA). Results have shown that in contrast to SA and GA, SMSA obtains satisfactory reliability in reasonable execution time. Meanwhile, SMSA meets all deadlines same as SA and GA. Furthermore, SMSA results have low deviation from average reliability. Keywords-distributed system; reliability; real-time system; task allocation; simulated annealing.

I.

INTRODUCTION

Hard real-time systems have been increasingly employed in safety critical applications where system failure can cause a catastrophe. In such systems timing correctness of the system is as important as the logical correctness. Therefore, to have a successful operation, the application not only should be logically correct, but also it should complete its work before the determined deadline. On the other hand, increased complexity of computing systems has led to the need for more powerful computation resources. Distributed Systems (DS) have been emerged as a promising solution to address this issue. DS is a collection of independent computers (nodes), where a parallel application can be divided into a number of tasks and executed concurrently on distinct nodes. In special, a heterogeneous DS comprises nodes with different computation power and memory capacity. Furthermore, communication links which connect the nodes may provide different bandwidths. Nowadays, Grid and Cloud are most important type of distributed systems. Major advantages of such systems over centralized ones are better performance, availability, reliability, resource sharing and extensibility [1]. While reliability is a critical requirement for most hard

real-time systems (e.g. military system, nuclear plant, etc.), it is also an important issue in DSs. Formally, system reliability is defined as probability that all tasks run successfully [3]. However, because of complexity of DSs, failure of nodes and communication links are inevitable. Hence we need an effective approach to achieve a system with high reliability. While redundancy and diversity have been employed as popular methods to attain better reliability [4][5][6][7][8][9][10], they impose extra hardware or software costs. Another alternative for improving system reliability is optimal task allocation. This method does not require additional hardware or software and improves system reliability just by using proper allocation of the tasks among the nodes [8][11][12] [13] [14]. In this paper, we consider a heterogeneous DS that runs a hard real-time application (all the tasks should meet their deadlines). The system is composed of several nodes. The nodes are connected by communication links that are configured in a cycle-free topology such as star, tree or bus. The application is divided into several tasks, where each task has its certain deadline and can be run on each node with a specific execution time. The problem is to find a task allocation under which the overall reliability of the system is maximized, while real-time requirements as well as other resource constraints are satisfied. This problem can be formulated as an optimization problem which consists of a cost function representing the unreliability caused by two factors: the execution time of tasks on nodes and interprocessor communication time. As shown in [8], this problem is NP-Hard, thus exact methods cannot produce optimal solution in reasonable time for large-scale inputs. Previously, there have been some efforts to address similar problems in non real-time systems. In 1992 Shatz, et al. defined a model for the problem with the assumption of time-dependent processor and communication link failures. According to this model, a task with longer execution time will increase failure probability [3]. Later, several algorithms have been proposed based on this model to find optimal or near optimal solution. Optimal solutions are usually based on branch and bound idea. Kartik and Murthy (1995, 1997) used the branch and bound technique with underestimating and reordering the tasks according to task independence to reduce the required computations [8][25]. Also they proved that reliability

oriented task allocation in distributed systems is NP-Hard [8]. In recent years, several heuristic and meta-heuristic algorithms have been developed to solve the problem, without considering real-time constraints. As an example of the former case, in 2001 Vidyarthi and Tripathi presented a simple genetic algorithm to find a near optimal allocation in reasonable time [13]. In 2006 Attiya and Hamam developed a simulated annealing algorithm for the problem and evaluated its performance by comparing it with branch-andbound technique [11]. Furthermore, a hybrid algorithm has been proposed in 2007 by Yin et al., which combines particle swarm optimization with a hill climbing heuristic [12]. In 2010 honeybee mating optimization technique has been used by Kang, et al. to solve the problem [14]. Recently, the hybrid of simulated annealing and tabu search which uses a non-monotonic cooling schedule is presented to solve the problem [22]. Moreover, in 2012 a new swarm intelligence based on Cat Swarm Optimization (CSO) was proposed by Shojaee et al. [27]. CSO results show that it works better than GA and PSO. A new mathematical model to analyze reliability in RDSs with hard periodic real-time tasks was presented in 2012 by Faragardi et al. [26]. They utilized ant colony optimization to solve the problem. In addition, some papers consider task deadline in DS but the target is maximizing the probability of meeting task deadlines [24]. In this paper, our goal is maximizing reliability in distributed real-time systems by optimal task allocation without redundancy and task precedence constraints. To find near optimal solution, we propose a new meta-heuristic algorithm, which is called Systematic Memory-based Simulated Annealing (SMSA). SMSA uses a monotonic cooling schedule and systematic search of neighborhood. Furthermore, to prevent cycling, SMSA uses a memory to keep recently visited solutions. For evaluating the algorithm, GA, SA and SMSA were implemented. Then we compare their reliability and execution time for various numbers of processors and tasks. Results have shown that in contrast to SA and GA, SMSA obtains satisfactory reliability in reasonable execution time. Meanwhile, SMSA like as SA and GA meet all deadlines. The rest of this paper is organized as follows. In Section II, system model and problem statement are defined in details. In sections III, the solution approach was proposed. Section VI presents simulation and performance evaluation of the SMSA. Finally, concluding remarks and future work are presented in section V. II.

PROBLEM STATEMENT

We consider a heterogeneous DS where each processing node may have different memory size and failure rate. In addition, the communication links between the nodes may have different bandwidths and failure rates. As network topology, we assume a cycle-free network topology, such as star, tree and bus. Each component of this system (node or

communication link) can be one of two states: operational or failed. Failure of a component during an idle period is not considered to be a critical failure since it can be replaced by a spare. The failure of a component follows a Poisson process with the constant failure rate. Also, failures of components are statistically independent. This assumption has been widely used in the community of computing system’s reliability analysis [18][19][20][21]. The reliability of the considered DS depends on:  The number of computing nodes composing the system and their individual likelihoods of failure.  The likelihoods of failure for each path between a pair of nodes. A parallel application can be divided into a set of M tasks [11] and is executed on the system, which is assumed to contain N processors. Tasks of the given application require resources, including computational load, memory space, and a specific communication bandwidth. Each pair of tasks communicates with a known rate. In this system, task execution times are processor dependent, meaning that the execution time of a task is variable for different processors. Furthermore, each task has a certain deadline, before which its execution should be finished. As we assume a hard realtime application, resource allocation should be performed in such a way that all the tasks meet their deadlines. As EDF is an optimal scheduling algorithm [15], we use this method for task scheduling in each processor. The goal is to find a task to processor assignment under which the overall reliability of the system (probability that all tasks run successfully) is maximized, while memory, communication bandwidth, and processing load constraints, as well as real-time requirements are met. In the following we formally define this problem. A. Notations The notations that are used throughout this paper can be summarized as follows:  N represents number of nodes (processors).  M represents number of tasks.  P={P1,…,PN} is set of nodes.  is hardware hazard rate of node .  is operating system hazard rate of node .  is total hazard rate of node .  T={T1,…, TM} is the task set ordered according to task deadlines in an ascending manner.  Di is deadline of task Ti.  Eij is execution time of task Ti on node .  CLij is a path between nodes and .  X=[xij] is task to node assignment matrix where xij equals one, if and only if Ti is assigned to . Otherwise xij = 0.  CBWij is communication bandwidth of CLij.  CRij is amount of Transmitted data between Ti and Tj.

           

B. Constraints In this section we outline the principle constraints of the system and formally define them.  Memory constraint: The total amount of memory requirements for the tasks assigned to a processor should not exceed the available memory of that processor. This constraint is formulated in Eq. 1.  Processing load constraint: The total amount of processing load requirements for the tasks assigned to a processor should not exceed that processor's load. This constraint is formulated in Eq. 2.  Path load constraint: The total amount of communication rate requirements of the tasks which communicate through a specific path should not exceed the load of that path. This constraint is formulated in Eq. 3.  Deadline constraint: The execution of each task should be completed before its deadline. Eq. 4 formulates this constraint. ∑ ≤ Memk for all k, 1≤ k ≤ N. (1) ∑ ∑ ∑

memory and other parts of a node are perfect). Therefore, the reliability of processing node P k in time interval [0,t] can be achieved by:

PLij is maximum communication allowed load for CLij. CHRij shows communication hazard rate for CLij. represents cost of execution of Ti on . Memi is memory amount for node . memi represents memory needed by task Ti. Li is maximum processing allowed load for node . li shows the essential processing load for task Ti. Rs(X) is system reliability for assignment X. Rs’(X) is system reliability without considering failure of links. Rs”(X) is system reliability without considering failure of nodes. C(X) is cost of assignment X (will be defined shortly). TC(X) is total cost of assignment X (will be defined shortly).

≤ Lk ∑ ∑

for all k, 1≤ k ≤ N.

(2)

1≤ p

Suggest Documents