a Fault-Tolerant Hierarchical Branch and Bound for ...

0 downloads 0 Views 198KB Size Report
FTH-B&B: a Fault-Tolerant Hierarchical Branch and Bound ... of the grid resources during the exploration process. Finally, the ..... on or which depends of them.
1

FTH-B&B: a Fault-Tolerant Hierarchical Branch and Bound for Large Scale Unreliable Environments A. Bendjoudi, CERIST Research Center ` 3 rue des freres Aissou, 16030 Ben-Aknoun, Algiers, Algeria ´ Universite´ Abderrahmane Mira de Bejaia ´ Route de Targa Ouzemmour, 06000 Bejaia, Algeria Email: [email protected] N. Melab and E-G. Talbi, Universite´ Lille1, France LIFL/UMR CNRS 8022 59655 - Villeneuve d’Ascq cedex - France Email: {nouredine.melab,el-ghazali.talbi}@lifl.fr Abstract—Solving to optimality large instances of combinatorial optimization problems using Brand and Bound (B&B) algorithms requires a huge amount of computing resources. In this paper, we investigate the design and implementation of such algorithms on computational grids. Most of existing grid-based B&B algorithms are based on the Master-Worker paradigm, their scalability is therefore limited. In addition, even if the volatility of resources is a major issue in grids fault tolerance is rarely addressed in these works. We thereby propose FTH-B&B, a fault tolerant hierarchical B&B. FTH-B&B is based on different new mechanisms enabling to efficiently build and maintain balanced the hierarchy, and to store and recover work units (sub-problems). FTH-B&B has been implemented on top of the ProActive grid middleware and programming environment and applied to the Flow-Shop scheduling problem. Very often, the validation of existing grid-based B&B works is performed either through simulation or a very small real grid. In this paper, we experimented FTH-B&B on the Grid’5000 real French nation-wide computational grid using up to 1900 processor cores distributed over 6 sites. The reported results show that the overhead induced by the proposed mechanisms is very low and an efficiency close to 100% can be achieved on some Taillards benchmarks of the Flow-Shop problem. In addition, the results demonstrate the robustness of the proposed mechanisms even in extreme failure situations. Index Terms—Fault tolerance, Hierarchical Master-Worker, Grid Computing, Parallel Branch and Bound, Large scale Experiments.

F

1

I NTRODUCTION

Solving combinatorial optimization problems (COPs) exactly consists in finding the optimal solution among a large set of feasible solutions (search space). However, these problems are NP-hard [18], they are CPU timeintensive and require a huge amount of computing resources to be solved optimally. Branch and Bound (B&B) algorithms are the best known methods for an exact resolution of COPs. They perform an implicit enumeration of all the search space instead of exhaustive one. Based on a pruning technique, they reduce considerably the computation time required to explore the whole search space. However, such a technique is not sufficient for very large problem instances. Therefore, high performance computing such as grid-based parallel computing is required. Nowadays, computational grids offer a huge amount of computing resources that can provide the power required by parallel B&B algorithms to solve large COP instances. Nevertheless, the computing resources are highly unreliable, volatile, and heterogeneous. There-

fore, the traditional parallel B&B algorithms must be rethought to meet these characteristics. The presence of node failures in the computational grid requires dealing with the Fault Tolerance (FT) issue to correctly perform the computation and to ensure reliable execution. The heterogeneous and dynamic nature of grids requires to balance the work load in order to maximize the use of the grid resources during the exploration process. Finally, the parallel algorithms must be adapted to deal with the large scale environments where a huge amount of spawned work units must be executed and/or communicated. Consequently, bottlenecks may be created on some parts of the network or some computational resources. Traditional Master/Worker-based parallel B&B (MWB&B) algorithms have been proved inefficient [2][1] to deal with the scalability issue. In such algorithms, a single master decomposes the initial problem to be solved into multiple smaller sub-problems and distributes them among multiple workers. The workers, on their side, perform the exploration of the different sub-problems. However, this approach is strongly limited regarding

2

scalability in large scale environments. Indeed, the central master process is subject to bottlenecks caused by the large number of requests submitted by the different workers. A higher scalability can be reached using a hierarchical design. Indeed, using a Hierarchical Master/Worker (HMW) design allows one to overcome the limits of MW-B&B. However, designing such hierarchical parallel B&B algorithms is not straightforward since they must also deal with the unreliability of the computing resources. FT can be achieved at two levels: application or middlewarelevel. Several middlewares offer FT mechanisms to reliable execution of applications: ProActive [29][3][10], XtremWeb [34], and Condor [7][20][30]. Nevertheless, they are generally costly in terms of CPU execution time and slow down the application. The second strategy is application-level and consists in including FT mechanism(s) into algorithms [26][27][24][16][21][22][12]. In this paper, we propose FTH-B&B (Fault Tolerant Hierarchical Branch and Bound) designed for volatile computational grids. FTH-B&B is an application-level distributed FT mechanism composed of several FT MWB&Bs organized hierarchically into groups. A fault recovery mechanism is introduced to avoid the loss of work units and to improve efficiency in terms of execution time. Indeed, work units are stored so that one can recover at each instant the sub-problems assigned to dead processes using a 3-phase recovery mechanism (presented in Section 4.1). Moreover, our approach ensures to maintain a balanced and safe hierarchy during the lifetime of the algorithm in order to guarantee a valid functioning. Finally, an efficient restart of the application is ensured by a distributed checkpointing mechanism in case of failure. ProActive includes an FT mechanism, but it is not exploited in this present work. FT issue is dealt with using our own FT mechanism proposed in this work. FTH-B&B has been applied to Flow-Shop scheduling problem (FSP) and experimented using the Grid’5000 French nation-wide grid [23]. The experiments show the ability of FTH-B&B to deal efficiently with FT in large scale environments. Indeed, the overall efficiency of the algorithms reaches easily 98,90% using the FT mechanism. Moreover, the recovered and rescheduled sub-problems allow to minimize the redundant work which improves the execution performance. Finally, our approach has proved its ability to maintain and to balance the hierarchy during its lifetime. The rest of this paper is organized as follows: Section 2 presents an overview of B&B algorithms and their parallelization. Section 3 presents existing FT mechanisms of parallel B&B algorithms. The proposed FTH-B&B and its associated mechanisms for work management, fault recovery, maintenance of the hierarchy, and distributed checkpointing, are detailed in Section 4. Section 5 reports the experimental results obtained on the Grid’5000 real nation-wide grid infrastructure. In Section 6, the concluding remarks of the paper are drawn and some

perspectives are presented.

2

PARALLEL B RANCH

AND

B OUND

B&B algorithms are well-known techniques for an exact resolution of COPs. They make an implicit enumeration of all solutions of the search space instead of an exhaustive enumeration which is obviously impractical due to the exponential growth of the potential solutions. B&B algorithms are characterized by four operators: branching, bounding, selection and elimination. Using the branching operator, the search space of a given problem is partitioned into a set of smaller sub-problems. The bounding operator is used to compute the lower bound of the optimal solution. When a new upper bound is identified, it is compared to the actual lower bound in order to decide whether or not it is necessary to branch the sub-problem. The elimination operator allows to identify the nodes which do not lead to the optimal solution and eliminates them. Sub-problems are explored according to a given selection strategy which can be: depth-first search, breadth-first search, best-bound search, etc. A serial implementation of the algorithm consists of a sequential execution of the above four operations. The recursive application of the branching operator creates a dynamic tree of sub-problems. The root of the tree designates the original problem, the leaves are solutions, and the inner nodes are intermediate subproblems. The value of the upper bound found so far is used to eliminate branches and prune the tree. If the lower bound of the current sub-problem is not better than the upper bound, the sub-problem will not be explored. Otherwise it is considered to be an active problem for further branching. The upper bound is updated as soon as a new better solution is found. The sub-trees generated when executing a B&B algorithm can be easily explored independently because the only global information in the algorithm is the value of the upper bound. Its parallelization may be designed according to the architecture of the processing machine, synchronization, granularity of generated tasks, communication between different processes, and the number of computing processors [19]. Gendron et al. [19] classified the parallelization strategies into three categories according to the degree of parallelization: parallelism of type 1 introduces parallelism when performing the operations (generally the bounding operation) on generated sub-problems (e.g. executing the bounding operation in parallel for each sub-problem). In type 2 the search tree is built in parallel by performing operations on several sub-problems simultaneously. The type 3 also implies the building of several trees in parallel. The information generated when building a given tree can be used for the construction of another. Therefore, the tree is explored concurrently. The processes involved in the computation of the parallel algorithm select their tasks from a work pool which is a memory where the processes select from

3

and store in their work units (generated and unexplored sub-problems) [19]. Two types of work pool can be distinguished: single work pool and multiple work pool. The first type is more adequate for applications based on the master-worker paradigm. In the second type, there are several memory locations where processes find and store their work units. Multiple work pool is divided into three sub-classes: collegial, grouped and mixed. In a collegial algorithm, each work pool is associated with exactly one process. In a grouped organization, processes are partitioned into groups, and each group of processes has its own work pool. In a mixed one, each process has its associated work pool, but there is also a global work pool, shared by all processes.

3

R ELATED

WORK

No loss of data is tolerated in parallel exact methods designed to solve COPs. A loss of one or several subproblem(s) can cause the loss of one or several optimal solution(s). Therefore, each parallel B&B designed to be executed in a volatile environment must take into account FT issue. FT can be achieved at the middleware level [29][34][7] or at the application level [26][27][24] [16][21][22][12]. FT mechanisms introduced at the middleware level induce additional overheads to the execution time of the algorithms. For instance, the ProActive middleware provides an FT mechanism based on the replication of the processes involved in the computation. This forces the application developers to reserve nodes dedicated to the replicated processes which leads to the loss of computing power. In our case, we focus on FT at the application-level. In the following, we present the state-of-the-art approaches for application-level FT. Iamnitchi et al. [24] have proposed a fully decentralized parallel B&B algorithm using a multiple pool collegial strategy. Each process maintains its local work pool and sends requests to others when this pool is empty. The process receiving a work request and having enough work in its pool, sends a part of its work to the requester. FT mechanism does not attempt to detect failures of processes and to restore their data, but rather focuses on detecting not yet completed problems knowing completed ones. The approach is based on the fact that the generated sub-problems of a B&B are tree nodes. Therefore, sub-problems are represented only by their position in the tree. Given a set of tree nodes, its complement can be easily found. Each process maintains a list of new locally completed sub-problems and a table of the completed problems. When a problem is completed, it is included in the local list. After a period of time or after processing a fixed number of sub-problems, the list is sent to a set of other processes, selected randomly, as a work report message. When a process receives a work report, it stores it. When a process runs out of work, it chooses an uncompleted problem and solves it. Finkel et al. [16] have proposed DIB (Distributed Implementation of Backtracking) which is an FT hierarchical

algorithm designed for tree-based algorithms like B&B. The algorithm is based on a multiple pool collegial strategy. Its failure recovery mechanism is based on keeping track of machines responsible of each unsolved subproblem. Each machine memorizes the sub-problems for which it is responsible, as well as the machines to which it sends sub-problems or from which it receives subproblems. The completion of a sub-problem is reported to the sub-problem source machine. Hence, each machine can determine whether its work is still unsolved, and can redo it in case of failure. These two approaches have not proposed any mechanism to minimize redundancy of work. Indeed, if a process fails, the entire sub-problem assigned to it is processed again from scratch inducing more redundant work. Dai et al. [12] have proposed a single-level hierarchical master-worker designed for divisible tasks which is similar to divide and conquer paradigm. In their framework, a main master only communicates with some sub-masters, and each sub-master manages a set of workers using multiple pool collegial strategy. Nodes of the hierarchy are organized into groups. Both the middleware-level and application-level FT mechanisms are used. The middleware-level mechanism proposed in ProActive [29] is used only by the main-master and the authors do not specify if it is also used by sub-masters and workers. When an inner master fails, the challenge is to manage its orphan workers. Two techniques are proposed to address this issue: the election of a submaster from the remaining sub-workers and the use of redundant sub-workers. The selected sub-workers are a replica of sub-masters, when a sub-master fails they are used to generate a new sub-master. The main drawback of this approach resides in the use of middleware-level FT and then the definition of redundant processes which will replace the failed ones. Therefore, this approach leads to the loss of computing power. Moreover, no solution is proposed to minimize the redundant work in case of failures. Drummond et al. [9] have proposed an FT hierarchical B&B designed to be run on grid infrastructure applied to the Steiner problem in graphs. The branching process of a steiner problem spawns exactly two sub-problems. The algorithm is organized as follows: a root master is launched on a process of a given cluster. The root master launches and manages a set of leaders on the first processor of each cluster. Each leader represents the processors of the cluster on which it is launched. A leader process branches a given problem into left and right sub-problems and assigns the right sub-problem to a leader according to a specific assignment algorithm. The process is repeated until no leader is available, then the branched sub-problems are assigned to the processes of the same cluster. FT is ensured using uncoordinated checkpoints [15] where each process keeps the unsolved sub-problems and sends checkpoint messages to its neighbor (the

4

neighborhood is defined according to a specific formula). The leaders do the same but they only send checkpoints to the other leaders. The main drawback of the algorithm resides in the number of tolerated failures. Only one failure per cluster can be handled and beyond one failure, the whole cluster is considered as failed. Moreover, if a leader fails, the entire cluster is also considered as failed and no flexibility in this sense is tolerated. The scalability of their algorithm cannot be proven since they only perform experiments using between 16 and 33 computing resources. Mezmaz et al. [26][27] have proposed B&B@Grid for large scale B&B algorithm using the master-worker paradigm. The authors rethought the representation of the search space and then minimized the quantity of flowing information in the network in order to minimize bottlenecks on the master process. Their approach is based on an efficient coding of work units. A list of subproblems is represented by a unique interval defined by only two integers. Fold and unfold operators are defined to deduce a search sub-space from an interval and vice versa. Therefore, the transferred and stored information in the grid is an interval instead of a list of sub-problems. A single work pool strategy is used for work distribution. In order to overcome the limits of B&B@Grid in terms of scalability, Djamai et al. [8] designed a pure P2P approach for the algorithm. It provides fully distributed algorithms to deal with B&B mechanisms like work sharing, best upper bound sharing and termination detection. FT is ensured by a checkpointing mechanism. The optimal solution and the unexplored sub-problems are stored as intervals in a backup file. When the master fails, the file is checked and then the unexplored sub-problems are deduced using the folding operator. Workers update the intervals regularly and inform the master of any new solution. When a worker fails, the last copy of its interval is either fully allocated to another B&B process or shared between several B&B processes. However, this FT mechanism has been only applied to the original B&B@Grid and has not been extended to the P2P distributed version. In a previous work, we have developed a MW-based framework called P2P-B&B [4] allowing P2P communication between workers. It provides a library of routines that simplifies the implementation of MW-based B&B application. However, the master process becomes rapidly a bottleneck when the number of considered workers increases. To overcome this drawback, we have developed a hierarchical approach AHMW [6] using P2P-B&B. Unlike, the state-of-the-art hierarchical approaches, it presents several levels of masters pushing the scalability limits of the MW-based approach. To deal with FT issue, FT mechanisms have been proposed in [5] in which a mechanism for fault recovery and two strategies for hierarchy maintenance are presented. In this paper, we present an extension of this work: first, the fault recovery mechanisms are presented in detail. Second, two new

level 0

root node local work pools

level 1

Masters (inner nodes)

level 2

FT MW−B&Bs Group of masters Group of workers level n Workers (leaves)

Fig. 1. General design of FTH-B&B. Several FT subB&Bs are organized hierarchically where inside each subB&B one master manages several workers.

reorganization strategies are proposed (Master Election ME and Hybrid Strategy HS) allowing the tolerance of the root-master failure, see Section 4.2 for more details. Third, a distributed checkpointing and reconstitution mechanisms are proposed allowing reconstituting the unexplored sup-problems after a global failure (see Section 4.3). In addition, a realistic failure generator approaching a real unreliable environment is proposed.

4

T HE FTH-B&B A LGORITHM

The proposed FTH-B&B is an FT parallel B&B algorithm based on the Hierarchical Master/Worker paradigm in order to deal with the FT issue while ensuring scalability in large scale environments. The proposed approach is composed of several FT Master/Worker-based subB&B algorithms (see Figure 1), where inside each subB&B one master manages several workers and performs failure recovery. The sub-B&Bs are launched in parallel and act independently on different sub-problems. They are organized hierarchically into several levels. The root node and the inner nodes of the hierarchy designate masters and the leaves designate workers. The masters perform a parallel recursive branching in order to decrease the size of sub-problems until reaching sufficiently fine-grained sub-problems which are explored sequentially by the workers. In addition, they locally store the branched and assigned sub-problems for further rescheduling. Each sub-B&B is composed of one master and a group of workers that can also be masters for another sub-B&B at the lower level. Each sub-B&B is developed using the framework (P2P-B&B) presented in [4] where one master manages a dynamic group of communicating workers. Therefore, the algorithm is composed of communicating groups of FT Master/Worker-based sub-B&B processes. FTH-B&B is designed to run on computational Grids providing a large amount of computing resources which are highly unreliable. Therefore, for a safe execution of the algorithm, any failure must be detected and handled.

PE

A

PJ

PB PI

C

PH

E

PJ

F

A sends the unexp− lored sub−problems

Mi

M ij WR sub−Problem (P) Bij [1..n]

D

Update explored sub−problem B [k]

Cijk Work Request (WR)

Update Bij [k]

Pijk WR Bij [k1 ]

ij

Phase 2

work pool of B

PG PH PI PJ

Connection of C

PH PG P F P E

B

PE PF

C ijl

Branching

branched sub−problems

PC

List of Branched Sub−problems LBS

already solved sub−problems

PB

Phase 1

assigned sub−problems

work pool of A

5

PG

G

Update Bij [m]

t

The fault detection is the responsibility of all the processes of the algorithm. Both masters and workers are responsible to detect failures of the process they depend on or which depends of them. Masters send heartbeats to their children every Heartbeat Period HP. Workers also send heartbeats to their masters every HP × G, G being the size of the group forming one sub-B&B. The period of heartbeats is increased G times to avoid flooding their masters by their heartbeats. When executing FTH-B&B in a faulty environment, one must take into account three major points. First, one must ensure fault recovery to avoid the loss of work units and to gain in terms of execution time by minimizing the redundant work. Second, one must ensure to maintain the same topology during the lifetime of the algorithm in order to guarantee a valid functioning of FTH-B&B. Third, one must ensure an efficient restart of a great number of FTH-B&B processes. 4.1 Work management with task recovery Each master has its local work pool obtained by branching the sub-problem assigned to it by its parent. A collegial strategy is adopted since multiple work pools are considered and each inner master has its own work pool. The task distribution is performed respecting a fault recovery mechanism between the current master, its children and its grandchildren. 4.1.1 Fault recovery Since no loss of work is tolerated in exact resolution of COPs, the masters manage a list of assigned subproblems, each element of the list is a mapping between the assigned sub-problem and the process assigned to it. In order to minimize the overall execution time and avoid redundant exploration of sub-problems, when facing failures, the masters also manage a list of branched sub-problems LBS. This list represents the sub-problems being explored by their grandchildren (see Fig. 2). It is updated each time a grandchild finishes the exploration of its assigned sub-problem. When the process exploring a sub-problem fails, the master identifies the subproblem assigned to it and the branched sub-problems

t+1

Failure

WR Bij ]m..n]

Phase 3

Fig. 2. Work Management with fault recovery. The parent of a failed worker only reschedules the unexplored subproblems in order to minimize redundancy.

Update Bij[m]

Bij [1..m] already explored

Fig. 3. 3-phase recovery mechanism sequence diagram

in LBS and reschedules only the unexplored part. For example, in Fig. 2, the sub-problems PE , ..., PJ are the result of the branching of the problem PB by the process B. B has already explored PE and PF (in bold in Figure 2) but not yet PG , PH , PI and PJ . Therefore, when B fails, A will only reschedule PG , PH , PI , and PJ to the newly connected process C. 4.1.2

3-Phase Recovery Mechanism

In order to deal with FT issue, the work distribution and management is revisited using a 3-phase recovery mechanism. The three phases are described in the following (see the sequence diagram in Fig. 3): •





Phase 1 (between a master and its children): Phase 1 allows a master to assign problems to its children and to receive back the branched sub-problems. In the sequence diagram, Mi assigns a problem Pij to its child Mij . This latter performs a branching and sends back the branched sub-problems Bij [1..n] to its parent Mi . After that, Mij can assign the subproblems Bij [1..n] to its available children Cij [1..g]. Phase 2 (between a master, its children and its parent): Phase 2 serves to update already explored sub-problems. Each time a child Cijk finishes the exploration of its sub-problem Bij [k], Mij informs Mi which updates Bij [k] the already explored subproblems of Cijk . Therefore, the master Mi knows at any time the unexplored parts of a given subproblem Phase 3 (between a master and a new free process): Phase 3 allows the master to reschedule the unexplored sub-problems to another free safe worker if the process handling it has failed. For example, if after a period of time t, the children Cij [1..g] finish the exploration of Bij [1..m < n] sub-problems. After t + 1, even if Mij fails, Mi will only reschedule Bij ]m..n] to another free safe process.

6

The 3-phase recovery mechanism allows the masters to minimize the execution of redundant work units, to speed up the exploration, and then to improve the overall performance. In order to obtain more precision about the already explored sub-problems and to store more branched sub-problems, the 3-phase recovery mechanism can be extended and include more levels in the descendance of a process. Indeed, the 3-phase mechanism can be extended to more than two levels. Therefore, we obtain communications between a master Mi and all its descendance (its grandchildren) at a given level. For instance, if j levels are included, Mi will communicate with all its descendance Cij until the j th level. Nevertheless, additional communication must be established between a master and all its grandchildren tree at the expense of the communication overhead. Indeed, the amount of the transferred and stored sub-problems grows in an exponential way when i increases. Therefore, a tradeoff must be found between the number of levels participating in the 3-phase recovery mechanism and the number of stored sub-problems, consequently, the amount of redundant work. In this work, only one level of masters participate in the 3-phase mechanism. 4.2 Maintenance of the hierarchy It is important to maintain the same topology during the lifetime of the algorithm in order to get a reliable execution. Indeed, all communication and work management are based on the used topology (in our case the hierarchy). In addition, failures can isolate some parts of processes from the rest of the hierarchy leading to loss of computing power and/or data. Moreover, at each instant, the hierarchy must be balanced to avoid the creation of new bottlenecks during the execution of the algorithm. Masters and workers are both subject to failures. The failure of a worker has not a great impact because it is located in a leaf of the hierarchy. Indeed, no other process depends on it and its task can be partially rescheduled by its parent using the 3-phase mechanism. However, a master failure can isolate some parts of the hierarchy because the inner masters represent intermediary links. Indeed, when an inner master fails, the sub-B&B it represents crashes and the link between its descendants and the rest of the hierarchy is lost. Therefore, orphan branches may be created leading to the failure of the algorithm. Hence, it is necessary to rebuild the hierarchy with the creation of new links between the descendants and the ascendants of the failed master. Three reorganization strategies are considered: Master Election (ME), Balanced Hierarchy (BH), and Hybrid Strategy (HS). In these strategies, every process pi holds a list of all its ascendants Ai [1..(l − 1)], l is the level of pi in the hierarchy. Ai contains at most logg (N ) elements, N being the size of the computational pool and g the size of a subB&B group. As shown in [5], the use of a simple strategy to reorganize the hierarchy, where the orphan processes

T

I

F B

Q

P

*

E NF ={ E, G } SF ={ K, L } EF =

G

F

H

I

E ={ I, J }

J NI ={ H, J }

H

K

*

SI =

L

EI = A

B

F

K

L

NF ={ B, I } SF ={ K, L, E, G } EF ={ B, I }

E

G

I

H

J

Fig. 4. Maintenance of the hierarchy using ME. When a master fails, the orphan workers elect a new master which connects to its closest ascendant and considers its electors as its children. connect to the closest ascendant master, creates new bottlenecks. In the following, we present the proposed strategies aiming at maintaining the hierarchy at a time balanced and tolerant to the failure of the root-master. 4.2.1 Master Election (ME) In this strategy, when a master m fails, the concerned orphan processes pi ∈ Cm elect a new master among them using the bully algorithm [17]. Each process has a unique identifier assigned to it at the moment of its creation. In this algorithm, when the master fails the process with the highest identifier is selected as follows: When a process pi detects the failure of its master, it initiates an election by sending an election message to all its neighbors with higher identifier. If no process responds, pi becomes the master and announces its success to the other processes. If one of the processes answers, it means that there is at least a safe process which can be a master, then pi ends its election. When a process pj receives an election message from a process pi with a lower identifier, it answers and initiates a new election algorithm. If the newly chosen master process pi is a worker, it changes its behavior and becomes a master considering its old neighbors as its children Ci = Ci ∪ Ni , Ni being the list of its neighbors. Ci will contain both of its old children and its old neighbors. Each pj ∈ Ni and pk ∈ Ci update their neighbors Nj/k = Nj/k ∪ Ci . After that, pi connects to its closest safe ascendant pa ∈ Ai . pa informs pi about its new neighbors Ni = Ca and the list of the processes it will contact in case of a future election session Ei = Ca . For example in Fig. 4, if P fails, E and G select F as a master and F considers them as its children CF = {K, L, E, G}. After that, F connects to its ascendant T which informs it about

7

T

F

hierarchy is balanced whatever the number of failures since the new orphan processes are dispatched over all the hierarchy.

E

G B

D

P

E

F

G

T

B

P

F

E

G

Fig. 5. Maintenance of the hierarchy using BH. Orphan processes migrate to safe non-full sub-B&Bs satisfying the constraint of the group size. its new neighbors NF = {B, I} and the new electors EF = NF . To avoid the same process to be always selected and consequently overloaded with the multiple orphan processes, we consider the lower identifier in the bully algorithm. Each newly elected master is assigned the highest identifier among its neighborhood. This method tolerate the failure of the root-master of the hierarchy. In fact, when the root-master fails, the masters of the first level select a master between them and this new selected master behaves as the root-master. 4.2.2 Balanced Hierarchy (BH) In this strategy, when a master m crashes, the orphan processes pi ∈ Cm migrate to another safe sub-B&B. Their migration is done according to a migration mechanism which assigns new orphan processes to a safe non-full sub-B&B. A sub-B&B is full if the number of processes composing it reaches the threshold group size a master can manage. The migration mechanism aims at avoiding the convergence of FTH-B&B to a traditional MW-B&B and to maintain the hierarchy balanced at each instant. That is achieved with the readjustment of the sizes of sub-B&B groups each time a master fails. An orphan process pi connects to its closest safe ascendant pa ∈ Ai (which holds pi ) only if the group it manages is not full. A process pi receiving orphan processes from its parent also respect the constraint of the group size. Indeed, Pi dispatches the orphan processes among its children if the group it manages is full. In Fig. 5, if D fails, its children E, F, and G connect to T which holds only one node E and dispatches the others F and G respectively to its children B and P . Therefore, the hierarchy of FTH-B&B changes over time as soon as masters fail. This allows one to avoid creating new bottlenecks after several failures and prevents from the risk of convergence to a standard MW-B&B during the lifetime of the application. Moreover, at each instant the

4.2.3

Hybrid Strategy (HS)

The main drawback of the Balanced Hierarchy method is that it does not tolerate the failure of the root-master. To circumvent this drawback, we have developed the Hybrid Strategy to benefit from the advantages of the last two methods BH and ME. HS uses BH to maintain the hierarchy balanced and when the root-master fails, it uses ME to select a new root-master. The new selected root-Master applies again BH in order to respect the constrain of group size to balance the hierarchy. 4.3

Distributed checkpointing

In a traditional MW-B&B, reliability can be achieved through checkpointing operations performed by the master process. However, this approach assumes that the master is reliable. In our approach, there exists several levels of checkpointing. Each master in the hierarchy performs distributed asynchronous checkpointing operations independently from others. This makes the whole algorithm more reliable even if some inner masters fail. Each master process uses a back-up file to store a sequence of unexplored branches represented by a vector of partial solutions. When a master fails, the newly designated master reads the back-up file, reconstitutes the locally unexplored sub-problems, and then the exploration process carries on from the last consistent local state of the sub-B&B. After a global failure, a new instance of the algorithm must be launched. The new formed hierarchy is different from the hierarchy formed before the global failure because of the arriving order of nodes of the new launched instance. The new processes must reconstitute the unexplored sub-problems from the different back-up files so that the new formed B&B tree will be similar to the one before the failure. Therefore, the exploration process cannot begin until a consistent global state is achieved. A consistent global state in our case is obtained when the B&B tree formed by the whole sub-trees handled by the different processes of the hierarchy is valid. The reconstitution operation of the sub-problems is illustrated in the following. 4.3.1

Reconstitution of the B&B tree

Even though a consistent local state of a sub-B&B is easy to achieve, achieving a global one is not trivial. The processes must collaborate to determine a consistent global state of the system. At each instant of the execution of FTH-BB, the graph formed by the dependencies between the sub-problems being explored must match with the one formed by the processes (the masters and workers). A consistent global state of the system means that the processes must reconstitute the sub-problems

8

according to their dependencies before the crash. Therefore, the same graph of dependencies (hierarchy) must be reconstituted. Let I be the set of processes of FTHB&B, pi a problem being explored by the process i, Ci the children of i, and δpi the set of sub-problems obtained by the branching of the problem pi . At each instant, the consistent global state means that the processes of the algorithm and the dependencies of the sub-problems must satisfy the implication: ∀ i, j ∈ I, j ∈ Ci ⇒ pj ∈ δpi

(1)

This implication means that whatever a process j being a child of i, it explores sub-problems obtained by the branching of problems belonging to the work pool of its parent i. After a failure of all processes, a new instance of the system is launched and then a new hierarchy is formed. The masters retrieve the unexplored sub-problems from the distributed back-up files. Some of the new processes (i and their children j) are launched on processors different from those on which they were launched before the failure. They retrieve sub-problems belonging to other failed processes. Consequently, the dependencies between the new retrieved sub-problems pj form a random graph and they do not match with the initial hierarchy. Therefore, (1) can not be satisfied because ∃i, j ∈ I, j ∈ Ci | pj ̸∈ δpi . Moreover, there is no global information about the retrieved work. Therefore, the global set of unexplored sub-problems cannot be known before reconstituting the sub-problems according to their dependencies. To address this issue, we define three operators: Gather, Reduce, and Deduce. 4.3.2 Reconstitution operators Gather allows one to gather the distributed sub-problems and to aggregate them into several centralized points. Each process sends the retrieved sub-problems to its parent which collects all its children sub-problems. Let yi be the sub-problem retrieved by the process i and Ci the set of its children. The set of unexplored sub-problems of i can be formulated as: ∪ Gi = (yj ) ∪ yi (2) j∈Ci

Reduce allows one to reduce several sub-problems into their smallest root problem p ̸= P (P being the original problem) such that they can be obtained by the branching of p according to (3). The reduction of two sub-problems q and r is achieved using the equation:  p      r Ri (q, r) =   q    {q, r}

∃ p | q, r ∈ δp ∧ @ p′ | p′ ∈ δp ∧ q, r ∈ δp′ if q = ϕ if r = ϕ else. if

(3)

Each process uses Deduce depicted in (4) to deduce the set of the unexplored sub-problems for each sub-B&B.

Global consistent state local consistant states

M

G , R (p ,p ), D0 0 0 00 01 G , R (R 0 ,R 1), D

M

0

W00 W01 M1 W10

p10retrieved

G , R (p ,p ), D 1 1 1 10 11

p retrieved

W11

11

Fig. 6. Consistent global state. Masters perform their checkpoints after they receive all the couples (p, φp ).

Di =



Ri (p, q)

(4)

p,q∈Gi



The product ( ) represents the application of the operator Reduce to all couples of sub-problems (p, q) including the newly obtained root problems. Using these 3 operators, the processes perform the reverse operation of the 3-phase recovery mechanism described in Section 4.1.2. In the 3-phase recovery mechanism, a master assigns a problem to its child and then the child sends back the branched sub-problems. Here, a child sends sub-problems to its parent and then the parent reconstitutes their root problem using the Gather, Reduce, and Deduce operators. After that, it reconstitutes its local work pool and the list of branched sub-problems. The sub-problems are sent as a list of couples (p, φp ) such that p ∈ Di and φp = {Gi ∩ δp }. Each couple represents the root problem p in the set Di and its corresponding set δp of retrieved sub-problems obtained by the branching of p. From the point of view of the 3-phase recovery mechanism, Di represents the local work pool of the process i and Gi represents the list of the branched sub-problems (LBS). 4.3.3 Consistent global state The 3 operators allow to build gradually a consistent global state from the local ones. At each instant and after applying Gather, Reduce, and Deduce in this order by a process i: ∀j ∈ Ci , pj ∈ pool(j) ⇒ pj ∈ Gi ⇒ ∃pi ∈ Di , ∃p′i ∈ Gi |pi = Ri (pj , p′i ) ⇒ pj ∈ δpi (5) These operators are applied on the processes from the leaves to the root master. All the reduced sub-problems (roots of the retrieved sub-problems) are obtained and the global unexplored work is deduced. The consistent global state can be built gradually (see Fig. 6) until

9

([5,3],{[5,3,2],[5,3,9,4]}) D= ([3,9], ) ([7,8], )

350

300

M ([5,3,2],{[5,3,2,1],[5,3,2,7]}) D1=

([3,9],{[3,9,6],[3,9,1]})

([7,8],{[7,8,2],[7,8,3],[7,8,4]}) ([5,3,9,4, )

M0

M1

([5,3,2,1], ) D00= ([3,9,6], )

M00

250

D01=

sub−problems fetched on the back−up files

W000

W001

W010

[5,3,2,1]

[3,9,6]

[3,9,1]

([5,3,2,7], ) ([3,9,1], )

([5,3,9,4], D11= D10=([7,8],{[7,8,4],[7,8,2]})

M 01

([7,8,3], )

)

Number of processes

D0=

M 11

M10

200

150

100

50

W011 [5,3,2,7]

W100

W101

W110

W 111

[7,8,2]

[7,8,4]

[7,8,3]

[5,3,9,4]

Fig. 7. An example of sub-problems recovery and reduction mechanism. Masters apply the three operators to deduce the global unexplored sub-problems.

0

0

2000

4000

6000

8000

10000

12000

14000

16000

Lifetime (sec)

Fig. 8. Lifetime distribution of the launched processes.

the checkpointing process. reaching the root master. Therefore, for the root master ∀i, j ∈ I, j ∈ Ci , ⇒ pj ∈ δpi i.e. Equation (1) is satisfied. In Fig. 6, the checkpoints are represented by gray squares. They are performed when a master receives the couples (p, φp ) from all their children. A complete example of the use of the 3 operators is presented in the Fig. 7 where a Flow-Shop scheduling problem is considered. A sub-problem is represented by a vector V [1..n] of scheduled jobs. The root problem of two sub-problems V1 [1..l] and V2 [1..m] is a vector V [1..k ≤ l, m] such that V1 = V V ′ and V2 = V V ′′ , V ′ = V1 ]k..l] and V ′′ = V2 ]k..m]. In Fig. 7, the workers check their back-up files and then send the sub-problems to their parents. For each master, the operators are applied and the couples (p ∈ Di , {Gi ∩δp }) are computed. The global unexplored sub-problems are located at the level of the root master. For example, the workers W100 and W101 retrieve respectively the sub-problems sp100 = [7, 8, 2] and sp101 = [7, 8, 4]. sp100 and sp101 are sent to the parent master M10 using Gather. M10 reduces them using Reduce and their root problem rp10 = [7, 8] is found. After that, the locally unexplored sub-problems are deduced using Deduce and the couple of the root problem and the branched sub-problems D10 = ([7, 8] , {[7, 8, 4], [7, 8, 2]}) is found. The main drawback of this approach is the large number of generated backup files when the number of launched sub-B&Bs increases. That can degrade the overall performance of the approach. Indeed, when many masters are launched on the same grid cluster, the multiple disk accesses can slow down the cluster. One can remedy to this drawback by uniformly launching sub-B&Bs on different grid clusters, or by limiting the number of masters that can perform checkpointing. Therefore, two types of masters can be distinguished: masters that perform checkpointing (active masters) and those that do not (passive masters). The active masters are located in the upper levels of the hierarchy and the passive ones in the lower levels. We can decide dynamically wether a master can or not participate in

5

P ERFORMANCE

EVALUATION

FTH-B&B has been experimented on the Flow-Shop scheduling problem (FSP) which consists in scheduling n jobs on m machines with respect to two constraints: (a) any two jobs can not be assigned simultaneously to the same machine; (b) all the jobs are scheduled in the same order on all the machines. The cost function to be optimized is the total completion time i.e. makespan (CM ax ) cost function. The objective is to find the optimal solution which minimizes the makespan. The root of the tree generated by the B&B algorithm represents a configuration where no job is assigned to any machine. An inner tree node with depth d has a partial solution with d assigned jobs. The effectiveness of B&B algorithms resides in the use of a good estimation of the optimal solution. We have opted for the best known lower bound for the FSP proposed by Lenstra et al. [25]. FTH-B&B has been implemented on top of the ProActive middleware [29][3][10]. ProActive is an open source Java library for the programming of multi-threaded, parallel and distributed applications for Grids, multi-core systems, clusters and data-centers. ProActive allows concurrent and parallel programming and offers distributed and asynchronous communication, a deployment framework, and an FT mechanism. However, FTH-B&B does not use the FT mechanism provided in ProActive. FTH-B&B has been experimented on Grid’5000 which is composed of a set of clusters distributed over 9 sites located in 9 different towns in France. The different sites are interconnected using a dedicated fiber of the 10Gbps Renater [31] network infrastructure. In our experiments, 6 sites have been involved. The use of Grid’5000 is done through the OAR [28] reservation system. Once reserved, the machines of the Grid are exclusively owned by the user (dedicated machines). However, the network is not dedicated, it is shared by the different users of the Grid. Therefore, up to 1900 grid nodes are involved in the experimentations according to their availability. In order to obtain deeper hierarchy, for some experiments

10

multiple processes are launched on the same processor. Therefore, between 1900 and 8900 FTH-B&B processes are deployed in the different experiments. To make our experiences more realistic in an unreliable environment, we consider that the lifetime distribution function of the processes follows an exponential distribution [33]. The reliability of a process i is measured as the probability that it performs its function for a specified period of time. The reliability function Ri (t) is the probability that the process i will be successfully operating without failure within the interval [0, t], Ri (t) = P (T > t), t ≥ 0, where T is a random variable representing the failure time. Therefore, the failure probability is given by: F (t) = 1 − R(t) = P (T ≤ t). If the lifetime distribution function follows an exponential distribution with parameter λ, F (t) = 1 − e−λt . In our experiments we only consider the failures of computational nodes. Given constant failure rates of resources, one can obtain the conditional probability of a process success as p(t) = e−λt [11], where λ is the failure rate of computational nodes. In all the experimentations we fixed λ = 0, 5 in order to expose the processes to extreme failure situations and to evaluate the robustness of the application. The lifetime distribution of the processes involved in the experimentations is presented in Fig. 8. The height of the bars represent the number of processes and their width represents intervals of lifetime duration. The whole bars show the number of processes that remain alive during a given time duration. According to the exponential distribution, the number of processes decreases as their lifetime increases. In the following, we experimentally evaluate the ability of FTH-B&B to deal with FT issue in large scale unreliable environments. Indeed, we measure the efficiency of FTH-B&B which is the effective CPU time spent by the workers solving their tasks taking into account FT (the time of storing, updating and recovering sub-problems). We also evaluate the benefit of the task recovery using the 3-phase mechanism in terms of the number of recovered and rescheduled sub-problems. In addition, we measure the impact of the different checkpointing levels on the performance of our approach. Finally, we measure the impact of failures on the overall topology of FTHB&B using the proposed reorganization strategies. Several FSP instances are considered in the performed experiments using the Taillard instances published in [32]. According to the size of the FSP instances, three classes are considered: small instances (20×20), medium (50 × 10 and 50 × 20), and large (100 × 10 and 100 × 20). The total number of tested instances is 50; 10 instances of each range of instance size are considered and the average values of each range of instances are computed. In the first experiment reported in Table 1, we evaluate the impact of the delay induced by the proposed FT mechanisms (the 3-phase recovery mechanism and the checkpointing mechanism) on the performance of the algorithm. We calculate the ratio R between the effective execution time tExec and the idle time tI recorded on the

workers. The idle time includes: (1) the communication time tC , (2) the additional time of internal management tM , (3) the checkpointing time tCheck including the time that the processes spend to store unexplored sub-problems, (4) the time masters take to perform the 3-phase recovery mechanism t3−P hase which includes: time of branching, time of storing sub-problems received from children and times of updating the sub-problems explored by the grandchildren. R = 100 ×

tExec tExec + tI

tI = t3−P hase + tC + tM + tCheck The columns of Table 1 designate, respectively, the size of the Taillard’s instance (N jobs × M machines), the range names of the 10 instances per size class, the cumulative effective execution time, the 3-Phase time, the communication and local work management time, and the parallel efficiency. All the times are expressed in seconds. As shown in Table 1 the use of the 3-phase recovery mechanism is not costly in terms of execution time. Indeed, the masters and workers spend on average only 0,55% of their total execution time recovering subproblems of failed processes. Moreover, the average waiting time of workers is negligible (only 0,03% of the total time) thanks to the short time the masters take to handle worker failures. The parallel efficiency in the last column shows that the workers spend on average 98, 03% of their time solving sub-problems. Nevertheless, it varies from an instance to another (98% for small instances versus 96% for large ones). This is mainly due to the variation of the amount of data to recover and to store when performing checkpointings implying a variation in recovering and checkpointing times. The rest of the idle time (on average 1,40%) is taken by the checkpointing mechanism that we will present in detail in the next section where the checkpointing times differ from a level to another. Even though the computing resources of computational grids are geographically distributed, there are some shared resources such as the network and the storage devices. These resources are shared by several users and the different launched processes of the same application. In the current work, we are interested in the storage devices which is a critical resource within each grid cluster. Therefore, when many masters are launched on the same cluster, the multiple disk accesses can slow down the cluster. As mentioned in Section 4.3.3, the checkpointing mechanism is tunable and the number of masters performing the checkpointing can then be tuned to avoid generating a large number of backup files when the number of masters increases. To measure the impact of this mechanism on the performance of the algorithm, additional experiments have been conducted. Table 2 reports the amount of stored and recovered work units, the checkpointing time

11

Size of Instance 20 × 20 50 × 10 50 × 20 100 × 10 100 × 20 Average

Range of Instances Ta021 - Ta030 Ta031 - Ta040 Ta041 - Ta050 Ta051 - Ta060 Ta061 - Ta070

Execution Time TExec 508404,05 530620,39 533474,89 478576,32 496698,69 509842,57

3-Phase Time T3−P hase 1271,01 1340,08 1349,54 1227,28 1269,27 1062,27 (0,55 %)

Waiting Time TM + TC 3,85 3,48 10,85 58,47 674,85 150,30 (0,03 %)

Efficiency 98,90 98,76 98,83 97,48 96,14 98,03

% % % % % %

TABLE 1 Measurement of the efficiency of FTH-B&B FSP instances of different sizes. The efficiency is on average 98,03%. Size of Instance

Range of Instances

20 × 20 50 × 10 50 × 20 100 × 10 100 × 20 Average

Ta021 Ta031 Ta041 Ta051 Ta061

-

Ta030 Ta040 Ta050 Ta060 Ta070

One checkpointing Saved Check Subpbs Time (×109 ) (s) 02,91 481,92 13,68 502,73 02,96 495,33 07,08 484,56 00,90 492,88 05,51 491,48

level Freq (c/s) 0,80 0,84 0,83 0,81 0,82 0,82

Size of files (KB) 0015,85 0282,12 0323,04 1284,60 1081,00 0597,32

Two checkpointing Saved Check Subpbs Time (×109 ) (s) 028,25 4863,01 132,82 5072,98 030,00 4998,29 068,87 4889,62 008,67 4973,58 053,72 4959,50

levels Freq (c/s) 8,11 8,45 8,33 8,15 8,29 8,27

Size of files (KB) 0120,95 3067,40 4091,36 17668,6 17260,2 8441,70

Three checkpointing Saved Check Subpbs Time (×109 ) (s) 093,94 43416,6 432,86 45291,2 094,42 44624,4 198,86 43654,2 020,37 44403,8 168,09 44278,0

levels Freq (c/s) 72,36 75,49 74,37 72,76 74,01 73,80

Size of files (KB) 00318,49 06555,56 09595,16 53508,60 49700,52 23935,67

TABLE 2 Measurement of the impact of the different checkpointing levels on the performance of FTH-B&B.

and frequency, and the total size of the backup files using different checkpointing levels. In each checkpointing level, all the masters belonging to the same level of the hierarchy are involved in the checkpointing process (active masters) and the others are passive masters, i.e. they only perform B&B branching and exploration operations and they do not care about checkpointing. Table 2 is organized as follows: the first and second columns represent respectively the size of the Taillard’s instances and their names. The next 3 sub-tables represent the number of levels involved in the checkpointing mechanism (1, 2, and 3 levels). Within each sub-table, the number of stored sub-problems (in billions), the checkpointing time (in seconds), the checkpointing frequency (checkpoint/second), and the cumulative size of the generated files (kilo byte) are represented. In this experiment each master performs a checkpointing operation every 10 seconds, but this value can be calibrated to achieve a better performance. We notice that the number of stored sub-problems grows as the number of checkpointing levels grows. Moreover, it grows with a larger scale: on average 05,51 billion nodes are stored when using one checkpointing level, 53,72 billion using two levels (more than 10 times the number of sub-problems) and more than 168 when using 3 levels (more than 33 times). Thanks to the number of involved masters on each level which grows exponentially as the level grows. It is important to increase the number of checkpointing processes because the larger the number of checkpointing masters is, the smaller the amount of redundant work is. However, this large amount of sub-problems is obtained to the detriment of the cumulative size of back-up files, the checkpointing frequency, and the checkpointing time.

Size of Instance 20 x 20 50 x 10 50 x 20 100 x 10 100 x 20 Average

Range of Instances Ta021-Ta030 Ta031-Ta040 Ta041-Ta050 Ta051-Ta060 Ta061-Ta070

Execution Time (s) 508404,05 530620,39 533474,89 478576,32 496698,69 509554,87

Restored work(×109 ) 93,94 432,86 94,42 198,86 20,37 168,09

Reconst Time (s) 52,43 116,65 115,17 372,20 772,33 285,76

% Exec time 0,02 0,02 0,02 0,08 0,59 0,15

TABLE 3 Measurement of the total reconstitution time after a global failure.

The size of the back-up files grows according to two main parameters: the size of the instance and the number of checkpointing levels. Fortunately, as shown in the table the size of back-up files is not really embarrassing because the obtained cumulative size of back-up files is not huge compared to the number of involved checkpointing masters. Indeed, on average the cumulative size varies between 597 KB when using one checkpointing level and 23 MB for 3 checkpointing levels. The second parameter is the checkpointing frequency, it varies between 0,82 c/s for one checkpointing level and 73,80 c/s when involving 3 checkpointing levels. According to the architecture of the considered grid and the number of storage devices, this parameter must be chosen carefully. It must be sufficiently high to avoid loosing too large number of sub-problems and sufficiently small to avoid overloading both the storage device(s) and the shared network. The checkpointing frequency also influences the checkpointing time cost. When high frequency is considered, the processes spend more time in storing sub-problems and vice versa when

12

25

Average number of children (degree)

considering small rate of frequency. Therefore a tradeoff must be found for this parameter. Table 3 shows the times the new launched processes spend to reconstitute the unexplored work after a global failure (100% of dead processes) using the reconstitution mechanism. The columns represent respectively the size and the name of the instance, the cumulative execution time (in seconds), the amount of restored subproblems from the backup files (in billions), the reconstitution time, and the ratio between the reconstitution time and the total execution time. We notice that the reconstitution time grows as the size of instance grows but, on average, it represents only 0,15% of the execution time, thanks to the participation of the whole processes in this operation which is performed in a distributed way decreasing the total reconstitution time. The proposed task recovery mechanism aims at minimizing the redundant work (rescheduled sub-problems) when the processes handling them fail. Table 4 reports the amount of rescheduled sub-problems when facing different rates of failures. The hierarchy is reorganized using BH. The table is composed of 3 sub-tables presenting respectively the amount of rescheduled work using the 3-Phase recovery mechanism, without using it, and the engendered redundant work. Within each sub-table, 3 different rates of worker failure are considered: 1/6 (16%), 1/3 (33%), and 1/2 (50%) of the total number of processes). Within the second sub-table, the ratio between the rescheduled work without using the 3Phase recovery mechanism and the approach using it is presented. We notice that the failure rate influences strongly the amount of rescheduled work with and without using the 3-Phase recovery mechanism. This amount is more considerable for larger instances. Consequently, masters that do not use the recovery mechanism perform more redundant work as the failure rate increases. Moreover, the approach without using the 3-Phase recovery mechanism reschedules between 1,09 and 110,30 times more than the approach using the 3-Phase mechanism. As mentioned in Section 4.2, it is important to maintain the hierarchy balanced during the lifetime of the algorithm in order to obtain a valid execution and to avoid isolating some parts of the hierarchy when inner masters are subject to failures. In the following, we experiment an approach which does not use any recovery mechanism and any method to maintain the hierarchy in an unreliable environment and we measure the number of isolated processes and the induced lost work. In this scenario, 4000 grid processes are launched. Each master manages a group of size 10. The formed hierarchy contains then 3 levels (the 4th level is only composed of workers). In Table 5 only the 3 levels of masters are represented because the failure of a worker does not induce any orphan process. Therefore, no part of the hierarchy is isolated. The considered rate of failures is 1/3 of the total number of processes. It is organized as follows: the first and second columns represent respectively the size of the Taillard’s instances and their names. The table contains

20

master0 master1 master2 master3

15

10

5

0

0

400

800

1200

1600

2000

Time (sec)

Fig. 9. Average degree using ME. Masters are fairly loaded.

3 sub-tables representing the results for the 1st, 2nd, and the 3rd levels. Within each sub-table, the number of failed masters, the number of isolated processes, and the amount of lost work are represented. The results show the crucial issue of the failure of inner masters. Indeed, the number of isolated processes is hierarchically multiplied per the group size. Therefore, the number of isolated processes grows exponentially as the failed master belongs to a higher level of the hierarchy. For instance, the loss of only 3 masters belonging to the first level of the hierarchy induces the isolation of more than 1000 processes making huge the amount of lost work. In the following, we experiment the ability of the proposed approaches to maintain the hierarchy and then to avoid isolating parts of the hierarchy. During the experiments, we measured the degree (number of children) of the different masters. The degree allows us to learn about both the shape of the hierarchy and whether the nodes are subject to bottlenecks or not. Indeed, when the average degrees are constant on all the masters, the hierarchy remains balanced. However, when the degrees differ and diverge from a master to another, the hierarchy is unbalanced. Moreover, if the degree of a master is high, it presents a risk to become a bottleneck. To obtain a deeper hierarchy, the initial size of a group is fixed to 10 and the number of launched processes on the same processor is duplicated to get a larger scale in terms of processes. For each strategy, the algorithm is executed 10 times during 60 minutes. Fig. 9 shows the evolution of the degrees of the four most loaded masters using ME. We note that the masters are fairly loaded because when a master is designated, it receives the highest identifier and then it is not designated in the next election session. Nevertheless, the degrees have doubled and have reached their maximum value after killing 25% of processes. This is due to the fact that when an inner master fails, the newly designated master manages the children of the failed master in addition to its own children. To evaluate BH and HS, we compute the average degree of masters on each level instead of the most loaded masters. The degrees are computed as follows:

13

Size of Instance

Instance name

20 x 20 50 x 10 50 x 20 100 x 10 100 x 20 Average

Ta021-Ta030 Ta031-Ta040 Ta041-Ta050 Ta051-Ta060 Ta061-Ta070

Rescheduled Work using 3-Phase ×109 16% 33% 50% 18,44 52,36 322,48 88,97 271,17 297,79 74,02 1259,78 1842,06 154,36 1784,79 1874,68 179,28 340,88 2416,22 103,01 741,79 1350,64

Rescheduled Work without using 3-Phase ×109 16,87 41,78 23,77 2,65 5,47 18,10

16% (×01,09) (×02,13) (×03,11) (×58,01) (×32,70) (×05,68)

33,62 111,03 40,43 22,11 6,99 42,83

33% (×01,56) (×02,44) (×31,16) (×80,73) (×48,51) (×17,31)

50% 37,29 (×08,65) 130,75 (×02,28) 435,3 (×04,23) 175,75 (×10,67) 21,87 (×110,30) 160,19 (×08,43)

Redundant Work ×109 16% 1,57 47,19 50,25 151,71 173,82 84,90

33% 18,74 160,14 1219,35 1762,68 333,89 698,9

50% 285,19 167,05 1406,76 1699,01 2394,36 1190,47

TABLE 4 Measurement of the impact of the Task Recovery mechanism on the minimization of the work redundancy.

Size of Instance

Instance name

20 x 20 50 x 10 50 x 20 100 x 10 100 x 20 Average

Ta021-Ta030 Ta031-Ta040 Ta041-Ta050 Ta051-Ta060 Ta061-Ta070

1st Level # Dead Isolated masters processes 3,60 1100,00 3,20 1237,50 3,00 1320,00 3,00 1320,00 3,40 1164,71 3,24 1228,44

Lost work ×109 0,75 4,35 21,54 305,17 514,28 169,22

2nd Level # Dead Isolated masters processes 31,00 796,92 34,40 884,32 31,00 796,92 32,40 832,90 35,40 910,03 32,84 844,22

Lost work ×109 0,54 3,11 13,00 192,56 401,83 122,21

3rd Level # Dead Isolated masters processes 315,60 811,31 330,60 849,87 330,00 848,33 325,20 835,99 334,80 860,67 327,24 841,23

Lost work ×109 0,55 2,99 13,84 193,27 380,03 118,14

TABLE 5 Measurement of the isolated processes and the induced loss of work in an approach which does not use any maintenance hierarchy and any recovery mechanism.

11 10

|Γtl |

Dtl =

9

dit

Average degree by level

∑ i=1

|Γtl |

Dtl is the average degree of the masters located at the level l of the hierarchy at the instant t, Γtl is the set of masters located at the level l at the instant t, and dit the degree of the master mi at the instant t. Fig. 10 shows the computed average degrees of the different masters sorted by their levels. First, we note that the masters resist well to failures and do not become bottlenecks during all the experiments. Second, the root master maintains the same degree during all the lifetime of the algorithm. Finally, the average degree decreases softly for the three first levels and more rapidly for the fourth one. That can be explained as follows: at the beginning, all sub-B&Bs are initially full because of the balanced construction of the hierarchy. After the first failures, no master can acquire orphan processes. Therefore, it uniformly dispatches them among its children, and so on until achieving the leaves of the hierarchy (workers). The adaptability of FTH-B&B allows the worker receiving an orphan worker to change its behavior and becomes a master. This newly converted master will have only one child (the migrated worker), therefore its degree is equal to 1. Consequently, the overall degree of the masters at the down levels of the hierarchy is affected. After many failures, the new masters acquire more and more migrated orphan processes. Therefore, the size of sub-B&B groups converges to the maximum size of a group stabilizing the global degree.

8 7 6

Level0 (root master) Level1 Level2 Level3

5 4 3 2 1 0

0

500

1000

1500

2000

2500

3000

3500

4000

Time (sec)

Fig. 10. Average degree using BH and HS. The masters resist well to failures and do not become bottlenecks.

6

C ONCLUSIONS

Solving combinatorial optimization problems exactly using parallel B&B algorithms requires a huge amount of computing resources. That can be achieved with their execution in large scale on computational grids. The scalability can be ensured using hierarchical Master/WorkerB&B overcoming the limits of the traditional Master/Worker paradigm. However, the resources offered by such grids are volatile and highly unreliable. Therefore, FT must be taken into account to ensure no loss of data during the execution of the algorithm which is intolerable for exact solving of COPs. In this paper, we have proposed FTH-B&B (Fault Tolerant Hierarchical B&B algorithm) in order to deal with the FT issue of grids at application level. Indeed, it is composed of several FT Master/Worker-based subB&Bs launched in parallel and organized into a hierarchy

14

so that each sub-B&B is composed of a unique master managing several workers and performing locally a set of FT mechanisms. A 3-phase recovery mechanism between a master, its children, and its grandchildren is proposed to distribute, store, and recover work units in case of failures. This protocol allows one to reduce the execution time because the master does not reschedule the sub-problems entirely but only the unexplored parts. In addition, three strategies are proposed to maintain the hierarchy. Finally, a distributed checkpointing mechanism is proposed in which each master performs its checkpointing independently from others. Three efficient operators are defined: Gather, Reduce, and Deduce, allowing the reconstitution of the unexplored sub-problems. We have implemented our approach on top of ProActive and have experimented it on the Grid’5000 real nation-wide grid environment. To give more realism to our experiences in an unreliable environment, the failures are generated according to the exponential lifetime distribution function of the processes. The different experiments demonstrate the ability of FTH-B&B to ensure FT while maintaining high execution efficiency. Indeed, the approach enables to achieve an efficiency of 98,03%. Moreover, the 3-phase recovery mechanism allows minimizing the redundant work. In addition, the experiments show that the hierarchy can be maintained safe and balanced, thanks to the balanced hierarchy BH and the Hybrid Strategy. As perspectives, we intend to develop an enhanced version which takes into account a best effort mechanism that allows to exploit a much larger number of resources on the used computational grid. We also intend to run our application on real production grids such as EGI/EGEE [14][13] that has a high rate of nodes failure. Finally, the approach will be investigated on other combinatorial optimization problems mainly those handling more data such as QAP and Q3AP.

R EFERENCES [1] K. Aida, Y. Futakata, and T. Osumi. Parallel Branch and Bound Algorithm with the Hierarchical Master-Worker Paradigm on the Grid(Grid). IPSJ Trans. on High Performance Computing Systems, 47(12):193–206, 20060915. [2] K. Aida, W. Natsume, and Y. Futakata. Distributed Computing with Hierarchical Master-worker Paradigm for Parallel Branch and Bound Algorithm. Cluster Computing and the Grid, IEEE International Symposium on, 0:156, 2003. [3] L. Baduel, F. Baude, D. Caromel, A. Contes, F. Huet, M. Morel, and R. Quilici. Programming, Composing, Deploying for the Grid. In in Grid Computing: Software Environments and. Springer Verlag, 2006. [4] A. Bendjoudi, N. Melab, and E-G. Talbi. P2P design and implementation of a parallel branch and bound algorithm for grids. International Journal of Grid and Utility Computing, 1(2):159–168, 2009. [5] A. Bendjoudi, N. Melab, and E-G. Talbi. Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm. IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 1806 – 1814, 2011. [6] A. Bendjoudi, N. Melab, and E-G. Talbi. An adaptive hierarchical master-worker (AHMW) framework for grids-Application to B&B algorithms. Journal of Parallel and Distributed Computing (JPDC), Vol.72(2), pages 120–131, Feb. 2012.

[7] Condor, http://www.cs.wisc.edu/condor/. [8] M. Djamai, B. Derbel, and N. Melab. Distributed B&B: A Pure Peerto-Peer Approach. In Proc. of 25th IEEE LSPP/IPDPS, Anchorage, (Alaska) USA, Mai 16th-20th 2011. [9] L.M.A. Drummond, E. Uchoa, A.D. Gonc¸alves, J. M. N. Silva, M. C. P. Santos, and M. C. S. de Castro. A grid-enabled distributed branch-and-bound algorithm with application on the steiner problem in graphs. Parallel Comput., 32:629–642, October 2006. [10] D. Caromel, C. Delb, A. di Costanzo, and M. Leyton. ProActive: an integrated platform for programming and running applications on Grids and P2P systems. In Computational Methos in Science and Technology 12(1), pages 69–77, 2006. [11] Y.S. Dai, G. Levitin, and K.S. Trivedi. Performance and reliability of tree-structured grid services considering data dependence and failure correlation. IEEE T. Computers, v. 56, pp. 925936, 2007. [12] Z. Dai, F. Viale, X. Chi, D. Caromel, and Z. Lu. A Task-Based Fault-Tolerance Mechanism to Hierarchical Master/Worker with Divisible Tasks. High Performance Computing and Communications, 10th IEEE International Conference on, 0:672–677, 2009. [13] EGEE, http://www.eu-egee.org. [14] EGI, http://www.egi.eu. [15] M. Elnozahy, L. Alvisi, Y. Wang, D.B. Johnson. A survey of rollback-recovery protocols in message-passing systems Report CMU-CS-96-181, School of Computer Science, Carnegie Mellon Univ., Pittsburgh, 1996 [16] R. Finkel and Udi Manber. Dib—a distributed implementation of backtracking. ACM Trans. Program. Lang. Syst., 9(2):235–256, 1987. [17] H. Garcia-Molina. Elections in a Distributed Computing System. In IEEE Transactions on Computers, Vol. C-31, No. 1, 48-59, 1982 [18] M.R. Garey and D.S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Commpleteness. W. H. Freeman & Co., New York, NY, 1979. [19] B. Gendron and T.G. Crainic. Parallel Branch-and-Bound Algorithms : Survey and Synthesis. Operations Research, 42(06):1042– 1066, 1994. [20] M. Ghasemi-Gol, M. Sabzekar, H. Deldari, and A-H. Bahmani. A Linda-based Hierarchical Master-Worker Model. ”International Journal of Computer Theory and Engineering, 1(5):1793–8201, December, 2009. [21] J. Goux, S. Kulkami, J. Linderoth, and M. Yoder. An enabling framework for master-worker applications on the computational grid. IEEE Symposium and High Performance Distributed Computing (HPDC9), 9:43, August 2000. [22] J-P. Goux, J. Linderoth, and M. Yoder. Metacomputing and the Master-Worker Paradigm. In Preprint MCS/ANL-P792-0200, Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, 2000. [23] Grid’5000, https://www.grid5000.fr. [24] Adriana Iamnitchi. A problem-specific fault-tolerance mechanism for asynchronous, distributed systems. In In Proceedings of the International Conference on Parallel Processing 2000, pages 4–14, 2000. [25] J.K. Lenstra, B.J. Lageweg, and A.H.G.R. Kan. A General boundind scheme for the permutation flow-shop problem. Operations Research, 26(1):53–67, 1978. [26] M. Mezmaz, N. Melab, and E-G. Talbi. A Grid-based Parallel Approach of the Multi-Objective Branch and Bound. In Fifteen Euromicro Conference on Parallel, Distributed and Network-based Processing, Naples, Italy, February. 7-9 2007. IEEE Computer Society Press. [27] M. Mezmaz, N. Melab, and E-G. Talbi. A Grid-enabled Branch and Bound Algorithm for Solving Challenging Combinatorial Optimization Problems. In Proc. of 21th IEEE Intl. Parallel and Distributed Processing Symposium, Long Beach, California, March 26th -30th 2007. [28] OAR, http://oar.imag.fr/ [29] ProActive, http://proactive.objectweb.org. [30] J. Pruyne and M. Livny. Interfacing Condor and PVM to haress the cycles of workstation clusters. Journal on Future Generations of Computer Systems, (12):56–61, 1996. [31] Renater, http://www.renater.fr/. [32] E. Taillard. Benchmarks for basic scheduling problems. European Journal of Operations Research, 64:278–285, 1993. [33] M. Xie, Y.S. Dai, and K.L. Poh. Computing Systems Reliability: Models and Analysis. Kluwer Academic Publishers: NewYork, NY, USA, 2004. [34] XtremWeb, http://www.xtremweb.net.

Suggest Documents