E cient Solutions for mapping parallel programs - CiteSeerX

0 downloads 0 Views 339KB Size Report
The main goal of a good mapping is to minimize the execution time of the whole .... by choosing its best \neighbor" (neighborhood is de ned for instance by ...
Ecient Solutions for mapping parallel programs P. Bouvry , J. Chassin and D. Trystram LMC-IMAG 46, avenue Felix Viallet 38031 Grenoble cedex (FRANCE)

A faire apres

Abstract

Keywords: Parallel environment, Distributed-memory machines, Load-balancing, Mapping.

 CWI - Center for Mathematics and Computer Science, , Kruislaan 413, PP.O. Box 94079, 1090 GB Amster-

dam, The Netherlands

1

1 Introduction Ecient use of distributed-memory parallel systems requires the use of speci c tools, whose mapping is one of the most important one. The aim of the mapping tools is to minimize the execution time of parallel programs on distributed-memory machines by controlling the use of computation and communication resources. A parallel program can be represented as a graph of interconnected tasks which can be executed in parallel. This article describes a toolbox aiming at providing an assignment of the tasks to the available processors to obtain the shortest possible execution time for the entire program. Therefore, mapping algorithms aim at maximizing the (useful) occupation of processors without increasing too much communication costs. The tasks execution times and inter-tasks communication costs of some regular programs can be entirely determined at compile time. In this case, it is possible to perform static task allocation in advance. This is known as the mapping operation whose complexity is exponential in the general case (This problem is known to be NP-Complete even in its most elementary form, the bin-packing problem (which consists simply of allocating independent tasks on a set of processors) [GJ79]). Because of this complexity, it is dicult to compute an optimal mapping and numerous heuristic solutions have been proposed, representing di erent tradeo s between computation cost and quality of mapping (for more information see the survey of Norman and Thanish [NT93]). This paper presents a mapping toolbox, implementing several \classical" mapping algorithms. This toolbox was used to assess di erent cost functions by computing the relation between the value of these functions, optimized by the mapping algorithms, and the actual execution times of parallel programs. The implemented mapping algorithms were also evaluated, by comparing the execution times of a set of representative parallel programs, mapped using the algorithms of the toolbox. The results of a large number of experiments were also used to assess the cost functions, by comparing their values to the actual execution times of the benchmarks. The paper also describes a new mapping method using monitoring tools to re ne the predictions of tasks execution times and inter-tasks communication costs. This work is part of the APACHE research project whose aim is the design and development of a general programming environment to balance automatically the load of parallel applications, resulting in reduced development time and increased portability of parallel applications [Pla94]. The load-balancing environment will be designed so as to support the testing of various techniques and strategies { both static and dynamic {, taking into account informations pertaining to the execution behavior of these programs. Generic resource management tools for scheduling, communications and memory management, will be studied both under their mathematical (searching for optimal solutions in speci c contexts) and heuristic aspects. The organisation of this article is the following. This introduction is followed by ...

2

2 The mapping problem

2.1 Models of machines and programs

A distributed-memory parallel computer is composed of a set of nodes connected via an interconnection network. Each node has some computation facilities (for instance a scalar processor coupled with some vector processing units) and its own memory. Most commercially available parallel machines such as the CM5, the Paragon, the SP1, the CS2, etc. include hardware routing mechanisms [RO93]. For a given number of processors, the length of any path between processors is the same in multi-stage communication networks. In other interconnection networks, ecient routing mechanisms (wormhole for instance) decrease the importance of path lengths between processors. However, a communication between two processors is much more time consuming than a communication between two tasks allocated to the same processor, which amounts to a local memory access. An external communication is more expensive because the message must be sent using slow (latency and throughput) communications. The easiest way to use these machines is to place the same executable code on each node and to split the data between these nodes. But all the existing problems cannot eciently be implemented using the data parallel model. The MIMD (Multiple Instructions Multiple Data) model intends to map di erent executable codes, called tasks, onto processors. Designing a program such that only one task will be allocated on one processor of the target machine would lead to make an architecture-dependent and non-scalable code. On the contrary, a too large number of tasks is dicult to manage eciently. The size of tasks that can be eciently managed by a given parallel computer is usually called the exploitable granularity of this computer. One important parameter for the eciency of a parallel program will therefore be the ratio between computation and communication. Most parallel programs can be described using a graph formalism (see gure 1). In most representations, each vertex represents a task and each edge a communication link. According to existing solutions, we will consider a model where a task can be allocated to only one processor. Any processor can make some communications and computations. We add to this basic model the computation costs of the tasks (execution time) and the amount of information communicated on the links. Sometimes the user can not determine the exact values of the program parameters but can only approximate them. In the following, we will denote: T , the set of tasks and n their number, P , the set of processors and m their number, ex(t), the computation time of task t, comm(t; t0), the total communication time between t and t0, the function alloc(t) returns the processor where task t is allocated. The parameters of this model are not sucient to compute the total execution time cannot be computed because the idle times of tasks (waiting for communications) and processors are unknown. This model is to compare to the usual Direct Acyclic Graph (DAG in short) representation [CD73] [SS89] which consists of several tasks interconnected by precedence relations. In this model each task receives (if needed) some data, performs some computations and then sends some results. Each vertex of the DAG is weighted by

3

Figure 1: Instances of a processor network and of a task graph

a computation cost and each arc by a communication cost. For DAGs, the best schedule, giving the smallest total execution time (also called makespan) can be determined if communications are neglected and the number of processors is unbounded. But, nding the best schedule for this model remains NP-hard in the general case [CD73]. This paper is based, as a large number of related works [NT93], on tasks graphs without precedence. This model is closely related to the programming model used by the transputer based parallel system that was used for experiments, where undirected task graphs are often explicitly described in a separate con guration le.

2.2 Description of the Problem

The parallelization process requires rst to distribute data among the di erent processors. The main goal of a good mapping is to minimize the execution time of the whole program. Other objectives can also be achieved, for instance in a multi-users processor network, it is interesting to minimize the average use of the di erent resources. Formally de ned, a mapping is an application called alloc from T to P which associates to each task t an unique processor q = alloc(t). It is well-known that the number of all possible solutions is nm . Even if the actual number of di erent mappings is lower, due to symmetry considerations, it remains too large for all mappings of \real" parallel programs to be computed. The mapping operation is part of programming environments (see gure 2). Environments such as Hypertool, Taskgrapher, Pyrros, Linda, Hence, Tosca, etc. [Gui95] include some mapping facilities.

2.3 Quality of the solution

Most solutions of the mapping problem are based on the optimization of cost functions. Let us denote by z this function. There exist in the literature many choices for z (qualitative or quantitative criterion, structural criterion, etc.) [Bok81]. Norman and Thanish propose a classi cation of the parameters which in uence the cost of a mapping [NT93]. Under the assumptions of the model of machines, two opposite criteria have to be taken

4

Source code With directives Detection Extraction of Parallelism

Without parallel directives

Use of a parallel compiler

Clustering Scheduling/Mapping Code generation

Execution/Monitoring Figure 2: Place of the mapping operation in parallel programming development Ideally, the user of a parallel machine would use a parallel compiler. It consists of distributing the data among the processors and to organize automatically (implicitly) the communications induced by local computations. However, this way is not usable practically for large application problems. The alternative today is the following: from the source code including parallelization directives, a task graph is generated, by determining computation and communication costs and analyzing data dependences. This is usually followed by a clustering operation parameterized by the granularity of the target machine. Then scheduling or/and mapping is performed, dependending on the task graph model: task graphs with precedence require a scheduling phase to assign an execution date to each task.

5

into account: minimization of inter-processors communications and load-balancing of computations between processors. The cost functions given below correspond respectively to these criteria:

Minimum communications factor z = Pt2T Pt0jalloc(t0)6=alloc(t) comm(t; t0) Load imbalance factor z= P

P

p2P jload(p) ?

P

0 2P load(p0)

p

jP j

j

where load(p) = tjalloc(t)=p ex(t) and gives the load of processor p in terms of calculation cost The criterion that we have chosen to consider is a trade-o between these two criteria. It corresponds to minimizing the most loaded processor: z = maxp2P Ptjalloc(t)=p ex(t) + Pt0jalloc(t0)6=p comm(t; t0)) This function does not consider that communication can be overlapped by computations. If for a given parallel machine and optimized parallel programs most of communications can be overlapped by computations this cost function has to be adapted by considering the maximum between computation and communication times in place of the sum. The communication part of the cost function can be computed by:  an average communication cost, on a full-connected communication network. It is estimated by an average time of communication between two connected tasks.  distance between communicating processors, on a point-to-point network, where some routing mechanisms are needed. If there exist xed routing tables (no random routing), a linear model of communications + L can be considered where is the startup cost,  represents the distance (number of hops between the two processors),  is the time to communicate one byte between two adjacent nodes (the transmission rate factor) and L is the length in bytes of the message. This still can be improved by taking into account some phenomena like network congestions, perturbations on the nodes belonging to the communication paths, load of the network, etc.

3 Classi cation of classical mapping algorithms Many solutions can be found in the literature for solving the mapping problem. We can roughly distinguish two classes of methods, namely, exact algorithms and heuristics [CK88]. Exact algorithms can only be used when the space of con gurations is small enough, for instance when only few tasks have to be allocated to a machine with a small number of processors. Exact algorithms give the optimal solutions but in practical cases they can not be used because of the combinatorial explosion of the number of solutions. The goal of heuristic algorithms is to give good solutions in relatively reasonable time. Two sub-classes of heuristic algorithms have been mainly explored: greedy algorithms

6

which construct progressively the solution and iterative algorithms whose principle is to improve an existing solution. Obviously, the cost of the mapping algorithm itself must not be too large, it must be related to the use of the solution. The more used is a given mapping, the greater the time may be spent to develop it. We will now review the most classical mapping strategies.

3.1 Exact algorithms

Exact algorithms used for solving the mapping problem have an exponential complexity. Several strategies can be developed for obtaining exact algorithms. A lot of them are issued from arti cial intelligence or operations research like the exploration of a search tree: Branch and Bound algorithm, breadth rst search, best rst search, A*, etc. [Sin87] [ST85]. When designing exact algorithms, the main diculty is to reduce the space of the solutions trying to eliminate silly solutions. We describe brie y a method based on this last idea. The idea behind the A* algorithm is to explore all the possible solutions organized as a tree (the search tree) but not in an exhaustive way. It consists of a best rst search guided by an appropriated heuristic whose goal is to limit the number of con gurations to explore (let us recall that the tree includes nm nodes). Each node N of the search tree presents a partial con guration. The algorithm computes the cost of each partial mapping done (g(N )) to which a sub-estimation (h(N )) of the cost of all realizable mappings based on this partial solution is added. Then, the node (n) whose cost function (f (N ) = g(N ) + h(N )) is the lowest is developed. The sub-estimation of the cost is based on a heuristic which must be as close as possible to the cost function and as cheap possible in terms of compuation time. This allows to explore fastly only the most interesting branches of the tree. For instance, any sub-optimal algorithm for which a worst case lower bound is known can be used as the heuristic. To nd a sub-estimation of the optimal solution, we just have to divide the value of the cost function by this bound. Many other strategies can be used for nding the exact solutions. For instance, classical graph theory methods like maximum ow or cutting [Lo88] [Sto77] can be used, or some techniques issued from the mathematical programming (combinatorial optimization, dynamic programming, etc). Remark that such methods can also be used for nding good heuristic functions for the A*.

3.2 Greedy algorithms

In a greedy algorithm, the mapping is done without backtracking (a choice already made can never be reconsidered). The allocation of the ith task is based on a criterion depending on the mapping of the (i?1)th rst tasks. Two kinds of greedy algorithms can be envisaged for the mapping problem: either they are based on empirical methods or they come from the relaxation of classical graph theory algorithms which are optimal for some restricted cases. Some examples of such criteria are given below:

7

 The modulo algorithm is one of the simplest possible method. It is a greedy algo-

rithm consisting to allocate the ith task onto the ith modulo m processor. Theoretically, this algorithm has the same behavior as a random mapping algorithm with a great number of tasks. The only modi cation made to this basic scheme was to skip processors that have not enough memory to place a task.  LPTF (Largest Processing Time First) is a heuristic whose criterion is restricted to4 load balancing. It is well-known that its performance in the worst case is about 3 from the optimal while considering independent tasks [Lee91].

Greedy algorithms are easy to implement and have a polynomial complexity. List algorithms, like LPTF, are the most used greedy algorithms. In these algorithms the tasks are rst sorted on a given criterion (decreasing computation cost in the case of LPTF) and then are mapped in this order on the processors (when a processor is free, the rst task ready to be executed in the list is chosen).

3.3 Iterative algorithms

All iterative algorithms depart from an initial solution and try to improve it. Usually, the initial solution is found using a greedy algorithm. Most iterative algorithms try to exchange tasks between processors to improve locally a solution. Most of such algorithms use random perturbations to leave local minima and to obtain better solutions. A well-known iterative algorithm was proposed by Bokhari [Bok81]. Its cost function (called cardinality) takes into account the number of tasks correctly mapped on the processor network and uses pair-wise exchanges of tasks to improve it. The basic hypothesis is that the number of tasks must be equal to the number of processors. However, grouping methods must be used to take into account a greater number of tasks.

3.3.1 Hill climbing

The basic iterative algorithm, called hill climbing consists of improving a current solution by choosing its best \neighbor" (neighborhood is de ned for instance by exchanging a pair of tasks). The quality of a solution is de ned by a cost function z. This solution leads directly to a local optimum.

3.3.2 Simulated annealing

Simulated annealing is a well-known iterative method [BM88]. It is based on an analogy with statistical physics: The annealing technique allows to obtain a metal with the most regular structure as possible. It consists of heating the metal and reducing slowly the heat so that it keeps its equilibrium. When the temperature is low enough, the metal is in an equilibrium state corresponding to the minimal energy. At high temperature, there is a lot of thermic agitation which can locally increase the energy of the system.

8

This phenomenon occurs with a given probability decreasing with the temperature. It corresponds mathematically to give a chance to leave a local minimum of the function to optimize. The corresponding algorithm consists of improving a mapping by elementary operations involving task exchanges. The percentage of bad exchanges (leading to a worse solution) is high at the beginning and will decrease to zero during the execution of the algorithm. Some theoretical studies of this algorithm have proved the convergence to the optimal solution of the continuous version if some properties are veri ed such as a very slow decreasing of the heat that it is not practical for real problems: the theoretical heat descent k follows the curve log1(k) . It is well-known that this algorithm is very hard to tune: nding the starting temperature or the heat decreasing steps are very dicult in practice and have to be done after many experiments.

3.3.3 Tabu search

Tabu search is an iterative meta-heuristic [GL92]. Contrary to the simulated annealing, it is deterministic. As for the simulated annealing, a lot of parameters have to be tuned or de ned[BBTW95]. The tabu search improves iteratively a cost function: the tabu starts from a given solution and tries to improve it by doing some local moves (e.g. pairwise exchange). Accessible solutions using one local move are called neighbors of the current solution. At a given step, the best unexplored neighbor is chosen. This implies that some information about the last explored moves must be recorded. This record is called tabu list. Only the last moves can be recorded in order to limit the memory and time costs. Possibilities of cycling are also reduced by this recording. Some people compare this mecanism to the human short-term memory. The tabu list length is xed empirically for each implementation of a tabu search. Some aspiration criteria can be used to override tabu list (e.g. if the proposed move leads to the best ever found value of the cost fonction). If too much time is spent in given areas without signi cative improvements of the solution, diversi cation factors could be used in order to move to other areas. And some intensi cation factors could be used if some areas seem very promising.

3.4 Mixed strategies

Other solutions can be imagined for solving the mapping problem. We can de ne many new heuristics by mixing some of the precedent algorithms. For instance, we can determine an initial solution with the simulated annealing algorithm and then try to improve it using a tabu search. Some other mixed methods are based on exact algorithms like the branch&bound algorithms [KN84] where the exploration of some branches in the search tree is arbitrary stopped, etc. As we will detailed further, some of the previous criteria can be also used simultaneously in greedy algorithms. In the next section, we will detail the set of mapping heuristics implemented in the mapping toolbox of the APACHE environment.

9

4 Description of the Mapping Toolbox (ALTO)

4.1 Basic hypotheses

The mapping toolbox "ALTO" (for ALlocation TOol box) was designed to help programmers to map easily and eciently their programs on a parallel machine. The proposed solution is not too demanding for the programmer. The following assumptions recall the conditions of use. In parallel C programs, a description of the di erent C tasks (in C les) and a con guration description (in a con guration le) must be supplied by the user. The con guration le describes the task interconnection graph, the processor network and the mapping of the di erent tasks on the processors. In our experiments, the con guration le will be extended to support cost values (computing and communicating costs). Note the following considerations: 1. All processors have the same processing capabilities. 2. Processors may have di erent memory sizes. 3. Communication costs between tasks allocated on the same processor are considered as negligible.

4.2 Implementation details of mapping heuristics Modulo: the modulo greedy algorithm has been implemented in order to serve at a

reference. The only modi cation made to this basic algorithm was to skip processors that have not enough memory to a task. Largest Processing Time First (LPTF): tasks are rst sorted by decreasing computation cost order, then allocated on the less loaded processor having enough memory. Largest Global Cost First (LGCF): this greedy algorithm aims to balance the global load. Tasks are rst sorted according to this order, then allocated on the less globally loaded (communication and computation costs taken into account) processor having enough memory. Struc-quanti: this algorithm uses rst a mixed criterion, i.e. qualitative (the number of links of each task) and quantitative (communication and computation costs), to sort the task. Then the tasks are allocated on the less globally loaded (communication and computation costs taken into account) processor having enough memory. In all the following iterative algorithms two kinds of neighborhood are used. The rst one consists of transfering a task from the most loaded processor to another one and the second consists of exchanging a task from the most loaded processor with another task communicating with it:

10

cost function

45000 43000 41000 39000 37000 35000 33000 31000 29000 27000 25000

Simulated annealing

0

10000

20000

30000

40000

50000 Iteration step

Figure 3: Typical execution of Simulated annealing

Hill climbing: the implemented hill climbing algorithm transforms sucessively each map-

ping into its best neighbor (restricted to neighbors improving the solution). All the neighbors are de ned by exchanging any task with one task of the most globally loaded processor. The costs function determines which one is the best transformation of the actual mapping. Taking into account that the neighborhood is restricted, this leads to nd the nearest local optimum. Simulated annealing: all the parameters have been determined according to the literature [Ber89] [BM88] [HB88] and a lot of experiments. An estimated value of the average di erential value of the cost function between moves denoted f is rst computed. Next, using a starting percentage of acceptance of bad solutions of 0:8 (denoted  ) the starting heat is determined(k = f 1  ). In this implementation log  0 : 98 the function k is used for practical heat descent. A bad solution is accepted if ?f e k is greater than a random number uniformly chosen in (0; 1) (see gure 3). The gures 3, 4, 5 re ects the searching method of simulated annealing and tabu search on the same example consituted of the mapping of a randomly generated task graph of 100 tasks onto 10 processors. Tabu search: the implemented tabu search [BBTW95] has the following characteristics:  Each tabu list element records a description of a move. To restrict the pos-

11

Cost function

45000 43000 41000 39000 37000 35000 33000 31000 29000 27000 25000

Tabu list length of 10 Tabu list length of 100

0

50

100

150

200 250 Iteration step

Figure 4: Comparison between two tabu list lengths

sibilities of search, each tabu element records only the computing cost moved (to try to take into account the large number of equivalent tasks in a parallel program).  The tabu list length is equal to the number of jobs. Figure 4 shows a cycling tabu having a tabu list of length 10 and the improvement with a tabu list length of 100.  As aspiration criterion, an improvement of the best ever found solution (in terms of cost function) is considered.  Three stopping criteria are used, namely, when all neighbors are in the tabu list and the aspiration criterion cannot be used, after a given number of steps without improving the cost function and when the user gives a bound for the maximum allowed (the algorithms provides the best solution found until this time). See gure 5 for a typical execution of a tabu search. Figures 3 and 5 depict the behaviours of simulated annealing and tabu search. They indicate that SA becomes good only after a long latence time, on the contrary, tabu is good at the beginning but converges slowly. Note that the basic iteration of the tabu is greater than simulated annealing. Indeed at each step, the tabu search explores all the neighbors to nd the best. While the simulated annealing only chooses one satisfying its condition among a few random ones. The user (or the system manager) of all iterative algorithm can x a time limit to the exploration.

12

Tabu search

Cost function

45000 43000 41000 39000 37000 35000 33000 31000 29000 27000 25000

0

500

1000

1500 2000

2500

3000 3500

4000 4500

Iteration step Figure 5: Typical execution of Tabu search

Figure 6: Set of mapping strategies implemented in ALTO

Exact algorithm: the quality of a mapping is estimated relatively to our cost functions. The heuristic chosen as part of our A* algorithm [ST85] is LPTF.

Figure 6 summarizes the strategies implemented in ALTO.

4.3 Use of ALTO

Some situations can occur while designing parallel programs: 1. If few tasks have to be to mapped and if the parallel program has to run many times on the same network and if good estimations of the di erent computing and communication costs are known, exact algorithms can be used, 2. a greedy algorithm can be used when little time can be spent to map the tasks and the quality of the mapping (eciency of the mapped program) is not crucial, 3. another solution is to give to the user the choice of some mapping algorithms and cost functions with some guidance to help users choosing the most suited to their problems.

13

5 Validation and experiments Two kinds of validation have been done: rst, we simulated parallel programs using a random task generator; then, the mapping algorithms of ALTO were tested on a set of benchmarks, representative of well-known parallel numerical algorithms.

5.1 Task graphs generated randomly To tune iterative algorithms and to show the behaviour of the di erent mapping algorithms, many task graphs have been randomly generated following as parameters :  the number of tasks,  the maximal computing cost of a task,  the maximal communication cost,  the maximal degree of a task (number of neighbors) The di erent task graphs were generated uniformly using these parameters. The following table presents the average improvement (in %) given by the di erent algorithms versus the behaviour of the modulo algorithm. Each value of the table is based on 100 random task graphs. The parameters used in the simulation are: 100 tasks, each task communicating to a maximum of 4 other tasks, 16 processors, a maximal computing cost of 1000 seconds. r, denotes the communication/computation ratio. The cost function used takes into account the sum between communication and computation (all the processors are supposed to be at the same distance). These tests re ect mostly the expectations:  the most sophisticated mapping algorithms (simulated annealing and tabu) are the most ecient ones.  LPTF performs well only for parallel programs with low communication costs.  The higher the ratio communication/computation, the more interesting are the iterative algorithms.  Obiously the results of LGCF are better than struct quanti which takes into account a qualitative criterion while the quality of results is estimated in terms of quantitative criteria (i.e. the cost function). If there is an interest for struct quanti it would only be apparent on comparison of execution times. Table 1 prresents the loss of optimality (in % versus an exact search) of the di erent mapping algorithms: 4 tasks, 2 processors, a maximum computing cost of 100, a maximum communication cost of 50 and a maximum out-degree of 3. The previous results are encouraging but not sucient since we are not sure that random task graphs are representative of existing parallel programs.

14

60 50 40 30 20 10 0

60 50 40 30 20 10 0

0,01 0,10

tabu sim-an struc-quanti

0,50 1 2

Communication/computing ratio

lgcf

3 4

lptf

Strategies

Figure 7: Percentage of improvements of mapping algorithms relative to modulo

Table 1: Relative loss with respect to the optimal solution

algorithm modulo lpt lgcf struc quanti sim. ann. tabu search r=0.5 -46.26 -33.32 -19.36 -21.60 -8.14 -8.66

15

5.2 Experiments with real programs executed on a real machine The aim of these experiments is to test the mapping strategies on a distributedmemory parallel system. Extensive tests were done using the ALPES performance evaluation environment of the APACHE project [BKPT94, KP92, Kit94]. ALPES provides a modeling language called ANDES, allowing the generation of synthetic programs from the description of an application as an oriented task graph. Synthetic programs consume resources, as their actual models, although they do not produce any results. Performance measurements are obtained by executing these synthetic programs on an actual parallel computer. Parameters of the synthetic programs (mainly computation/communication ratios) can be changed easily to emulate various parallel target architectures. The ALPES environment includes also monitoring tools. Synthetic programs as well as monitoring tools can be run on a 128 transputers network MegaNode [HJMWS86], using the VCR software router1. ALPES is currently being ported on IBM SP2. A benchmark for mapping tools was designed using ANDES: it includes the description of classical parallel programs (Fast Fourier Transform, matrix-vector multiplication, Gaussian elimination, matrix product using the Strassen algorithm, Divide and conquer, PDE solver, etc.) [BKPT94]. ALTO was coupled with PYRROS 2 so that the proposed mapping tools can be used on directed tasks graphs after a clustering done by Pyrros. We have also tested some hand-made clustering versus Pyrros. The results are not signi cant.

5.3 Adequation of the cost functions Several mapping experiments were done using various cost functions, in order to determine the best one, that is the one producing the closest costs of the execution time of the benchmarks. The cost functions tested in our experiments were:  a maximum between the computation and the communication cost. All the processors are supposed at the same distance. The load of a processor p is in this model is: X load(p) = (ex(t) + t0jalloc max comm(t; t0)) (t0)6=p tjalloc(t)=p

 a sum of computation and communication costs. All the processors are always considered at the same distance. The load of a processor p is in this model is: X X load(p) = (ex(t) + comm(t; t0)) tjalloc(t)=p

1 2

t0 jalloc(t0)6=p

VCR: Virtual Channel Router, developed by Southampton University. PYRROS: is a complete scheduling platform designed at Rutgers by Gerasoulis and Yang [YG92]

16

Execution time

2e+08 1.8e+08 1.6e+08 1.4e+08 1.2e+08 1e+08 8e+07 6e+07 4e+07 2e+07 0

Max Sum Rout Tor Ref

0

2e+07

4e+07

6e+07

8e+07

1e+08 1.2e+08 Cost function

Figure 8: Regression between cost functions and execution times These results were computed from the complete benchmark set

Two re nements of this cost function were used to take into account distances between processors. Both assume the existence of xed routing tables. { The rst re nement consider that the communications can be modeled as L where  is the shortest path between the processors, L is the number of bytes and  is the time taken to transfer one byte between two adjacent processors. { The second re nement is based on precise measures taken on a torus of transputers con gured by the VCR router. These measures give the costs to transfer bytes between two processors depending on the number of bytes communicated (packetisation can be allowed or not). A correlation between cost function and execution time can be established (see gure 8). We can use the facilities of the transputer to communicate over several links simultaneously and to overlap communications by computations. However, because of the overhead induced by VCR, this is not the case in practice and the rst cost function is not applicable. In addition, since VCR uses packetisation and pipelining mechanisms, the rst re nement is not usable. The best cost function is the one which considers the torus and VCR routing tables (second re nement).

17

Improvement of the cost function of TABU compared to LGCF

30 t pde mat-vect gauss strassen

25 20 15 10 5 0

0

0.5

1

1.5

2

2.5

communication*factor

3

Figure 9: Comparison between tabu and LGCF

The following give results on some well-known families of task graphs. The results compare (in %) the tabu and the LGCF (best greedy) algorithms for the cost function which takes into account the torus architecture.

5.3.1 Use of the algorithms ICI, LE BUT EST DE FAIRE UNE SYNTHESE DE L'UTILSATION DE ALTO... CETTE SECTION N'EST PAS VRAIMENT CLAIRE ET N'A PAS ETE TRES BIEN RELUE... The greedies always delivered their results in less than one second on a Sparc-4 workstation, while iterative algorithms could spend up to one hour. To distinguish between the algorithms available in the toolbox, it can be noticed that:  LPTF has a di erent behavior than the other algorithms and is better than modulo in most of the tests.  Taking into account communication times is important and a greedy like LGCF has shown a really interesting behavior.  Iterative algorithms did not result in important improvements relatively to LGCF. This may be due to several reasons: { The graphs of many benchmarks are very structured: a lot of symmmetries and identical tasks. And so a greedy algorithm such as LGCF can be sucient to map eciently these benchmarks.

18

rst mapping

- execution

-

'

with monitoring

&

new mapping

6



trace ?analysis

-

%

Figure 10: Mapping of a program using monitoring tools

{ In our benchmarks, the ration communication/computation was not chosen

high. Therefore, it must be stressed that:  tests using random task graphs demonstrated that iterative algorithms perform better for programs having a high ratio communication/computation,  when the ratio communication/computation is low, it is closer of the case without communications where the limit 4=3 of the LPTF is known. And this limit is already a good result.

EN PARTICULIER, ON S'ATTEND A TROUVER SOME ADVISES TO USE ALTO...

6 Description of a new mapping method In the general case, we propose a stepwise process a rst mapping is done with a greedy algorithm. If some parameters are unknown, the greedy uses default parameters. Next, the parallel program is executed using monitoring tools. Monitoring results are used to run an iterative mapping algorithm. If the mapping is not good enough, these steps can be repeated as long as needed and the statistical analysis will try to improve the mapping. Usually, one step is sucient. Monitoring can be used to determine the mapping parameters, computation and communication costs, and thus improve the quality of mapping (see gure 10). The monitoring gives the following information:

 The total execution time of the program.  The number of bytes communicated between each pair of tasks.  The computing time of each task.  The idle time of each task (the time wasted by waiting communications).  The total idle time of each processor.  The number of bytes communicated between each pair of processors.

19

Execution time

4.8e+06 4.6e+06 4.4e+06 4.2e+06 4e+06 3.8e+06 3.6e+06 3.4e+06 3.2e+06 3e+06 2.8e+06

Matrix*Vector

0

0.5

1

1.5

2

2.5

3

Communication*x

Figure 11: Improvement of a mapping using feedbacks

The analysis of these data will help to determine good average values for communication and computing costs and to detect some local problems found in a initial solution such as congestion problems or bad estimations. These data are the e ective values of one execution instance of the program. They must be used to ajust the values used by default in the task graph and to weight the average values given by the user. In our benchmarks the knowledge about communication and computation cost was very important. To simulate the behaviour of an user having only an approximate knowledge of these values and to use the possibility of a feedback mechanism (to use the post-mortem trace analysis), we have modi ed the information given to the tabu for the graph corresponding to the matrix-vector product. In the gure 11, the communication costs are multiplied (only for the mapping and not for the execution) by x. And so, the closer to 1 we are, the better the solution (the execution time is low). CHANGER DANS LA FIGURE diamond... PAR mat-vect

7 Conclusion A FAIE APRES, INSISTER SUR ELS ORIGINALITES DU PAPIER: GROS DEVELOPPEMENT LOGICIEL, ETC.. An original solution was designed and implemented; this solution is interfaced with a working programming environment. Three kinds of mapping algorithms are used: greedy, iterative and exact ones. The more complex developments have been done for a tabu search and a simulated annealing. These algorithms improve various objective functions (from the simplest and portable to the most complex and dedicated to a given architecture). The weigths of the task graphs can be tuned using

20

a post-mortem analysis of traces. The possibility of using directed task graphs after a clustering phasis is underlined. The use of tracing tools leads to a validation of the cost function and of the mapping algorithms. A benchmark protocol is de ned and used. The tests are runned on the Meganode (a 128 transputer machine) using a software routing mechanism (VCR), synthetic task graphs generation with ANDES of the ALPES project (developed at IMAG by Kitajima and Plateau) and the Dominant Sequence Clustering of PYRROS (developed at Rutgers by Gerasoulis and Wang). Perspectives: dynamic and taking into account scheduling...

References [BBTW95] J. Bla_zewicz, P. Bouvry, D. Trystram, and R. Walkowiak. A tabu search algorithm for solving the mapping problem. 1995. Submitted to ECCO'95. [Ber89] F. Berman. Experience with an automatic solution to the mapping problem. In Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences, 1989. [BKPT94] P. Bouvry, J.-P. Kitajima, B. Plateau, and D. Trystram. Andes: A performance evaluation tool, application to the mapping problem. submitted to a special issue of Performance Evaluation, 1994. [BM88] S. W. Bollinger and S. F. Midki . Processor and link assignment in multicomputers using simulated annealing. In Int. Conf. Par. Proc., 1988. [Bok81] S. H. Bokhari. On the mapping problem. IEEE Transactions on Computers, 1981. [CD73] E.G. Co man and P.J. Denning. Operating systems theory. Series in Automatic Computation. Prentice-Hall, Englewood Cli s, NJ, 1973. [CK88] T.L. Casavant and J.G. Kuhl. A taxonomy of scheduling in general-purpose distributed computing systems. IEEE Transactions on Software Engineering, 1988. [GJ79] M.R. Garey and D.S. Johnson. Computers and intractability: A guide to the theory of NP-completeness. W.H. Freeman, New York, 1979. [GL92] F. Glover and M. Laguna. Tabu Search, a chapter in Modern Heuristic Techniques for Combinatorial Problems. W.H. Freeman, N-Y, 1992. [Gui95] F. Guinand. Overview of parallel programming environments: the scheduling point of view. Rapport APACHE 16, IMAG, Grenoble, 1995. Available by anonymous ftp at imag.fr in pub/APACHE/RAPPORTS.

21

[HB88]

P. Haden and F. Berman. A comparative study of mapping algorithms for an automated parallel programming environment. Technical report, UC San Diego, 1988. [HJMWS86] J.G. Harp, C.R. Jesshope, T. Muntean, and C. Whitby-Stevens. The development and application of a low cost high performance multiprocessor machine. In ESPRIT'86: results and achievements, Amsterdam, 1986. North Holland. [Kit94] J. Kitajima. Modeles Quantitatifs d'Algorithmes paralleles. PhD thesis, Institut National Polytechnique de Grenoble, Grenoble - France, November 1994. in french. [KN84] H. Kasahara and S. Narita. Practical multiprocessor scheduling algorithms for ecient parallel processing. IEEE Transactions on Computers, 1984. [KP92] J.-P. Kitajima and B. Plateau. Building synthetic parallel programs: the project ALPES. In N. Topham, R. Ibbett, and T. Bemmerl, editors, Programming Environments for Parallel Computing, pages 161{170, Amsterdam, 1992. IFIP, North Holland. [Lee91] C-L. Lee. Parallel machines scheduling with nonsimultaneous machine available time. Discrete Applied Mathematics, 1991. [Lo88] V.M. Lo. Algorithms for static task assignment and symetric contraction in distributed systems. In ICPP88, 1988. [NT93] M. Norman and P. Thanish. Models of machines and computation for mapping in multicomputers. ACM Computing Surveys, September 1993. [Pla94] B. Plateau. Presentation d'APACHE. Rapport APACHE 1, IMAG, Grenoble, October 1994. Available by anonymous ftp at imag.fr in pub/APACHE/RAPPORTS. [RO93] G. Ramanathan and J. Oren. Survey of commercial parallel machines. ACM Siggarch, 21(3), june 1993. [Sin87] J. B. Sinclair. Ecient computation of optimal assignments for distributed tasks. Journal on Parallel and Distributed Computing, 1987. [SS89] N.W. Sauer and M.G. Stone. Preemptive scheduling of interval orders is polynomial. Order, 5:345{348, 1989. [ST85] C.-C. Shen and W.-H. Tsai. A graph matching approach to optimal task assignment in distributed computing systems using a minimax criterion. IEEE Transactions on Computers, 1985.

22

[Sto77] [YG92]

H.S. Stone. Multiprocessor scheduling with the aid of network ow algorithms. IEEE Transactions on Software Engineering, 1977. T. Yang and A. Gerasoulis. PYRROS: static scheduling and code generation for message passing multiprocessors. In Proceedings of the 6th ACM International Conference on Supercomputing, pages 428{437. ACM, July 1992.

23