parallel application developers, a strong need exists for various tools to ... WAS tools that were developed and will introduce the approach of load balancing.
Load Balancing by Redundant Decomposition Mapping* J. F. de Ronde’, A. Schoneveld’,
and
P.M.A. Sloot’. N. Floras” and J. Reeve”
University of Amsterdam, Department of Mathematics and Computer Science, Kruislaan 403, 1098 SJ Amsterdam, teI+3 I 20 525 7463. fax +3 1 20 525 7490, http://www.fwi.uva.nUresearch/pscs. email:{janr.arjen,peterslo}@fwi.uva.nl. a University of Southampton. Department of Computer Science, email:{nf,jsr} @ecs.soton.ac.uk Abstract. In this paper a methodology is presented that has been developed in the CAMAS’ project for the purpose of decomposition and mapping of parallel processes to processor topologies. The methodology has been implemented in terms of a toolset. thus allowing automatic decomposition and mapping of parallel processes. The parallel processes and processors are modelled according to a generally applicable formalism. based on the so-called virtual particle model. As a case study the presented methodology is applied to parallel linite element simulations. Keyords:
(redrtndanr) domain decomposition. mapping, virtual particles, parallel process mod-
. lling
I Introduction fixperience gained in the CAlMAS project. see [ 111. indicates that in the community of parallel application developers, a strong need exists for various tools to support efficient :ude development. One can distinguish e.,.v tools for code analysis. performance evaljation and load balancing of parallel applications. The CAMAS project has focused on :he development of methodologies on which such tools can be based. The methods that have been developed are implemented in an integrated workbench. In this paper we will :WUSon a method that has resulted in two tools of this workbench that can be classified .I\ load balancing tools. In close cooperation they allow for automatic decomposition Ind mapping of parallel processes. The outline of this paper is as follows. Section 2 will briefly give an overview of the (‘;WAS tools that were developed and will introduce the approach of load balancing ‘hmugh redundant decomposition and mappin g. Section 3 presents the computational ‘ncthodology that is used to deal with the load balancing problem. Section 4 shows re“its for the methodology applied to parallel finite element simulation and section 5 disWes the obtained results and in which direction research will continue from here.
!
Background
%$re I shows an overview ofthe CAMAS tools. The following
tools can be identified:
” cnde analysis tools (W Interprocedural Dependency Analyser (IDA) *---2_, arcsWing author: J.F. de Ronde ’ Ct\,MAS: co mputer Aided Migration of Applications System. ESPRIT III project number h7S~mSeptember 1992 / September 1995
(b) Fortran to Symbolic Application Description translator (F2SAD). For extensive descriptions of both tools see respectively [6] and [12]. 2. performance evaluation tools (a) Parasol I (parallel machine modelling) (b) Parasol II (parallel performance prediction) See for example [12] and [2]. 3. Load balancing tools (a) Domain Decomposition tool (DDT) (b) Mapping tool (MAP) See respectively [3] and [ 11.
Fig. 1. CAMAS tools
The methodology of this paper.
behind the grey boxes and ovals in this toolset is the focal poiai
2.1 Load Balancing by Redundant Decomposition and Mapping The problem of finding an efficient mapping of a set of parallel tasks generally is Kferred to as the load balancing problem (LBP). We pose that a practical approach for !n.: LBP is to split the problem into two distinct phases: domain decomposition followed n: mapping. This is motivated by the following. Many applications (like finite element) work or data domains of considerable size. In general the intrinsic parallelism (often denoted b! “problem size” N) is much higher than the available number of processors (P) in a parallel system. Finding the optimal mapping is an NP-complete problem, which require\ O(pN) different mappings to be evaluated. It is obvious that the size of solution-space is too large to find the best mapping in a reasonable time. It is therefore essential to reduce this solution space. This can be done by clusterin: N parallel tasks into M clusters where M is of O(P) (d ecomposition). The decomposi-
tion determines the connectivity between the clusters as well as their relative computational weight. In this way the LBP is reduced to the mapping of n/1 parallel tasks instead cf the N atomic parallel tasks. The M clusters can then be grouped into (a maximum) means of some optimization method.
of P “super-clusters”,
by
In summary we can motivate the two-phase approach as follows: decomposition is necessary to separate the domain of an application in an acceptable number of segments t parallel tasks). Mapping is necessary for optimization of the parallel execution time. It i> important that a near optimal solution is still present in the reduced solution space. The chances on this are significantly increased by decomposing into M clusters where .\i > P (redundant decomposition).
3
Computational
Methodology
In order to allow for fast evaluation of a given decomposition and mapping, two models are needed: a parallel application model and a parallel machme model. see e.g. [8]. Both models must be of moderate complexity, allowing for quick evaluation of candidate mappings. However. the model still has to carry enough richness to allow for comparison rotrhe model mapping with the real system. Both parameterizations are instantiations of 3 more generic formalism used in dynamic complex systems theory: the virtual particle model (vip), see [IO]. The fact that a vip formalism can be used to describe the system ib a great advantage. since the techniques described below are applicable to all systems [hat can be described in terms of vip’s. Moreover, methods developed for vip’s can be applied to the LBP as well. 3.1
Parallel Application
Model
.Agood candidate for a generic formalism to describe the performance of static parallel processes is the static execution graph description. The vertices of the graph correspond !o work load while the edges model communication load in the parallel process graph. For example, Fig. 2 shows the modelling procedure for a torus consisting of 32 x 10 = Z20 quadrilateral elements. The torus is partitioned into 16 parts (only the top half is \hown) and these 16 partitions are consecutively represented in a static execution graph. In this case, the work load weights of the vertices can be chosen identical. The edges in ihe graph have weight 2,5 or 10.
Fig. 2. decomposition
into 16 vip’s of a 320 element torus
3.2 Parallel Machine Model A parallel machine can be modelled analogously. Now a vertex corresponds to a pm. cessor and the attribute is processing power, while an edge corresponds to the physical network connection between processors and the attribute there is link speed or band. width. The processor graph should be fully connected. This means that every processt,, can communicate with every other processor in the topology, although they may be nut linked to each other directly in the physical topology. 3.3 Fitness of a Mapping Having modelled the mapping problem in terms of a compact parameter set we havr to define a function that expresses the quality of a mapping. The function (1) e.g. ha\ the property that it has a minimal value in case of optimal load balancing, see e.g. [s] and [4] for other examples of such cost functions.
c=
&P(n))+;~$(n,m) n=l
(II
n=l m=l
1. ND is number of processors. Work(nj =Plnl (a) P(n): co’mputing speed for processor n in Flops/s. (b) Worli(n): Amount of work on processor n in terms of flops.
2. w(n)
3. C(n,m)
= $$f$
(a) P(n, m): bandwidth between processor n and m in Bytes/s (b) .M(n, m): Number of bytes to be sent from processor n to m. Functions such as (1) are well suited for deciding on the fitness of mappings. The evaluation cost is relatively low (in terms of computing time) and they have proven to be realistic enough, see e.g. [5]. An advantage of using (1) is that it has the locality property. This means that incremental changes in a mapping can be propagated into the cost without having to recalculate the whole cost function. Only a delta has to be recalculated instead. This is specifically useful if an optimization algorithm is applied to the mapping problem, that is based on incremental changes, such as Simulated Annealing (SA) or steepest descend methods. A disadvantage is that it is a less correct model for the absolute cost. 3.4 Decomposition and Mapping: Algorithms Since the decomposition and mapping phases are processed separately, dedicated methods for each can be developed. Experience has learned that decomposition is performed with considerable efficiency via deterministic partitioning methods that work on exactly the graph representation of an application that is described above. Examples of such algorithms, like recursive spectral bisection (RSB) can be found in [3]. Although these algorithms can take into account the properties of the parallel platform on which the execution graph should be mapped, they are especially suited for parallel systems that can be described recursively, whereas our interest is of a more generic nature. The problem of evaluating every possible mapping for a problem with granularity I\’ on a parallel platform of P processors is an intractable task for realistic problems. Even if
the granularity of the application first is reduced by means of a redundant decomposition, the amount of possibilities still grows unacceptably. Heuristic methods like genetic algorithms (GA) and SA have proven to be an ade+tate approach for finding, sub optimal mappings (see e.g. [5]). Motivated by arguments like parallelizability of the algorithm. 0oeneric applicability. ease of implementation and extendibility we have chosen to base the mapping method for which results are given in this paper on genetic algorithms. _t Results ,.I c;1r grid, see figure 3, consisting of 7000 finite elements has been partitioned into 16, ;I and 64 parts using RSB. Each of these partitionings is mapped on a homogeneous 16 processor grid topology, which is a model of a 16 processor partition of a Parsytec Powcrxpiorer. that is. the link speeds and floating point performance of this machine were incorporated into the model. Figure 4 shows how the cost of the 64 partition mapping evolves for the best individual in the genetic population as well as for the average cost. Cost function (1) was used for calculating these costs. The genetic algorithm initiates with a population of N random chromosomes, where .V is proportional to the numberofclusters (rU) to be mapped. As a selection mechanism roulette wheel selection is applied. Furthermore I -point crossover (with crossover probabiiity 0.7) and mutation (inversely proportional to M) is applied. Both the evolution of the “cheapest” chromosome as well as the average cost in the population is monitored. If the cost of the cheapest chromosome hasn’t changed during the last L genetic steps we assume that the system has converged into a sub-optimal configuration. A typical range for this convergence length L for the experiments presented here is 100 < L c
310.
I
Fig. 3. car grid
-3
aa
‘m
am
I
“““SW.9
lDDD
1m
IMm
II
Fig. 4. evolution of average cost and cost of best genetic individual for mapping of 64 car partitions on a 16 processor grid
after mapping of each of the three difof work load gets better when there’s more
Figure 5 shows the work load distribution
‘crcntpartitioninus Clearly the distribution -“‘unduncy after%composition.
I
Fig. 5. work load distribution 16.32 and 64 partitions
for mappin g Fig. 6. communication load distribution mapping 16,32 and 64 partitions
for
Figure 6 shows the communication load distribution (amount of communication per processor) after mapping of the three different partitionings. Also in this case the distribution of communication load gets better with increased redundancy.
5 5.1
Discussion
and Future Work
Discussion
The genetic algorithm convergences to (sub) optimal mappings. This follows from a large suite of experiments (data not shown) like mapping on processor networks with respectively infinitely fast processors (all processes are mapped on one processor) and on topologies with an infinitely fast network (work load balance). Also for several regular situations. for which analytically the optimal solution is known. the method convergej to this derived solution. For example. mapping of the 16 torus partitions of Fig. 2 on a 4 processor ring, results in a sensible clustering of 4 partitions per processor. For an extensive discussion of these experiments see [I]. Mapping of the 16,32 and 64 car partitions presented in section 4 shows (see Figs. 5 and 6) that the work load as well as the communication pattern considerably improves for increasing redundancy. For example, 16 to 16 mapping can’t even utilize all 16 processors efficiently, and gets stuck in a situation where only 14 processors are used. Furthermore the increased redundancy causes a decrease in the cost of the best mapping found: the redundancy provides more degrees of freedom for the mapping algorithm to obtain a good solution. It must be noted that the specific parallel system and parallel application parameters are very important for the shape of the “phase space” of the mapping problem. Only in the region where the communication term and work load term in (1) really frustrate one another (are of comparable magnitude) behaviour such as presented here should be expected. Unfortunately, this tends to be the case in many situations. 5.2 Future Work The methodology described above and the tools that have emerged from it are applicable to the load balancing problem for applications that display a static work load distribution combined with parallel machines that have static processor characteristics. Furthermore
the description is generic, allowing application of the tools to much more than a dedicated field as finite element simulation. However, some applications can not be described in terms of a static parallel execution graph. Furthermore parallel machines like workstation clusters are by no means static resources. At the moment the applicability of the tools is therefore somewhat limited. An important pact of our future work will be to adapt the load balancing methodology for problems that show dynamic behaviour. We intend to incorporate (optimized 2nd parallel) tool kernels into a runtime support system that allows for task migration, which also has been under development at the University of Amsterdam, see [9]. A report which covers the results of the CAMAS workbench tools is available from the HPCnet” consortium through the following URL: http://hpcnet.soton.ac.uk/ References I. J. F. de Ronde and P.M.A. Sloot. Camas-tr-2.1.3.4 map final report. Technical report, University of Amsterdam. October 1995. 2. J.F. de Ronde. B. van Halderen, A. de Mes. M. Beemster, and P.M.A. Sloot. Automatic performance estimation of spmd programs on mpp. In L. Dekker, W. Smit, and J. C. Zuidervaart. editors, Massively Parallel Processing Applications and Development. pages 38 l-388. EUROSIM. June 1994. 3. N. Floros. Camas-tr-2.2.2.8 ddt user’s guide. Technical report, University of Southampton, .-\pril 1995. (I. J. De Keyser and D. Roose. Load balancing data parallel programs on distributed memory computers. Parallel Computing, 19: 1199-1219, 1993. 5. N. Mansour and G. Fox. AlIocating data to multicomputer nodes by physical optimization algorithms for loosely synchronous computations. CONCURRENCY: PRACTICE AND EXPERIENCE, 4(7):.557-574, OCTOBER 1992. 6. J. Merlin. Camas-tr-2.2.1.2 ida’s user’s guide. Technical report. University of Southampton, September 1993. 7. N.Floros. J.Reeve. J. Clinckemaille. S.Vlachoutsis. and G. Lonsdale. Comparative efficiencies of domain decompositions. Parallel Computing, 1995. Accepted for publication. 8. .Ll. G. Norman. Models of machines and computation for mapping in multicomputers. ACM Computing Surveys. 25:263-302, 1993. 9. Benno J. Overeinder. Peter M. A. Sloot. and Robbert N. Heederik. A dynamic load balancing system for parallel cluster computing. Future Generation Computer Systems, 1996. Ac-
cepted for publication. 10. P.M.A. Sloot, J.A. Kaandorp, and A. Schoneveld. Dynamic complex systems (dcs) a new approach to parallel computing in computational physics. Technical Report TR CS 95, University of Amsterdam, November 1995. 11. P.M.A. Sloot and J. Reeve. Executive report on the camas workbench. Technical Report CAMAS-TR-2.3.7, University of Amsterdam and University of Southampton, October 1995. 12. B. van Halderen and P.M.A. Sloot. Camas-tr-2. I. 1.7 sad/parasol final report. Technical report. University Of Amsterdam, October 1995.