Parallel Load Balancing for Dynamic Execution ... - CiteSeerX

61 downloads 7976 Views 1MB Size Report
The octree is generated automatically and handles any type of 3-D geometry and domain connectivity. The method is evaluated in terms of execution time as.
Parallel Load Balancing for Dynamic Execution Environments 

T. Minyard ,

y

Y. Kallinderis ,

and

K. Schulz

z

Dept. of Aerospace Engineering and Engineering Mechanics The University of Texas at Austin Austin, TX 78712

Abstract

achieve computational speeds which surpass modern vector-supercomputers by dividing the computational domain onto a number of processors which perform computations independently. Part of the parallel algorithms' widespread popularity can be attributed to the ease with which a parallel environment can be achieved. With the recent standardization of parallel communication schemes, a simple collection of workstations networked together can behave as a parallel machine and can in fact, run the same codes as those developed for state-of-the-art parallel architectures.

A novel partitioning method which uses orthogonal subdivision of a special octree corresponding to the computational grid is presented. The octree is generated automatically and handles any type of 3-D geometry and domain connectivity. The method is evaluated in terms of execution time as well as the quality of the partitions generated. A parallel load balancing method for dynamic execution environments is also presented. The balancer is designed to work when loads on the processors change due to local mesh adaptation or to changes in the parallel execution system. For the case of a dynamic parallel system, the loads on the processors are determined by run time measurements and the balancer redistributes the work based on these timings. The e ectiveness of the balancer is demonstrated via parallel execution times for adaptive turbulent ow simulations.

Implementing CFD simulations in a parallel environment introduces a number of new diculties not encountered with serial algorithms. First, to achieve optimum scalability, parallel simulations require computational domains to be partitioned equally so as to have identical loads on all processors. These domains can have very complex geometries. The partitioning of the computational domain must be automatic and ecient and it should also seek to minimize the communication requirements along interpartition boundaries. FurINTRODUCTION thermore, the partitioning algorithm should work e ectively for a variety of di erent geometries and Recent trends in computational uid dynam- mesh topologies. ics (CFD) have shown a tremendous increase in the development of parallel algorithms for large- A number of approaches for partitioning of comscale CFD simulations. Such parallel algorithms putational domains have been developed [1, 2, 3]. Two of the more popular techniques are orthogo Graduate Research Assistant, Member AIAA nal recursive coordinate bisection and eigenvalue y Associate Professor, Senior Member AIAA recursive bisection. Orthogonal recursive bisection z Graduate Research Assistant, Member AIAA c 1996 by the American Institute of Aero- uses cutting planes to partition the grid based on Copyright the centroidal coordinates of the cells. This apnautics and Astronautics, Inc. All rights reserved. 1

ever, its use as a load balancer may be limited because the partition shapes can change signi cantly after grid adaptation and the time for data migration would be prohibitive. The \greedy" methods do not have direct control of the shape and form of the interpartition boundaries for unstructured grids. Additionally, these methods are nondeterministic and iterations are needed to achieve a load balance. A new partitioning scheme is developed in which the orthogonal coordinate division approach is applied to a special octree corresponding to the hybrid mesh. This octree is generated based on the distribution of grid cells in the computational domain so that an octant contains a number of grid cells. The octants are then partitioned and the resulting subdomains have fewer elements on the partition interfaces than if orthogonal division of the grid cells were performed. The generality of the octree-based partitioning is explored by applying the method to a variety of geometries. Qualities of the resulting partitions are compared with those obtained by the RSB method. Scaling of execution times for the octree partitioner with increasing number of partitions is also examined. A new parallel load balancer is presented which adaptively determines load imbalance and repartitions the computational grid accordingly. The balancer uses the same octree-based division to redistribute the work among the processors. The determination of imbalance is calculated by one of two approaches. The rst simply counts the number of grid cells and balances based on this number. This approach is applied in the case of a parallel mesh adaptation. The second approach uses run time measurements of the parallel solver and assigns weight factors to the cells. The balancer uses these weight factors to adjust the partitions so that work is distributed evenly. This approach is used in dynamic parallel environments where the load on the processors may vary. The e ectiveness of the octree-based load balancer is examined by comparing execution times for the parallel solver before and after load balancing. The current work uses the parallel solver and adapter presented in [9]. A more thorough dis-

proach is fast but the number of elements on the partition interfaces can be large. Moreover, the method cannot handle complex 3-D grids easily. Eigenvalue recursive bisection requires the solution of an eigenvalue problem and is quite expensive but this technique reduces the number of elements on partition interfaces [2, 4]. One of the most effective eigenvalue techniques is recursive spectral bisection (RSB) which partitions on the basis of graph connectivity. This technique has been used for partitioning of unstructured meshes to obtain high quality subdomains [2, 3]. A second key issue in the development of parallel algorithms is the need for a parallel load balancer capable of adapting to changes in the parallel environment. Imbalances in the load of a parallel environment can arise from local mesh adaptation or from load changes on the parallel machine itself. For example, consider the case of a parallel algorithm executing on a network of workstations and one of the workstations suddenly becomes heavily loaded by other processes. The increased load on only one workstation will hinder the performance of the entire parallel solver. An ideal parallel load balancer should be able to calculate the amount of imbalance among the processors and redistribute the work in as little time as possible. Moreover, it should maintain the quality of the partition interfaces for a variety of geometries and grid types. Another desirable quality of a load balancer is to minimize the amount of data migration so as to reduce the amount of communication. Several algorithms have been proposed for dynamic load balancing over the past few years. Most of the algorithms are based on either recursive bisection of the computational domain [5] or on local migration techniques such as \greedy" algorithms and simulated annealing [6, 7, 8]. Most of the recursive bisection approaches use coordinate bisection of the grid to adjust the interpartition boundaries. While these methods may be fast, they do not yield high quality partitions for complex geometries. The RSB method results in minimal interpartition boundaries and it has recently been implemented in parallel for static meshes. How2

of octants is only about ten percent of the total grid cells, far fewer calculations are needed for partitioning and a reduced amount of computational time is realized. The computational grid is divided into as many subgrids as processors using a partitioning algoOCTREE-BASED PARTITIONING rithm which consists of the following two steps: OF HYBRID GRIDS (i) Coordinate-based grouping of octants, and (ii) Smoothing of partition boundaries. The followA special octree decomposition of the compu- ing sections present both steps using a tetrahedral tational domain is constructed for partitioning of grid around a sphere as an example case. an unstructured mesh. The octree is generated by recursive subdivision of the master octant, which encompasses the entire computational domain, into successively smaller octants. A sweep over the cells Coordinate-based Grouping of Octants in the domain is performed and the cell is placed in the octant in which its centroid lies. When the The grid is partitioned by dividing up the cornumber of cells in an octant exceeds a speci ed responding octree and assigning the cells in an ocamount, typically twenty, the octant is re ned into tant to the appropriate subdomain. The octants eight smaller octants and the cells that were in the parent octant are placed in the appropriate child are divided into groups based upon their centroidal octant. This process continues until all cells in the coordinates by cutting planes for the number of domain are placed in their respective octants. The subdomains desired. The coordinate-based cutting resulting octree has signi cantly fewer octants than planes are better suited for division of an octree the total number of grid cells. An example of the than for partitioning of the computational cells. octree for a tetrahedral mesh around a sphere is For example, if the cutting planes were used to partition the unstructured 3-D mesh about a sphere shown in Figure 1. A cut of the octree at the equa- then the partitioning shown in Figure 2 would retorial plane is depicted. sult. The gure shows the footprint of the partiTwo advantages of the octree-based partitioning tions on the symmetry plane for a sixteen processor method are apparent when compared to previous case. Several of the resulting subdomains are long approaches for partitioning grid cells. First, the and thin with a high percentage of nodes and edges octree results in a structure that follows the geom- on interpartition boundaries and cells that are disetry of interest as shown in Figure 1. The com- connected. Such a partitioning results in a large putational cells are clustered around the body to amount of communication when used by a paralresolve the pertinent ow features. As a result, the lel ow solver. Figure 3 shows the partitioning of octants are re ned more in this region while the the octree corresponding to the same tetrahedral octants near the far eld remain relatively large. grid for a sphere. The footprints of the resulting This biasing of the octants to the geometry results tetrahedral subdomains on the symmetry plane are in partitions that have a lower surface to volume ra- shown in Figure 4. The partitions are not as long tio with fewer grid elements on the interpartition and thin as before and no disconnected cells are boundaries. The second advantage of the special present. The percentage of nodes on the boundoctree is a reduced amount of computational time ary has been reduced for the octree partitioning. for partitioning. Generation of the octree is a fast Furthermore, the computational time for subdiviprocess that requires only a small percent of the sion of the octree is much smaller than for division total time for partitioning of the grid and once an of the entire computational grid. In this case, the octree is generated for a hybrid grid, it can be used grid contains over 100K tetrahedra, but the correfor any number of partitionings. Since the number sponding octree has only 10K octants.

cussion of the nite-volume solver and adapter on which the parallel implementation is based is given in [10, 11]. The generation of the hybrid grids used by this work is described in [12].

3

percentage of nodes and edges on the boundary.

The strategy for partitioning of a hybrid grid is to subdivide the prismatic and tetrahedral regions separately. The prismatic and tetrahedral subdomains can either be combined or kept separate. If both regions are partitioned into the same number of subdomains, then a prismatic and a tetrahedral subdomain can be combined to form a single partition. This approach gives the best load balance among the partitions since each partition will contain the same number of prisms and tetrahedra. However, the di erent grid subdomains can also be kept in separate partitions but a load balance is much more dicult to obtain. The structure of the prisms in the normal-tothe-surface direction is exploited. The prisms are de ned by their corresponding base faces on the surface. As a result, the stacks of prisms are partitioned by simply partitioning the triangular surface mesh. All cells within each prism-stack are assigned to the same partition. In this way, the data structure operations for partitioning, solving, adapting, and load balancing of the prismatic grid refer to the triangular surface mesh. This results in savings in both memory and execution time.

PERFORMANCE OF OCTREE PARTITIONING The most important property of an e ective grid partitioner is that it should produce balanced subdomains with a minimum number of elements on partition interfaces for all types of grid elements and any complex geometry. Furthermore, the method should be as automated as possible and the amount of time for partitioning should not increase drastically as the number of partitions increases. This section presents the e ectiveness of the hybrid grid partitioner by examining partition qualities and timings obtained using the octreebased method. The partition qualities are compared with those obtained from an established grid partitioning method, namely recursive spectral bisection (RSB) [3]. Several geometries are examined including the sphere geometry and an aircraft con guration with and without engines. It should be noted that the octree-based partitioner is automated with minimal user interaction. In the present work, the quality of a partitioning is de ned as the maximum percentage of grid points in a subdomain that are on interpartition boundaries. This percentage relates the amount of communication required for the parallel solver to the number of computations performed within a subdomain. As this percentage increases, a larger portion of the computational time will have to be spent transferring information to neighboring processors. The rst case tested was a hybrid grid for the sphere geometry. Figures 6(a) and 6(b) show the reduction in the maximum percentage of nodes on partition boundaries before and after smoothing for several prismatic and tetrahedral partitionings of the hybrid mesh about the sphere. The smoothing of partition interfaces results in a substantial reduction in the percentages of interface nodes for the tetrahedral subdomains. The gures also show how the present partitioning technique compares to the RSB approach. For most of the partitionings, the present method yields about the same

Smoothing of Interpartition Boundaries Due to the unstructured nature of the grid, interpartition boundaries may be jagged with a large number of nodes and edges on partition interfaces. These boundaries can be improved by applying a smoothing technique to the cells on the partition interfaces. The process begins by determining which cells in the computational domain are candidates for smoothing. A cell is agged for smoothing if all of its nodes are shared between two neighboring partitions. After agging all of the candidate cells, the partition interfaces are altered by assigning half of the agged cells to one partition and the other half are assigned to the neighboring partition. Typically, ve smoothing iterations are performed. Figure 5 shows the previous octree-based partitioning of the sphere tetrahedral grid after smoothing has been applied. Comparing Figure 5 with Figure 4, it is observed that the irregular and jagged boundaries have now been improved with a lower 4

geometry. A cutaway view of the octree for this case is shown in Figure 10. The gure shows the outer prismatic surface and the intersection of the octree with the surface on three planes. The gure illustrates how the octree is biased to the geometry. The footprint of the resulting tetrahedral subdomains on the symmetry plane after smoothing of the partition interfaces is shown in Figure 11. The partitions have smooth boundaries with no disconnected cells.

qualities when comparing the maximum local percentage of nodes on partition interfaces. The average percentages of nodes on the boundaries show the same trend as the maximum percentages when compared to the RSB method. It is also noted that the maximum number of neighboring partitions for the octree-based partitioning was about the same as the maximum number of neighbors for the RSB partitioning. The maximum number of neighboring partitions governs the number of communications while the percentage of nodes on interfaces corresponds to the amount of data for communication. Even though RSB does slightly better for the static meshes, implementation of RSB in a parallel environment with dynamic meshes is much more dicult than using a load balancer with octreebased partitioning strategies.

The amount of execution time required to partition three di erent hybrid grids is plotted versus increasing number of partitions in Figure 12. The execution times increase linearly with increasing number of partitions. These timings correspond to ten smoothing iterations on the interpartition boundaries. The largest amount of time is spent in smoothing of the tetrahedral portions of the partitioning. Only twenty- ve percent of the total execution time is required for generating, sorting, and dividing of the octree. Approximately ten percent of the time is for smoothing of the prism boundaries and the rest of the time is spent smoothing the tetrahedra interfaces. The total execution time could be reduced by performing fewer iterations, but the quality of the partitions will be slightly worse. However, most of the improvement in the partition boundaries is accomplished in the rst few smoothing iterations while the remaining iterations improve the partition qualities only slightly. Using the octree of a hybrid unstructured grid to partition the grid results in subdomains that have better qualities than standard coordinate bisection. Since the octree is biased by the geometry and the distribution of the cells in the grid, the partitions tend to follow the same biasing and the resulting partitions have a better surface to volume ratio. Universality of application of the present method to very di erent geometries and grids is examined next. Figure 13 shows the global percentage of nodes on interpartition boundaries plotted against the number of nodes per partition for the sphere, HSCT, and HSCT with engines hybrid grids. The global percentage is calculated by dividing the total number of nodes on interpartition boundaries by the total number of nodes in the grid. The

Partitioning of a hybrid mesh about a High Speed Civil Transport (HSCT) aircraft con guration is now considered. The initial hybrid mesh consists of approximately 120K nodes, 170K tetrahedra, 4400 surface triangles, and 176K prisms. The corresponding octree used to partition this mesh contains just over 16K octants. Figure 7(a) shows the signature of sixteen partitions on the upper surface of the HSCT before smoothing of the interpartition boundaries. The surface partitions after smoothing are shown in Figure 7(b). The partition interfaces after smoothing are not as jagged as they were previously and a lower percentage of the nodes and edges are on the boundaries. Figure 8(a) compares the maximum local percentage of prismatic nodes on partition interfaces for the current octree partitioning and a partitioning generated using RSB. The octree technique yielded a slightly higher percentage of nodes on the boundaries than the RSB method. A comparison of the maximum local percentage of tetrahedral nodes on the partition interfaces is presented in Figure 8(b). Again, the percentage of nodes on interpartition boundaries is only slightly less for the RSB than for the present method. The octree partitioning of the tetrahedral portion of the HSCT hybrid grid for a case with sixteen partitions is shown in Figure 9. The gure shows the footprint of the partitions on the symmetry plane of the mesh. The octree biases the partitioning around the aircraft 5

grid partitioning. Then, a sweep through the local prismatic and tetrahedral cells is performed and the cells are placed in their respective octants. If the load imbalance results from a local mesh adaptation, then the weight of each octant is calculated solely on the number of cells within that octant. In the case of dynamic execution systems, i.e clusters of workstations, the weight assigned to each octant is the sum of the cell load factors within that octant. These cell load factors are computed by monitoring the amount of time for each processor to complete one time-step of the parallel solver. The octree-initialization step of the load balancer is parallel since each processor works only on the cells within the local subdomain. However, the global octree is contained on each processor. While this requires extra memory on each processor, the entire data structure of the octree is only a small percent of the total amount of memory needed by the balancer. Also, the communications and computations required if using a distributed octree would only increase the complexity of the balancer with minimal savings in memory and execution time.

curves are very similar even though the grids have di erent distributions of cells and nodes. This similarity shows that the current partitioning method yields approximately the same partition qualities for entirely di erent geometries.

PARALLEL DYNAMIC LOAD BALANCING The initial grid partitioning algorithm generates subdomains with approximately an equal number of grid cells in each of them. However, the load on the processors may become imbalanced as the grid is adapted dynamically or if the parallel system environment changes. The problem of eliminating this imbalance consists of two independent subproblems. The rst concern is the identi cation of the processors who need to exchange cells with their neighbors along with the number of cells to be exchanged. The second problem concerns the actual exchange of cells between processors including the updating of the pertinent data structures. Determination of the load imbalance is based on one of two concepts. The rst uses the number of cells in a partition and the load balancer just distributes the cells evenly among the processors. This approach is typically used after a dynamic mesh adaptation. The second concept uses run time measurements of the parallel solver to calculate weighting factors for the cells. The load is then balanced based on these weight factors to improve the eciency of the solver. This method applies to a dynamic parallel environment where the speeds of the individual processors may vary. The load balancing algorithm consists of four steps: (i) initialization of the octree, (ii) sorting of the octree, (iii) determination of communication patterns, and (iv) local migration of cells.

Sorting of the Octree

The same coordinate based sorting approach used by the hybrid grid partitioner is applied for the load balancer. This rst step gathers all of the octant weight factors onto every processor. This requires two global communications for a hybrid mesh to get the octant weight factors for the prisms and tetrahedra. After obtaining the global weight factors, each processor then performs the orthogonal coordinate based bisection procedure on the global octree. Even though the calculations are repeated on every processor, the time for sorting of the octree is small when compared to the time to migrate the cells. After sorting of the octree, every processor will know on which processor each octant should be located. The cells within the local subdomains are then colored based on the octant in which the cell lies. At this point, each processor Initialization of Octree knows which cells need to be transferred to other The rst step of the load balancer is to deter- processors. The signi cant advantage o ered by mine in which octants the cells lie. Each processor this approach is that the length and form of the reads in the global octree generated by the initial interprocessor boundary can be maintained by the 6

octree-based subdivision. This is very important structures, the nodes that reside on interpartition especially when multiple load balancings are to be boundaries will have to inform their neighbors of e ected in a single execution, as is the case with the new node numbering. dynamic adaptive meshes. Each migration step results in a new grid partitioning that will have the complete and updated Determination of Communication Patterns data structures for the grid as well as the proper interpartition boundary pointers. The load will Now that the cells are colored by which proces- be balanced only after all migration steps are persor they should belong to, the global communica- formed. tion pattern for migration must be generated. A sweep through the local cells is performed in order to calculate the number and destination of cells to PERFORMANCE OF be transferred. A global gather step is performed PARALLEL so that each processor knows which processors need DYNAMIC LOAD BALANCING to send data to other processors. A sweep over the A partitioned memory Multiple Instruction Mulglobal processors is performed and the processors tiple Data (MIMD) architecture typically consists are colored based on who they need to send to and of a collection of multiple processors connected toreceive from. The processors are then paired with gether by a high-speed interconnection network. one another according to their color. The number Each processor has the freedom of executing its of migration steps that have to be performed is de- own set of instructions on its data. There is no termined by the maximum number of pairings over notion of shared memory and the only way that all processors. processors can interact is through the connecting network. The IBM SP2 is an example of such an architecture. Local Migration of Cells The same user program is executed on all the The nal stage of the dynamic load balancer processors each with its own set of data. Coordiperforms the actual migration of the cells among nation among is achieved through mesthe pairs of processors. The number of migration sage passing forprocessors which \send" and \receive" primsteps has already been calculated during the de- itives are provided. The programming paradigm termination of communication patterns. The fol- is essentially that of any ordinary sequential lowing procedure is performed during each migra- That is, the actual structure of any program code. writtion step. First, each pair of processors determines ten for a parallel machine has basically a sequential which cells and nodes will be on the partition inwith additional calls to the message passing terface of the pair. A smoothing pass similar to the form for synchronization among the processors. smoothing applied during partitioning is then per- routines The message passing libraries used for the present formed on the interface cells. The processors that work are based on the Message Passing Interface are transferring cells then pack the data structures, (MPI). MPI provides a standard so that parallel including the spatial coordinates and the solution applications are portable among di erent parallel vectors for any nodes that do not already reside on machines and clusters of workstations. the receiving processor. The data is packed into one array so that only one communication has to This section presents the results for dynamic be performed. Then, all pairs of processors in- load balancing with adaptive meshes and dynamic volved in migration update their data structures parallel execution systems. Performance of the to remove the holes in the data structures due to octree-based dynamic load balancer for hybrid cells being transferred and to add any nodes and grids is tested using two cases; supersonic ow over cells that were received. Since the sending and re- a bump in a channel and transonic ow around the ceiving processors have renumbered their local data ONERA M6 wing. The performance is evaluated 7

in terms of time for balancing and execution time tation. It is observed that the partition boundof the hybrid grid solver before and after load bal- aries have moved chordwise at the leading and trailing edge regions. The maximum percentage ancing. of nodes on interface boundaries for the original hybrid mesh partitioning is 22.9% while the load Adaptive Hybrid Grids balanced adapted mesh has a maximum percentage 20.5% for the nodes on partition interfaces. The The rst case to examine the eciency of the of quality of the mesh partitions remained basically parallel load balancer involves the parallel mesh the after load balancing. The slight improveadaptation of a hybrid grid over a bump in a chan- mentsame in quality due to the increased number of nel. Supersonic ow over a 4% bump in a chan- elements in each issubdomain. partition boundnel is simulated on the IBM SP2 and the mesh is aries look jagged on the surfaceThe of the wing but this adapted based on ow feature detection. The orig- is mainly due to the anisotropic nature of the surinal hybrid meshes consists of approximately 7K face mesh. prisms and 11K tetrahedra while the adapted mesh has 10K prisms and 20K tetrahedra. The hybrid Eciency of a load balancing method depends mesh was originally partitioned only in the stream- critically on the number of cells to be migrated. wise direction resulting in a strip partitioning. This The octree-based load balancer requires a small partitioning requires only two neighbor communi- amount of cell migration since the partition boundcations during the load balancing phase. The tim- aries move only slightly. On the other hand, by ings for each stage of the balancer are shown in comparing the partitionings of the surface trianguFigure 14 along with the total time for balancing. lation for the original and adapted meshes using The gure shows that the total time increases only RSB shown in Figures 17(a) and 17(b), it is obslightly with the number of processors. It is noted served that several of the partitions change subthat the time for migration dominates the total ex- stantially. This would require a larger number of ecution time of the balancer. This is expected since cells for migration if the load balancing were based the migration of cells requires the greatest amount on the RSB partitioning. The partitions for the of calculations and communications. tetrahedral region of the mesh change even more than the prism partitions after adaptation. The The second case to test the e ectiveness of the computational cost for migration of the cells for load balancer for adapted grids uses a hybrid mesh this case would be much higher than for the present around an ONERA M6 wing. The hybrid grid for octree-based load balancer. this simulation consists of about 240K prisms and 156K tetrahedra. The surface triangulation for the Scaling of parallel execution times for the load prismatic region is shown in Figure 15(a). Tran- balancer is examined next. Figure 18 shows the sonic turbulent ow (M = 0:84; Re = 11:72  106 ) timings for octree-based load balancing on the ONis simulated around the wing at an angle of attack ERA M6 wing adapted hybrid mesh with increasof 3.06 degrees. After achieving a partial solution, ing number of processors. As the number of prothe mesh is adapted resulting in approximately cessors increases, the amount of time to initialize, 307K prisms and 185K tetrahedra. The adapted sort, and determine the communication pattern for surface triangulation is shown in Figure 15(b). The the octree increases only slightly. However, the prismatic cells are adapted mainly in the regions of time for migration of the cells increases by a large the lambda shock. Figure 16(a) shows the surface amount from four to sixty-four processors. This inpartitioning of the original mesh into eight proces- crease is due to more migration steps needed since sors, four in the spanwise direction and two in the the number of neighboring processors to transfer normal-to-surface direction. Figure 16(b) shows cells to has increased. However, the change in time the resulting partition boundaries after load bal- for balancing is not as much from the thirty-two ancing of the adapted mesh. The partition bound- to sixty-four case because only a few more neigharies have moved towards the main regions of adap- bor transfer pairings are needed for the sixty-four 8

tion time compared to the additional nine seconds needed to balance the load.

processor load balancing. The total times for load balancing are only a fraction of the time to run the parallel solver for a typical number of 1000 time-steps. Figure 19 shows the timings of the parallel Navier-Stokes solver for 1000 time-steps. The timings for the load balanced cases include the total time for balancing. The unbalanced load timings get progressively worse for more processors due to the increasing imbalance among the processors. However, the execution time for the balanced partitions reduces almost linearly with increasing number of processors. The slight deviation for the sixty-four processor case is due to the fact that there is not enough work on each processor.

CONCLUDING REMARKS The octree-based partitioning resulted in highquality, balanced subdomains that that follow the geometry of interest. The octree-based method yielded similar quality subdomains as the RSB method. However, the present method is also applicable to adaptive dynamic meshes, as well as to changing parallel computing environments. Furthermore, the octree-based method produced similar quality partitions for very di erent geometries, namely the sphere and HSCT aircraft con gurations. The importance of the load balancer to dynamic simulations was demonstrated via parallel execution of a Navier-Stokes solver. The savings in solver execution time after load balancing far outweighed the time required by the octree-based load balancer. The current method incurred a relatively small amount of migration which is crucial for ef ciency of the load balancer. The case of an overloaded processor demonstrated the e ectiveness of the balancing method for parallel execution on dynamic systems, such as a cluster of workstations.

Dynamic Parallel Execution Systems With the increasing use of workstation clusters serving as parallel systems, the implementation of a load balancer based on run time measurements of the parallel solver becomes more important. In this type of environment, processors may become loaded due to other jobs running on the workstations. The octree-based load balancer is used to balance the load in this type of changing computing environment. During parallel execution of the solver, the \wait" times of each processor are monitored. These \wait" times measure how long each CPU remains idle while all other processors nish the same task. The cells are assigned a load factor that is proportional to the \wait" time of their processor. The weight factor for an octant is the sum of the load factors for the cells contained within that octant. The octree then balances the load based on the weight factors of the octants rather than the number of cells in the octants. Figure 20 shows the partition boundaries for the prismatic surface mesh after balancing for an overloaded processor whose corresponding partition is shaded darker in the gure. Comparing Figures 16 and 20, it is observed that the partition boundaries have moved towards the overloaded partition. The execution time for 1000 time-steps of the parallel solver on eight processors is just over 173 minutes for the unbalanced partitions while the execution time of the solver after balancing is only 113 minutes. This results in a savings of one hour in execu-

Acknowledgments The authors would like to thank Horst Simon for providing us with the RSB partitioning code Version 2.2 written by Steve Barnard and Horst Simon. This work was supported by NSF Grant ASC9357677 (NYI program) and the Texas Advanced Technology Program (ATP) Grant #003658-413. Parallel computing time on the IBM SP2 was provided by the NAS Division of the NASA Ames Research Center. Supercomputing time was provided by the High Performance Computing Facility at the University of Texas at Austin.

9

References [1] C. Farhat and M. Lesoinne, \Automatic Partitioning of Unstructured Meshes for the Parallel Solution of Problems in Computational Mechanics," International Journal for Numerical Methods in Engineering, Vol. 36, pp. 745-764, 1993. [2] A. Pothen, H. D. Simon, and K.-P. Liou, \Partitioning Sparse Matrices with Eigenvectors of Graphs," SIAM Journal of Matrix Anal. Applications, Vol. 11, No. 3, pp. 430-452, July 1990. [3] H. D. Simon, \Partitioning of Unstructured Problems for Parallel Processing," Technical Report RNR-91-008, NASA Ames Research Center, Mo ett Field, CA, 1991.

[10] V. Parthasarathy, Y. Kallinderis, and K. Nakajima \Hybrid Adaptation Method and Directional Viscous Multigrid with Prismatic / Tetrahedral Meshes," AIAA Paper 95-0670, Reno, NV, January 1995. [11] A. Khawaja, Y. Kallinderis, and V. Parthasarathy, \Implementation of Adaptive Hybrid Grids for 3-D Turbulent Flows," AIAA Paper 96-0026, Reno, NV, January 1996. [12] A. Khawaja, H. McMorris, and Y. Kallinderis, \Hybrid Grids for Viscous Flows around Complex 3-D Geometries including Multiple Bodies," AIAA Paper 95-1685-CP, San Diego, CA, June 1995.

[4] E. R. Barnes, \An Algorithm for Partitioning the Nodes of a Graph," SIAM Journal of Alg. Disc. Methods, Vol. 3, p. 541, March 1982. [5] A. Vidwans, Y. Kallinderis, and V. Venkatakrishnan, \A Parallel Dynamic Load Balancing Algorithm for 3-D Adaptive Unstructured Grids," AIAA Journal, Vol. 32, No. 3, pp. 497505, March 1994. [6] H. L. de Cougny, et al \Load Balancing for the Parallel Adaptive Solution of Partial Di erential Equations," Applied Numerical Mathematics, Vol. 16, pp. 157-182, 1994. [7] R. Lohner, R. Ramamurti, and D. Martin, \A Parallelizable Load Balancing Algorithm," AIAA Paper 93-0061, Reno, NV, January 1993. [8] R. D. Williams, \Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations," Caltech Concurrent Computation Program Report #C3P913, Pasadena, CA, 1990. [9] T. Minyard and Y. Kallinderis, \A Parallel Navier-Stokes Method and Grid Adapter with Hybrid Prismatic/Tetrahedral Grids," AIAA Paper 95-0222, Reno, NV, January 1995. 10

Figure 1: Two-dimensional view of the special octree generated for a tetrahedral mesh around a Figure 3: Partitioning of the special octree corresphere. The size and distribution of the octants sponding to the tetrahedral mesh using coordinate based cutting planes. A view of the octant partifollow the geometry and grid cell distribution. tions is shown on the symmetry plane corresponding to the sixteen partition case.

Figure 2: Partitioning of a tetrahedral mesh about a sphere using coordinate based cutting planes results in long and thin subdomains with a high percentage of nodes on the boundary and disconnected cells. A view of the sixteen partitions on the symmetry plane of the domain is shown.

Figure 4: Resulting partitions using the octreebased coordinate division before smoothing of partition interfaces. View of the sixteen tetrahedral partitions on the symmetry plane.

11

Max Percent Nodes on Boundary

80 70 60 50 40 30 20 10 0 0

5

10

15 20 25 30 35 Number of partitions

40

45

50

(a) Max Percent Nodes on Boundary

90 80 70 60 50 40 30 20 10

0 5 10 15 20 25 30 35 40 45 50 Figure 5: E ect of smoothing on the sixteen parNumber of partitions titions for the sphere tetrahedral mesh. The parti(b) tion boundaries are no longer jagged and the percentage of nodes and edges on the boundary has Figure 6: Comparison of the scaling of the maxibeen reduced. mum percentage of nodes on interface boundaries for varying number of partitions. Cases of (a) prismatic and (b) tetrahedral meshes around the sphere.  before smoothing, + after smoothing, 2 recursive spectral bisection (RSB).

12

(a)

(b) Figure 7: Signature of sixteen partitions on the surface of the High Speed Civil Transport (HSCT) aircraft con guration (a) before and (b) after smoothing of partition interfaces.

13

Max Percent Nodes on Boundary

70 60 50 40 30 20 10 0

10

20 30 40 50 Number of partitions

60

20 30 40 50 Number of partitions

60

70

(a)

Figure 9: Partitioning of the octree corresponding to the HSCT tetrahedral mesh using coordinatebased cutting planes results in subdomains that are biased to the geometry. Signature of the sixteen octant partitions are shown on the symmetry plane.

Max Percent Nodes on Boundary

70 60 50 40 30 20 10 0

10

70

(b)

Figure 8: Comparison of the scaling of the maximum percentage of nodes on interface boundaries for varying number of partitions. Cases of (a) prismatic and (b) tetrahedral meshes around the HSCT aircraft.  octree-based partitioning, + recursive spectral bisection (RSB).

Figure 10: Cutaway view of the sixteen octree partitions for the HSCT tetrahedral grid. The octree is biased to the geometry. 14

Global Percentage of Nodes on Interfaces

60 50 40 30 20 10 0 0

Figure 11: The signature of sixteen tetrahedral partitions on the symmetry plane for the HSCT aircraft obtained using the octree-based partitioning with smoothing.

20000 40000 60000 Number of Nodes per Partition

80000

Figure 13: Geometry and grid-independence of the octree-based partitioning method.  sphere geometry, + HSCT without engines, 2 HSCT with engines.

1

Time (sec)

0.8

Execution time for partitioning (sec)

200 180 160

0.6 0.4

140

0.2

120 100

0

80

0

60 40

5

10 15 20 25 30 Number of processors

35

Figure 14: Scaling of execution time for the dynamic load balancer with increasing number of pro0 0 10 20 30 40 50 60 70 cessors. Case of an adapted hybrid grid for ow Number of partitions over a bump in a channel. This case uses a \strip" Figure 12: Scaling of execution time for partition- partitioning of the duct. ing of three di erent hybrid grids with the number  total execution time of balancer, + time to initialize octree, of partitions. 2 time for sorting of the octree,  sphere geometry, + HSCT aircraft con guration without engines,  time to calculate communication pattern, 4 time for migration of cells. 2 HSCT aircraft con guration with engines. 20

15

(a)

(a)

(b)

(b)

Figure 16: Partitioning of the (a) unadapted and (b) adapted prismatic surface mesh for the M6 wing using the octree-based method. The partition boundaries have moved towards the regions of adaptation after load balancing. The maximum percentages of nodes on partition interfaces is 22.9% and 20.5% for the unadapted and adapted meshes respectively. 16

Figure 15: Surface triangulation of the (a) unadapted and (b) adapted ONERA M6 wing for the prismatic region of the hybrid grid. The original mesh contains about 12K triangles while the adapted has about 15K triangles.

Time (sec)

20 15 10 5 0 0

10

20 30 40 50 60 Number of processors

70

Figure 18: Timings for parallel load balancing of an adapted hybrid mesh around the ONERA M6 wing.  total execution time of balancer, + time to initialize octree, 2 time for sorting of the octree,  time to calculate communication pattern, 4 time for migration of cells.

(a)

Execution Time (min)

512 256 128 64 32 16 2

(b) Figure 17: Surface partitionings for the (a) unadapted and (b) adapted prismatic region around the ONERA M6 wing using RSB. The surface mesh is partitioned into eight subdomains. Several of the partition boundaries have moved signi cantly thus requiring a large number of cells to transfer during load balancing.

4

8 16 32 64 Number of processors

128

Figure 19: E ect of load balancing on parallel execution of a Navier-Stokes solver. Case of turbulent transonic ow around the ONERA M6 wing run for 1000 time-steps. Total time for balancing is included in the times for the balanced load case.  time for parallel solver with load balancing + time for parallel solver without load balancing.

17

Figure 20: Load balancing in response to an overloaded processor. Load balancing is based on run time measurements of the parallel solver. The partition boundaries have moved towards the overloaded processor (compare with Figure 16).

18