tocol: A ghost node is left at computer i to forward the rst incoming message on each channel .... the next frame to be rendered will be close in view to the last, by virtue of smooth head motion. ... boundaries: What if a door opens in ight? What if ...
The Concurrent Graph:1
Basic Technology for Irregular Problems
Stephen Taylor, Jerrell Watts, Marc Rieel, and Michael Palmer Scalable Concurrent Programming Laboratory California Institute of Technology Pasadena, CA 91125
Abstract This paper describes basic programming technology to support irregular applications on scalable concurrent hardware and shows how the technology has been applied to a variety of large-scale industrial application problems. The technology is based on the concept of a concurrent graph library that provides an adaptive collection of light-weight threads that may relocate between computers dynamically. The graph is portable to a wide range of highperformance multicomputers, shared-memory multiprocessors, and networked workstations. For each machine it is optimized to take advantage of the best available underlying communication and synchronization mechanisms. The graph provides a framework for adaptive re nement of computations, automatic load balancing, and interactive, on-the- y visualization. It has been applied to a variety of large scale irregular applications to provide portable, scalable implementations with substantial code reuse. The applications described in this article typify a broad category of problems in continuum and non-continuum ow simulations.
1 The Concurrent Graph Over the last ten years there has been a constant evolution of high-performance parallel architectures with new machines appearing on a yearly basis. These architectures increasingly leverage low-cost workstation and shared-memory technology on a small scale (less than 256 computers) with distributed memory multicomputer technology to provide scalability. These hybrid architectures, such as the Intel Paragon, Cray T3D, and SGI Power Challenge, provide a combination of shared-address space and message-passing programming models. As the performance gap between these models continues to diminish, there is a gradual evolution toward architectures that support a global address space implemented through a memory hierarchy. This paper describes basic programming techniques and technology to support largescale irregular applications on hybrid architectures. This support maintains applications investments by providing portability, scalability, and maintainability. An The concurrent graph library was developed under a Presidential Young Investigator Award from the National Science Foundation, ASC{9157650. The applications research described in this report is sponsored by the Advanced Research Projects Agency, ARPA Order number 8176, and monitored by the Oce of Naval Research under contract number N00014-91-J-1986. 1
1
application is developed in terms of a concurrent graph library. This library is implemented using a low-latency remote-procedure call (RPC). This mechanism represents the lowest common denominator of available programming models. It can be implemented with a variety of methods, including pointer copying on shared-memory machines and message-passing on distributed systems; hardware implementations are also available in experimental architectures [6]. By focusing on a single communication concept, portable applications are easier to construct and maintain. The library is in use on a wide range of multicomputers, shared-memory multiprocessors, and networked workstations, including the Cray T3D, Intel Paragon, IBM SP2, and SGI Power Challenge. In using the library, an application problem is described and implemented as a graph, Computers G=fV,Eg, where V is a set of nodes, and E Channels is a set of edges as shown in Figure 1. The Gather nodes correspond to partitions of the prob{ barrier lem and edges correspond to data depenScatter dencies inherent in the parallel algorithm. Multiple nodes may be mapped to a single Nodes computer, or collection of computers sharing memory, in order to overlap communication and computation. Where possible, Figure 1: The Concurrent Graph nodes can be implemented as light-weight threads. This approach separates the logical structure of the application from the underlying machine. This allows the library to alter the relationship dynamically, during program execution. The concurrent graph is constructed via a set of routines which create the nodes and the communication channels between them as well as install their user-created states. For grid-based computations, a partitioner is provided that builds the communication structure automatically. (For other problems, the user must derive the channels manually.) Once the nodes and channels have been created and the computation initiated, nodes communicate using an abstraction of the low-level RPC mechanism, called nodespawn(). This routine is called with a channel and a function and its arguments. The given function is invoked with its arguments in the context of the node at the other end of the channel. In this way messages, such as boundary data values, can arrive and install themselves ansynchronously, eliminating double-buering of messages and unnecessary sequencing. The library also supports a simple global synchronization mechanism, called simply barrier(), that allows typical gather/scatter operations such as global reductions and broadcasts. These two routines are all that is necessary to express a parallel computation using the concurrent graph.
2
Each node in the graph is composed of four components as shown in Figure 2. The state of a node is the set of variables or data structures State Comm that represent a problem partition. The communication list describes the mapping of nodes to computers. There is a collection of application speci c physics routines used to implePhysics OTHER ment each partition. These are encapsulated User behind appropriate software interfaces. Finally, there are some other functions that are application and architecture independent but that operFigure 2: Graph Node ate under the assumption that the computation conforms to the architecture of the concurrent graph. These functions are provided by the library and will be described later. First we describe some typical industrial applications and show how they can be cast in terms of the graph structure. The applications are representative of a wide class of continuum and non-continuum ow problems. Library
2 Launch-Vehicle Simulations The Scalable Concurrent Programming Laboratory, in a collaborative eort with The Aerospace Corporation, has developed a concurrent implementation of the Aerospace Launch System Implicit/Explicit Navier-Stokes code (ALSINS). This general code is the primary uid dynamics tool used by The Aerospace Corporation for a broad range of practical simulations. Those of interest involve a variety of nozzle ows and multi-body launch vehicle con gurations such as the Titan IV shown in Figure 3 [11]. The code utilizes a nite volume TVD scheme for computing both steady state and unsteady solutions to the 3-D compressible Navier-Stokes equations. The scheme is second order Figure 3: Titan IV Vehicle accurate in space and is fully vectorized. A line-by-line relaxation algorithm is used to accelerate the convergence for steady state solutions. The code employs a variety of features that increase its practical utility. These include multibody support, viscous eects, turbulence modeling, and implicit ow capabilities.
3
Figure 4 illustrates the pressure distribution in a two dimensional cross section through the supersonic (Mach 1.6) ow- eld solution over the Titan IV vehicle. This result was obtained using the concurrent ALSINS code on the Intel Delta Machine using 256 compute nodes. The results have been shown to display a close correspondence with experimental windtunnel data. The Titan IV vehicle exhibits acoustic bueting that arises from a recirculation region formed immediately above the main guidance system, adjacent to the payload fairing. This region is present upon close examination of the ow eld. In addition to determining bueting characteristics, the ow eld predictions have a variety of other practical uses: From the calculated pressure eld it is possible to determine the forces acting on the vehicle and thereby predict its stability during dierent regimes of ight. The results can be used to predict the aerodynamic Figure 4: Titan IV Pressure drag of the vehicle which is useful to engineers in determining payload characteristics. By extending the calculated bow shock structure, it is possible to determine the strength of the sonic boom created by the vehicle. Calculated temperature contours allow the casing temperatures to be quanti ed, so that appropriate shielding and paint can be designed. The concurrent algorithm is based on a domain decomposition whereby the grid is divided into separate partitions. Each partition is solved independently and appropriate boundary conditions are used to signify the presence of bodies or in ow regions etc. There is one non-physics boundary condition representing a cut in the domain. This boundary condition represents the fact that communication must be used to solve an area of the ow eld. The concurrent algorithm communicates with neighboring partitions to construct inviscid and viscous uxes at partition interfaces. For example, to compute the inviscid ux E across a cell surface at time t, the second order accurate numerical scheme requires two cells of information from other partitions at time t ? 1 as shown in the dierence equations: n ;U inv (Uin?1;j;k ; Ui;j;k in+1;j;k ; Uin+2;j;k ) Eiinv +1=2;j;k = E and
n n n inv n Eiinv ?1=2;j;k = E (Ui?2;j;k ; Ui?1;j;k ; Ui;j;k ; Ui+1;j;k )
Unfortunately, this communication scheme is not sucient to calculate the viscous eects due to cross derivative terms of the viscous sheer. These terms require diagonal corners from adjacent partitions: vis n n n n Eivis +1=2;j;k = E (U i+1;j 1;k ; Ui+1;j;k1 ; Ui;j 1;k ; Ui;j;k1 ) 4
The decomposition allows partition interfaces and the grid points on the interface of the decomposition to be mismatched. An example of this gridding structure for a simple double wedge geometry is shown in Figure 5. Notice how the cell boundaries are mismatched along the length of the wedge. This approach substantially simpli es grid generation. It allows isolated regions in the ow- eld that require a re ned mesh (such as the boundary layer) to capture the physics of the ow, without incurring overheads where a large Figure 5: Mismatched Grid mesh is sucient. To use such grids, the entire computational domain is divided into partitions of dierent size with arbitrary topological arrangement. Blocks are interfaced using an interpolation scheme that constructs the uxes at the interface and preserves the second order accuracy and TVD properties of the numerical scheme. The use of a mismatched structure has a number of implications to the concurrent algorithms. Each partition in the domain communicates an arbitrary number of messages, of varying size, at each time step, to arbitrary destinations | thus yielding a static irregular structure. All of the necessary information to construct the messages and predict their destination can be obtained statically during grid generation. The abstract algorithm used by this application is shown in Figure 6, cast in terms of the concurrent graph library. Each partition of the problem corresponds to a single node in the graph. The state of each node corresponds to the collection of dependent variables used to represent the ow, ie. pressure, density, velocity, uxes, etc. The communication list is built from les constructed during grid generation and represents data dependencies in the numerical scheme. These data dependencies describe values to be extracted from the state and sent between nodes at each iteration. The physics routines associated with the node are encapsulated behind the interfaces provided by the initialize, extract, and compute functions. The latter function performs interpolation on data received during communication and subsequently solves the Navier-Stokes equations, for a single partition of the problem, at a given timestep. In essence, this function is just the original sequential industrial code with a single new boundary condition to represent a cut in the domain. Load balance, in this application, can be achieved statically through bin packing. Side-by-side comparisons of the convergence of the concurrent algorithm with that of the original vector based code indicated no appreciable changes dierences on large-scale simulations. The critical factor in such applications is the speed with which boundary information is propagated throughout the grid. Domain decomposition allows information to propagate only to a partition boundary within a single timestep, however, by virtue of pipelining, within a few timesteps this information is continually propagated throughout the computational domain. The code is able to realize processor utilization in excess of 70% with only simple mapping strategies. 5
partition(...) f load geometry data into partition initialize state calculate local t and norm gather/scatter to obtain global t and norm while(time not exhausted) f extract partition boundaries send partition boundaries to neighbors receive adjacent boundaries compute dependent variables at t + t using neighbor information, and t calculate new local t and norm gather/scatter to obtain new global t and norm
g
g
Figure 6: Concurrent Navier-Stokes Algorithm This has been sucient to enable a variety of parametric studies of both the Titan IV and Delta II launch vehicles.
3 Satellite Simulations Electric propulsion devices are under consideration for a number of space missions, as well as station keeping applications for communications satellites. The issue of spacecraft contamination resulting from this type of propulsion system is thus receiving increased attention. One such system, the ion thruster, involves a low-energy plasma, created by charge-exchange processes, that can expand around a spacecraft leading to a current drain on high voltage surfaces. Enhanced plasma density can also lead to attenuation and refraction of electromagnetic wave transmission and reception. In addition, many thrusters emit heavy metal species, both charged and uncharged, due to erosion; these can easily adhere to spacecraft surfaces. It is important to understand and predict the back ow transport of these species from the plume onto a spacecraft. Thus, a clear understanding of the plumes of electric propulsion thrusters and the transport of contaminating euents is necessary. Back ow contamination can lead to sputtering and euent deposition that can affect solar arrays, thermal control surfaces, optical sensors, communications, scienti c instrumentation, general structural properties of materials, and spacecraft charging.
6
The Scalable Concurrent Programming Laboratory, in collaboration with the Space Power and Propulsion Laboratory of the MIT Department of Aeronautics and Astronautics, has recently completed a numerical model of an ion thruster plume and a prototype concurrent code for conducting back ow simulations [9]. The concurrent code applies the plasma particle-in-cell (PIC) technique to the slow charge-exchange (CEX) ions produced in the beam and their transport in the region exterior to the beam. These CEX ions are transported into the back ow region and can present a contamination hazard for the spacecraft. The self-consistent electrostatic potential is determined by solving Poisson's equation over the entire computational domain. Historically, PIC techniques have been applied to only small-scale problems. However, in using parallel computers, like the In- Figure 7: Satellite Simulations tel Delta and Cray T3D, we have been able to develop a fully three-dimensional PIC code that can simulate the back ow over an entire realistic spacecraft. Figure 7 shows the geometry and example results in the cross-section through a three-dimensional calculation of a full-scale ESEX/Argos Satellite. From a computational perspective, this problem involves a computational grid representing the electrostatic eld surrounding the spacecraft. In addition, a number of simulation particles are used to represent physical particles emitted from the plume. The particles are moved throughout the computational domain under the in uence of the eld. At each time step the eld is solved self-consistently using a Poisson solver. Thus, the cost of each time-step is a function of the both the cost of solving the eld and the cost of moving particles. Since particle positions are determined dynamically, no static, a priori decomposition can yield a load balanced calculation { this is a dynamic, irregular computation. Figure 8 shows the abstract algorithm used by this application, cast in terms of the concurrent graph library. Each node in the graph corresponds to a partition of the grid representing the eld around the spacecraft. The state associated with a node is comprised of a portion of the grid and the particles contained within the corresponding portion of physical space. Each partition is solved independently and appropriate boundary conditions are used to signify what should happen at the interface between partitions. As in the Navier-Stokes application, there is one non-physics boundary condition representing a cut in the domain. This boundary condition represents the fact that communication must be used to solve an area of the eld or transport particles. At each timestep, the algorithm has two parts: Based on some initial eld, particles are 7
partition(...) f load geometry data into partition initialize state calculate local t and norm gather/scatter to obtain global t and norm while(time not exhausted) f move particles by virtue of velocity and t extract particles that exit partition send particles that exit to appropriate partition receive new particles from neighboring partitions update new particle positions within current partition gather/scatter to obtain global norm calculate termination condition based on global norm while(global norm < termination condition) f extract eld at boundary of partition send boundaries to neighbors receive adjacent boundaries from neighbors compute single iteration of the eld solver calculate new local norm gather/scatter to obtain new global norm
g
g
g
Figure 8: Concurrent Particle-In-Cell Algorithm
8
injected into the domain and moved according to their velocities. If a particle exits a partition, it is communicated to an appropriate neighboring partition. After all particle movement has been conducted, the eld is solved using communication to obtain information related to the eld in adjacent partitions. The communication list associated with each node of the graph describes possible destinations for particles that move outside a partition and data dependencies required to implement the eld solver. The physics routines used in Figure 8 describe the dynamics of particle movement and the solution of the eld. These routines are completely dierent from those employed in the Navier-Stokes application described in Section 2; however, the general structure of the concurrent algorithm is only a slight re nement of that shown in Figure 6.
4 Plasma Reactor Simulations Plasmas are used in 30 to 40% of the processing steps in microelectronics fabrication. Furthermore, plasma equipment represents a substantial part of new factory costs. It is now widely recognized that the development of robust plasma processing models and high performance simulation tools is essential if we are to reduce the signi cant cost of introducing new processing technologies. Simulation studies have generally employed grossly simpli ed chemical and physical models in order to reduce computational complexity. Due to the vastly dierent time scales involved, complete three-dimensional simulations of realistic reactors have not been possible. Methods for reducing the cost of these reactors, while accelerating reactor design and providing for early equipment evaluation by fabricators, are highly desirable. High-performance simulation tools can clearly help meet these needs. The Scalable Concurrent Programming Laboratory, in collaboration with Intel Corporation and the Philips Laboratory at Edwards AFB, has developed a concurrent, three-dimensional simulation capability based on the Direct Simulation Monte Carlo (DSMC) method that operates on irregular grids. This capability, called Hawk, has been used by Intel Corporation [10] to provide the rst, fully three-dimensional simulations of realistic plasma reactors. Hawk has been carefully constructed using modern software engineering practices. As a result, the chemical and surface models are interchangeable. For examFigure 9: Plasma Reactor ple, standard models can be used for validation, while more sophisticated plasma details can be used for proprietary reactors. Figure 9 shows the exterior of a typical reactor geometry; the GEC Reference Cell used primarily for validation. The grid for this geometry was generated using a commercial grid generation tool, ICEM-CFD, that 9
has been adapted to operate in connection with Hawk. Figure 10 shows the results from a simulation of this reactor using a 256node Intel Paragon machine. This simulation demonstrates the ow across the wafer in a two-dimensional section through the reactor. The color range signi es the value of the mean free path between collisions: dark colors indicate a short mean free path and consequently a high incidence of collisions whereas light colors indicate a long mean free path. Flow is injected at the small lowerright port and exhausts through the large port on the left. Notice the short mean free path at the inlet and long mean free path at the outlet. This gure illustrates the asymFigure 10: Reactor Simulation metric, three-dimensional ow pattern that is typical of low pressure reactors. Figure 11 shows the abstract algorithm for this application, cast in terms of the concurrent graph library. From a computational viewpoint this calculation is similar in structure to both the Navier-Stokes and Particle-In-Cell codes described in Sections 2 and 3 respectively. The main distinction lies in the use of a collisional model of particle motion based on chemical and surface models. As in the other applications, each node in the concurrent graph represents a partition of physical space. The state of a node is in essence the collection of particles contained in a region and a description of associated electromagnetic eld. The communication list is again used to implement data dependencies resulting from particle motion and solution of the eld. The physics routines are again completely dierent from those used in the other applications: They incorporate collision, chemistry, and surface models not present in the other applications.
5 Load Balancing Recall that each node in the graph may incorporate other, application and architecturally independent functionality. The graph library has the ability to automatically maintain information on the amount of communication, computation, and idle time for a node and the computer on which it resides. It also provides functions to access this information. Armed with this information it is possible to implement dynamic 10
partition(...) f load geometry data into partition initialize state calculate local statistics gather/scatter to obtain global statistics while(time not exhausted) f move and collide particles according to velocity and t send away particles that exit current partition receive particles from neighboring partitions update cells with arriving particles gather/scatter to obtain global statistics calculate termination condition based on global statistics while(global norm < termination condition) f send boundaries of eld to neighbors recieve adjacent boundaries from neighbors compute single iteration of the eld solver gather/scatter to obtain new global statistics
g
g
g
Figure 11: Concurrent Direct Simulation Monte Carlo Algorithm
11
load balancing. The graph library treats the load balancing problem in ve phases:
Load Evaluation: The load of the nodes and the total loads of the computers
to which they are mapped is calculated. Pro tability Determination: Imbalance is detected, and if the cost of remedying it is exceeded by the bene t of doing so, actual load balancing begins. Work Transfer Vector Calculation: Based on load measurements, the ideal amount of work to transfer between computers is calculated. Task Selection: Tasks are selected for transfer or exchange to best ful ll the ideal work transfer vectors. Task selection may be constrained by communication locality. Task Migration: The selected tasks move to their nal destinations with assistance from user-supplied routines that pack, unpack and free a node's state. A fault-tolerant protocol handles cases of a memory allocation failure.
The graph library incorporates a default transfer vector calculation algorithm that utilizes emperical performance information. This algorithm is based on the notion of heat diusion: An increase in the workload at a given computer is treated as heat to be diused to other computers [3]. This method distinguishes itself from the other \diusion-like" methods such as gradient algorithms, because it is based on the evolution of the actual parabolic partial dierential equation describing heat diusion. It is amenable to rigorous analysis of its correctness and convergence rate. Figure 12 gives the diusion algorithm, based on a secord-order accurate nite dierencing scheme [12]; it consists of a simple arithmetic iteration which is performed concurrently by every computer in a parallel machine. The diusive load balancing method has a variety of attractive qualities. It is simple to implement, involving only nearest neighbor communication. It is guaranteed to converge for arbitrary asynchronously introduced load imbalances and its rate of convergence has been determined analytically [3]. The algorithm allows a trade-o to be made between the quality of load balance and the time required to achieve the balance. Finally, the method maintains locality present in the original graph in the balanced graph. Once the ideal amount of work to transfer has been determined, nodes are selected from transfer or exchange between adjacent computers. The selection process involves either exhaustive or partial searches depending on the number of nodes involved{the more nodes, the less thorough the search. The necessity of maintaining communication locality may further constrain node selection. Nodes are eliminated from consideration if their movement would increase the distance to their neighbors beyond a xed threshold. 12
diuse(...)p := := lnln3 1+3 ti;j := 0 for each neighbor j 2 i send ui to all neighbors j 2 i receive uj from all neighbors j 2 i while umax > (1 + )uavg do ti;j := ti;j + 2 (ui ? uj ) for each neighbor j 2 P u(0) i := ui + 2 [( j 2 i uj ) ? 6ui ] for k := 1 to do send u(ik?1) to all neighbors j 2 i receive u(jk?1) from all neighbors j 2 i P u(0) (k?1) i + u(i k) = 1+3 2(1+3) j 2 i uj end for ui := u(i ) send ui to all neighbors j 2 i receive uj from all neighbors j 2 i ti;j := ti;j + 2 (ui ? uj ) for each neighbor j 2 end while end diuse
i
i
Figure 12: Diusive Load Balancing Algorithm After the load balancing algorithm has determined which and how much work to move, it is necessary to eect the movement of work without aecting the communication structure present in the application. Recall that this information represents data dependencies present in the problem to be solved. These dependencies do not change under diering mappings of partitions to computers. The graph library provides a single operation, node-move(), that allows a node to be moved from a source computer i to a destination computer j so as to eect a change in workload. This operation is illustrated in Figure 13. The operation automatically re-orients the communication list to re ect the movement of the node such that inter-node communication is unchanged. This is achieved through a simple handshake protocol: A ghost node is left at computer i to forward the rst incoming message on each channel to computer j . The sender associated with the incoming message is noti ed of the change in location with a \moved" message, and it handshakes with the ghost node when it has recorded the new location j of the moved node. Subsequent messages to the remapped node will thus proceed directly to computer j . When acknowledgements of the change have been received by each neighboring node, the ghost node terminates. This organization allows nodes to be moved 13
between computers with only local communication and updates. computer i
Moved Die
Ghost
Moved Die
MOVE computer j
barrier(...,move)
Figure 13: Workload Distribution The algorithm described above has been used to dynamically balance portions of the plasma reactor simulations described in Section 4. The GEC grid described there was divided into 2,560 partitions and mapped onto 256 processors of an Intel Paragon. Because of the wide variance in particle density for each partition, the overall eciency of the computation was only 11 percent. The eciency was improved to 86 percent by load balancing, resulting in an 88 percent reduction in run time. The ion thruster simulations described in Section 3 also bene tted from load balancing. In this simulation, for a carefully hand-crafted static decomposition of 1,575 partitions, average processor utilization varied from 58 to 42 percent over a 500-hour run on a 256 processor Cray T3D. Initial application of the load balancing algorithm proved ineective, raising the eciency to only 60 percent in the worst case. The reason for this is that each phase of the PIC code ( eld solve and particle movement) has very dierent load distribution properties. The solution to this problem is to view load as a vector rather than a scalar, where each component of the vector is the load of a phase of the computation. The modi cation of the diusion algorithm is trivial; each scalar u and t in Figure 5 is replaced by a vector. Initial implementations of the vector-based load balancing strategy have improved the eciency of the PIC code to 72 percent, reducing run time by over 25 percent. Further improvements in the load balance of the PIC code require the use of dynamic granularity adjustment to increase work movement options.
6 Adaptive Computations Although it is valuable to be able to adjust the workload at each computer, this property is less useful unless it is also possible to adjust granularity of the computation. To achieve this the graph library provides two basic functions that operate on nodes: node-split() and node-merge(). The node-split() operation is illustrated in Figure 14. It allows a single node to be 14
decomposed dynamically into two nodes. This is achieved using application-speci c functions provided as arguments to the operation. For grid-based computations the partitioner can be called internally to further subdivide the state of a node. These functions decompose the state and communication list associated with the node to be partitioned. A portion of the state is provided to each resulting node. The communication list is also updated to re ect changes in the inter-node communication structure. As in the case of the node-move() operation, a ghost node is created that is used to forward messages. In this case the ghost node selectively routes messages to the appropriate new node and through a simple protocol updates the senders communication list. This technique is scalable in that it allows the nature of the decomposition to be changed dynamically using a completely local, controlled, modi cation to the graph structure. The operation is carried out at a single computer and involves only simple data structure manipulation. Nodes created within the same computer or collection of computers sharing memory may utilize lightweight threads for execution. barrier(...,split)
Die Moved
Die Moved S1
S2
Ghost S1
C1
SPLIT
S1
C1
S2
C2
Figure 14: Graph Re nement The node-merge() operation is illustrated in Figure 15. This operation allows two nodes to be combined to form a new node. The states and communication lists for the two nodes are combined using application-speci c functions provided as arguments to the operation. For grid-based computations, library routines can be used to merge adjacent partitions. In this case, one of the nodes being merged acts as a ghost node so as to to engage in the normal protocols for establishing direct communication paths within the graph structure. In general, it is valuable to attempt to merge nodes within the same computer where possible. This operation is more ecient because it is based only on thread operations and not message passing. The node-split() and node-merge() operations can be used recursively to dynamically alter the granularity of workload at a given computer. In combination with nodemove() these functions can be used to implement a wide range of load balancing and adaptive techniques. These techniques are application independent and can be reused without change. They lead to a scalable, asynchronous view of irregular, adaptive applications that exhibits substantial locality of reference and emphasizes local communication. 15
barrier (..., merge) S1
C1
S2
C2
Merge
S1
C1
get s2+c2
Ghost
Moved Die
ACK
Figure 15: Graph Coalescing
7 Visualization As parallel machines continue to scale, the data sets produced by large-scale simulations grow inLibrary creasingly large. Scientists need to be able to obtain a quick visual check of their data, as well as perform detailed analysis. It would therefore be valuable to connect to a running parallel application, step into its data structures, and walk around inside the changing data in order to unRender derstand its evolution. To achieve this interacUser tive exploration of a large data set, it is necessary to allow a computation access to appropriate rendering facilities. These facilities are Figure 16: Visualization Node another example of a functionality that can be incorporated into a single node of the concurrent graph as shown in Figure 16. While the algorithms required to visualize simple structured data sets are well understood, those for visualization of large irregular data sets are in their infancy. The Scalable Concurrent Programming Laboratory, in conjunction with Silicon Graphics Inc., has developed concurrent algorithms for the visualization of large scienti c data sets on parallel machines [8]. Figure 17 shows part of the visualization of a large medical volume database. The black lines in the gure show 32 processors adapting themselves to an irregular load distribution. Since this example uses a regular data set, "empty" space within the bounds of the dataset actually requires more expense to render than space containing data: rays that strike opaque material can be terminated before passing all the way through the volume. The rendering algorithms are the rst to achieve interactive frame rates on the SGI Power Challenge technology and were demonstrated at both Supercomputing '94 and '95, and at HPCN '95. The data set shown in Figure 17 contains 357 Mbytes, and can be 16
rendered on a 128 processor Power Challenge Array, with D1 resolution (640x480) at approximately 8 to 10 frames per second.
Figure 17: Example Visualization Interestingly, the workload resulting from the rendering algorithms is itself an example of an irregular computation and requires dynamic load balancing to keep all computers operating at peak eciency. The algorithms are based on the notion that the next frame to be rendered will be close in view to the last, by virtue of smooth head motion. As a result, the algorithms are able to predict the likely position of load imbalance and correct the load distribution on-the- y during rendering.
8 Related Work Two other libraries that provide support for parallel programming, CHAOS [4] and Cilk [1], are particularly interesting. CHAOS provides a framework for data and control decomposition of irregular, adaptive array-based codes via index translation and communication scheduling. It diers from our approach in that it is appropriate only for FORTRAN-style regular data structures and in that the communication structure is determined implicitly by the reference patterns in the code. While the methods worked well for the two applications presented, they appear to be ill-suited for applications with irregular data structures such as linked lists and trees. Cilk provides a multithreaded environment with integrated load balancing. It is best applied to tree-structured computations, however, and does not t the SPMD style typical of scienti c applications. A variety of ecient{and informal{load balancing methods have appeared in the literature, such as the Redi ow algorithm of Keller et. al. and the improved gradient model of Muniz and Zaluska [5, 7]. Although such approaches are intuitively persuasive, they may also be incorrect and lead to unbalanced loads or unstable 17
behavior. Diusive techniques are a better alternative, providing scalability and provable performance. The rst analytical work on a correct diusive dynamic load balancing strategy is due to Cybenko [2]. In addition to an optimal diusive method for hypercube architectures he presents an elegant analysis of a general iterative scheme for arbitrary interconnection patterns. His scheme distributes the computational work across all of the aected processors but does not restrict itself to nearest neighbor communication. Our algorithm is the rst diusive method for mesh architectures which is scalable and has rigorous proofs of correctness and convergence. We are currently exploring implementations of this and other models, including the vector strategy mentioned in Section 5, as well as techniques for heterogeneous and hybrid architectures.
9 Conclusion As parallel machines continue to scale, there is a steady trend toward increasingly realistic, large-scale simulations across the board from continuum to noncontinuum
ow regimes. The complexity of geometries in such simulations forces us to move in the direction of automatic grid generation techniques that produce complex irregular grids. Inevitably, the next step is to pose what if style questions involving moving boundaries: What if a door opens in ight? What if there is a structural failure in the tail? What if a particular valve is opened? etc. These factors lead inexorably in the general direction of large-scale, irregular concurrent computations. These calculations inevitably require dynamic partitioning and load balancing techniques to eectively utilize a large parallel machine. It is important to provide a clear conceptual framework for dealing with these calculations to ensure that the resulting programs are scalable, portable, and maintainable. Programs can only be maintained if they employ sound software engineering principles of encapsulation and information hiding. By clearly addressing these issues it is possible to protect software investments as new technology arrives and machines continue to scale. This paper has described one cohesive approach to these questions and attempts to raise the issues in the context of a representative class of applications in both continuum and noncontinuum uid dynamics. Our intent is not to push the envelope in the design of new mathematics but rather to explore the methods currently in use on industrial problems and provide a growth path to parallel machines. The concepts we describe allow software reuse between applications on widely dierent types of architecture and allow generic load balancing techniques to be utilized. They lead naturally to conclusion that applications should be structured so that: a partition is a conceptually distinct entity that is able to
18
execute concurrently, interact with other partitions, move between computers, dynamically adjust its granularity, render itself, and allow interactive understanding and modi cation of its data structures.
Ongoing research in the Scalable Concurrent Programming Laboratory at Caltech seeks to clarify the issues related to this general precept and impact large-scale problems through direct application of the basic technology described in this paper. This technology is in its infancy and signi cant practical experimentation is still required to quantify many of the trade-os. However, the technology has already had substantial impact on industrial-strength problems in a broad range of application areas.
19
References [1] R. Blufome, et al, \Cilk: An Ecient Multithreaded Runtime System," Proc. Fifth ACM SIGPLAN Sym. on Principles & Practice of Parallel Programming, pp. 207{216, ACM Press, 1995. [2] G. Cybenko, \Dynamic Load Balancing for Distributed Memory Multiprocessors," J. Parallel and Distributed Computing, 7:279{301, 1989. [3] A. Heirich and S. Taylor, \A Parabolic Load Balancing Algorithm," Proc. 24th Int'l Conf. on Parallel Programming, vol. 3, pp. 192{202, CRC Press, 1995. [4] Y.-S. Hwang, et al, \Runtime and Language Support for Compiling Adaptive Irregular Problems on Distributed-Memory Machines," Software: Practice and Experience, 25:597{621, 1995. [5] F. Lin and R. Keller, \The Gradient Model Load Balancing Method," IEEE Trans. Software Engineering, 1:32{38, 1987. [6] D. Maskit and S. Taylor, \A Message-Driven Programming System for FineGrain Multicomputers," Software - Practice and Experience, 24:953{980, 1994. [7] F. Muniz and E. Zaluska, \Parallel Load-Balancing: An Extension to the Gradient Model," Parallel Computing, 21:287{301, 1995. [8] M. Palmer and S. Taylor, \Interactive Volume Rendering on Shared-Memory Multiprocessors", to appear in Proc. Parallel CFD 95, North Holland, 1996. [9] R. Samanta Roy, D. Hastings and S. Taylor, \Three-Dimensional Plasma Particle-in-Cell Calculations of Ion Thruster Back ow Contamination," to appear in the Journal of Computational Physics. [10] S. Shankar, M. Rieel, S. Taylor, D. Weaver and A. Wulf, \Low Pressure Neutral Transport Modeling for Plasma Reactors," Proc. Workshop on Industrial Applications of Plasma Chemistry, vol. A, pp. 31{40, 1995. [11] S. Taylor and J. Wang, \Launch Vehicle Simulations using a Concurrent, Implicit Navier-Stokes Solver," to appear in the Journal of Spacecraft and Rockets. [12] J. Watts, \A Practical Approach to Dynamic Load Balancing," Caltech Computer Science Department Masters Thesis, TR-95-13, 1995.
20