Enabling scalable parallel implementations of structured ... - CiteSeerX

2 downloads 829 Views 1MB Size Report
Engineering, Organization to Rutgers, The State University of New Jersey, 94 Brett ... e-mail: sumir@caip.rutgers.edu .... inject information from the child to its parent. ..... LPA scheme helps to reduce application synchronization time while the BPA tech- ..... ment applications, M.S. thesis, Graduate School, Rutgers University.
J Supercomput DOI 10.1007/s11227-007-0110-z

Enabling scalable parallel implementations of structured adaptive mesh refinement applications Sumir Chandra · Xiaolin Li · Taher Saif · Manish Parashar

© Springer Science+Business Media, LLC 2007

Abstract Parallel implementations of dynamic structured adaptive mesh refinement (SAMR) methods lead to significant runtime management challenges that can limit their scalability on large systems. This paper presents a runtime engine that addresses the scalability of SAMR applications with localized refinements and high SAMR efficiencies on large numbers of processors (upto 1024 processors). The SAMR runtime engine augments hierarchical partitioning with bin-packing based load-balancing to manage the space-time heterogeneity of the SAMR grid hierarchy, and includes a communication substrate that optimizes the use of MPI non-blocking communication primitives. An experimental evaluation on the IBM SP2 supercomputer using the 3-D Richtmyer-Meshkov compressible turbulence kernel demonstrates the effectiveness of the runtime engine in improving SAMR scalability. Keywords Structured adaptive mesh refinement · SAMR scalability · Hierarchical partitioning · Bin-packing based load-balancing · MPI non-blocking communication optimization · 3-D Richtmyer-Meshkov application

S. Chandra () · X. Li · T. Saif · M. Parashar The Applied Software Systems Laboratory, Department of Electrical and Computer Engineering, Organization to Rutgers, The State University of New Jersey, 94 Brett Road, Piscataway, NJ 08854, USA e-mail: [email protected] X. Li Scalable Software Systems Laboratory, Department of Computer Science, Oklahoma State University, 219 MSCS, Stillwater, OK 74078, USA e-mail: [email protected] T. Saif e-mail: [email protected] M. Parashar e-mail: [email protected]

S. Chandra et al.

1 Introduction Dynamic adaptive mesh refinement (AMR) [3, 5, 29] methods for the numerical solution to partial differential equations (PDEs) employ locally optimal approximations, and can yield highly advantageous ratios for cost/accuracy when compared to methods based upon static uniform approximations. These techniques seek to improve the accuracy of the solution by dynamically refining the computational grid in regions with large local solution error. Structured adaptive mesh refinement (SAMR) methods are based on uniform patch-based refinements overlaid on a structured coarse grid, and provide an alternative to the general, unstructured AMR approach. These methods are being effectively used for adaptive PDE solutions in many domains, including computational fluid dynamics [2, 4, 24], numerical relativity [9, 10], astrophysics [1, 6, 18], and subsurface modeling and oil reservoir simulation [22, 30]. Methods based on SAMR can lead to computationally efficient implementations as they require uniform operations on regular arrays and exhibit structured communication patterns. Furthermore, these methods tend to be easier to implement and manage due to their regular structure. Parallel implementations of SAMR methods offer the potential for accurate solutions of physically realistic models of complex physical phenomena. These implementations also lead to interesting challenges in dynamic resource allocation, datadistribution, load-balancing, and runtime management [7]. Critical among these is the partitioning of the adaptive grid hierarchy to balance load, optimize communication and synchronization, minimize data migration costs, and maximize grid quality (e.g., aspect ratio) and available parallelism. While balancing these conflicting requirements is necessary for achieving high scalability of SAMR applications, identifying appropriate trade-offs is non-trivial [28]. This is especially true in the case of SAMR applications with highly dynamic and localized features and requiring high SAMR efficiencies1 [5], as they lead to deep hierarchies and small regions of refinement with unfavorable computation to communication ratios. The primary objective of the research presented in this paper is to address the scalability requirements of such applications on large numbers of processors (upto 1024 processors). In this paper, we present a SAMR runtime engine consisting of two components: (1) a framework that augments hierarchical partitioning with binpacking based load-balancing to address the space-time heterogeneity and dynamism of the SAMR grid hierarchy, and (2) a SAMR communication substrate that is sensitive to the implementations of MPI non-blocking communication primitives and is optimized to reduce communication overheads. The SAMR runtime engine as well as its individual components are experimentally evaluated on the IBM SP2 supercomputer using a 3-D Richtmyer-Meshkov (RM3D) compressible turbulence kernel. Results demonstrate that the proposed techniques individually improve runtime performance, and collectively enable the scalable execution of applications with localized refinements and high SAMR efficiencies, which was previously not feasible. The rest of the paper is organized as follows. Section 2 presents an overview of structured adaptive mesh refinement and related SAMR research efforts. Section 3 1 AMR efficiency is the measure of effectiveness of AMR and is computed as one minus the ratio of the number of grid points using AMR to that required if a uniform mesh is used.

Enabling scalable parallel implementations of SAMR applications

analyses the computation and communication behavior for parallel SAMR. Section 4 details the SAMR hierarchical partitioning framework that includes the greedy partitioner and level-based and bin-packing based partitioning/load-balancing techniques. Section 5 discusses the behavior of MPI non-blocking communication primitives on IBM SP2 and outlines the optimization strategy for the SAMR communication substrate. Section 6 describes the experimental evaluation of the SAMR runtime engine and scalability analysis using the RM3D application. Section 7 presents concluding remarks.

2 Structured adaptive mesh refinement 2.1 Overview of SAMR The numerical solution to a PDE is obtained by discretizing the problem domain and computing an approximate solution to the PDE at the discrete points. One approach to discretizing the domain is to introduce a structured uniform Cartesian grid. The unknowns of the PDE are then approximated numerically at each discrete grid point. The resolution of the grid (or grid spacing) determines the local and global error of this approximation and is typically dictated by the solution features that need to be resolved. The resolution also determines computational costs and storage requirements. For problems with “well behaved” PDE solutions, it is typically possible to find a grid with uniform resolution that provides the required solution accuracy using acceptable computational and storage resources. However, for problems where the solution contains shocks, discontinuities, or steep gradients, finding such a uniform grid that meets accuracy and resource requirements is not always possible. Furthermore, in these classes of problems, the solution features that require high resolution are localized causing the high resolution and associated computational effort in other regions of the uniform grid to be wasted. Finally, due to solution dynamics in timedependent problems, it is difficult to estimate the minimum grid resolution required to obtain acceptable solution accuracy. In the case of SAMR methods, dynamic adaptation is achieved by tracking regions in the domain that require higher resolution and dynamically overlaying finer grids on these regions. These techniques start with a base coarse grid with minimum acceptable resolution that covers the entire computational domain. As the solution progresses, regions in the domain with high solution error, requiring additional resolution, are identified and refined. Refinement proceeds recursively so that the refined regions requiring more resolution are similarly tagged and even finer grids are overlaid on these regions. The resulting grid structure is a dynamic adaptive grid hierarchy. The adaptive grid hierarchy corresponding to the SAMR formulation by Berger and Oliger [5] is illustrated in Fig. 1. 2.2 Related work on SAMR Currently, there exist a wide spectrum of software infrastructures and partitioning libraries that have been developed and deployed to support parallel and distributed implementations of SAMR applications. Each system represents a unique combination

S. Chandra et al.

Fig. 1 Berger-Oliger SAMR formulation of adaptive grid hierarchies

of design decisions in terms of algorithms, data structures, decomposition schemes, mapping and distribution strategies, and communication mechanisms. A selection of related SAMR research efforts are summarized in this section. PARAMESH [16] is a FORTRAN library designed to facilitate the parallelization of an existing serial code that uses structured grids. The package builds a hierarchy of sub-grids to cover the computational domain, with spatial resolution varying to satisfy the demands of the application. These sub-grid blocks form the nodes of a tree data-structure (quad-tree in 2-D or oct-tree in 3-D). Each grid block has a logically Cartesian structured mesh. Structured Adaptive Mesh Refinement Application Infrastructure (SAMRAI) [12, 31] provides computational scientists with general and extensible software support for the prototyping and development of parallel SAMR applications. The framework contains modules for handling tasks such as visualization, mesh management, integration algorithms, geometry, etc. SAMRAI makes extensive use of C++ objectoriented techniques and various design patterns such as abstract factory, strategy and chain of responsibility. Load-balancing in SAMRAI occurs independently at each refinement level resulting in convenient handling but increased parent-child communication. The Chombo [8] package provides a distributed infrastructure and set of tools for implementing parallel calculations using finite difference methods for the solution of partial differential equations on block-structured, adaptively refined rectangular grids. Chombo includes both elliptic and time-dependent modules as well as support for parallel platforms and standardized self-describing file formats. Grid Adaptive Computational Engine (GrACE) [20] is an adaptive computational and data-management framework for enabling distributed adaptive mesh refinement computations on structured grids. It is built on a virtual, semantically specialized distributed shared memory substrate with multifaceted objects specialized to distributed

Enabling scalable parallel implementations of SAMR applications

adaptive grid hierarchies and grid functions. The development of GrACE is based on systems engineering focusing on abstract types, a layered approach, and separation of concerns. GrACE allows the user to build parallel SAMR applications and provides support for multigrid methods. One approach to dynamic load-balancing (DLB) [13] for SAMR applications combines a grid-splitting technique with direct grid movements so as to efficiently redistribute workload among processors and reduce parallel execution time. The DLB scheme is composed of two phases: moving-grid phase and splitting-grid phase. The moving-grid phase is invoked after each adaptation and utilizes the global information to send grids directly from overloaded processors to underloaded processors. If imbalance still exists, the splitting-grid phase is invoked and splits a grid into two smaller grids along the longest dimension. This sequence continues until the load is balanced within a given tolerance. A modified DLB scheme on distributed systems [14] considers the heterogeneity of processors and the heterogeneity and dynamic load of the network. The scheme employs global load-balancing and local load-balancing, and uses a heuristic method to evaluate the computational gain and redistribution cost for global redistribution. The stack of partitioners within the SAMR runtime engine, presented in this paper, build on a locality preserving space-filling curve [11, 23, 26] based representation of the SAMR grid hierarchy [21] and enhance it based on localized requirements to minimize synchronization costs within a level (level-based partitioning), balance load (bin-packing based load-balancing), or reduce partitioning costs (greedy partitioner). Moreover, the optimization of the underlying SAMR communication substrate alleviates synchronization costs. Coupled together, these features of the SAMR runtime engine address space-time heterogeneity and dynamic load requirements of SAMR applications, and enable their scalable implementations on large numbers of processors.

3 Computation and communication behavior for parallel SAMR In the targeted SAMR formulation, the grid hierarchy is refined both in space and in time. Refinements in space create finer level grids which have more grid points/cells than their parents. Refinements in time mean that finer grids take smaller time steps and hence have to be advanced more often. As a result, finer grids not only have greater computational loads, but also have to be integrated and synchronized more often. This results in a space-time heterogeneity in the SAMR adaptive grid hierarchy. Furthermore, regridding occurs at regular intervals at each level of the SAMR hierarchy and results in refined regions being created, moved, and deleted. Figure 2 illustrates a 2-D snapshot of a sample SAMR application representing a combustion simulation [25] that investigates the ignition of a H2 -Air mixture. The top half of Fig. 2 depicts the temperature profile of the non-uniform temperature field with 3 hot-spots, while the bottom half of the figure shows the mass-fraction plots of various radicals produced during the simulation. The application exhibits high dynamism and space-time heterogeneity, and presents significant runtime and scalability challenges.

S. Chandra et al.

Fig. 2 2-D snapshots of a combustion simulation illustrating the ignition of H2 -Air mixture in a non-uniform temperature field with three hot spots (Courtesy: J. Ray, et al., Sandia National Laboratory). The top half of the figure shows the temperature profile while the bottom half shows the mass-fraction plots of radicals

3.1 Parallel SAMR implementations Parallel implementations of SAMR applications typically partition the dynamic grid hierarchy across available processors and the processors operate on their local portions of the computational domain in parallel. Each processor starts at the coarsest level, integrates the patches at this level, and performs intra-level or “ghost” communications to update the boundaries of the patches. It then recursively operates on the finer grids using the refined time steps—i.e., for each step on a parent grid, there are multiple steps (equal to the time refinement factor) on a child grid. When the parent and child grids are at the same physical time, inter-level communications are used to inject information from the child to its parent. The solution error at different levels of the SAMR grid hierarchy is evaluated at regular intervals and this error is used to

Enabling scalable parallel implementations of SAMR applications

determine the regions where the hierarchy needs to be locally refined or coarsened. Dynamic re-partitioning and redistribution is typically required after this step. The timing diagram (note that this diagram is not to scale) in Fig. 3 illustrates the operation of the SAMR algorithm described above using a 3-level grid hierarchy. For simplicity, only the computation and communication behaviors of processors P 1 and P 2 are shown. The three components of communication overheads (described in Sect. 3.2) are illustrated in the enlarged portion of the time line. Note that the timing diagram shows that there is one time step on the coarsest level (level 0) of the grid hierarchy followed by two time steps on the first refinement level and four time steps on the second level, before the second time step on level 0 can start. Also, note that the computation and communication for each refinement level are interleaved. The overall efficiency of parallel SAMR applications is limited by the ability to partition the underlying grid hierarchies at runtime to expose all inherent parallelism, minimize communication and synchronization overheads, and balance load. A critical requirement while partitioning these adaptive grid hierarchies is the maintenance of logical locality, both across different levels of the hierarchy under expansion and contraction of the adaptive grid hierarchy structure, and within partitions of grids at all levels when they are decomposed and mapped across processors. The former enables efficient computational access to the grids while the latter minimizes the total communication and synchronization overheads. Furthermore, the grid hierarchy is dynamic and application adaptation results in application grids being dynamically created, moved, and deleted. This behavior makes it necessary and quite challenging to efficiently re-partition the hierarchy at runtime to both balance load and minimize synchronization overheads.

Fig. 3 Timing diagram for a parallel implementation of Berger-Oliger SAMR algorithm, showing computation and communication behaviors for two processors

S. Chandra et al.

3.2 Communication overheads for parallel SAMR As shown in Fig. 3, the communication overheads of parallel SAMR applications primarily consist of three components: (1) Inter-level communications are defined between component grids at different levels of the grid hierarchy and consist of prolongations (coarse to fine grid data transfer and interpolation) and restrictions (fine to coarse grid data transfer and interpolation); (2) Intra-level communications are required to update the grid elements along the boundaries of local portions of a distributed grid and consist of near-neighbor exchanges. These communications can be scheduled so as to be overlapped with computations on the interior region; and (3) Synchronization costs occur when load is not balanced among processors at any time step and at any refinement level. Note that there are additional communication costs due to the data movement required during dynamic load-balancing and redistribution. Clearly, an optimal partitioning of the SAMR grid hierarchy and scalable SAMR implementations require a careful consideration of the timing pattern and the communication overheads described above. A critical observation from Fig. 3 is that, in addition to balancing the total load assigned to each processor, the load-balance at each refinement level and the communication/synchronization costs within a level need to be addressed. This leads to a trade-off between (i) maintaining parent-child locality for component grids that results in reduced communication overheads but possibly high load imbalance, and (ii) “orphaning” component grids (i.e., isolating grid elements at different refinement levels of the SAMR hierarchy) resulting in better load balance with slightly higher synchronization costs. 3.3 Driving SAMR application—RM3D kernel The driving SAMR application used in this paper is the 3-D compressible turbulence application kernel solving the Richtmyer-Meshkov (RM3D) instability. The RM3D application is part of the virtual test facility (VTF) developed at the ASCI/ASAP Center at the California Institute of Technology2 . The Richtmyer-Meshkov instability is a fingering instability which occurs at a material interface accelerated by a shock wave. This instability plays an important role in studies of supernova and inertial confinement fusion. The RM3D application is dynamic in nature and exhibits spacetime heterogeneity, and is representative of the simulations targeted by this research. A selection of snapshots for the RM3D adaptive SAMR grid hierarchy are shown in Fig. 4. In particular, RM3D is characterized by highly localized solution features resulting in small patches and deep application grid hierarchies (i.e., a small region is very highly refined in space and time). As a result, the application has increasing computational workloads and greater communication/synchronization requirements at higher refinement levels with unfavorable computation to communication ratios. The physics modeled by the RM3D application for detonation in a deforming tube 2 Center for Simulation of Dynamic Response of Materials—http://www.cacr.caltech.edu/ASAP/

Enabling scalable parallel implementations of SAMR applications

Fig. 4 Snapshots of the grid hierarchy for a 3-D Richtmyer-Meshkov simulation. Note the dynamics of the SAMR grid hierarchy as the application evolves Table 1 Sample SAMR statistics for RM3D application on 32, 64, and 128 processors of IBM SP2 “Blue Horizon” using a 128*32*32 coarse grid, executing for 50 iterations with 3 levels of factor 2 space-time refinements and regridding performed every 4 steps Application

32

64

128

Parameter

processors

processors

processors

AMR efficiency

87.50%

85.95%

85.29%

Avg. blocks per processor per regrid

15

12

9

Avg. memory per processor (MB)

360.64

202.72

106.08

Minimum block size

4

4

4

Maximum block size

128

128

128

is illustrated in Fig. 5. Samples of the SAMR statistics for RM3D execution on 32, 64, and 128 processors of IBM SP2 “Blue Horizon” are listed in Table 1. The resulting SAMR grid hierarchy characteristics significantly limit RM3D scalability on large numbers of processors. The SAMR runtime engine described in Sects. 4 and 5 addresses these runtime challenges in synchronization, load-balance, locality, and communication to realize scalable RM3D implementations.

4 SAMR hierarchical partitioning framework The hierarchical partitioning framework within the SAMR runtime engine consists of a stack of partitioners that can manage the space-time heterogeneity and dynamism of the SAMR grid hierarchy. These dynamic partitioning algorithms are based on a

S. Chandra et al.

Fig. 5 Richtmyer-Meshkov detonation in a deforming tube modeled using SAMR with 3 levels of refinement. The Z = 0 plane is visualized on the right (Courtesy: R. Samtaney, VTF+GrACE, Caltech)

core Composite Grid Distribution Strategy (CGDS) belonging to the GrACE3 SAMR infrastructure [20]. This domain-based partitioning strategy performs a composite decomposition of the adaptive grid hierarchy using Space-filling Curves (SFCs) [11, 17, 23, 26]. SFCs are locality preserving recursive mappings from n-dimensional space to 1-dimensional space. At each regridding stage, the new refinement regions are added to the SAMR domain and the application grid hierarchy is reconstructed. CGDS uses SFCs and partitions the entire SAMR domain into sub-domains such that each subdomain keeps all refinement levels in the sub-domain as a single composite grid unit. Thus, all inter-level communication are local to a sub-domain and the inter-level communication time is greatly reduced. The resulting composite grid unit list (GUL) for the overall domain must now be partitioned and balanced across processors. However, certain SAMR applications with localized refinements and deep hierarchies (such as RM3D) have substantially higher computational requirements at finer levels of the grid hierarchy. As described in Sect. 3.2, maintaining a GUL with composite grid units may result in high load imbalance across processors in such cases. To address this concern, CGDS allows a composite grid unit with high workload (greater than the load-balancing threshold) to be orphaned/separated into multiple sub-domains, each containing a single level of refinement. An efficient load-balancing scheme within CGDS can use this “orphaning” approach to alleviate processor load imbalances and provide improved application performance despite the increase in inter-level communication costs. The layered structure of the SAMR hierarchical partitioning framework is shown in Fig. 6. Once the grid hierarchy index space is mapped using the SFC+CGDS scheme, higher-level partitioning techniques can be applied in a hierarchical manner using the hierarchical partitioning algorithm (HPA). In HPA [15], processor groups are defined based on the dynamic grid hierarchy structure and correspond to regions 3 GrACE SPMD data-management framework: http://www.caip.rutgers.edu/TASSL/Projects/GrACE.

Enabling scalable parallel implementations of SAMR applications Fig. 6 Layered design of the SAMR hierarchical partitioning framework within the SAMR runtime engine. SFC = space-filling curves, CGDS = Composite Grid Distribution Strategy, HPA = hierarchical partitioning algorithm, LPA = level-based partitioning algorithm, GPA = greedy partitioning algorithm, BPA = bin-packing based partitioning algorithm

of the overall computational domain. The top processor group partitions the global GUL obtained initially and assign portions to each processor sub-group in a hierarchical manner. In this way, HPA further localizes the communication to sub-groups, reduces global communication and synchronization costs, and enables concurrent communication. Within each processor sub-group, higher-level partitioning strategies are then applied based on the local requirements of the SAMR grid hierarchy sub-domain. The objective could be to minimize synchronization costs within a level using the level-based partitioning algorithm (LPA), or efficiently balance load using the bin-packing based partitioning algorithm (BPA), or reduce partitioning costs using the greedy partitioning algorithm (GPA), or combinations of the above. GPA and BPA form the underlying distribution schemes that can work independently or can be augmented using LPA and/or HPA. 4.1 Greedy partitioning algorithm GrACE uses a default greedy partitioning algorithm to partition the global GUL and produce a local GUL for each processor. The GPA scheme performs a rapid partitioning of the SAMR grid hierarchy as it scans the global GUL only once while attempting to distribute the load equally among all processors, based on a linear assignment of grid units to processors. If the workload of a grid unit exceeds processor threshold, it is assigned to the next successive processor and the threshold is adjusted. GPA helps in reducing partitioning costs and works quite well for a relatively homogeneous computational domain with few levels of relatively uniform refinement, and small-to-medium scale application runs. However, due to the greedy nature of the algorithm, GPA tends to result in overloading of processors encountered near the end of the global GUL, since the load imbalances from previous processors have to be absorbed by these latter processors. Scalable SAMR applications require a good load balance during the computational phase between two regrids of the dynamic grid hierarchy. In applications with localized features and deep grid hierarchies, the load imbalance in GPA at higher levels of refinement can lead to large synchronization delays, thus limiting SAMR scalability.

S. Chandra et al.

Fig. 7 Partitions of a 1-D grid hierarchy for (a) GPA, (b) LPA

4.2 Level-based partitioning algorithm The computational workload for a certain patch of the SAMR application is tightly coupled to the refinement level at which the patch exists. The computational workload at a finer level is considerably greater than that at coarser levels. The level-based partitioning algorithm (LPA) [15] attempts to simultaneously balance load and minimize synchronization cost. LPA essentially preprocesses the global application computational units represented by a global GUL, disassembles them based on their refinement levels, and feeds the resulting homogeneous units at each refinement level to GPA (or any other partitioning/load-balancing scheme). The GPA scheme then partitions this list to balance the workload. Due to the preprocessing, the load on each refinement level is also balanced. LPA benefits from the SFC-based technique by maintaining parent-children relationship throughout the composite grid and localizing inter-level communications, while simultaneously balancing the load at each refinement level, which reduces the synchronization cost as demonstrated by the following example. Consider the partitioning of a one-dimensional grid hierarchy with two refinement levels, as illustrated in Fig. 7. For this 1-D example, GPA partitions the composite grid unit list into two sub-domains. These two parts contain exactly the same load: the load assigned to P 0 is 2 + 2 × 4 while the load assigned to P 1 is 10 units. From the viewpoint of GPA scheme, the partition result is perfectly balanced as shown in Fig. 7a. However, due to the heterogeneity of the SAMR algorithm, this distribution leads to large synchronization costs as shown in the timing diagram (Fig. 8a). The LPA scheme takes these synchronization costs at each refinement level into consideration. For this simple example, LPA will produce a partition as shown in Fig. 7b which results in the computation and communication behavior depicted in Fig. 8b. As a result, there is an improvement in overall execution time and a reduction in communication and synchronization time.

Enabling scalable parallel implementations of SAMR applications

Fig. 8 Timing diagrams showing computation and communication behaviors for a 1-D grid hierarchy partitioned using (a) GPA, (b) LPA

4.3 Bin-packing based partitioning algorithm The bin-packing based partitioning algorithm (BPA) improves the load-balance during the SAMR partitioning phase. The computational workload associated with a GUL at different refinement levels of the SAMR grid hierarchy is distributed among available processors. The distribution is performed under constraints such as minimum block size (granularity) and aspect ratio. Grid units with loads larger than the threshold limit are decomposed geometrically along each dimension into smaller grid units, as long as the granularity constraint is satisfied. This decomposition can occur recursively for a grid unit if its workload exceeds processor threshold. If the workload is still high and the orphaning strategy is enabled, the grid units with minimum granularity are separated into multiple uni-level grid elements for better load balance. Initially, BPA distributes the global GUL workload among processors based on processor load threshold, in a manner similar to GPA. A grid unit that cannot be allocated to the current processor, even after decomposition and orphaning, is assigned to the next consecutive processor. However, no processor accepts work greater than the threshold in the first phase. Grid units representing unallocated loads after the first phase are distributed among processors using a “best-fit” approach. If no processor matches the load requirements of an unallocated grid unit, the “most-free” approach (i.e., the processor with least load accepts the unallocated work) is adopted until all the work in the global GUL is assigned. BPA allows the user to set a tolerance value that determines the acceptable workload imbalance for the SAMR application. In case of BPA, the load imbalance, if any, is low since it is bounded by the tolerance threshold. Due to the underlying binpacking algorithm, the BPA technique provides overall better load balance for the grid hierarchy partitions among processors as compared to the GPA scheme. However, a large number of patches may be created as a result of multiple patch divisions.

S. Chandra et al.

Also, the load distribution strategy in BPA can result in multiple scans of the grid unit list that marginally increases the partitioning overheads. A combined approach using LPA and BPA can provide good scalability benefits for SAMR applications since LPA reduces the synchronization costs and BPA yields good load balance at each refinement level. The experimental evaluation presented in Sect. 6 employs this combined approach. It disassembles the application global GULs into uniform patches at various refinement levels using level-based partitioning, which is subsequently followed by bin-packing based load-balancing for the patches at each refinement level of the SAMR grid hierarchy.

5 Optimizing the SAMR communication substrate Due to irregular load distributions and communication requirements across different levels of the grid hierarchy, parallel SAMR implementations make extensive use of MPI non-blocking primitives to reduce synchronization overheads. A typical implementation is illustrated in Fig. 9. In this implementation, each processor maintains local lists of all messages to be sent to and received from its neighboring processors. As seen in Fig. 9, each process first posts non-blocking receives (MPI_Irecv) and then packs each message from its send list into a buffer and sends it using MPI_Isend. Following each MPI_Isend, a corresponding MPI_Wait is posted to ensure completion of the send operation so that the corresponding send buffer can be freed. Once all the sends are completed, a MPI_Waitall is then posted to check completions of the receives. This exchange is a typical ghost communication associated with parallel finite difference PDE solvers. { /*Initiate all non-blocking receives before hand*/ LOOP MPI_Irecv(); END LOOP for each data operation { for all neighboring processors { Pack local boundary values /*Send boundary values*/ MPI_Isend(); /*Wait for completion of my sends*/ MPI_wait(); Free the corresponding send buffer } /*Wait for completion of all my receives*/ MPI_Waitall(); Unpack received data } *****COMPUTATION CONTINUES***** } Fig. 9 Intra-level communication model for parallel SAMR implementations

Enabling scalable parallel implementations of SAMR applications

Typical message sizes for SAMR applications are in the order of hundreds of Kilobytes. However, increasing message sizes can affect the behavior of MPI nonblocking communication, thereby influencing the synchronization latencies and scalability of SAMR applications. Asynchronous MPI behavior is limited by the size of the communication buffers and the message-passing implementation. As a result, naive use of these operations without an understanding of the underlying implementation can result in serious performance degradations, often producing synchronous behaviors. As a part of this research, we have investigated the behavior of non-blocking communication primitives provided by popular MPI implementations (such as the MPICH implementation on a Linux Beowulf cluster and the native IBM MPI implementation on an IBM SP2). Since the performance of MPI primitives is sensitive to implementation details, we propose architecture-specific strategies that can be effectively applied within the SAMR communication model to reduce processor synchronization costs [27]. A “staggered sends” approach improves performance for the MPICH implementation on a Beowulf architecture while the IBM SP2 benefits from a “delayed waits” strategy. The primary objective of this paper is to address and evaluate SAMR scalability on large numbers of processors that are available on the IBM SP2. Therefore, the rest of this section focuses on the analysis and optimization of MPI non-blocking communication primitives on the IBM SP2. However, we emphasize that non-blocking communication optimizations are applicable to other architectures as well, depending on the MPI semantics. 5.1 Analysis of MPI non-blocking communication semantics on IBM SP2 This section experimentally investigates the effect of message size on MPI nonblocking communication behavior for IBM MPI version 3 release 2 implementation on the IBM SP2, and determines the thresholds at which the non-blocking calls require synchronization. In this analysis, the sending process (process 0) issues MPI_Isend (IS) at time-step T 0 to initiate a non-blocking send operation. At the same time, the receiving process (process 1) posts a matching MPI_Irecv (IR) call. Both processes then execute unrelated computation before executing an MPI_Wait call (denoted by Ws on the send side and Wr on the receive side) at T 3 to wait for completion of the communication. The processes synchronize at the start of the experiment and use deterministic offsets to vary values of T 0, T 1, T 2, and T 3 at each process. The value of T 0–T 3 are approximately the same on the two processes and the message size is varied for different evaluations on the IBM SP2. The system buffer size is maintained at the default value. For smaller message sizes (1 KB), the expected non-blocking MPI semantics are observed. As observed in Fig. 10, IS and IR return immediately. Furthermore, Ws and Wr posted after local computations also return almost immediately, indicating that the message was delivered during the computation. This behavior is also true for larger message sizes (greater than or equal to 100 KB), i.e., the IBM MPI_Isend implementation continues to return immediately as per the MPI specification, and Ws and Wr take minimal time (of the order of microseconds) to return.

S. Chandra et al.

To analyze the effect of increasing message size on the IBM SP2, the MPI_Wait on the send side (Ws) is moved to T 1, i.e., directly after the send to simulate the situation where one might want to reuse the send buffer. However, the MPI_Wait call on the receive side (Wr) remains at T 3. In this case, the non-blocking behavior remains unchanged for small messages. However, for message sizes greater than or equal to 100 KB, it is observed that Ws blocks until Wr is posted by the receiver at T 3. This behavior is illustrated in Fig. 11. In an experiment where both processes exchange messages, IS and IR are posted at T 0, process 0 posts Ws at T 1 while process 1 posts Ws at T 2, and both processes post Wr at T 3. The message size is maintained at 100 KB. Deadlock is avoided in this case when Ws, posted at T 1 and blocks on process 0, returns as soon as process 1 posts Ws at T 2, rather than waiting for the corresponding Wr on T 3. The IBM SP2 parallel operating environment (POE) imposes a limit called the “eager limit” on the MPI message size that can be sent out asynchronously. This limit is set by the environment variable MP_EAGER_LIMIT and is directly dependent on the size of the memory that MPI uses and the number of processes in the execution environment [19]. For message sizes less than this eager limit, the messages are sent asynchronously. When message sizes exceed the eager limit, the IBM MPI implementation switches to the synchronous mode such that Ws blocks until an acknowledgment arrives at the receiver (Wr is posted), as demonstrated in the above experiment. Note that the synchronization call on the receive side need not be a matching wait. In fact, the receiver may post any call to MPI_Wait (or any of its variants) to complete the required synchronization. 5.2 “Delayed Waits” on IBM SP2 The analysis on the IBM MPI implementation demonstrates that the time spent waiting for processes to synchronize is the major source of communication overheads, rather than the network latency. This problem is particularly significant in applications where communications are not completely synchronized and there is some load imbalance, which is typical in SAMR applications. An attempt to increase the MPI eager limit on the IBM SP2 is not a scalable solution, since increasing this value of the environment variable increases the MPI memory usage per processor, thus reducing the amount of overall memory available to the SAMR application. A more scalable strategy is to address this at the application level by appropriating positioning IS, IR,

Fig. 10 MPI test profile on IBM SP2: Ws and Wr are posted at the same time-step

Enabling scalable parallel implementations of SAMR applications

Fig. 11 MPI test profile on IBM SP2: Ws and Wr are posted at different time-steps

For n=1 to number_of_messages_to_receive{ MPI_IRECV (n, msgid_n) } ***COMPUTE*** For n=1 to number_of_messages_to_send{ MPI_ISEND (n, send_msgid_n) MPI_WAIT (send_msgid_n) } MPI_WAITALL (recv_msgid_*) Fig. 12 Example of a MPI non-blocking implementation on IBM SP2

For n=1 to number_of_messages_to_receive{ MPI_IRECV (n, msgid_n) } ****COMPUTE**** For n=1 to number_of_messages_to_send{ MPI_ISEND (n,send_msgid_n) } MPI_WAITALL (recv_msgid_*+send_msgid_*) Fig. 13 Optimization of MPI non-blocking implementation on IBM SP2 using delayed waits

Ws and Wr calls. The basic optimization strategy consists of delaying Ws until after Wr and is shown in Figs. 12 and 13. To illustrate the strategy, consider a scenario in which two processes exchange a sequence of messages and the execution sequence is split into steps T 0–T 3. Both processes post MPI_Irecv (IR) calls at T 0 and W all denotes a MPI_Waitall call. Assume that, due to load imbalance, process 0 performs computation until T 2 while process 1 computes only until T 1. Ws posted on process 1 at T 1 will block until process 0 posts Ws at T 2. For a large number of messages, this delay can become quite significant. Consequently, to minimize the blocking overhead due to Ws on process 1, it must be moved as close to T 2 as possible, corresponding to Ws on process 0. Now, if Ws is removed from the send loop and a collective MPI_Waitall is posted as shown in Fig. 13, it is observed that process 1 posts all of its non-blocking

S. Chandra et al.

sends. Also, by the time process 1 reaches T 2, it has already posted IS for all of its messages and is waiting on W all, thus reducing synchronization delays.

6 Scalability evaluation of the SAMR runtime engine The experimental evaluations of individual components of the SAMR runtime engine such as LPA, BPA, and the delayed waits optimization, as well as the scalability for SAMR applications are performed using the RM3D compressible turbulence application kernel (described in Sect. 3.3) on the NPACI IBM SP2 “Blue Horizon” supercomputer. Blue Horizon is a teraflop-scale Power3 based clustered Symmetric Multiprocessing (SMP) system from IBM, installed at the San Diego Supercomputing Center (SDSC). The machine contains 1,152 processors running AIX that are arranged as 144 SMP compute nodes. The nodes have a total of 576 GB of main memory such that each node is equipped with 4 GB of memory shared among its eight 375 MHz Power3 processors. Each node also has several gigabytes of local disk space. Nodes are connected by the Colony switch, a proprietary IBM interconnect. 6.1 Evaluation: LPA and BPA techniques The LPA and BPA partitioning/load-balancing techniques are evaluated for the RM3D application on 64 processors of Blue Horizon using the GrACE infrastructure. RM3D uses a base grid of 128*32*32 and 3 levels of factor 2 space-time refinements with regridding performed every 8 time-steps at each level. The RM3D application executes for 100 iterations and the total execution time, synchronization (Sync) time, recompose time, average maximum load imbalance, and the number of boxes are measured for each of the following three configurations, namely: (i) default GrACE partitioner (GPA), (ii) LPA scheme using GrACE, and (iii) the LPA + BPA technique using GrACE, i.e., the combined approach. Figure 14 illustrates the effect of LPA and BPA partitioning schemes on RM3D application performance. Note that the values plotted in the figure are normalized against the corresponding maximum value. The results demonstrate that the LPA scheme helps to reduce application synchronization time while the BPA technique provides better load balance. A combined approach reduces the overall execution time by around 10% and results in improved application performance. The Fig. 14 Evaluation of RM3D performance (normalized values) for LPA and BPA partitioning schemes on 64 processors of Blue Horizon

Enabling scalable parallel implementations of SAMR applications

LPA + BPA strategy improves load balance by approximately 70% as compared to the default GPA scheme. The improvements in load balance and synchronization time outweigh the overheads (increase in number of boxes) incurred while performing the optimizations within LPA and BPA. This evaluation is performed on a reduced scale on fewer processors, and hence the performance benefits of the SAMR partitioning framework are lower. However, these hierarchical partitioning strategies become critical while addressing scalable SAMR implementations, as shown later in this paper. 6.2 Evaluation: delayed waits optimization The evaluation for delayed waits optimization consists of measuring the message passing and application execution times for the RM3D application with and without the optimization in the SAMR communication substrate. Except for the MPI nonblocking optimization, all other application-specific and refinement-specific parameters are kept constant. RM3D uses a base grid size of 256*64*64 and 3 levels of factor 2 space-time refinements with regridding performed every 8 time-steps at each level, and the application executes for 100 iterations. Figures 15 and 16 show the comparisons of the execution times and communication times for the two configurations on 64, 128, and 256 processors of Blue Horizon, respectively. The delayed waits optimization helps to reduce overall execution time, primarily due to the decrease in message passing time. In this evaluation, the reduction in communication time is 44.37% on the average and results in improved application performance. Fig. 15 RM3D performance improvement in total execution time using delayed waits optimization for MPI non-blocking communication primitives on IBM SP2

Fig. 16 Comparison of RM3D message passing performance using delayed waits optimization for MPI non-blocking communication primitives on IBM SP2

S. Chandra et al.

6.3 Evaluation: overall RM3D scalability The evaluation of overall RM3D SAMR scalability uses a base coarse grid of size 128*32*32 and the application executes for 1000 coarse level time-steps. The experiments are performed on 256, 512, and 1024 processors of Blue Horizon using 4 levels of factor 2 space-time refinements and regridding is performed every 64 time-steps at each level. The partitioning algorithm chosen from the hierarchical SAMR partitioning framework for these large-scale evaluations is LPA + BPA since a combination of these two partitioners results in reduced synchronization costs and better load balance, as described in Sects. 4 and 6.1. The orphaning strategy and a local refinement approach are used in conjunction with LPA + BPA in this evaluation since the RM3D application exhibits localized patterns and deep hierarchies with large computational and communication requirements. Furthermore, the delayed waits strategy that optimizes MPI non-blocking communication is employed for these RM3D scalability tests to further reduce messaging-level synchronization delays, as discussed in Sect. 6.2. The minimum block size for a patch on the grid is maintained at 16. The scalability tests on 256, 512, and 1024 processors measure the runtime metrics for RM3D application. For these three experiments, the overall execution time, average computation time, average synchronization time, and average regridding/redistribution time are illustrated in Figs. 17, 18, 19, and 20 respectively. The vertical error bars in Figs. 18, 19, and 20 represent the standard deviations of the corFig. 17 Scalability evaluation of the overall execution time for 1000 coarse-level iterations of RM3D SAMR application on a 128*32*32 base grid with 4 levels of refinement—log graphs for 256, 512, and 1024 processors of Blue Horizon

Fig. 18 Scalability evaluation of the average computation time for 1000 coarse-level iterations of RM3D SAMR application on a 128*32*32 base grid with 4 levels of refinement—log graphs for 256, 512, and 1024 processors of Blue Horizon. The vertical error bars are standard deviations of the computation times

Enabling scalable parallel implementations of SAMR applications Table 2 Scalability evaluation for RM3D application on 256, 512, and 1024 processors of Blue Horizon in terms of coefficients of variation (CV) and additional metrics. The RM3D evaluation is performed for 1000 coarse-level iterations on a 128*32*32 base grid with 4 levels of refinement

Application

256

512

1024

parameter

processors

processors

processors

Computation CV

11.31%

15.5%

12.91%

Synchronization CV

8.48%

6.62%

7.41%

Regridding CV

2.89%

7.52%

9.1%

Avg. time per sync

0.43 sec

0.38 sec

0.22 sec

Avg. time per regrid

5.35 sec

7.74 sec

9.46 sec

responding metrics. Table 2 presents the coefficients of variation4 (CV) for computation, synchronization and regridding times, and the average time per synchronization and regrid operation. As seen in Fig. 17, the scalability ratio of overall application execution time from 256 to 512 processors is 1.394 (ideal scalability ratio is 2) yielding a parallel efficiency of 69.72%. The corresponding values of scalability ratio and parallel efficiency as processors increase from 512 to 1024 are 1.661 and 83.05% respectively. These execution times and parallel efficiencies indicate reasonably good scalability, considering the high computation and communication runtime requirements of the RM3D application. The LPA partitioning, bin-packing based load-balancing, and MPI nonblocking communication optimization techniques collectively enable scalable RM3D runs with multiple refinement levels on large numbers of processors. The RM3D average computation time, shown in Fig. 18, scales quite well. The computation CV in Table 2 is reasonably low for the entire evaluation, implying good overall application load-balance provided by the SAMR runtime engine (LPA + BPA strategy). Note that in the case of large numbers of processors, the RM3D application has few phases in which there is not enough workload in the domain that can be distributed among all processors. In such cases, some processors remain idle during the computation phase which affects the standard deviations of the computation times. Hence, the computation CV for 512 and 1024 processors are slightly higher than for 256 processors, as shown in Table 2. However, this lack of computation on large numbers of processors is an intrinsic characteristic of the application and not a limitation of the SAMR runtime engine. The scalability of synchronization time is limited by the highly communicationdominated application behavior and unfavorable computation to communication ratios, as described in Sect. 3.3 and observed in Fig. 19. The evaluation exhibits reasonably low synchronization CV in Table 2, which can be attributed to the communication improvements provided by the SAMR runtime engine (LPA and delayed waits techniques). As observed in Table 2, the average time taken per synchronization operation reduces with an increase in the number of processors. This is primarily due to the reduction in size of the grid units owned by processors, resulting in smaller message transfers for boundary updates. The decrease in synchronization time is proportionately greater when processors increase from 512 to 1024 due to smaller 4 The coefficient of variation is a dimensionless number that is defined as the ratio of the standard deviation to the mean for a particular metric, and is usually expressed as a percentage.

S. Chandra et al. Fig. 19 Scalability evaluation of the average synchronization time for 1000 coarse-level iterations of RM3D SAMR application on a 128*32*32 base grid with 4 levels of refinement—log graphs for 256, 512, and 1024 processors of Blue Horizon. The vertical error bars are standard deviations of the synchronization times

Fig. 20 Scalability evaluation of the average regridding time for 1000 coarse-level iterations of RM3D SAMR application on a 128*32*32 base grid with 4 levels of refinement—log graphs for 256, 512, and 1024 processors of Blue Horizon. The vertical error bars are standard deviations of the regridding times

processor workloads, further reduced message sizes, and greater parallelism among message transfers. Moreover, as described earlier, some application phases may not have sufficient workload that can be allocated to all processors during redistribution in case of large numbers of processors, forcing some processors to remain idle. In such cases, this can result in reduced synchronization times since the idle processors do not participate in the synchronization process. However, this is not a limitation of the SAMR runtime engine, but is a direct consequence of the lack of application dynamics. The scalability evaluation of the regridding costs for the RM3D application is plotted in Fig. 20. SAMR regridding/redistribution entails error estimation, clustering and refinement of application sub-domains, global domain decomposition/partitioning, and reconstruction of the application grid hierarchy followed by data migration among processors to reflect the new distribution. The average regridding time increases from 256 to 1024 processors since it is more expensive to perform global decomposition operations for larger number of processors. However, the partitioning overheads induced by the SAMR hierarchical partitioning framework are minuscule compared to the average regridding costs. As an example, the overall partitioning overhead for the SAMR runtime engine on 512 processors is 4.69 seconds which is negligible compared to the average regridding time of 364.01 seconds. The regridding CV and the average time per regrid, noted in Table 2, show a similar trend as the average regridding time—these metrics increase in value for larger number of processors.

Enabling scalable parallel implementations of SAMR applications

To avoid prohibitive regridding costs, the redistribution of the SAMR grid hierarchy is performed less frequently, in periods of 64 time-steps on each level. This scalability evaluation on 256-1024 processors analyzed the runtime behavior and overall performance for the RM3D application with a 128*32*32 base grid and 4 levels of refinement. Note that a unigrid RM3D implementation for the same experimental settings will use a domain of size 1024*256*256 with approximately 67 million computational grid points. Such a unigrid implementation will require extremely large overall memory availability (a total of 600 GB) that accounts for the spatial resolution (1024*256*256), grid data representation (8 bytes for “double” data), local copies in memory for grid functions and data structures (approximately 50), storage for previous, current, and next time states (total 3 sets), and temporal refinement corresponding to the number of time-steps at the finest level (factor of 8). With 0.5 GB memory availability per processor on Blue Horizon, this unigrid implementation will require 1200 processors to complete execution, which exceeds the system configuration of 1152 processors. Thus, unigrid implementations are not even feasible for large-scale execution of the RM3D application on the IBM SP2. Moreover, even if such unigrid RM3D implementations were possible on large systems, they would entail a tremendous waste of computational resources since the RM3D application exhibits localized refinement patterns with high SAMR efficiencies. Consequently, a scalable SAMR solution is the only viable alternative to realize such large-scale implementations, which underlines the primary motivation for this research. 6.4 Evaluation: SAMR benefits for RM3D performance To additionally evaluate the benefits of using SAMR, another experiment compares the RM3D application performance for configurations with identical finest-level resolution but different coarse/base grid size and levels of refinement. This evaluation is performed on 512 processors of Blue Horizon and all other application-specific and refinement-specific parameters kept constant. Note that for every step on the coarse level (level 0), two steps are taken at the first refinement level (level 1), four steps on level 2, and so on. Each configuration has the same resolution and executes for the same number of time-steps (in this case, 8000) at the finest level of the SAMR grid hierarchy. The RM3D domain size at the finest level is 1024*256*256 and the Table 3 RM3D performance evaluation of SAMR on 512 processors of Blue Horizon for different application base grids and varying refinement levels. Both experiments have the same finest resolution and execute for 8000 steps at the finest level Application

4-Level Run

5-Level Run

Performance

parameter

(128*32*32

(64*16*16

improvement

base grid)

base grid)

Overall Execution time

10805 sec

6434.62 sec

Avg. Computation time

2912.51 sec

1710.15 sec

41.28%

Avg. Synchronization time

2823.05 sec

1677.18 sec

40.59%

Avg. Regridding time

364.01 sec

255.34 sec

29.85%

40.45%

S. Chandra et al. Table 4 Coefficients of variation (CV) and additional metrics for evaluating RM3D SAMR performance on 512 processors of Blue Horizon using 4 and 5 refinement levels

Application

4-Level Run

5-Level Run

parameter

(128*32*32 base)

(64*16*16 base)

Computation CV

15.5%

14.27%

Synchronization CV

6.62%

7.67%

Regridding CV

7.52%

4.39%

Avg. time per sync

0.38 sec

0.32 sec

Avg. time per regrid

7.74 sec

7.74 sec

evaluation uses the LPA + BPA technique with a minimum block size of 16. The first configuration (4-Level Run) uses a coarse grid of size 128*32*32 with 4 levels of factor 2 space-time refinements, and executes for 1000 coarse-level iterations which correspond to 8000 steps at the finest level. The second configuration (5-Level Run) uses a 64*16*16 base grid with 5 refinement levels and runs for 500 coarse-level iterations, corresponding to 8000 steps at the finest level, to achieve the same resolution. Table 3 presents the runtime metrics for 4-Level and 5-Level configurations and the performance improvement obtained with SAMR. Table 4 lists the coefficients of variation for computation, synchronization and regridding times for the two configurations, and the average time per synchronization and regrid operation. In this evaluation, the error estimator used for determining refinement at each level of the hierarchy is an absolute threshold of 0.005. Note that the quality of solutions obtained in the two configurations are comparable; however, the grid hierarchies are not identical. It is typically not possible to guarantee identical grid hierarchies for different application base grids with varying refinement levels, if the application uses localized refinements and performs runtime adaptations on the SAMR grid hierarchy. However, we reiterate that the two grid hierarchies in this evaluation are comparable, which is validated by the similar average time per regrid (shown in Table 4) for 4-Level and 5-Level runs. SAMR techniques seek to improve the accuracy of the solution by dynamically refining the computational grid only in the regions with large local solution error. The 5-Level Run has fewer grid points in the application grid hierarchy at the finest resolution since the refinements are localized, and hence uses fewer resources than the corresponding 4-Level Run. Consequently, the average times for all runtime metrics shown in Table 3 and the CV values for computation and regridding (in Table 4) are lower for the 5-Level Run. Though the average time for each synchronization operation is lesser for the 5-Level Run (as seen in Table 4), the standard deviation values per synchronization operation are similar for both configurations. Consequently, the 5-Level Run has a relatively higher synchronization CV than the 4-Level Run due to deeper SAMR hierarchies requiring more inter-level communications. As observed from Table 3, the overall RM3D execution time shows around 40% improvement for the 5-Level hierarchy due to the efficiency of the basic Berger-Oliger SAMR algorithm. These results corroborate our claim that the RM3D application exhibits localized patterns with high SAMR efficiencies, and can derive substantial performance benefits from scalable SAMR implementations.

Enabling scalable parallel implementations of SAMR applications

7 Conclusion This paper presented a SAMR runtime engine to address the scalability of SAMR applications with localized refinements and high SAMR efficiencies on large numbers of processors (upto 1024 processors). The SAMR runtime engine consists of two components: (1) a hierarchical partitioning/load-balancing framework that can address the space-time heterogeneity and dynamism of the SAMR grid hierarchy, and (2) a SAMR communication substrate that optimizes the use of MPI non-blocking communication primitives. The hierarchical partitioning approach reduces global synchronization costs by localizing communication to processor sub-groups, and enables concurrent communication. The level-based partitioning scheme maintains locality and reduces synchronization overheads while the bin-packing based loadbalancing technique balances the computational workload at each refinement level. The underlying MPI non-blocking communication optimization uses a delayed waits strategy and helps in reducing overall communication costs. The SAMR runtime engine as well as its individual components were experimentally evaluated on the IBM SP2 supercomputer using a 3-D Richtmyer-Meshkov (RM3D) compressible turbulence kernel. Results demonstrated that the proposed techniques improve the performance of applications with localized refinements and high SAMR efficiencies, and enable scalable SAMR implementations that have hitherto not been feasible. Acknowledgements This work was supported in part by the National Science Foundation via grant numbers ACI 9984357 (CAREERS), EIA 0103674 (NGS) and EIA 0120934 (ITR), and by DOE ASCI/ASAP (Caltech) via grant number PC295251 and 1052856 awarded to Manish Parashar. The authors would like to thank Johan Steensland for collaboration and valuable research discussions, and Ravi Samtaney and Jaideep Ray for making their applications available for use.

References 1. ASCI Alliance, http://www.llnl.gov/asci-alliances/asci-chicago.html, University of Chicago 2. ASCI/ASAP Center, http://www.cacr.caltech.edu/ASAP, California Institute of Technology 3. Bell J, Berger M, Saltzman J, Welcome M (1994) Three-dimensional adaptive mesh refinement for hyperbolic conservation laws. SIAM J Sci Comput 15(1):127–138 4. Berger M, Hedstrom G, Oliger J, Rodrigue G (1983) Adaptive mesh refinement for 1-dimensional gas dynamics. In: IMACS Trans Sci Comput. IMACS/North Holland, pp 43–47 5. Berger M, Oliger J (1984) Adaptive mesh refinement for hyperbolic partial differential equations. J Comput Phys 53(March):484–512 6. Bryan G (1999) Fluids in the universe: adaptive mesh refinement in cosmology. Comput Sci Eng (March-April):46–53 7. Chandra S, Sinha S, Parashar M, Zhang Y, Yang J, Hariri S (2002) Adaptive runtime management of SAMR applications. In: Sahni S, Prasanna V, Shukla U (eds), Proceedings of 9th international conference on high performance computing (HiPC’02), Lecture notes in computer science, Springer, Bangalore, India, vol 2552, December 2002, pp 564–574 8. CHOMBO, http://seesar.lbl.gov/anag/chombo/, NERSC, ANAG of Lawrence Berkeley National Lab, CA, USA, 2003 9. Choptuik M (1989) Experiences with an adaptive mesh refinement algorithm in numerical relativity. In: Evans C, Finn L, Hobill D (eds), Frontiers in numerical relativity. Cambridge University Press, London, pp 206–221 10. Hawley S, Choptuik M (2000) Boson stars driven to the brink of black hole formation. Phys Rev D 62:104024 11. Hilbert D (1891) Uber die stetige Abbildung einer Linie auf Flachenstuck. Math Ann 38:459–460

S. Chandra et al. 12. Kohn S, SAMRAI Homepage, Structured adaptive mesh refinement applications infrastructure. http: //www.llnl.gov/CASC/SAMRAI/, 1999 13. Lan Z, Taylor V, Bryan G (2001) Dynamic load balancing for structured adaptive mesh refinement applications. In: Proceedings of international conference on parallel processing, Valencia, Spain, 2001, pp 571–579 14. Lan Z, Taylor V, Bryan G (2001) Dynamic load balancing of SAMR applications on distributed systems. In: Proceedings of supercomputing conference (SC’01), Denver, CO, 2001 15. Li X, Parashar M (2003) Dynamic load partitioning strategies for managing data of space and time heterogeneity in parallel SAMR applications. In: Kosch H, Boszormenyi L, Hellwagner H (eds), Proceedings of 9th international Euro-par conference (Euro-Par’03), Lecture Notes in Computer Science, Springer, Klagenfurt, Austria, vol 2790, August 2003, pp 181–188 16. MacNeice P, Olson K, Mobarry C, de Fainchtein R, Packer C (2000) PARAMESH: a parallel adaptive mesh refinement community toolkit. Comput Phys Commun 126:330–354 17. Moon B, Jagadish H, Faloutsos C, Saltz J (2001) Analysis of the clustering properties of the Hilbert space-filling curve. IEEE Trans Knowl Data Eng 13(1):124–141 18. Norman M, Bryan G (1998) Cosmological adaptive mesh refinement. In: Miyama S, Tomisaka K (eds), Numerical astrophysics. Kluwer, Tokyo 19. Parallel Environment (PE) for AIX V3R2.0, MPI programming guide, 2nd edn, December 2001 20. Parashar M, Browne J (1996) On partitioning dynamic adaptive grid hierarchies. In: Proceedings of 29th annual Hawaii international conference on system sciences, January 1996, pp 604–613 21. Parashar M, Browne J (1995) Distributed dynamic data structures for parallel adaptive mesh refinement. In: Proceedings of international conference for high performance computing, December 1995, pp 22–27 22. Parashar M, Wheeler J, Pope G, Wang K, Wang P (1997) A new generation EOS compositional reservoir simulator: Part II—framework and multiprocessing. In: Society of petroleum engineering reservoir simulation symposium, Dallas, TX, June 1997 23. Peano G (1890) Sur une courbe qui remplit toute une aire plane. Math Ann 36:157–160 24. Pember R, Bell J, Colella P, Crutchfield W, Welcome M (1993) Adaptive Cartesian grid methods for representing geometry in inviscid compressible flow. In: 11th AIAA computational fluid dynamics conference, Orlando, FL, July 1993 25. Ray J, Najm H, Milne R, Devine K, Kempka S (2000) Triple flame structure and dynamics at the stabilization point of an unsteady lifted jet diffusion flame, Proc of Combust Inst 28(1):219–226 26. Sagan H (1994) Space-filling curves. Springer 27. Saif T (2004) Architecture specific communication optimizations for structured adaptive mesh refinement applications, M.S. thesis, Graduate School, Rutgers University 28. Steensland J, Chandra S, Parashar M (2002) An application-centric characterization of domain-based SFC partitioners for parallel SAMR. IEEE Trans Parallel Distribut Syst 13(12):1275–1289 29. Steinthorsson E, Modiano D (1995) Advanced methodology for simulation of complex flows using structured grid systems. In: Surface modeling, grid generation, and related issues in CFD solutions, NASA Conference Publication 3291, May 1995 30. Wang P, Yotov I, Arbogast T, Dawson C, Parashar M, Sepehrnoori K (1997) A new generation EOS compositional reservoir simulator: Part I—formulation and discretization. In: Society of petroleum engineering reservoir simulation symposium, Dallas, TX, June 1997 31. Wissink A, Hornung R, Kohn S, Smith S, Elliott N (2001) Large scale parallel structured AMR calculations using the SAMRAI framework. In: Proceedings of supercomputing conference (SC’01), Denver, CO, 2001

Sumir Chandra is a Ph.D. student in the Department of Electrical and Computer Engineering and a researcher at the Center for Advanced Information Processing at Rutgers University in Piscataway, NJ,

Enabling scalable parallel implementations of SAMR applications USA. He has also been a visiting researcher at Sandia National Laboratories in Livermore, California. He received a B.E. degree in Electronics Engineering from Bombay University, India, and a M.S. degree in Computer Engineering from Rutgers University. He is a member of the IEEE, IEEE Computer Society, and Society for Modeling and Simulation International. His research interests include computational science, software engineering, performance optimization, modeling and simulation.

Xiaolin Li is Assistant Professor of Computer Science at Oklahoma State University. He received B.E. degree from Qingdao University, China, M.E. degree from Zhejiang University, China, and Ph.D. degrees from National University of Singapore and Rutgers University, USA. His research interests include distributed systems, sensor networks, and software engineering. He is directing the Scalable Software Systems Laboratory (http://www.cs.okstate.edu/~xiaolin/S3Lab). He is a member of IEEE, member of the executive committee of the IEEE Computer Society Technical Committee on Scalable Computing (TCSC), and a member of ACM.

Taher Saif is a Senior Java Consultant at The A Consulting Team Inc, NJ. He has been consulting with TD Ameritrade, a Brokerage Firm based in NJ for the past 3 years. Prior to his current position, he completed his Masters in Computer Engineering at Rutgers University. He was a researcher at the Center for Advanced Information Processing at Rutgers. His research interests include parallel/distributed and autonomic computing, software engineering, and performance optimization.

Manish Parashar is Professor of Electrical and Computer Engineering at Rutgers University, where he also is co-director of the Center for Advanced Information Processing and director of the Applied Software Systems Laboratory. He received a B.E. degree in Electronics and Telecommunications from Bombay University, India and M.S. and Ph.D. degrees in Computer Engineering from Syracuse University. He has received the Rutgers Board of Trustees Award for Excellence in Research (2004–2005), NSF CAREER Award (1999) and the Enrico Fermi Scholarship from Argonne National Laboratory (1996). His research interests include autonomic computing, parallel and distributed computing, scientific computing, and software engineering.