A Case Study with PETSc and MPI

9 downloads 412 Views 255KB Size Report
benchmarks as well as an application over the PETSc software library. .... loss in the performance and scalability of applica- tions. ... C++, Fortran and Python.
Nonuniformly Communicating Noncontiguous Data: A Case Study with PETSc and MPI ∗ † P. Balaji

D. Buntinas S. Balay B. Smith R. Thakur W. Gropp Mathematics and Computer Science Division Argonne National Laboratory {balaji, buntinas, balay, bsmith, thakur, gropp}@mcs.anl.gov

Abstract

formance of a 3-D Laplacian multi-grid solver application when evaluated on a 128 processor cluster.

Due to the complexity associated with developing parallel applications, scientists and engineers rely on highlevel software libraries such as PETSc, ScaLAPACK and PESSL to ease this task. Such libraries assist application developers by providing abstractions for mathematical operations, data representation and management of parallel layouts of the data, while internally using communication libraries such as MPI and PVM. With high-level libraries managing data layout and communication internally, it can be expected that they organize application data suitably for performing the library operations optimally. However, this places additional overhead on the underlying communication library by making the data layout noncontiguous in memory and communication volumes (data transferred to each process) nonuniform. In this paper, we analyze the overheads associated with these two aspects (noncontiguous data layouts and nonuniform communication volumes) in the context of the PETSc software toolkit over the MPI communication library. We describe the issues with the current approaches used by MPICH2 (an implementation of MPI), propose different approaches to handle these issues and evaluate these approaches with microbenchmarks as well as an application over the PETSc software library. Our experimental results demonstrate close to an order of magnitude improvement in the per-

1 Introduction Developing large-scale parallel applications and simulations has been a cumbersome and inherently daunting task, owing to the complexity of current generation parallel systems. Accordingly, scientists and engineers rely on high-level software libraries such as PETSc [1][2], ScaLAPACK [3] and PESSL to ease implementation and minimize the development cycle required to parallelize their application. These high-level software libraries internally use parallel communication libraries such as the Message Passing Interface (MPI) [6] or the Parallel Virtual Machine (PVM) [20] to move appropriate data between processes as needed. Most scientific applications formulate the physical equations corresponding to problems in discrete numerical forms. Similarly, they represent the domain on which they intend to solve the equation into a grid of data points. Such formulation involves representing the problem domain (1-D, 2D or 3-D) as a structured grid, unstructured grid or adaptive mesh, and evaluating the discretized equations using the associated data. High-level software libraries assist such a representation by providing abstractions for mathematical operations, data representation and management of parallel layouts of the data as required by the application. With the high-level libraries managing the data layout internally, it is only intuitive for them to choose to organize the data in representations suitable for performing the library operations opti-

∗ This work was supported by the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Contract W-31109-ENG-38. † The authors would like to thank Dr. Panda and his team from the Ohio State University for allowing us access to their 64-node InfiniBand cluster.

1

mally. However, this places additional overhead on the underlying communication library in two aspects: Noncontiguous data layout: Since the overall data layout is optimized for the library operations, the partial data that needs to be communicated to other processes tends to be laid out noncontiguously in memory. For example, representing a 2D grid as a contiguous vector might be optimal for computational purposes. However, if a process needs to communicate a column of the grid to its neighbor, this partial data is now noncontiguously laid out in memory with a uniform stride. Nonuniform communication volumes: In grid layouts (e.g., structured grids) several applications communicate only (or mostly) with their immediate neighbors. Depending on the mapping of data to processes in each dimension (e.g., a process could potentially manage a nonsquare region in a 2-D grid) and different discretization models (e.g., star or box-type stencils, as will be described in Section 2), the amount of data communicated to different neighbors can be different. In short, depending on the way the high-level library manages data internally, a process can communicate vastly different volumes of data with each of the other processes in the system. In this paper, we analyze the overheads associated with these two aspects (noncontiguity in data layout and nonuniformity in communication volumes), in the context of the PETSc software toolkit over the MPI communication library. We describe the issues with current approaches used by MPICH2 [17] (an implementation of MPI) and its derivatives (e.g., the MVAPICH2[10] implementation from the Ohio State University) and propose different approaches for handling these issues. Specifically, for the first aspect (nonuniform data processing), we propose a new dual-headed lookup-based design for handling inefficiencies with noncontiguous data processing. For the second aspect (nonuniform communication volumes), we describe multiple designs for different communication operations that support nonuniform communication volumes (e.g., MPI Allgatherv, MPI Alltoallw).

Apart from proposing and describing the different designs, we also experimentally evaluate each of them to illustrate the inefficiencies associated with existing approaches and the benefits obtainable with our approaches. Next, we evaluate our integrated framework (comprising of all our new designs) with the PETSc toolkit to understand the impact these designs can achieve at the high-level software libraries level. Finally, we evaluate our framework with a 3-D Laplacian multi-grid solver application using the PETSc library over MPI. Our experimental results demonstrate that most current approaches do not follow any sophisticated mechanisms to enhance the nonuniform data processing and communication and use identical mechanisms as uniform data processing; this leads to significant loss in the performance and scalability of applications. On the other hand, sophisticated schemes specifically tuned for nonuniform data communication can avoid such inefficiencies and achieve an overall high performance. For example, with our framework, we could achieve close to an order of magnitude improvement in the performance of a 3-D Laplacian multi-grid solver application when evaluated on a cluster of 128 processors. The remaining part of the paper is organized as follows: in Section 2, we provide a brief overview of the PETSc toolkit. We analyze the requirements imposed by applications and high-level software libraries such as PETSc on the underlying MPI communication library in Section 3. We describe our designs to tackle issues with nonuniform data processing and communication in Section 4 and evaluate them in Section 5. Description of some related work is presented in Section 6 and the conclusions and future work are presented in Section 7.

2 The PETSc Toolkit The Portable Extensible Toolkit for Scientific Computation (PETSc) was developed in the Mathematics and Computer Science Division at the Argonne National Laboratory. PETSc is a set of software tools for users writing large-scale application code involving solutions to Partial Differential Equations (PDEs). Application domains cur-

2

Level of Abstraction

Process Boundary

Application Codes PDE Solvers

TS (Time Stepping)

SNES (Nonlinear Equations Solvers)

SLES

(Linear Equations Solvers)

PC

KSP

(Krylov Subspace Methods) (Preconditioners)

Matrices BLAS

Vectors LAPACK

Draw

Local Data Point

Index Sets

Figure 2: Ghost Points: Bordering portions of process’ local data that is used for computation

MPI

computation would require data from neighboring processors (bordering data points or ghost points). This communication can also be handled inside the PETSc library routines. Figure 2 illustrates the concept of ghost points for structured and unstructured grids. As shown in the figure, the entire data required for processing is distributed amongst different processes. However, to evaluate a local function, each process requires its local portion of the data as well as its bordering portions or ghost positions, that are owned by the neighboring processes. This data layout pattern presents three interesting aspects in the context of data communication, namely (i) noncontiguous data layout, (ii) nonuniform communication volumes to different processes and (iii) nonuniform set of processes that are communicated with. Noncontiguous data layout: Application data (or field values) are represented in a PETSc vector object which is stored as a contiguous array on each processor. When this data corresponds a multidimensional structured grid, the array values are contiguous on first dimension of the grid, but strided on the other dimensions. Furthermore each grid point might have multiple field values (for e.g., pressure, temperature, x-velocity and yvelocity). These values get stored interlaced in the PETSc vector. This representation allows PETSc to perform the required mathematical operations in an efficient manner and with least overhead. Now, when the data corresponding to the ghost points needs to be communicated to a neighboring process, PETSc uses MPI to perform this data transfer efficiently. However, from MPI’s perspective, the data that needs to be communicated is now laid out in a noncontiguous manner. As we will see in

Figure 1: PETSc Architecture

rently using PETSc include Nanosimulations, Biological sciences, Fusion, Geosciences, Environmental/Subsurface flow, Computational Fluid Dynamics, Wave propagation and others. Figure 1 describes the abstract components of PETSc. Like other high-level software libraries, PETSc provides a suite of data structures and routines to create vectors, matrices, and distributed arrays, both sequential and parallel, as well as routines for linear and nonlinear numerical solvers that can be easily used in application code written in C, C++, Fortran and Python. It also incorporates timestepping methods and graphics. PETSc utilizes the Message Passing Interface (MPI) standard for all inter-process data communication while providing implicit message passing for the application, i.e., it transparently handles the issues associated with moving data between processes without requiring the application to directly call any data transfer routines. This includes handling parallel data layouts (parallel vectors and matrices), communicating ghost point data (data points that are needed for computation, but reside on a different process memory) and others.

2.1

Ghost Data Point

Handling Parallel Data Layouts

As mentioned earlier, PETSc provides mathematical abstractions like matrices, vectors, and utilities like distributed arrays to represent a grid in a parallel manner. With the data objects distributed across processors, data communication when required is handled internally by PETSc library, transparent to the application code. Some of the application 3

non-zero amount of data to its neighbors and zero data to the rest of the processes. This can be considered to be a special case of the previous aspect (nonuniform communication volumes), but is relevant to note separately due to its potential to minimize the impact of skew if designed appropriately. Proc 0 Proc 1 Proc 0 Proc 1 It is to be noted that all processes perform comBox−type stencil Star−type stencil munication with a set of processes during the collective operation, but each process does not comFigure 3: Data Layout Models: Box-type Stencil and municate with every other process. Thus, if a set Star-type Stencil of processes are delayed, if appropriately designed, the collective communication operation should not Section 3, handling this efficiently is a non-trivial delay a process which does not belong to the set of delayed processes. problem that needs to be addressed. Nonuniform communication volumes: Depending on the discretization model used, the num3 Understanding upper-layer ber of neighbors that need to be contacted by a requirements on MPI process and the volume of data that needs to be transferred to each neighbor differs. For example, In this section we describe implications of the reFigure 3 illustrates two such discretization modquirements from upper-layers (such as high-level els, namely box-type stencil and star-type stencil. software libraries) on MPI. Specifically, we anAs shown in the figure, in a star-type stencil, each alyze MPICH2 (a high-performance open-source process has to communicate with two neighbors in MPI implementation from Argonne National Laboeach dimension, i.e., 4 processes in a 2-D grid. The ratory) and its derivate MPI implementations (e.g., volume of communication in each dimension is the MVAPICH2 over the InfiniBand network from the same, but could differ across dimensions (e.g., if a Ohio State University), and identify design issues process manages a nonsquare portion of the grid). that affect its capability to handle the requirements A box-type stencil is more complicated with refrom upper layers. spect to the communication pattern. In this model, each process has to communicate with four neigh3.1 Issues with Noncontiguous Data bors in each dimension, two along the sides and two along the corners. The interesting aspect in this As described in the previous section, scientific apcommunication pattern is that the volume of data plications often need to send noncontiguous porcommunicated to the neighbors that share a side is tions of memory to other processes. For instance typically much larger than the volume of data com- a process may need to send a column of an array municated to the neighbors that share only a corner. to a neighboring process. In the C programming Further, as the dimensionality of the grid increases, language the elements of the column would not be the number of different volumes that need to be located contiguously in memory. MPI provides a communicated could potentially increase quadrat- mechanism to describe noncontiguous memory regions, using MPI derived datatypes. Once a deically. Nonuniform set of processes to communi- rived datatype has been created describing the noncate with: As mentioned earlier, applications contiguous region, it can be passed as a parameter solving PDEs typically perform communication to MPI communication functions and the MPI lionly with their neighbors. This is handled with brary will handle sending the noncontiguous data. PETSc using MPI collective operations (e.g., MPI In this way, the use of datatypes in an application Alltoallw) where each process communicates a simplifies coding. The alternative to using MPI datatypes is for the programmer to explicitly pack 4

0 8 16

192

384

Figure 5: Noncontiguous memory layout of the first column of the matrix. vector (count=8, stride=8) contiguous (count=3) double double double

contiguous (count=3) double double double

contiguous (count=3) double double double

Figure 6: MPI datatype describing the first column of the matrix.

copy buffer

Figure 4: An 8x8 2-D matrix. Each element consists of three doubles.

Figure 7: Packing part of the first column of the matrix.

the noncontiguous data into a contiguous buffer then send that buffer, and on the receiving side, the programmer would have to receive the data into a contiguous intermediate buffer then unpack the buffer. Let us consider an 8x8 2-D matrix where each element consists of three double-precision floatingpoint values. Figure 4 shows such a matrix. If we consider only the first column of the matrix, the data in this column would be laid out noncontiguously in memory as shown in Figure 5. Figure 6 depicts a datatype that could be used to describe this column. Datatypes are defined recursively, where a new datatype is built up from other datatypes. In this case a single matrix element is described by a contiguous datatype of three doubles. The column of these elements is described by a vector, with a stride of eight, of the element datatype. Although the MPI implementation can use network operations which can directly send from or receive into noncontiguous buffers, e.g., writev in the sockets API, these generally perform well only for datatypes which are dense, i.e., which have medium to large segments of contiguous data. For sparse noncontiguous datatypes, which consist of many short contiguous segments, it us often more efficient to pack the data into an intermediate buffer before sending (Figure 7). Further, most MPI implementations perform packing of the data and the actual communication from the packed buffer in a pipelined manner to improve performance. Ac-

cordingly, an intermediate buffer of a certain size (depending on the pipelining granularity) is allocated and data packed into it. The arrow in Figure 6 shows how much of the derived datatype has been processed and copied into the intermediate buffer by the MPI implementation and represents the current context of the datatype. This context is internally stored by the MPI implementation, so that when the next portion of the datatype is packed (due to the pipelining of packing and communication), the datatype processing can quickly reload the context and continue from there instead of re-searching through the datatype for the appropriate context at every packing event. While the context-based datatype processing is efficient in the simple case where there is no additional processing between the pipeline event, in real implementations this is rarely the case. For example, before each data packing event, MPICH2 examines the derived datatype to see whether it is sparse or dense in order to decide whether or not to pack the data into an intermediate buffer, i.e., it performs a look-ahead in the derived datatype from the current context. Thus, context of the derived datatype is incremented by the look-ahead amount. In case the datatype turns out to be sparse, data packing needs to be performed from the previous context – this is a problem since our current context is already incremented by the look-ahead amount! Currently, this problem is dealt by re-searching the entire derived datatype for the appropriate con5

text. This search time, however, increases quadratically with the size of the derived data type. In Section 4.1 we present an optimization to the data processing code in MPICH2 to eliminate this overhead.

3.2

Large Message

Issues with Nonuniform Communication Volumes

Small Message

A common model for scientific applications is to repeatedly perform a computation phase followed by a communication phase in which all processes perform the communication phase at the same time. To support such models MPI defines collective communication operations in which all processes send and receive some portions of data. For instance in the MPI Allgather() operation all processes participate and each process specifies data to be sent all other processes, as well as the buffer into which it will receive the data from each of the other processes. In this operation the amount of data that each process sends is required to be the same, i.e., the volume of data communicated by each process should be uniform. MPI also provides a variant of this operation called MPI Allgatherv() which allows each process to send a different amount of data, i.e., the volume of data communicated by each process could be nonuniform. MPICH2 collective communication operations have been optimized for uniform communication patterns. However, when there is a large difference between the amount of data that is sent by each process, these operations perform suboptimally. Consider the case where the processes call MPI Allgatherv() and one process has a large amount of data to send, while the others have small amounts of data to send. Because the total amount of data to be sent is large, MPICH2 uses the ring algorithm which is optimal for large messages (in the uniform communication volumes case). In the ring algorithm, the processes are arranged in a logical ring and each process receives data from its predecessor and sends data to its successor. If each process were to send the same amount of data, the communication would be well balanced, and each process’ capacity would be fully utilized through

Figure 8: Allgatherv Ring Algorithm for Large Messages – One large message in the communication volume set can sequentialize the entire communication operation

continuously sending and receiving. On the other hand, in our example, because of the large imbalance of message sizes, most of the time only one process will be sending and one process will be receiving a large message; the other processes will be idly waiting for the next message. Thus, the communication time would be dominated by the large message being passed around the ring. Figure 8 illustrates this problem. In some collective communication patterns used by scientific applications, each process may have different data to send to different processes. MPI provides the MPI Alltoall() operation to support such a pattern for uniform communication volumes (i.e., each process sends the same volume of data to each of the other processes). MPI Alltoallw() operation is the nonuniform communication volume counterpart for MPI Alltoall(). As a special case, the MPI Alltoallw() operation also provides support for a communication pattern where each process sends and receives data from some processes, but not necessarily from every other process – this is achieved by specifying zero communication volume for the other processes in the collective operation. For instance, this operation can be used to perform a nearest-neighbor communication pattern, where processes are arranged logically in a grid and each process exchanges data only with the processes adjacent to it in the grid. Note that in this operation, every process is exchanging data with some processes, hence it is participating in the

6

Non−contiguous Application Buffers

operations; the process just does not communicate with every other process in the system. Because the MPICH2 collective communication operations are optimized for uniform communication volumes, the MPI Alltoallw() operation assumes (from a performance stand point) that the same amount of data is sent and received between every two processes. So, cases where the communication volumes are nonuniform might not be handled in the most efficient way. Specifically, MPI Alltoallw() implementation requires each process to send and receive a message to every other process in a round-robin manner. This approach has two major disadvantages. First, if a process has no data to exchange with another process, sending and receiving zero bytes of data (as is with the current implementation of MPICH2 and its derivatives) adds an additional synchronization step between the two processes. For example, in the case of nearest-neighbor communication, each process will only have data to exchange with a few other processes, regardless of the number of processes in the application. Second, let us consider a scenario where process 1 has to send a large message to process 2 and a small message to process 3. Now, in the round-robin if process 2 comes before process 3, the data to be sent to process 2 is processed before that of process 3. In cases where the processing overhead is large (e.g., if the data is noncontiguous), process 3 could be significantly delayed. On the other hand, if the smaller messages are sent out first, this can be avoided. In Section 4.2 we present optimizations to MPICH2 collective communication operations for nonuniform communication volumes.

4

Context 1

Context 2

Look−ahead Processing Intermediate Buffer

Communicate

Figure 9: Dual-context Look-ahead Design

out in Section 3 and discuss the advantages of these approaches compared to the current approaches. In Section 5, we evaluate our schemes independently as well as with PETSc-based applications to understand the benefits achievable.

4.1 Processing Noncontiguous Data As described in Section 3.1, approaches used by current MPI implementations to handle noncontiguous data rely on continuous parsing of the derived datatype. If a look-ahead needs to be performed within the datatype to understand if the portions of the datatype are sparse or dense, such continuity is lost. This results in the algorithm having to search the derived datatype at each step to find the position till where it had previously parsed and continue from there. Thus, the searching time needed to find the previous position increases quadratically with the size of the derived datatype. In this section, we present a new approach to handle such issues with derived datatype processing, namely a dual-context look-ahead approach (figure 9). The primary idea of this approach is to utilize two contexts to parse the datatype instead of one. These two contexts can be thought of as snapshots of the layout of the derived datatype. The first context is primarily utilized for lookaheads in order to understand the structure of the upcoming portions of the derived datatype. This context rolls forward within the datatype allowing the implementation to analyze the structure of the datatype and make a decision on the algorithm to be utilized (for sparse and dense derived datatypes). Note that this context only requires parsing through the signatures of the datatypes (i.e., the number of contiguous regions in the datatype) and not the ac-

Redesigning MPI Communication

As described in Section 3, most current approaches used by MPI implementations do not follow any sophisticated mechanisms to enhance noncontiguous data processing and nonuniform communication volumes and use identical mechanisms as uniform data processing. In this section, we describe different approaches to handle the issues pointed 7

4.2 Handling Nonuniform Communication Volumes

tual data. Thus, when the contiguous elements of the derived datatype are large, the cost of this lookahead is very small as compared to the cost of actually packing the data. However, when the contiguous elements are small, the cost of the look-ahead is comparable to the actual packing of the data. The second context is utilized for actual processing of the datatype, including packing the data if needed and communicating the appropriate data in a pipelined fashion. This processing touches the actual data making its overhead proportional to the amount of data. The functionality of this context is identical to the previous approaches for derived datatype processing which used a single context to keep track of the position in the datatype (as described in Section 3.1). The main issue with the dual-context look-ahead approach is the redundant parsing of the derived datatype signature. In the case where the algorithm decides that the datatype is too dense, there is no redundant processing. However, in the case where the algorithm decides that the datatype is too sparse, the processing done by first context has to be repeated by the second context in order to pack and communicate the data. In this regard, we point out that if this redundant processing was not present (as in the case of the current approaches), there is no way of keeping track of the position in the datatype till where the packing was done. Thus, we would have to search the datatype that had been packed so far to reach the appropriate position. In summary, existing approaches search the derived datatype which has already been processed in order to get back to the appropriate position. Our new approach parses later elements in the datatype and avoids searching through the already processed elements. The benefit of our approach, however, is that the amount of lookup needed is typically very small (e.g., 15 elements in the current design); thus this time is near constant. On the other hand, the amount of search time for the current approach increases linearly at each step with the size of the datatype. Thus, doing this iteratively for each time a look-up is performed makes the overall search time increase quadratically with the datatype size.

As described in Section 3.2, current collective communication operations are highly ill-equipped to deal with applications which use nonuniform communication volumes, i.e., not all processes send or receive the same amount of data. When the difference in the communication volumes is large, several times this results in sequentialization of the communication and significant loss of performance. In the extreme case, when a process pair sends or receives no data to each other, this can also result in increased skew making the application more sensitive to load imbalance and system noise. In this section, we propose different designs for two such collective communication operations that deal with nonuniform communication volumes, namely MPI Allgatherv (Section 4.2.1) and MPI Alltoallw (Section 4.2.2). 4.2.1 Enhancing MPI Allgatherv When the total size of the data that needs to be communicated is large, MPI Allgatherv currently uses a ring algorithm (which is optimal for uniform communication). At every step, each node forwards the data it received in the previous step to the next node. If one node is sending a large amount of data and the remaining nodes are sending small amounts of data, this would essentially sequentialize the transfer of the large message, thus taking O(N) time for communication, where N is the number of nodes participating in the operation. On the other hand, if we can quickly identify the communication volumes of all the processes, we can design more sophisticated algorithms to be used in the case of nonuniform communication volumes. Thus, we break down this problem into two sub-problems: (i) designing an efficient approach to quickly identify large nonuniformities in the communication volume set and (ii) designing a more efficient algorithm based on the knowledge of the nonuniform communication volumes to transfer the data efficiently. Identifying Nonuniformities in the Communication Volume Set: We formulate this subproblem as 8

an outlier detection problem, i.e., we need to identify if a small subset of the communication volume set falls significantly outside the range of the rest of the communication volume set. We do this by Phase 0 Phase 1 Phase 2 computing the outlier ratio, using equation 1, and comparing this to a threshold value. Figure 10: Recursive doubling algorithm for eight processes

k select(COMM VOL SET, N ) k select(COMM VOL SET, N × OUTLIER FRACT) (1) In this equation, COMM VOL SET is the set of communication volumes used by each of the processes in the system (note that this information is already available at each process in an MPI Allgatherv operation), N is the number of processes participating in the communication operation and OUTLIER FRACT is the fraction of processes that have to be outside the range of volumes encompassing the bulk of the messages in order to called outliers. In this equation, k select() allows us to determine the k th smallest element of a set of elements. We utilize the algorithm by Floyd and Rivest to evaluate k select() in linear time. Thus, the entire mathematical formulation described above can be obtained in linear time. It is to be noted that the existing approach parses through the entire communication volume set to identify the total communication volume, which already is linear time. With our current approach, we are increasing the coefficient of the linear time taken, but not its computational complexity. There exist alternate approaches which utilize variants of the above described model to reduce the coefficient of the linear time which we are currently studying.

Phase 0

Phase 1

Phase 2

Figure 11: Dissemination algorithm for five processes

power-of-two number of processes. The recursive doubling algorithm proceeds in log N phases, where in each phase each process exchanges data with another process. With each phase the amount of data exchanged increases, as each pair of processes exchange not only their own data, but the data which they have received in the previous phases. Figure 10 shows the basic communication pattern for the recursive doubling algorithm for eight processes. The dissemination algorithm proceeds in dlog N e phases. In phase p, each process with rank i sends its data to the process with rank i + 2p mod N , and receives data from the process with rank 1 − 2p mod N , where N is the number of participating processes. As with the recursive doubling algorithm, the amount of data sent and received in each phase doubles from the previous phase. Figure 11 shows the basic communication pattern for the dissemination algorithm. The main benefit with both these algorithms is that the movement of the large outlier messages to all the processes is carried out simultaneously by multiple process. For example, if we just consider the movement of one large message, the data follows a binomial tree pattern, instead of the sequential pattern in the previous ring algorithm.

Designing Efficient Algorithms based on the Communication Volume Characteristics: Based on the knowledge of the communication volumes that is obtained as described above, in cases where some of the communication volumes are significantly larger than the rest, we use two algorithms. The first algorithm is a recursive doubling approach which is used when power-of-two number 4.2.2 Enhancing MPI Alltoallw of processes are participating in the communica- Current algorithms for implementing MPI tion. The second algorithm is a variant of the Alltoallw, again, do not carefully consider the dissemination algorithm [8], and is used for non9

5

Experimental Evaluation

In this section, we evaluate the approaches described in Section 4 and compare them with the existing approaches. We perform microbenchmarklevel evaluations of the proposed schemes in Sections 5.2 and 5.3. Evaluation with PETSc benchmarks is performed in Section 5.4 and evaluation of the 3-D Laplacian multi-grid solver application is presented in Section 5.5.

5.1

and 2GB DDR2 400MHz SDRAM. The chipset on these machines was the Intel E7520 (Lindenhurst) chipset. We used the RedHat AS4 Linux distribution with the kernel.org kernel 2.6.16 installed on each machine. Cluster 2: 32-node cluster comprising of dual AMD Opteron 2.8GHz processors, 1MB L2 cache and 4GB DDR 400MHz SDRAM. The chipset on these machines was the dual NVidia 2200/2050 chipset. We used the RedHat AS4 Linux distribution with the kernel.org kernel 2.6.16 installed on each machine. Network: All 64 nodes in the machines were connected together with Mellanox MT25208 InfiniBand DDR adapters through a 144-port IB switch. Software: We used the the OpenFabrics Gen2 stack as the underlying IB driver and verbs interface. Above this interface, we used the MVAPICH2 implementation of MPI (which is a derivative of the MPICH2 stack from Argonne National Laboratory) over InfiniBand. Specifically, we used MVAPICH2-0.9.5 as the base case and compared it against our optimizations to it (labeled as MVAPICH2-New).

5.2 Evaluating Noncontiguous Data Processing Overheads In this section, we evaluate our optimizations to the noncontiguous datatype processing using a benchmark which sends a matrix from one process to another while transposing it. In the benchmark the sender sends the data in column-major order while 900 800 Latency (msec)

overheads associated with nonuniform communication volumes. Specifically, if some processes either communicate significantly larger volumes of data that requires preprocessing (e.g., noncontiguous data that requires to be packed) as compared to the others or do not communication any data at all, current algorithms tend to perform suboptimally. In our approach, for each process, we divide the data to be communicated to every other process into bins based on the volume of data to be communicated. If the data that is being communicated requires preprocessing, we process the bins containing small messages first, and then move to the bins containing larger messages. In this approach, since the processing of the data is sequentialized by the host processor, remote processes that communicate only small amounts of data do not have to wait for the larger data messages to be processed. In the existing approach, however, no such prioritization is performed; this could potentially cause the wait times for processes that only communicate small amounts to be higher than required. Another optimization is to place processes to which there is no communication in a separate bin, which is completely exempted. This reduces unnecessary skew in the collective operation. In our implementation, we used three bins – zero size messages, small messages (lesser than a threshold) and large messages.

MVAPICH2-0.9.5 MVAPICH2-New

700 600 500 400 300 200 100

Experimental Testbed

0 64x64

128x128 256x256 512x512 1024x1024

The experimental testbed used in this paper conMatrix size sisted of two clusters: Cluster 1: 32-node cluster comprising of dual Figure 12: Performance of matrix transpose benchmark Intel EM64T 3.6GHz processors, 2MB L2 cache 10

100%

80%

80%

Percentage time

Percentage time

100%

60% 40% 20% 0%

60% 40% 20% 0%

64x64

128x128 256x256 512x512 1024x1024

64x64

Matrix size Comm

Pack

128x128 256x256 512x512 1024x1024

Matrix size Comm

Search

Pack

Search

Figure 13: Datatype processing breakup: (a) Current approach and (b) Proposed dual-context look-ahead approach

the sender receives the data in row-major order, effectively transposing the matrix. Because of the noncontiguous layout of the data being sent, the performance of this benchmark is highly dependent on the performance of the MPI implementation’s noncontiguous datatype processing. Figure 12 shows the evaluation of this benchmark. In the figure, we see that as the size of the matrix being transposed increases, the time to perform the transpose increase much faster for the original implementation, labeled MVAPICH20.9.5, than for the optimized implementation, labeled MVAPICH2-New. For the 1024x1024 sized matrix, the optimizations give over 85% improvement over the original implementation. Because the time to perform the transpose increase with the size of the matrix much faster for the original implementation than the optimized one, the improvement gained from the optimization will further increase for larger matrices. Figure 13 shows the breakdown of the time spent in the transpose benchmark for the current approach and the optimized implementation. The breakdown for each matrix size is normalized to 100%. The figure shows the time spent performing communication, packing the data, and searching for the current location in the datatype. As expected, because the original implementation loses its context within the derived datatype each time it sends a portion of the noncontiguous data, the search time increases dramatically with the size of the matrix. In the optimized implementation we have eliminated the search time altogether by using

the dual-context approach, and so the communication dominates the time to perform the benchmark.

5.3 Evaluating Nonuniform Volume Collective Communication We evaluated the performance of our optimizations for nonuniform communication patterns using MPI-level microbenchmarks. In the first benchmark, we measured the average latency of MPI Allgatherv when Process 0 sends a large amount of data while the other processes sent only one double. Figure 14 shows the results of this benchmark. The graph on the left shows the latency of MPI Allgatherv for 64 processes as we change the amount of data that Process 0 sends. Notice that as this amount changes, the latency for the original implementation increases at a faster rate than our optimized implementation. The graph on the right shows the latency of MPI Allgatherv when Process 0 sends 32 KB of data as we vary the number of processes. Here we see that as the number of processes increases, again the latency of the original implementation increases at a faster rate than our optimized implementation. We see up to a 20% improvement for 64 processes when Process 0 is sending 32 KB of data. Our next benchmark evaluated the performance of MPI Alltoallw for nonuniform communication patterns. In this benchmark, the processes are arranged in a logical ring and each process has a 10x10 matrix of doubles to exchange with its suc-

11

1800

1800

MVAPICH2-0.9.5 MVAPICH2-New

1600

Latency (µsec)

1400 Latency (µsec)

MVAPICH2-0.9.5 MVAPICH2-New

1600

1200 1000 800 600 400

1400 1200 1000 800 600

200

400

0

200

K

16

4K

6

1K

25

64

16

4

1

Messages size (in doubles)

2

4

8

16

32

64

Number of Processes

Figure 14: MPI Allgatherv Performance: (a) With varying problem size and (b) With varying system size 1800

5.4 PETSc Vector Scatter Benchmark

MVAPICH2-0.9.5 MVAPICH2-New

1600 Latency (µsec)

1400 1200 1000 800 600 400 200 0 2

4

8

16

32

64

128

Number of Processes

Figure 15: MPI Alltoallw performance

cessor and predecessor, but nothing to exchange with the other processes. Figure 15 shows the results of this benchmark as the number of processes is varied. We see that the the original implementation is much more easily impacted by any minor imbalance between processes as compared to our optimized implementation. Note that we did not add any artificial skew to the benchmark, but given that we used a combination of two different clusters in our evaluation (32 Intel nodes and 32 Opteron nodes), some skew is bound to be present between the processes. For the 128 processor case we see over 88% improvement. The evaluation till 32 processes was done completely on the Opteron cluster – the benefit in this case is about 50%.

We also evaluated the benefit of our optimization at the PETSc library-level. We measured the latency of a vector scatter benchmark which stresses the communication portion of PETSc without utilizing any of its mathematical operations. In this benchmark, we create two 1-D grids of elements with one degree of freedom (i.e., each element is comprised of one double precision physical element). These 1-D grids are initially passed over to PETSc to lay them out in a parallel manner across multiple processes. Once laid out, each process scatters the elements in its portion of the first vector to unique portions of the second vector. This benchmark emulates the communication portion of applications that require to operate on multiple 1D grid structures where the elements of each of the grid structures is distributed across the system. Because of the poor performance of the original implementation, the PETSc library by default does not use the derived datatypes or collective operations. Rather, it uses a hand-tuned algorithm which explicitly performs the packing of data and individual sends and receives to scatter the vector. Together with this hand-tuned option, PETSc also has a mechanism to use derived datatypes and collective operations for MPI implementations which optimize these routines. We compared the performance of PETSc for the three implementations: (i) the default handtuned implementation (labeled as hand-tuned), (ii) using the implementation that uses derived

12

datatypes and collective operations for the original MPI (MVAPICH2-0.9.5) and (iii) using derived datatypes and collective operations with our optimized MPI (MVAPICH2-New). Figure 16 shows the results of this benchmark as we varied the number of processes. The 1-D grid size is scaled according to the number of processes, so the number of elements managed by each process is constant across the graph. The graph on the left shows the latency of the three implementations, while the graph on the right shows the percentage improvement of the optimized implementation to the original implementation and the hand-tuned implementation. We see that our optimized implementation shows a large improvement over the original implementation as the number of processors increases: over 95% for 128 processors. The hand-tuned implementation does perform slightly better than our optimized implementation (almost 4% better at 128 processes). This indicates that there are more optimizations that we can apply to MPI. Also, we would like the readers to note that though the hand-tuned implementation does perform better, it requires implementing the packing and communication pattern explicitly, which in general complicates the library communication implementation. By using MPI capabilities (derived datatypes and collective operations), these implementations can be simplified, which may be a desirable trade-off considering the small performance gain from using a hand-tuned algorithm.

5.5

3-D Laplacian Multi-grid Solver

To evaluate the benefit of our optimized implementation, we evaluated the performance of a 3-D Laplacian multi-grid solver application that is modeled by the following partial differential equation: 5 = i.

∂u ∂u ∂u + j. + k. ∂x ∂y ∂z

(2)

BoundaryConditions : 0 ≤ x, y, z ≤ 1 A grid of size 100x100x100 is used with one degree of freedom. The data grid varies the values of the variants (x, y, z) uniformly across the

grid in each dimension. The application itself utilizes PETSc to solve this equation using a multigrid solver. In our experiments, we evaluated the application by using three levels in the multi-grid to solve the equation. Figure 17 shows the execution time of the application as the number of processes is varied. We see that the execution time of the application using our optimized implementation continues to decrease even up to 128 processes. However, when the application used the original implementation, the execution time started increasing after 32 processes. This indicates that the application using the optimized implementation scales better than when using the original implementation. Our optimized implementation gives almost 90% improvement in the application execution time for 128 processes. Note that when the application used the hand-tuned implementation, it performed better than our implementation by just over 10% for 4 processors. However, as the number of processors increases, this performance gap between the hand-tuned and our optimized implementations decreases, and is less than 3% for 128 processes.

6 Related Work There has been a lot of previous work on optimizing the performance of noncontiguous data transfers, including packing and unpacking techniques [4, 18, 7] and approaches to utilize intelligent network adapters [23, 24, 19]. However, none of these approaches consider the search time to be a significant bottleneck. But, as we described in this paper, the search time within a datatype increases quadratically with the datatype size and becomes a significant bottleneck for large datatypes. In [18], the authors proposed an efficient mechanism to reduce this search time in simple cases; however, with advanced schemes, especially those involving look-aheads, this scheme loses its benefit. There has also been a significant amount of work in optimizing various collective operations including broadcast [9, 13, 12, 14], reduce [16, 15], allgather [11, 14, 21], alltoall [22], etc. Most of this work has been for uniform volume communication.

13

35

100%

30

Percentage Improvement

MVAPICH2-0.9.5 MVAPICH2-New Hand-tuned

Latency (sec)

25 20 15 10 5

MVAPICH2-New Hand-tuned

80% 60% 40% 20% 0%

0 2

4

8

16

32

64

128

2

4

Number of Processes

8

16

32

64

128

Number of Processes

Figure 16: PETSc Vector Scatter Benchmark: (a) Absolute Performance and (b) Percentage Improvement 90 70

Percentage Improvement

Execution Time (sec)

100%

MVAPICH2-0.9.5 MVAPICH2-New Hand-tuned

80 60 50 40 30 20 10 0

80%

MVAPICH2-New Hand-tuned

60% 40% 20% 0% -20%

4

8

16

32

64

128

4

Number of Processes

8

16

32

64

128

Number of Processes

Figure 17: 3-D Laplacian Multi-grid Solver Application Evaluation: (a) Absolute Performance and (b) Percentage Improvement

There has also been some limited work to optimize the performance of collective operations that perform nonuniform volume communication [5]. However, the actual approaches proposed in this literature are to optimize performance based on network topology rather than the variation in the communication volumes between processes. On the whole, this paper addresses an important problem and presents various novel designs to tackle them in an efficient manner.

7

Concluding Remarks

In this paper, we analyzed two kinds of overheads (noncontiguous data layouts and nonuniform volumes of communication to different processes) that are typically created as a byproduct of multilayered structures containing at least the applica-

tions, high-level software libraries and the communication libraries. We used the PETSc software library and the MPI communication library as examples of a case study. We noticed that current approaches designed in MPI implementations are tuned to support contiguous and uniform data communication and are highly ill-equipped with respect to handling noncontiguous and nonuniform communication, especially when used together. We designed sophisticated approaches that are specifically optimized for nonuniform communication of noncontiguous data and demonstrated that such approaches can provide significant benefits in some cases. We evaluated our approaches with MPI-level micro-benchmarks, PETSc level benchmarks and a 3-D Laplacian multi-grid solver application. Our experimental results demonstrate close to an order of magnitude improvement in the performance

14

of the 3-D Laplacian multi-grid solver application when evaluated on a 128 processor cluster. In future, we would like to analyze overheads in other high-level libraries and software as well. For example, applications using the FLASH software for solving nuclear astrophysical problems related to exploding stars typically rely on adaptive meshes where the area of interest if dynamically discovered. Accordingly, the processing required for fine-grained analysis in the area of interest is load-balanced between processors. Of course, depending on the granularity of the load-balancing, this could create significant amounts of skew between processes. Studying impact of such upper layers on the MPI library would be interesting.

[10] W. Huang, G. Santhanaraman, H. W. Jin, Q. Gao, and D. K. Panda. Design of High Performance MVAPICH2: MPI2 over InfiniBand. In CCGRID, 2006.

References

[16] A. Moody, F. Petrini, and D. K. Panda. Efficient Reduce Algorithms on the Quadrics Network. In IPDPS, 2002.

[11] M. Jacunski, P. Sadayappan, and D. K. Panda. Allto-All Broadcast on Switch-Based Clusters of Workstations. In IPDPS, 1999. [12] S. L. Johnsson and C.-T. Ho. Optimum Broadcasting and Personalized Communication in Hypercubes. IEEE TC, 1989. [13] Manoj Kumar. Supporting Broadcast Connections in Benes Networks. Technical Report RC 14063, IBM Research, May 1988. [14] S. Lee and K. G. Shin. Interleaved All-to-all Reliable Broadcast On Meshes and Hypercubes. In ICPP, 1990. [15] A. Mamidala, J. Liu, and D. K. Panda. Efficient Barrier and Allreduce on IBA clusters using Hardware Multicast and Adaptive Algorithms. In Cluster, 2004.

[1] S. Balay, K. Buschelman, W. Gropp, D. Kaushik, M. G. [17] MPICH2. http://www.mcs.anl.gov/mpi/ Knepley, L. C. McInnes, B. F. Smith, and H. Zhang. PETSc Web page, 2001. http://www.mcs.anl.gov/petsc. [18] R. Ross, N. Miller, and W. Gropp. Implementing Fast and Reusable Datatype Processing. In EuroPVM/MPI, [2] S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith. 2003. Efficient management of parallelism in object oriented numerical software libraries. In E. Arge, A. M. Bruaset, [19] G. Santhanaraman, J. Wu, and D. K. Panda. ZeroCopy MPI Derived Datatype Communication over Inand H. P. Langtangen, editors, Modern Software Tools finiBand. In EuroPVM/MPI, Sep 2004. in Scientific Computing, pages 163–202. Birkh¨auser Press, 1997. [20] V. S. Sunderam. PVM: A Framework for Parallel and Distributed Computing. Concurrency: Practice and Ex[3] L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, perience, 2(4):315–339, December 1990. J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, 1997.

[21] S. Sur, U. Bondhugula, A. Mamidala, H.-W. Jin, and D. K. Panda. High Performance RDMA Based All-toall Broadcast for InfiniBand Clusters. In HiPC, 2005.

[4] S. Byna, W. Gropp, X.-H. Sun, and R. Thakur. Improving the Performance of MPI Derived Datatypes by Optimizing Memory Access Costs. In SC, 2003.

[22] S. Sur, H. W. Jin, and D. K. Panda. Efficient and Scalable All-to-All Exchange for InfiniBand-based Clusters. In ICPP, 2004.

[5] A. Faraj and X. Yuan. Automatic Generation and Tun- [23] V. Tipparaju, G. Santhanaraman, J. Nieplocha, and D. K. Panda. Host-assisted Zero-copy Remote Meming of MPI Collective Communication Routines. In ICS, ory Access Communication on InfiniBand. In IPDPS, 2005. 2004. [6] MPI Forum. MPI: A Message Passing Interface. In SC, [24] J. Wu, P. Wyckoff, and D. K. Panda. High Performance 1993. Implementation of MPI Datatype Communication over [7] W. Gropp, E. Lusk, and D. Swider. Improving the PerInfiniBand. In IPDPS, 2004. formance of MPI Derived Datatypes. In MPIDC, 1999. [8] Han and Finkel. An optimal scheme for disseminating information. In ICPP, 1988. [9] C. T. Ho and S. L. Johnsson. Distributed Routing Algorithms for Broadcasting and Personalized Communication in Hypercubes. In ICPP, 1986.

15

The submitted manuscript has been created by the University of Chicago as Operator of Argonne National Laboratory (“Argonne”) under Contract No. W-31-109-ENG-38 with the U.S. Department of Energy. The U.S. Government retains for itself, and others acting on its behalf, a paid-up, nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government.