Leveraging C++ Meta-programming. Capabilities to Simplify the Message Passing. Programming Model. Simone Pellegrini, Radu Prodan, and Thomas ...
Leveraging C++ Meta-programming Capabilities to Simplify the Message Passing Programming Model Simone Pellegrini, Radu Prodan, and Thomas Fahringer University of Innsbruck – Distributed and Parallel Systems Group Technikerstr. 21A, 6020 Innsbruck, Austria {spellegrini,radu,tf}@dps.uibk.ac.at
Abstract. Message passing is the primary programming model utilized for distributed memory systems. Because it aims at performance, the level of abstraction is low, making distributed memory programming often difficult and error-prone. In this paper, we leverage the expressivity and meta-programming capabilities of the C++ language to raise the abstraction level and simplify message passing programming. We redefine the semantics of the assignment operator to work in a distributed memory fashion and leave to the compiler the burden of generating the required communication operations. By enforcing more severe checks at compile-time we are able to statically capture common programming errors without causing runtime overhead. Keywords: Message passing, C++, Meta-programming, PGAS.
1
Introduction
The message passing paradigm is frequently used in High Performance Computing (HPC) for programming computer clusters and supercomputers. Compared to other existing parallel programming models such as OpenMP, message passing offers two basic primitives: send and receive. The burden of managing almost every aspect of the program execution including data partitioning, communication, and synchronisation between processes is left to the programmer. A low-level of abstraction is helpful in writing highly optimised programs, however, it makes distributed memory programming very difficult and error-prone. Recently, new programming models are increasingly being used aiming at simplifying distributed programming. An example is the Partitioned Global Address Space (PGAS) model, which provides the programmer with a logically global memory address space where variables may be directly read and written by any process. Below the logical view, each variable is physically associated to a single process. Any attempt to read or write memory locations physically allocated on a different process results in a communication operation generated either by a runtime environment in the Global Array library [7]) or during the compilation process in the Co-array Fortran and UPC [8,3]. However, because Y. Cotronis et al. (Eds.): EuroMPI 2011, LNCS 6960, pp. 302–311, 2011. c Springer-Verlag Berlin Heidelberg 2011
Leveraging C++ Meta-programming to Simplify MPI Programming Model
303
of the increased level of abstraction, the programmer loses control over the generation of communication and synchronisation operations resulting in important performance losses compared to manually written message passing applications. The motivation for the research presented in this paper is based on the observation that sending a message from a process A to a process B is semantically equivalent to an assignment operation. The content of the memory cell owned by process B is overwritten with data residing on process A’s memory space. We use the C++ operator overloading mechanism and template meta-programming techniques [4] to enable the automatic generation of low-level communication primitives by the standard C++ compiler. For example, whenever an assignment operator involving memory cells residing on different processes is encountered, the compiler generates the required communication statements. Additionally, we generate for each process rank a separate executable containing only those operations involving the assigned memory cells, which eliminates the control flow overhead incurred by the Single Program Multiple Data (SPMD) nature of the input program. The main advantage of our approach is the fact that it achieves a level of abstraction similar to PGAS-based languages by only exploiting features of the standard C++ language and compiler. Furthermore, because the underlying programming model is based on message passing, the programmer still retains full control over the resulting performance. In Section 2, we provide an overview of our new approach of writing message passing parallel programs. In Section 3 we discuss the implementation details of the mem wrap object that is the main abstraction behind our method. Section 4 compares our method with a UPC-based implementation for a Jacobi relaxation algorithm. Section 5 concludes the paper and highlights the future work.
2
Overview
This section gives a brief overview of our technique while further details will be given in Section 3 and 4 of the paper. Let us consider in Listing 1.1 a simple message passing program written in MPI [2], which is the de-facto standard for programming HPC applications. Two processes are involved in this example: process 0 computes the value of the π constant (pi) and sends it to process rank 1. The computed value is then used by both processes for further computation. One of the first characteristics of the program is the use of the SPMD technique, which generates a single executable that is spawned on multiple processors. To customize the program behaviour for a specific process rank, the programmer needs to continuously use control statements to guide the specific process flow of execution (lines 2 and 5). The use of control flow statements is in general the source of many inefficiencies and limits compiler analysis and optimizations. Additionally, miss-predicted branches cause significant performance penalties on modern pipelined CPU architectures. Because the generated executable contains code which is never executed on a particular process rank, the L1 instruction cache may be not optimally used too.
304
S. Pellegrini, R. Prodan, and T. Fahringer Listing 1.1. Simple message passing program in MPI
1 2 3 4 5 6 7
float pi; if ( rank == 0 ) { pi = calc pi(); MPI Send(&pi, 1, MPI FLOAT, 1, 0, MPI COMM WORLD); } else if (rank == 1) MPI Recv(&pi, 1, MPI FLOAT, 0, 0, MPI COMM WORLD, MPI STATUS IGNORE); use(pi);
A second observation is that message passing programs are often complex to read and, more importantly, to analyse. Because the programmer is forced by the programming model to describe the low-level operations (i.e. the “how”), the semantics of the program (i.e. the “what”) is mostly hidden. For example, although a connection between the send and receive operations in lines 4 and 6 exists, it is implicitly in the mind of the programmer and not made explicit in the code. This hidden knowledge could be used by the compiler to improve error checking and program performance, but it is unfortunately very complex to be captured by static analysis [5,9]. For example, the compiler could enforce the amount of received data to be not less than the amount of data sent, or use constant propagation to remove communication statements in case the transmitted value is constant (detected by compiler dataflow analysis). Listing 1.2. Overload of assignment operator in C++ 1 2 3 4 5 6
mem wrap pi; // manages memory allocation in the distributed env. pi[r0] = calc pi(); // Rank 0 executes calc pi() and writes the returned value // into its own copy of pi pi[r1] = pi[r0]; // Copies the value of pi owned by process rank 0 onto the // memory cell owned by process rank 1 (by using send/recv) use(∗pi);
In this paper, we propose a different approach which lets the programmer focus on the program semantics (the “what”) and lets the compiler deal with the generation of the required communication operations. The idea is not entirely new [3], however, instead of introducing a new programming model (e.g. PGAS) and an underlying language support (e.g. UPC), we exploit the capabilities of the standard C++ language and compiler. Listing 1.2 shows a simple C++ program semantically equivalent to the previous example. The first aspect is the lack of any control flow statements, which is achieved by offloading all memory operations to a new data type, i.e. mem wrap, acting as a memory wrapper for distributed memory environments. The input program is compiled multiple times, each time for a different process rank. Keeping the value of the process rank constant at compile-time allows meta-programming techniques to be used for specializing the semantics of operations involving mem wrap instances. For example, the initialisation of a memory cell owned by the process rank 0 results in a no-operation (NOP) when the program is compiled for process rank 1 (line 2).
Leveraging C++ Meta-programming to Simplify MPI Programming Model
305
Assignment operations involving memory cells residing on different address spaces are replaced by communication statements (line 4). Table 1 shows the codes generated at compile-time by our approach for the processes with rank 0 and 1 from program code in Listing 1.2. The SPMD input program is compiled into multiple executables (as many as the number of processes) and successively executed using the Multiple Program Multiple Data (MPMD) paradigm. Table 1. Compiler generated codes for process rank 0 and 1
Rank 0 1 2 3 4
float pi; pi = calc pi(); MPI Send(&pi,1,MPI FLOAT,1,0,...); use(pi);
Rank 1 1 2 3
float pi; MPI Recv(&pi,1,MPI FLOAT,0,0,...); use(pi);
Running the MPMD program generated by our technique produces very promising results. We executed both the SPMD and MPMD executables on an Intel Xeon X5570 CPU and an AMD Opteron 2435, both compiled with GCC 4.5.3 and optimization enabled (-O3). We repeated the code snippet one thousand million times and used shared memory communication (SM module of Open MPI’s Modular Component Architecture) to reduce the communication overhead. The main program loop has been executed 10 times, the average value of execution time and standard deviation are depicted in Table 2. A considerable performance improvement, of around 30%, is observed for the Intel architecture, while on the AMD CPU, the improvement was of around 5%. Because the two processors have a similar L1 cache size (i.e. 64KB), we believe that the main source of performance improvement comes from the simplification of the control flow. Table 2. Execution time for each process of the program in Listing 1.3 using SPMD and MPMD models
Intel Xeon AMD Opteron
SPMD Exec. time Standard [milisec.] deviation 8180 440 9638 166
MPMD Exec. time Standard [milisec.] deviation 6162 129 9296 177
Speedup 1.32 1.04
In order to explain the performance improvement we executed the code snippet enabling performance counters on the Intel CPU by using the PAPI library [1]. The measured values for three performance counters are depicted in Table 3. We measured the instruction cache misses for both level 1 and 2 and the total amount of conditional branch instructions. The code snippet is small to easily fit on the L2 cache, therefore no differences in terms of L2 cache misses are
306
S. Pellegrini, R. Prodan, and T. Fahringer
Table 3. Performance counter values for the Intel architecture
Hardware counter L1 Instruction Cache misses L2 Instruction Cache misses Conditional branch instructions
SPMD MPMD 4253718 4246317 681689158 681689158 4260166 4254384
visible. However, the utilization of the L1 cache is improved for the MPMD code as we were able to reduce the amount of cache misses by a 0.5%. This is because, by removing unreachable branches, code locality is improved. Additionally, also the amount of conditional branch instructions is reduced by the same amount. This alone cannot however explain the 32% speedup which we believe to be the result of optimizations (e.g. loop unrolling and constant propagation) performed by the compiler on the MPMD code. As a matter of fact, thanks to the simplification to the control flow obtained with our meta-programming technique, we enable the compiler analysis to perform more aggressive optimizations which are not applicable on the SPMD version.
3
The mem wrap Object
Meta-programming is the practice of writing a computer program that writes or manipulates other programs (or themselves) as their data. Meta-programming can be used to perform part of the computation at compile-time instead of runtime. By combining templates and meta-programming, it is possible in C++ to specialize the implementation of generic functions based on particular properties of the input parameters. For example, a generic function can have two implementations depending on whether the input parameter is a pointer or a value type. Because these checks are conducted at compile-time, it is necessary that the expressions used to select a particular implementation involve compile-time constants only. Listing 1.3. mem wrap object interface 1 2 3
template struct mem wrap { T& operator∗(); // Access to managed memory
4
mem wrap& operator=(const T&); template mem wrap& operator=(const mem wrap&);
5 6 7 8 9 10
};
template mem wrap operator[ ](const R2&);
Leveraging C++ Meta-programming to Simplify MPI Programming Model
307
Our approach is based on a similar mechanism. The objective is to introduce an enhanced assignment operator which, depending on the type of the left and right hand side expressions, is specialized to implement different semantics. We introduce a new data type called mem wrap illustrated in Listing 1.3 that manages the allocation and accesses to memory locations in the distributed memory environment. The first template parameter T represents the wrapped type which allows the management of single elements (e.g. mem wrap) or of more complex data types such as arrays (e.g. mem wrap). The second parameter Sel is the selector, which decides whether the wrapped object (of type T) has to be allocated on a particular process rank for which the input program is being compiled. For example, by using the expression Rank%2==0 as a selector, we enforce only even process ranks to allocate the memory to host the object of type T. We refer to these instances of mem wrap as active. Odd ranks for which the selector is not satisfied allocate an empty wrapper instance called shadow. A shadow wrapper acts as a pointer to a memory location on a different machine and can be used to read data from it. To note that mem wrap does not perform any data partitioning, the programmer is still responsible to divide the memory space among the processes. Because a mem wrap instance can refer to memory locations on multiple address spaces, the R parameter is used to address the copy owned by a specific process rank. The mem wrap also provides three basic methods among several others: a dereferencing operator * used to directly access the memory managed by the wrapper (line 3), an assignment operator = overloaded to work with data type instances of type T (line 5) or mem wrap instances (line 7), and a subscript operator [] used to select a copy of the wrapped data which belongs to a particular address space. Listing 1.4. Example of using selectors 1 2 3 4 5 6 7 8
template struct even { template struct apply : public mpl::bool { }; }; mem wrap vect(100); for (unsigned int i=0; i