Abstract. This paper presents a prototype runtime environment for programming and executing adaptive master-slave message passing ap- plications on cluster ...
OpenMP for Adaptive Master-Slave Message Passing Applications Panagiotis E. Hadjidoukas, Eleftherios D. Polychronopoulos, and Theodore S. Papatheodorou High Performance Information Systems Laboratory (HPCLAB), Department of Computer Engineering and Informatics, University of Patras, Rio 26500, Patras, Greece, http://www.hpclab.ceid.upatras.gr, {peh,edp,tsp}@hpclab.ceid.upatras.gr
Abstract. This paper presents a prototype runtime environment for programming and executing adaptive master-slave message passing applications on cluster of multiprocessors. A sophisticated portable runtime library provides transparent load balancing and exports a convenient application programming interface (API) for multilevel fork-join RPC-like parallelism on top of the Message Passing Interface. This API can be used directly or through OpenMP directives. A source-to-source translator converts programs that use an extended version of the OpenMP workqueuing execution model into equivalent programs with calls to the runtime library. Experimental results show that our runtime environment combines the simplicity of OpenMP with the performance of message passing.
1
Introduction
Master-slave computing is a fundamental approach for parallel and distributed applications. It has been used successfully on a wide class of applications. On distributed memory machines, programming a master-slave application is a difficult task that requires the knowledge of the primitives provided by the underlying message passing environment and additional programming effort when the application exhibits load imbalance. OpenMP [10] is an emerging standard for programming shared-memory multiprocessors and recently clusters of workstations and SMPs through software distributed shared memory. Since OpenMP represents a master-slave (fork-join) programming paradigm for shared memory, it is possible to extend OpenMP for executing the task parallelism of masterslave message passing applications. OmniRPC [13] is an initial proposal for using OpenMP on client-server applications through remote procedure calls (RPC). This paper presents a runtime environment for programming and executing adaptive master-slave message passing applications on cluster of SMPs. A portable runtime library exports a “shared-memory” API that can be used either directly or through OpenMP directives and a source-to-source translator, providing thus ease of programming and transparent load balancing without A. Veidenbaum et al. (Eds.): ISHPC 2003, LNCS 2858, pp. 540–551, 2003. c Springer-Verlag Berlin Heidelberg 2003
OpenMP for Adaptive Master-Slave Message Passing Applications
541
requiring any knowledge of low-level message passing primitives. The library has been designed to provide application adaptability to the same application code, or even binary, on shared-memory multiprocessors, clusters of SMPs and metacomputing environments. The primary goal of this work is the easier development of efficient masterslave message passing applications using OpenMP directives rather than an extension of OpenMP. The translator converts programs that use an extended version of the proposed OpenMP workqueuing execution model [15] into equivalent programs with calls to the runtime library. Experimental results on two architectural platforms (shared-memory multiprocessor and cluster of SMPs) and two different operating systems (Windows and Linux) indicate the efficient execution of adaptive OpenMP master-slave message passing programs on our runtime environment. The rest of this paper is organized as follows: Section 2 describes the general design and architecture of the proposed runtime system. Section 3 presents our OpenMP extensions for master-slave message-passing programming. Experimental evaluation with OpenMP programs is reported in Section 4. Related work is presented in Section 5. We discuss our future research in Section 6.
2
Runtime Library
The proposed runtime library provides an implementation of the Nano-threads Programming Model on top of message passing and specifically the Message Passing Interface (MPI) [7]. This model has been implemented on shared-memory multiprocessors in the context of the NANOS project [9] and recently on clusters of SMPs, through SDSM, within the POP (Performance Portability of OpenMP) project [12]. The NANOS API is targeted by the NanosCompiler [1], which converts programs written in Fortran77 that use OpenMP directives to equivalent programs that use this API. Our runtime library exports a slightly extended version of the NANOS API for master-slave message passing applications. 2.1
Design and Architecture
Figure 1 illustrates the modular design of our runtime system. NanosRTL is the core runtime library that implements the architecture and the functionality of the Nano-threads Programming Model. Its implementation is based on the POSIX Threads API and the interfaces that are exported by the rest components of this runtime environment. Moreover, a POSIX Threads library for Windows and a simple unix2nt software layer allows the compilation of our code, without modifications, on both Unix and Windows platforms. UTHLIB (Underlying Threads Library) is a portable thread package that provides the primary primitives (creation and context-switching) for managing non-preemptive user-level threads. Its purpose is to be used for implementing two-level thread models (libraries), where virtual processors are system scope POSIX Threads. It is portable since its machine dependent parts are based
542
Panagiotis E. Hadjidoukas et al.
NANOS API Runtime Library (NanosRTL) Master-Slave Operations UTHLib
MP Library
SDSM Library
Fig. 1. Modular Design
on either the POSIX setjmp-longjmp calls or the ucontext operations. The MPI library is used by the runtime system for the management of its internal structures and the explicit, though transparent, movement of data. Since MPI corresponds to an SPMD (Single Program Multiple Data) execution model, the runtime system provides several mechanisms for a fork-join parallel execution model (e.g. memory allocation, broadcasting). Moreover, our two-level thread model requires the thread-safety of the MPI library. An optional component of our runtime environment is an SDSM library, which provides an additional way for implicit movement of data. The integration of the MPI and the SDSM libraries is performed with their concurrent linking and the appropriate setting of the environment variables that each library requires. In master-slave computing, the master distributes the data to the slaves and waits any processed results to be returned. A task in our runtime environment corresponds to the execution of a function on a set of data that are passed as arguments to this function. Each task is represented with a work descriptor (called nano-thread), a data structure that encapsulates all the necessary information of a task (function pointer, arguments (private data or pointers to them), dependencies, successors) [9]. This descriptor is separated from its execution vehicle, i.e. an underlying user-level thread, which is created on demand, according to the adopted lazy stack allocation policy. Similar to Remote Procedure Call (RPC), a descriptor is associated with the data of its corresponding function. The user defines the quantity (count), MPI datatype, and the passing method for each argument, information that is also stored in the descriptor. This definition is the only major extension in the existing NANOS API. The runtime library has the same architecture with our OpenMP runtime system on top of SDSM, described thoroughly in [5]. The main difference is that the consistency protocol has been replaced with message passing: there are not shared data or stacks and data movement has to be performed explicitly with MPI calls. Each node of the cluster (process) consists of one or more virtual processors and a special I/O thread, the Listener, which is responsible for the dependencies and queue management and supports the transparent and asynchronous movement of data. There are per-virtual processor intra- and inter-node ready queues and per-node global queues. The insertion/stealing of a descriptor in/from a queue that resides in the same node (process) is performed through
OpenMP for Adaptive Master-Slave Message Passing Applications
543
hardware-shared memory without any data movement. Otherwise, the operations are performed with explicit messages to the Listener of the remote node. This combination of hardware shared-memory with explicit messages is also used to maintain the coherence of the fork-join execution model: each work descriptor (task) is associated with an owner node. If a task is finished on its owner node, its parent is notified directly through (hardware) shared memory. Otherwise, its descriptor returns to the owner node and the notification is performed by the Listener. 2.2
Data Movement
The insertion of the descriptor in a remote queue corresponds to MPI Send calls for the descriptor and any arguments, based on their description. On the other side, the Listener accepts the descriptor, analyzes its information and allocates the necessary space to receive the arguments. The runtime system ensures the coherence of the descriptor and its data by appropriately setting the tag field in each sent MPI message. Specifically, the tag field denotes the local identifier of the virtual processor that sends the message. Once the Listener has received a descriptor, it receives the subsequent data from the source of the message (i.e. the specific virtual processor). A virtual processor executing a descriptor actually executes the function with the locally stored arguments. When it finishes, it sends the descriptor back to its owner node, along with any arguments that represent results. These are received asynchronously by the Listener and copied on their actual memory locations in the address space of the owner. All the aforementioned movement of data is transparent to the user and the only points that remind the underlying MPI programming is the description of the arguments. Currently, the following definitions determine the passing method for an argument: – CALL BY VAL: The argument is passed by value. If it is a scalar value, it is stored directly in the descriptor. – CALL BY PTR: As above, but the scalar value has to be copied from the specified address. This option can be also used for sending user-defined MPI datatypes. – CALL BY REF: The argument represents data that are sent with the descriptor and returned as a result in the owner node’s address space. – CALL BY RES: No data has to be sent but it will be returned as a result. It is assumed by the target node that receives data initialized to zero. – CALL BY SVM: The argument is an address in the shared virtual memory of a software DSM system. Data will be transferred implicitly, through the memory consistency protocol. For arguments (data) that represent results and return to the master, the runtime system supports some primary reduction operations that can be performed on both single values and arrays. The following definitions can be combined (OR ’ed) with the passing method definitions:
544
Panagiotis E. Hadjidoukas et al.
– OP SUM: The results (returned data) are added to the data that reside at the specific memory address. – OP MUL: The results are multiplied to the data that reside at the specific memory address. The first list can be expanded with other forms of data management/fetching, like MPI one-side communications or SHMEM operations and the second with more reduction operations. 2.3
Load Balancing
According to the Nano-threads Programming Model, the scheduling loop of a virtual processor is invoked when the current user-level thread finishes or blocks. A virtual processor visits in a hierarchical way the ready queues in order to find a new descriptor to execute. The stealing of a descriptor from a remote queue includes the corresponding data movement, given that it does not return to its owner node. The stealing is performed in a synchronous way: the virtual processor that issued the request waits synchronously for an answer from the Listener of the target node. The answer is either a special descriptor denoting that no work was available at that node, or a ready descriptor, which will be executed directly by the virtual processor. Due to the adopted lazy stack allocation policy [4], a descriptor (and its data) can be transferred among the nodes without complications, since an underlying thread is created only at the time of its execution. The two-level thread model of our runtime environment favors the fine decomposition of the applications, which can be exploited for achieving better load balancing. These conditions minimize, or even eliminate, the need for user-level thread migration, which is difficult to support on heterogeneous and metacomputing environments. 2.4
Integration with SDSM
As aforementioned, we have managed to integrate in this runtime environment an SDSM library, which provides a method for implicit movement of data. A distinct classification of data is required, as described in [2], since data that resides in shared virtual memory cannot be sent (MPI Send) explicitly. On Windows platforms, we have successfully integrated the MPIPro [8] and the SVMLib [11] libraries. With SVMLib, the user can disable the SDSM protocol for a specific memory range, a feature that allows the explicit movement of shared data. On Unix (Linux), we have integrated MPIPro with Mome [6]. According to the Mome’s weak consistency model, a process can issue explicit consistency requests for specific views of a shared-memory segment, a feature that provides a receiverinitiated method for data transfer. This integration of MPI and SDSM provides a very flexible and complex programming environment that can be easily exploited with the unification of our OpenMP environment on top of SDSM and the work presented in this paper. However, a thorough study of this integration is beyond the scope of the paper.
OpenMP for Adaptive Master-Slave Message Passing Applications
3
545
OpenMP Extensions
In [13], authors propose a parallel programming model for cluster and global computing using OpenMP and a thread-safe remote procedure call facility, OmniRPC. Multiple remote procedure calls can be outstanding simultaneously from multithreaded programs written in OpenMP, using parallel for or sections constructs. These constructs have the disadvantage that all work units that can be executed are known at the time the construct begins execution and have difficulties handling applications with irregular parallelism. Moreover, in masterslave (client-server) computing, the master (client) usually executes task-specific initializations before distributing the tasks to the slaves (servers). The execution model that addresses these issues and fits better in our case is the proposed OpenMP workqueuing model [15]. This model is a flexible mechanism for specifying units of work that are not pre-computed at the start of the worksharing construct. Conceptually, the taskq pragma causes an empty queue to be created by the chosen thread, and then the code inside the taskq block is executed singlethreaded. The task pragma specifies a unit of work, potentially executed by a different thread. The following fragment of code is a typical example that uses the workqueuing constructs in order to execute task parallelism. This pattern is found in most master-slave message passing applications, where the master distributes a number of tasks among the slaves (do work), usually after having performed some task-specific initialization (j=i*i): int i, j; #pragma intel omp parallel taskq { for (i = 0; i < 10; i++) { j = i*i; #pragma intel omp task { do_work(j); } }
3.1
Workqueuing Constructs
In this section, we present our extensions to the proposed OpenMP workqueuing model for master-slave message passing computing. Since we target a pure distributed memory environment, with private address spaces, a default (private) clause in implied naturally in the OpenMP programs. Our OpenMP compilation environment exploits the two workqueuing constructs presented in the previous example (intel omp parallel taskq and intel omp task). For convenience, we omit the intel keyword and replace omp with domp. These directives are parsed by a source-to-source translator (comp2mpi), which generates an equivalent program with calls to our runtime library. Similar directives have been defined for the Fortran language and a corresponding translator has been developed. None of these two constructs is altered and only the second one is extended with two optional clauses (schedule, callback):
546
Panagiotis E. Hadjidoukas et al.
#pragma domp parallel taskq #pragma domp task [schedule(processor|node))] [callback()] []
The MPI description for the arguments of a function is provided with the following format: #pragma domp function #pragma domp function ... #pragma domp function
where, -> ,,[|] -> number of elements -> valid MPI data type -> CALL_BY_VAL | CALL_BY_PTR | CALL_BY_REF | CALL_BY_RES | CALL_BY_SVM -> OP_SUM | OP_MUL
Finally, the schedule clause determines the task distribution scheme that will be followed by the OpenMP translator. Currently, the generated tasks can be distributed on a per-processor (default) or on a per-node basis. The latter can be useful for tasks that generate intra-node (shared-memory) parallelism either by using the NANOS API or through OpenMP directives and the NanosCompiler. Alternatively, there can be one virtual processor per node and the intra-node OpenMP parallelism can be executed using other OpenMP compilation environments. 3.2
Examples
Figure 2 presents two programs that use our OpenMP extensions. The first application is a recursive version of Fibonacci. It exhibits multilevel parallelism and each task generates two new tasks that are distributed across the nodes. If we remove the pragmas that describe the function and the schedule clause, the resulted code is compliant with the proposed workqueuing model. The second example is a typical master-slave application: each task computes the square root of a given number and returns the result to the master. Tasks are distributed across the processors and whenever a task completes, a callback function (cbfunc) is executed asynchronously.
4
Experimental Evaluation
Our runtime environment is portable and allows the execution of the same application binary on both shared-memory multiprocessors and distributed memory
OpenMP for Adaptive Master-Slave Message Passing Applications /* Fibonacci */
/* Master-Slave Demo Application */
void fib(int n, int *res) { #pragma domp function fib 2 #pragma domp function 1,MPI_INT,CALL_BY_VAL #pragma domp function 1,MPI_INT,CALL_BY_REF int res1 = 0, res2 = 0; if (n < 2) { *res = n; } else if (n < 30) { fib(n-1, &res1); fib(n-2, &res2); *res = res1+ res2; } else { #pragma domp parallel taskq schedule(node) { #pragma domp task fib(n-1, &res1); #pragma domp task fib(n-2, &res2); } *res = res1+ res2; } }
void taskfunc(double in, double *out) { #pragma domp function taskfunc 2 #pragma domp function 1,MPI_DOUBLE,CALL_BY_VAL #pragma domp function 1,MPI_DOUBLE,CALL_BY_RES *out = sqrt(in); }
void main() { int res, n = 36; fib(n, &res); }
547
void cbfunc(double in, double *out) { printf("sqrt(%f)=%f\n", in, *out); } void main() { int cnt = 100, i; double *result; double num; result = (double *)malloc(cnt*sizeof(double)); #pragma domp parallel taskq for (i=0; i