Experiences on the Implementation of PARMACS Macros ... - CiteSeerX

8 downloads 0 Views 32KB Size Report
one-to-one mapping between user-level and kernel-level threads. All Mach threads in a task share all the task resources, specially the address space. To have.
Experiences on the Implementation of PARMACS Macros Using Different Multiprocessor Operating System Interfaces Ernest Artiaga, Nacho Navarro, Xavier Martorell, Yolanda Becerra, Marisa Gil, Albert Serra Department of Computer Architecture Universitat Politècnica de Catalunya c/ Jordi Girona 3, Mòdul D6, 08071 Barcelona, Spain e-mail: {ernest, nacho, xavim, yolandab, marisa, alberts}@ac.upc.es

ABSTRACT In order to evaluate the goodness of parallel systems, it is necessary to know how parallel programs behave. The SPLASH-2 applications provide us with a realistic workload for such systems. So, we have made different implementations of the PARMACS macros used by SPLASH-2 applications, based on several execution and synchronization models, from classical Unix processes to multithreaded systems. Results have been tested in two different multiprocessor systems (Digital and Silicon Graphics). As parallel constructs in the SPLASH-2 applications are limited to those provided by PARMACS, we can easily study the overhead introduced by synchronization and parallelism management.

KEYWORDS: Microkernel, multithreaded, multiprocessor, ANL macros, PARMACS, SPLASH-2, parallel applications, Mach, CThreads, Irix, Sprocs.

1. Introduction In order to evaluate the goodness of parallel systems, it is necessary to know how parallel programs behave when running on such systems. Evaluation cannot rely on very simple and unrealistic test codes. It is desirable to test the real, complete programs which will eventually run on such kind of systems.The Stanford Parallel Applications for Shared-Memory (SPLASH) [SING92] is a set of parallel applications for use in the design and evaluation of shared-memory multiprocessing systems. It contains programs that represent a wide range of computations in scientific, engineering and graphics domains. The SPLASH applications are written in C, using the PARMACS macros from ANL [LUSK87] for parallel constructs. These macros implement code for concurrency and synchronization for different architectures, so that the applications are programmed in an architecture-independent way and can be ported to other systems by simply changing the implementation of the macros. In our group we are developing a parallel execution environment, based in multiprocessor multithreaded microkernels [BECE96]. The environment will provide a set of tools (libraries, servers, ...) to extract the maximum performance from parallel applications [GIL95]. We use the SPLASH-2 test suite to study the behavior of parallel applications. The macros will be instrumented in order to obtain a trace of the parallel behavior of the programs (including synchronization, context switches...). This information can be passed as hints to the operating system during program execution, to help it to take the most appropriate scheduling and resource management decisions for each application, improving the performance of the system. This paper describe our experience implementing PARMACS to execute the SPLASH programs. Our intention is to see the real behavior of the system not by simulation, but actually executing the applications in a parallel machine. Different implementations are used to study the benefits of multithreaded models with respect to classical Unix implementations. This work has been supported by the Ministry of Education of Spain (CICYT) under contract TIC 94-439

2. The test suite We have decided to use the SPLASH-2 [WOO95] applications suite as a workload to test the system. These programs cover a wide range of scientific applications and they are commonly used for architectural studies. They are composed of a set of complete applications and computational kernels (i.e. implementations of algorithms widely used for scientific calculus). Some of the programs it currently contains are shown in Table 1. Application

Description

Barnes

Interaction of a system of bodies in three dimensions, using Barnes-Hut hierarchical N-body method.

Cholesky

Factors a sparse matrix into the product of a lower triangular matrix and its transpose.

LU

Factors a dense matrix into the product of a lower triangular and an upper triangular matrix.

Ocean

Studies large scale ocean movements based on eddy and boundary currents.

Radiosity

Computes the distribution of light in a scene using the iterative hierarchical diffuse radiosity method.

TABLE 1. Some SPLASH-2 programs

The SPLASH-2 programs are written in C, extended with a set of macros for parallel constructs (PARMACS). This macros were developed at the Argonne National Laboratory and can be used with the m4 preprocessor. PARMACS offers basic primitives for synchronization, creation of parallel processes and shared memory allocation [LUSK87][BOYL87]. Implementations for several systems such as Encore Multimax, SGI, Alliant and others are publicly available; nevertheless, it is often necessary to modify the implementation to adapt the macros to specific systems or special requirements. In our case, we have decided to make our own implementations of the PARMACS macros in order to adapt them to the capabilities of modern operating systems, trying to keep the different versions comparable. Special care has been taken to minimize the impact of the macros code on the program behavior. Macros have been implemented for System V IPC, Unix BSD, OSF/1 on Mach 3.0 (using the CThreads library) and SGI’s Irix 6.2 (using sprocs), focusing our efforts in the subset of PARMACS used by SPLASH-2 (Table 2). A brief description of the macros can also be found in [ARTI97]. As PARMACS macros can be implemented for different architectures and parallel models (Unix processes, threads, etc.) we will use the word process to name an execution flow disregarding its real implementation, unless otherwise stated.

3. Architectural background In order to fully exploit the services offered by a particular system, it is important to know its characteristics. Most of programs are portable from one system to another, but, normally, efficiency won’t be maintained through ports to different architectures without modifying the code. Two different shared-memory multiprocessors have been used: a DEC 433 MP (4 processors) running Mach 3.0 and a Silicon Graphics Power Challenge R10000 (12 processors) running Irix. We intend to compare the performance of a traditional Unix implementation of the PARMACS macros with implementations which use the multithreading capabilities of both Mach and Irix systems. Mach is a multiprocessor multithreaded microkernel [ACCE86][BLACK90]. All threads in a task share the same address space. Parallelism can be easily exploited by using several threads in a single task. These threads offered by the Mach Kernel are much lighter than Unix

Macros

Description

MAIN_ENV EXTERN_ENV

Variables and symbol definitions for the PARMACS environment.

MAIN_INITENV(,unsigned int shared_mem) MAIN_END

Initialization and termination of the PARMACS environment. They should be the first and the last sentences in the application.

CLOCK(unsigned long time)

Get current time.

CREATE(void (*proc)(void)) WAIT_FOR_END(int n)

Create a new process, starting in the proc routine. Wait for children processes to finish.

G_MALLOC(int size) G_FREE(void *ptr)

Allocate and deallocate shared memory.

LOCKDEC(variable) LOCKINIT(lock l) LOCK(lock l) UNLOCK(lock l)

Declaration, initialization and usage of binary semaphores.

ALOCKDEC(variable, n) ALOCKINIT(alock al, int n) ALOCK(alock al, int i) AULOCK(alock al, int i)

Declaration, initialization and usage of arrays of binary semaphores. Note that they not provide atomic operations on several semaphores in the array simultaneously.

BARDEC(variable) BARINIT(barrier b) BARRIER(barrier b, int nprocs)

Declaration, initialization and usage of barriers.

GSDEC(variable) GSINIT(gs g) GETSUB(gs g, int i, int max, int nprocs)

Global subscripts management for self-scheduled loops. Each call to GETSUB returns a unique subscript from 0 to max. At the end of the loop, -1 is returned and each process waits for the others.

PAUSEDEC(variable [,n]) PAUSEINIT(event ev [, int n]) SETPAUSE(event ev [, int n]) CLEARPAUSE(event ev [, int n]) WAITPAUSE(event ev [, int n]) PAUSE(event ev [, int n]) EVENT(event ev [, int n])

Operations for synchronization via events. PAUSEDEC declares an array of events. Each event can be set or cleared using SETPAUSE and CLEARPAUSE. The rest of the operations block processes waiting for a certain event to be set or cleared. PAUSE and EVENT reset the event when the caller is awakened.

TABLE 2. PARMACS macros used in SPLASH-2

processes, so their use results in efficiency improvement [OSF93]. The CThreads library for Mach 3.0 provides lightweight user-level threads and synchronization primitives [COOP88]. The SGI Power Challenge runs Irix 64 6.2, a Unix-like operating system[SGI95]. As Mach 3.0, Irix also supports multithreading, using sprocs. The sproc system allows the creation of processes which can share the same address space and file descriptors. Creation of sprocs is not so expensive as creation of common Unix processes. Low level, hardware based, synchronization primitives like test_and_set are available to user level.

4. Implementing the PARMACS macros In order to test the behavior of parallel applications running in parallel environments, we have implemented both multithreaded and multiprocess versions of the PARMACS macros. This will allow us to execute the SPLASH applications. 4.1. General Comments As a reference, we have developed parallel versions of the macros based on traditional Unix processes. The first one is based on System V IPC and runs in both the platforms (DEC and SGI). Another version is based on BSD (using mmap and msemaphores). Finally, we have developed multithreaded implementations of the macros. The version for Mach is based on the CThreads library, which provides user-level threads (which can also be used as a friendly inter-

face to the kernel threads). The Irix version uses the sprocs as lightweight processes to obtain the parallelism. The implementation of the different versions of the PARMACS macros has required the addition of data structures with both shared and per-process information. The new data structures must reside somewhere in the application address space, and the SPLASH applications (and other scientific applications, in general) group and/or distribute their data in memory in order to achieve the optimum memory/cache performance (by trying to reduce the page faults and the cache misses). So, it is important to see if the new data introduced by the PARMACS implementation changes the location of the application data. We have decided to group the data structures needed to implement the PARMACS macros and allocate memory for them at the beginning of the program execution. This way, we can easily identify the memory accesses due to the PARMACS data structures from the memory access pattern of the application. So, we can measure the noise introduced by the parallel implementation in the memory behavior. Variables used for synchronization (such as barriers, locks, events, etc.) are visible to the programmer, and he or she is responsible for locating them in the proper place in the application space. In the traditional Unix versions, synchronization variables are merely pointers to the parmacs shared data structures, where the synchronization data resides; this is necessary to ensure any process is able to access and modify the data. However, in the multithreaded versions, synchronization data is located in the place where the variable is declared: all the processes can access it because they all share the same address space (Figure 1). Implementation of synchronization variables in the Unix versions (the sync variable can be stored in local memory and points to synchronization data in the shared space).

synchronization data value mutex_data

sync variable index

...

n_sync sync_array

... _parmacs_data

sync variable value mutex_data

...

application

synchronization data

application Implementation of synchronization variables in the multithreaded versions (all the address space is shared).

...

... ... ... ... _parmacs_data

FIGURE 1. Implementation of the PARMACS synchronization variables

It can be observed that the SPLASH-2 applications allocate all the memory needed for their execution dynamically after starting the program, so that memory is correctly located and aligned to contain the application data, which does not interfere with PARMACS data. In our implementations, we have tried to minimize the impact of the effects described in the previous paragraphs and reduce the overhead from the management of parallelism to a minimum.

4.2. PARMACS macros implementation In this section we will comment our implementations of the PARMACS macros. Versions have been designed for System V, BSD, Irix 6.2 and Mach OSF/1 (using the CThreads library). A more detailed description of each implementation can be found in [ARTI97]. System V IPC provides a very rich functionality, specially working with semaphores, but at a high cost in system time. Macros for System V are heavily based on the operating system functionalities, so that a great number of system calls are done during program executions, which increases the execution time in a significant factor. An important restriction of this implementation is that the number of IPC objects is limited by the system. Moreover, IPC objects survive the application which has created them, so PARMACS must keep track of the IPC resources in order to deallocate them at the end of the applications. The functionality provided in BSD is simpler, and thus faster. An important part of the synchronization mechanisms are implemented at user-level, and introduces much less overhead than the System V implementation. In both System V and BSD implementations, parallelism is obtained via classical Unix processes, each of them with its own private address space. Sproc PRDA u. data

Shared Address Space

...

Stack

Sproc PRDA u. data

Cthread

... Stack

... _parmacs_data

Stack

Data

Data

...

...

...

Per-process data

Procs

t.stamp

...

...

Stack

Per-process data

Procs

Synchr.

Cthread

Synchr. Parent PARMACS Code Process Data ...

t.stamp Synchr.

... ... _parmacs_data

Synchr. Parent PARMACS Code Process Data ...

FIGURE 2. Relationship between PARMACS data structures and per-thread data (PRDAs for Irix sprocs, and cthread structs for CThreads on Mach 3.0)

The Irix version uses the sproc facility to implement parallelism. The sproc system call creates a new process that is a clone of the calling process. The new child shares the address space of the parent process, though the parent and the child each have their own program counter value and stack pointer. Each sproc has also its own private data area (PRDA) which is not shared with other processes. Part of the PRDA can be used by the application and it is mapped at a fixed virtual address. In our implementation, the PRDA contains a pointer to the per-process data in the global parmacs data structure, as seen in Figure 2. The PARMACS implementation for OSF Mach 3.0 is based on the CThreads library. The CThreads package is a run-time library that provides primitives for manipulating threads of control in support for multi-threaded programming. CThreads are scheduled on top of Mach kernel threads, using a FIFO policy. They can also be wired to kernel threads to obtain a one-to-one mapping between user-level and kernel-level threads. All Mach threads in a task share all the task resources, specially the address space. To have private data for each cthread, the CThreads package allows for a small user-data storage in the

very same cthread structure. In our implementation, each cthread contains a pointer to its own entry in its private user-data field (Figure 2). In next subsections we will describe particular details of the implementation of the different macros of PARMACS, described in Table 2. 4.2.1. Initialization In order to use the PARMACS macros, certain data structures must be maintained. As commented at Section 4.1, these structures are packed together in order to have a greater control on their location. PARMACS information is initialized with MAIN_INITENV, and this macro is responsible for making the data accessible from all processes. Since traditional Unix processes do not share the address space, PARMACS structures must be explicitly located in shared memory (using the corresponding system calls). For the same reason, the PARMACS main data structure in the Unix versions also has space for the data required by the synchronization variables. System V version also has a list of the IPC identifiers to dispose of them at the end of the execution. On the other hand, Irix and Mach versions need space to keep per-process information. Most Unix applications use address 0 (zero) to detect null pointers, and the ability of Mach to allocate memory at such address can confuse them. due to this fact, PARMACS implementation for Mach allocates and protect memory at such logical address during initialization. 4.2.2. Shared memory management Shared memory is allocated by using the G_MALLOC macro. For the multithreaded versions (Irix and Mach), G_MALLOC is simply implemented as a normal malloc call. This works fine because the address space is shared by all PARMACS processes. Multiprocess versions (System V and BSD) need special system calls to allocate shared memory. In such versions, all the shared memory needed by the application is allocated in the MAIN_INITENV macro. G_MALLOC invocations return portions of that shared area. 4.2.3. Process creation PARMACS processes are created via fork system call in the implementations for System V and BSD. Synchronization with children finalization is performed via the wait system call. The implementation for Irix uses the sproc system call to create a new process which shares the address space with its parent. The Mach 3.0 PARMACS implementation creates processes are via the cthread_fork call, thus creating a new cthread. The number of kernel threads to use can be tuned at execution time. There is no limit by default, so that each cthread has a corresponding mach kernel thread. In both multithreaded versions, synchronization with children termination is implemented at user level. 4.2.4. Synchronization The implementation of synchronization primitives of PARMACS use specific functionalities for each version. In System V, the synchronization variables have been implemented using System V IPC semaphores. Operations on semaphores include modifying the semaphore value

or waiting for a certain value to be reached. System V semaphores have a complex functionality which provides great flexibility, but also a high cost. Synchronization in BSD is based on msems (msemaphore structures). These structures must be located within a shared memory region. They are used to implement binary semaphores. As its functionality is very simple (just locking and unlocking), complex synchronization operations are usually implemented by performing active waits on shared memory regions, protected by msemaphores. Synchronization in the Irix version is based on the test_and_set primitives provided by Irix, which are used to implement spin (active wait) locks. The implementations of the different PARMACS synchronization mechanisms are based on the spin locks. The PARMACS synchronization macros for Mach are based on the synchronization mechanisms provided by the CThreads package: mutual exclusions, conditions variables and spin locks. Spin locks avoid the overhead of blocking and waking up threads. Condition variables are used when one thread wants to wait until some shared data change. Each condition variable must be protected by a mutex (blocking lock). Mutual exclusions and condition variables can be used this way to implement monitors. The four versions use mechanisms to yield the cpu when active waits to implement synchronization operations can affect the performance (e.g. by preventing other processes from running). Next subsections comment details about the implementation of the different synchronization macros. Locks and arrays of locks In the Irix and Mach implementations, locks are implemented as spin locks. As locks in SPLASH-2 protect very short sequences of instructions, it is not worthy using more expensive mechanisms. On behalf of Unix implementations, BSD simply uses a msemaphore, and System V uses a semaphore set with a single semaphore inside. Arrays of locks are actually implemented as arrays of the same structures used for locks, except for System V, which uses a semaphore set with the specified number of entries. Barriers The implementation of barriers is quite simple in Mach. They consist of a mutex, a condition variable and a counter. Each thread entering the barrier acquires the lock, increases the counter and waits in the condition until the last thread resets the counter and sends a broadcast to all threads blocked in the condition, releasing them. Barriers in Irix and BSD are implemented by using a lock for mutual exclusion and a counter, both located in shared memory. As processes enter the barrier, they increase the counter, and then wait for it to have a certain value, set by the last process to enter. The barrier structure for the System V version consists of a semaphore set with two individual semaphores and a counter. The first semaphore is used for mutual exclusion, and the second one is used to block the processes. The last process to enter the barrier releases the other processes. Special care must be taken to prevent a process from reusing a barrier before all processes have exited the previous call, so ensuring the correctness of the barrier behavior.

Global subscripts Global subscripts have the same components of a barrier, plus a field containing the next subscript value. This field is protected by mutual exclusion and increased by each thread invoking the GETSUB macro. When the loop is finished, a barrier is implemented to make all the processes finish simultaneously. Events Events in the Mach version are implemented using a structure which contains the value of the event, a mutual exclusion variable, and two condition variables. Setting and clearing an event simply consists of changing its value and broadcasting the proper condition variable. For Irix and BSD, events are implemented by using a lock (spin lock or semaphore) and the value of the event. Most of macros simply consist of actively waiting for the event to have the proper value and then changing the value again, if necessary. Finally, events in System V consist of a semaphore set with three entries. The first one is for mutual exclusion. The other entries are to block processes waiting for the event to be cleared or set, respectively.

5. Running the SPLASH-2 programs In order to test our implementation, we have compiled some of the SPLASH-2 applications with the different versions of PARMACS and run them in both the DEC and SGI machines. In this section we will comment some details to get the programs running. The results of some executions are given to show the differences between the implementations, though a performance analysis is beyond the scope of this paper. 5.1. General comments Some aspects about running the SPLASH-2 applications with PARMACS are system dependent. We comment some things to take into account when using our PARMACS implementations. The most relevant caveats are those concerning the system limits for System V IPC. These limits include the maximum number of semaphores in a semaphore set, the maximum size of a shared memory segment and the maximum number of semaphore sets and shared memory segments allocatable by an user. The maximum number of IPC resources allocatable will affect the number of barriers, locks, and other synchronizations variables that the application can use. SPLASH-2 applications can usually be adjusted to fit the system limits, though, sometimes this could affect its performance. Though IPC limits only affect the implementation based on System V IPC, we have limited the number of resources used in some SPLASH-2 applications to adapt them to the system in which they would run and we used the modified sources with the different implementations of PARMACS, in order to obtain comparable results.

5.2. An example of execution: Barnes The Barnes program simulates the interaction of a system of bodies in three dimensions over a number of time steps, using the Barnes-Hut hierarchical N-body method. The maximum length of a lock array has been limited to fit the System V IPC limits. Table 3 shows the results of executing Barnes with the different versions of macros on the SGI Power Challenge and the DEC 433 MP. The input parameters correspond to the base problem as defined in SPLASH-2 documentation, using 4 processors. Measures have been obtained with the built-in csh command time (in seconds). Version

SGI/IRIX

Version

DEC/OSF/1

User

System

User

System

System V IPC

20.5

17.6

System V IPC

499.1

98.7

Irix and sprocs

19.2

0.0

BSD

488.3

2.1

Mach 3.0-OSF/1 and Cthreads

482.7

0.7

TABLE 3. User and system times for BARNES running on SGI Power Challenge and DEC 433 MP

Though an accurate analysis of the performance results is not the objective of this paper, we can say that the previous tables confirm the expected behavior. We can observe an important overhead in the System V version due to the use of its IPC facilities, which produce a high number of expensive system calls. Versions which use more user-level based synchronization mechanisms perform better. Finally, using lightweight abstractions result in improvements, specially in system time, as we can see for the multithreaded versions. Figure 3 shows the performance speed-up when using from 1 to 4 processors in the DEC. Results with null macros are included for one processor. As can be seen, BSD and Mach macros do not introduce significant overhead when running on a single processor. 720 660 600

time (seconds)

540 480 BARNES.unix.null BARNES.bsd BARNES.sysv BARNES.mach.cthreads

420 360 300 240 180 120 60 0 0

1

2

3

4

5

processors

FIGURE 3. Speed-up for Barnes on a DEC 433 MP (16384 particles)

6. Conclusions and future work It is necessary to know how parallel applications behave on different systems. In order to test our parallel environment, we are using the SPLASH-2 benchmarks. The PARMACS macros, used by these benchmarks, provide an easy way to test different parallel programing models, since all parallel constructs are limited to well specified interface. To analyze the behavior of a

new model, we can change the macros implementation and execute the same program with the different implementation for the parallel constructs. Moreover, the PARMACS macros are a good point to instrument applications in order to examine their parallel behavior, including bottlenecks and synchronization overhead. They also allow us to easily port the parallel programs to different architectures and analyze them. We have developed several implementations for PARMACS macros based on both system-level and user-level mechanisms. Our preliminary tests show that user-level based implementations perform better, despite the fact that spin locks have been used at user-level and, thus, some CPU time is wasted. The next step will consist of instrumenting the macros in order to take accurate measurements. We are obtaining measures of the parallelization overhead for the different implementations and detecting hot spots in the applications. This information will lead to increase the efficiency of the parallel environments.

7. Acknowledgments Our acknowledgment to Angel Toribio and Albert Vila, who designed and implemented previous versions of the PARMACS macros in our group.

Bibliography [ACCE86] M.J. Accetta et al. “Mach: A New Kernel Foundation for Unix Development”, 1986 Summer USENIX Conference, July 1986. [LUSK87] E.L. Lusk, R.A. Overbeek. “Use of Monitors in FORTRAN: A Tutorial on the Barrier, Self-scheduling DO-Loop, and Askfor Monitors”, Technical Report No. ANL-84-51, Rev. 1, Argonne National Laboratory, June 1987. [BOYL87] J. Boyle, R. Butler, T. Disz, B. Glickfeld, E. Lusk, R. Overbeek, J. Patterson, R. Stevens. “Portable Programs for Parallel Processors”, Holt, Rinehart, and Winston, 1987. [COOP88] E.C. Cooper and Richard P. Draves. “CThreads”, Technical Report CMU-CS-88-154, School of Computer Science, Carnegie Mellon University, February 1988. [BLAC90] David L. Black. “Scheduling and Resource Management Techniques for Multiprocessors”, PhD Thesis, CMU-CS-90-152, Carnegie Mellon University, July 1990. [SING92] Jaswinder Pal Singh, Wolf-Dietrich Weber, Anoop Gupta. “SPLASH: Stanford Parallel Applications for Shared Memory”, Computer Architecture News, 20(1): p. 5-44, March 1992. [OSF93] “OSF Mach Kernel Principles”, Open Software Foundation and Carnegie Mellon University. June 1993. [WOO95] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, Anoop Gupta. “The SPLASH-2 Programs: Characterization and Methodological Considerations”, 22nd. Proceedings of the Annual International Symposium on Computer Architecture, p. 24-36, June 1995. [SGI95] “Power Challenge Technical Report”, Silicon Graphics, Computer Systems. 1995. [GIL95] Marisa Gil, Xavier Martorell, Yolanda Becerra, Ernest Artiaga, Albert Serra, Nacho Navarro. “Herramientas para la Gestión de la Localidad en Microkernels de Memoria Compartida”, VI Jornadas de Paralelismo, July 1995. [BECE96] Yolanda Becerra, Ernest Artiaga, Albert Serra, Julita Corbalán, Marisa Gil, Xavier Martorell, Nacho Navarro. “Soporte del Entorno Operativo a la Gestión de la Memoria para la Ejecución de Aplicaciones Paralelas”, Report UPC-DAC-1996-38, Department of Computer Architecture, Polytechnic University of Catalonia. Presented at VII Jornadas de Paralelismo, Santiago de Compostela, September 1996. [ARTI97] Ernest Artiaga, Nacho Navarro, Xavier Martorell, Yolanda Becerra. “Implementing PARMACS Macros for Shared-Memory Multiprocessor Environments”, Report UPC-DAC-1997-07, Department of Computer Architecture, Polytechnic University of Catalonia. January 1997.

Suggest Documents