JaMP: An Implementation of OpenMP for a Java DSM M. Klemm2 , M. Bezold1 , R. Veldema2 , and M. Philippsen2 1
Computer Science Department 2 University of Erlangen-Nuremberg Martensstrasse 3, 91058 Erlangen, Germany {klemm,veldema,philippsen}@cs.fau.de 2
[email protected]
Abstract In this paper we present JaMP, an adaptation of the OpenMP standard. JaMP is fitted to Jackal, a software-based DSM implementation for Java. While the set of supported directives is directly adopted from the OpenMP standard, we also satisfy all requirements that are enforced by the Java Language Specification and the Java Memory Model. JaMP implements a (large) subset of the OpenMP specification, but expressiveness is still comparable to that of OpenMP. We evaluated the performance of the JaMP compiler in a series of benchmarks consisting of a set of small micro-benchmarks and a Java implementation of the Lattice-Boltzmann Method (LBM). The JaMP parallel version achieves about 83 % of the speed-up of a manually parallelized LBM using C and MPI.
1
Introduction
Today’s High Performance Computing (HPC) landscape consists of various different architectures and platforms. The two widest used platforms are largescale Symmetric Multiprocessors (SMP) employing shared memory and computational clusters. The distributed nature of clusters forces the programmer to use message-oriented programming models such as MPI [11] or PVM [8] to transfer data between the nodes of the cluster. Software-based Distributed Shared Memory (S-DSM) systems hide the complexity of message passing by adding a middle-ware layer that automatically takes care of both accessing remote objects and maintaining memory consistency. The programmer is then able to concentrate on the development of efficient solutions for his scientific problems rather than worrying about placing send and receive operations at the right location inside his program. The Jackal project [15] implements an S-DSM system for the Java programming language. When compiling a Java program natively for a given platform, Jackal inserts so-called access checks that form the basis of Jackal’s S-DSM implementation. Each access to a Java object is prefixed by an access check that
tests whether or not the object is already cached at the local node. If the object is not cached, the access check invokes the runtime system to transfer the object to the local node. If the object is already cached, no further actions are required. For SMP machines, OpenMP [13] has become a widely accepted standardized programming model. With OpenMP, the programmer writes a sequential version of the program and then inserts special directives that the OpenMP compiler can use to generate a parallel version of the program which makes use of a thread library. The remainder of the paper is organized as follows. Section 2 gives a short overview of related work in the field of OpenMP and DSMs. Section 3 is a brief introduction of the Jackal DSM project. Section 4 shows how the OpenMP directives were adapted to the Java programming language. The implementation of JaMP in Jackal is presented in section 5. Section 6 evaluates the performance of JaMP.
2
Related Work
POSIX threads [3, 5] and the Java Threading API [9] are two popular APIs for parallelizing programs on SMPs. In both approaches the programmer is explicitly responsible for writing a parallel version of the program by creating, synchronizing, and joining worker threads. With OpenMP [13] the compiler is able to semi-automatically parallelize programs. OpenMP allows the programmer to write a sequential program and enrich it with OpenMP directives that inform the compiler on how to transform the program into a parallel version with the help of a thread library such as POSIX threads. Besides commercial OpenMP compilers for C/C++ and Fortran (such as the Intel compiler3 suite or the compilers of the Portland Group4 ), there also exist various open-source implementations of the OpenMP standard such as OdinMP/CCp for C/C++ [1] and the Omni OpenMP compiler for C/C++ and Fortran [10]. The latter are source-to-source compilers that preprocess source code containing OpenMP directives and emit a transformed source program that employs a native threading API and can then be compiled into an executable program using another compiler. In case of OdinMP the source is parallelized using pthreads, whereas OmniMP allows the use of different threading APIs. We are not aware of any OpenMP specification for the Java programming language. However, with JOMP [2] there exists a proposal that transfers a subset of the OpenMP standard to Java. The JOMP compiler is a source-to-source compiler that transforms JOMP code to standard Java source code and uses the Java Threading API for parallelism. In contrast to JOMP, the JaMP compiler potentially benefits from translating the OpenMP directives to native code rather than rewriting the source code, because the Jackal compiler is aware of the parallelization applied. For example, this enables the compiler to perform data 3 4
http://www.intel.com/cd/software/products/asmo-na/eng/compilers/index.htm http://www.pgroup.com/products/cdkindex.htm
in t f o o ( SomeObject o ) { return o . f i e l d ; } (a)
i n t f o o ( SomeObject o ) { i f ( ! readable (o )) fetch (o , readable ) ; return o . f i e l d ; } (b)
Fig. 1. Example function foo() before and after insertion of the access checks by the compiler.
race analysis, employ explicit send/receive operations instead of the DSM protocol, and the like. In case of a source-to-source translator, in the final compiler stage the parallelization is hidden by calls to the threading API which makes analysis related to the parallelization a complex task. Intel Cluster OMP is a commercial OpenMP compiler that extends the OpenMP specification by a special attribute to support shared data between different cluster nodes. The DSM is provided by an extended version of the TreadMarks DSM [6]. Omni/SCASH [12] is another project for transparently executing OpenMP-enriched programs in the DSM environment of SCASH [4].
3
The Jackal DSM System
Jackal implements an object-based DSM system that automatically distributes a thread-parallel Java program onto a cluster. Object-based DSMs distribute data on the level of objects rather than on the level of operating-system pages or even (hardware) cache lines. With Jackal, multiple threads are allowed to share a process on a single machine to enable both the efficient use of multi-processor SMP nodes in a cluster and to allow overlapping of communication of one thread with the computation of another. Jackal’s DSM functionality is achieved by prefixing each object access by a test whether or not that object is locally available, a so-called access check. In the case that it is not, a request are sent to the machine holding the object’s master copy (the home-node). Fig. 1(a) shows a simple example of a function foo() containing a read-only object access, Fig. 1(b) shows foo() after the access check has been inserted. At runtime, the address space of each process of a Jackal application is partitioned into three storage areas. A garbage-collected heap that is used to allocate Java objects on their time of instantiation by new. An object cache that is used to temporarily store copies of objects that have been transferred from their homenode after an access check. Finally, an administrative storage area that is used to store data that the runtime system uses to maintain information about the DSM state, locking, etc. In particular the data structures employed are the following. A per thread flush list that is used to maintain the set of objects that have to be flushed to their home-nodes whenever a synchronization point is reached. A
read/write bitmap is allocated that contains information about the accessibility of the objects that are transferred to the local node. Finally, a set of hashtables is used to provide a mapping of global object references to memory addresses on a local node. Besides its highly optimizing compiler, the Jackal compiler and the DSM runtime also provide a set of optimizations directly related to the DSM functionality [16, 15]. Firstly, by aggressively applying static optimizations, redundant access checks are removed. Hence, the access check is made once and as long as the compiler is able to prove that the object is still locally available, no further access checks need to be performed. Secondly, read-only replication is used to replicate Java objects for which no thread issues write requests. A thread is then able to avoid to flush and invalidate such a Java object whenever a synchronization point is reached. Finally, to avoid false sharing, multiple writers are allowed to concurrently modify a particular object. The multi-writer protocol employs a differential image that is sent to the home-node of the object when performing a flush operation.
4
JaMP Directives
The specification of JaMP directives is chosen such as to closely follow the OpenMP standard. The syntax of the JaMP directives is equal to their OpenMP counterparts. This ensures that an OpenMP programmer is able to use JaMP without any effort in learning a set of new directives. The model introduced by JaMP is as expressive as the OpenMP programming model, although we restrict some directives to simplify the JaMP compiler back-end. Because the Java specification does not support pragmas as C/C++ does, Jackal provides its own implementation of the pragma concept: //#pragma
Each pragma statement is opened by a Java line comment starting with #pragma followed by a category to distinguish different pragma types. Similarly to OpenMP, which uses omp as its category identifier, JaMP uses jamp. This identifier is followed by a directive part which is evaluated by the corresponding compiler pass. Options and parameters are passed to the pragma in the attributes part of the pragma definition. For example, the following OpenMP directive #pragma omp p a r a l l e l private ( x )
can easily be transformed into a corresponding JaMP directive //#pragma jamp p a r a l l e l private ( x )
i n t a = 1 ; in t b = 2 ; i n t c = 3 ; //#pragma jamp p a r a l l e l private ( a ) f i r s t p r i v a t e ( b ) shared ( c ) System . out . p r i n t l n ( ” a= ” + a + ” , b=” + b + ” c=” + c ) ; Fig. 2. Example of a parallel region with a private, a firstprivate, and a shared variable. //#pragma jamp p a r a l l e l //#pragma jamp f o r f o r (< i n i t >; ; ) { // some code }
Fig. 3. Example of the for directive.
Please note, that structured block is either a statement or a code block that is introduced by curly braces. As the code examples (Fig. 2 to 4) show, it is straightforward for an OpenMP programmer to turn OpenMP pragmas into the corresponding JaMP pragmas. 4.1
Parallel Regions
To mark a section of a program as parallel, it is enclosed with a jamp parallel directive. See Fig. 2 for an example. JaMP supports all types of access restrictions that are defined by OpenMP. For variables marked as shared, the same variable is used by all threads of the parallel region. For private variables, every thread receives an uninitialized private copy of the variable. To initialize a private variable with the value it had before the parallel region, the firstprivate attribute can be used. The this variable of the Java programming language is always implicitly passed to parallel regions as a shared variable. At the end of each parallel region, there is an implicit barrier at which the threads wait until all other threads created in that region reach the barrier. If the nowait attribute is present, the implicit barrier after the parallel region is omitted. 4.2
Work Sharing Directives
The iteration space of a loop can be distributed among a set of worker threads by means of the for directive (see Fig. 3). In the subsequent for statement, init is the initialization expression of the loop, cond is a loop-invariant termination condition, and inc is an increment expression that increments the loop counter by some loop-invariant value. The loop variable is made thread private as enforced by the OpenMP standard. The load distribution of the work-sharing directive can be controlled by the schedule attribute (not shown in the example) that can have the values static,
in t sumOfArray ( i n t [ ] a r r a y ) { i n t sum = 0 ; //#pragma jamp p a r a l l e l { //#pragma jamp f o r reduction (+:sum ) f o r ( in t i = 0 ; i < a r r a y . l e n g t h ; i ++) { sum += a r r a y [ i ] ; } } return sum ; } Fig. 4. Example of a parallel summation of all array elements.
dynamic, and guided. With a static distribution, the iteration space of the loop is divided into chunks of equal sizes that are assigned to the individual worker threads. A dynamic distribution partitions the iteration space into chunks of a certain size, which defaults to 1. Each thread then continuously requests a new chunk as long as there are unprocessed chunks left. Similarly to dynamic, using guided loops, each thread requests unprocessed chunks. However, the chunk size starts with one third of the size of the iteration space and is cut in half at each step. 4.3
Reductions
With the reduction attribute, JaMP allows reductions of partial results that have been computed by individual worker threads. JaMP supports all arithmetic reductions defined by the OpenMP standard. Fig. 4 shows the summation of an array into a single variable. The summation is parallelized by means of a work sharing construct together with a reduction of the partial sums of each worker thread. 4.4
Other Directives
In addition, JaMP fully supports parallel sections marked by the directive sections. Whereas with the parallel and for constructs each thread executes the same parts of the parallel region, sections allows the programmer to specify a set of section directives each of which is executed by a different thread. It is also possible to define regions that are executed by only one thread using the single and master directives. Contrary to master, single allows for the specification of attributes such as data-access clauses and contains an implicit barrier at the end of the construct. User defined barriers can be defined by means of the barrier directive. At the location of the barrier statement, each thread waits until all other threads
reach the barrier. The directive critical can be used to mark critical sections that should be executed by only one thread at a time. 4.5
Limitations of JaMP
JaMP does not yet completely implement the whole OpenMP specification. The limitations listed in this section, however, do not impose any limitations of the expressiveness of the programming model offered by JaMP. It is possible to express the missing directives by other directives or by means of the Java programming language without much effort. For reasons of simplicity, JaMP currently does not yet support combined directives such as parallel for or parallel sections, which are shortcuts defined by the OpenMP specification. JaMP currently does not provide any means to declare orphaned regions. An orphaned region is a region (e.g. a for region) which is not enclosed by a lexically surrounding parallel directive. In such cases the OpenMP-compliant implementation is forced to dynamically determine a parallel region in the call stack of the current function. If there is none, the orphaned region is executed sequentially. Privatization of instance variables or class variables is currently not possible. In particular, threadprivate and the attributes copyin and copyout are not supported. We have chosen to limit the functionality of the JaMP compiler on these issues to achieve a smaller and easier implementation. We also limit the use of the num threads and if clauses. The num threads clause may only contain a constant or a local variable. Complex expressions can be stored in additional local variables. The if attribute can also easily be expressed by means of Java’s if statement.
5
Implementation Details
The implementation of JaMP in the Jackal framework is divided into three parts: (1) an extension to Jackal’s compiler back-end, (2) an implementation of JaMP classes in the Java library, and (3) a dispatcher added to the runtime system written in C for efficiency reasons (see Fig. 5). Consider Fig. 4 again. The JaMP compiler pass identifies the parallel region and moves the code enclosed by it into a newly created function. How to extract arbitrary code sequences is described in [7]. At runtime, the address of the newly created function will be registered such that a mapping between the address and the function’s globally unique name is established. Additionally, the compiler inserts code that starts up and terminates the parallel execution of the region. Fig. 7 shows the pseudo-code of Fig. 4 after the transformation has been applied. In line 3 of Fig. 7, the thread team is constructed and initialized. Internally, the dispatcher locates the address of the spliced-out function by searching for the function name in a look-up table. The execution of the parallel region starts (line 7) when the stub invokes the function pointer through the dispatcher. The
Fig. 5. Architecture of the JaMP implementation.
Fig. 6. Accessibility of private and shared data.
implicit barrier is hidden inside the startAndJoin call. Finally, in line 8 the reduction of the partial sums is performed. Data access is handled by two types of Java objects as can been seen in Fig. 6. When a parallel region begins execution, a Java object of the type JampParam is created to store shared data. For each worker thread, a private JampParam object is allocated which receives the thread’s private data. For each possible variable type (int, double, etc.), there is an array of the particular type inside JampParam. For every variable, an offset inside the respective array is generated by the JaMP compiler. These two objects are then passed to the spliced-out function (see Fig. 7 at line 11). Correspondingly, each access to a shared variable inside the parallel region has to be altered to access the correct JampParam object instead of the former variable. In Fig. 2, for example, for the private variable a might assigned offset 0 and variable b may be located at offset 1 of a thread-private JampParam object. Correspondingly, the shared variable c may be stored at offset 0 of the shared JampParam object. A JaMP for loop’s behavior has to reflect the specified work-sharing models static, dynamic, and guided. This is handled by a special Java object that contains the original loop boundary, the termination condition, and the increment as well as the list of blocks (in case of dynamic or guided scheduling). As can be seen
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
in t sumOfArray ( in t [ ] a r r a y ) { i nt sum = 0 ; JampThreadTeam team = new JampThreadTeam ( . . . , ” jamp method 1 ” ) ; f o r ( in t i = 0 ; i < N; i ++) team . addThread ( i ) ; team . s t a r t A n d J o i n ( ) ; sum = team . g e t R e d u c t i o n I n t P l u s ( ) ; return sum ; } void jamp method 1 ( JampParam sharedVars , JampParam p r i v a t e V a r s , JampThreadTeam team ) { // i n i t i a l i z a t i o n : c r e a t e l i s t o f b l o c k s // based on t h e work−s h a r i n g model i nt b l o c k I d = g e t B l o c k I d ( . . . ) ; i n t count = getNumberOfBlocks ( b l o c k I d ) ; i n t sum = 0 ; i n t [ ] a r r a y = ( in t [ ] ) s h a r e d V a r s . g e t O b j e c t ( 0 ) ; int i = g e t F i r s t I t e r a t i o n ( blockId ) ; while ( b l o c k I d != −1) { sum += a r r a y [ i ] ; i ++; count −−; i f ( count == 0 ) { blockId = getNextBlockId ( ) ; count = getNumberOfBlocks ( b l o c k I d ) ; i = g e t F i r s t I t e r a t i o n ( blockId ) ; } } p r i v a t e V a r s . r e d u c t i o n V a r = sum ; } Fig. 7. Example in Fig. 4 after JaMP transformation.
in Fig. 7, the loop is rewritten in a way that every thread handles only a part of the iteration space. After the initialization, which depends on the chosen work-sharing model, a first block and the number of iterations in that block are loaded (lines 15–16). getBlockId() will return -1 when there are no more blocks left. In the new loop body, the old loop body is executed until the current block is finished (lines 21–24). After that, the next block is loaded (lines 25–27) and the computation continues until all blocks have been processed. Inside the loop body, the original loop counter i is not used any more for determining when the loop is finished, as the number of iterations in a block is stored in count.
6
Performance Evaluation
We tested the performance of the JaMP implementation by a set of microbenchmarks and the computationally extensive Lattice-Boltzmann Method. The benchmarks were run on a commodity cluster of AMD Opteron (2.2 GHz) with 16 GB of main memory. The nodes are equipped with 1 GBit/s Ethernet and run SuSE SLES 9 (kernel version 2.6.5-7.201-smp). 6.1
Micro-benchmarks
For determining the speed of the basic JaMP operations, we implemented a set of micro-benchmarks. These measurements show how long it takes to create a parallel region, to wait at a barrier, and finally how costly it is to access shared variables. To ensure that the compiler does not remove any instructions, we compiled the benchmark program with most compiler optimizations turned off. The only optimizations applied to the code include method inlining and bounds check elimination. Both are always enabled for the JaMP runtime to ensure that the JampParam objects are accessed in an optimal manner. The first micro-benchmark is used to measure thread start-up time. To speed up the creation of parallel regions, we implemented a thread pool that reuses threads that were used for earlier computations. Figure 8 shows that the time increases linearly with the number of nodes with the thread-pool version being three times faster. The overhead included in thread creation includes setting up Jackal’s internal data structures that are used to implement the DSM system (see section 3). The data structures that are created at thread startup include flush lists for administration of to-be-flushed objects, read/write bitmaps that maintain information about the accessibility of objects, and a a set of hashtables used by the DSM protocol. In the second benchmark, we determined how long waiting at a barrier takes. Barriers occur quite often in a JaMP program: (1) as explicit barriers requested by a barrier directive and (2) as implicit barriers at the end of parallel regions. As can be seen in Figure 8, the time increases with the number of nodes. The barrier implementation used is rather simple and implemented in Java. Thus, the whole DSM protocol stack is involved whenever a barrier is entered. The barrier code itself consists of a synchronized method which updates a counter until the barrier’s limit is reached. If the limit is reached Thread.notify() is executed, otherwise Thread.wait() is called. The lock() and unlock() due to the synchronized block cause a minimum of 4 messages and a thread creation at the receiver to handle the lock request. Lock and unlock both have the side effect of flushing the barrier object, again causing 4 messages at minimum to be exchanged. The barrier’s counter update causes the barrier to be fetched in write modus (in turn causing home-migration which may cause more messages) and a differential image to be created. The wait() and notify() calls both cause 2 messages minimum and a thread to be created at the barrier’s current lock owning machine. In total, a minimum of 12 messages is sent per barrier entry and some thread creations.
Fig. 8. Results of the micro-benchmarks: Creating parallel regions and waiting at barriers.
When using shared variables in a parallel region, the code is modified in a way that the JampParam object is accessed. To determine the overhead due to accessing shared JaMP variables vs. instance variables, we perform 106 variable accesses to both variable types in separate loops. We then subtract the loop overhead from both results. Accesses to shared JaMP variables are about 2.6 times slower than accesses to instance variables. The higher latency of the variable access is caused by the array access inside the JampParam object. Please note, that for shared variables this type of access is always required as the value of a given variable has to be shared with other threads. For a firstprivate variable, this type of access occurs only once when the variable is first accessed inside the parallel region. Each other access is done by a either a machine register or local temporary. 6.2
Lattice-Boltzmann Method
The Lattice-Boltzmann Method (LBM) [17] is used to simulate fluids using cellular automata. Space and time are discretized and normalized. In our case, LBM operates on a 2D domain divided into cells. Each cell holds a finite number of states called distribution functions. In one time step the whole set of states is updated synchronously by deterministic, uniform update rules. In addition, the evolution of the state of a given cell depends only on its neighboring cells. In our example, the computational kernel of LBM repeatedly applies two steps to each of the cells in the domain: the stream step and the collide step. During the stream step, the particles from the neighboring cells flow into the middle cell. The collide step then computes how collisions affected the particles during the stream step. The kernel contains several loop-carried dependencies incurred due to reading the neighboring cells while updating a middle cell. To remove these dependencies from the kernel, we use two distinct domains: the source domain and the destination domain. In one time step, the source domain is read and the destination domain is updated accordingly. After the update of a time step is finished, the domains are exchanged.
Fig. 9. Speed-up and parallel efficiency of the Lattice-Boltzmann Method on upto 8 CPUs of the AMD cluster using the DSM of Jackal.
In our implementation, the 2D domain is implemented by means of a 3D Java array. The first and second dimension are used as the x-axis and the y-axis, and the third dimension is used to store the nine distribution functions of the cells at position (x, y). The kernel was parallelized in a straightforward manner. The domain is decomposed along the y-axis, that is, the outermost loop is distributed over the worker threads. A similar scheme is used for the manually parallelized version in [14]. In [14], LBM was scaled up to use 512 CPUs of a Hitachi SR8000. Fig. 9 shows the speed-up and the parallel efficiency achieved by both the JaMP parallelized LBM and the manually parallelized LBM. Please note that the number of CPUs using the manually parallelized LBM is limited to powers of two. For 8 nodes, JaMP achieves a speed-up about 6.1. This is about 83 % of the speed-up achieved by the LBM kernel that is written in C using MPI for communication. The lower scalability is due to the DSM environment that strongly depends on the communication layer (in our case Ethernet). Because Jackal is an object-based DSM, each cell is separately transferred between the nodes. In contrast, the MPI version transfers a whole partition, i. e. each process requests the cells from other processes by one MPI request.
7
Future Work
Thus far, JaMP relies on Jackal’s object flushing implementation, but the amount of flushed data could be reduced by using knowledge of JaMP’s parallelizations. Within a parallel region the compiler is able to gain information about what data was modified during the execution. Only the data that is needed by other threads executing the same parallel region need to be flushed at the time a barrier is
encountered. Thus, the amount of data that is exchanged can be significantly reduced. Another factor in Jackal’s performance is the application’s data distribution scheme. Again, for a JaMP parallel region the compiler can exploit information about how the computation will be distributed to align data such that the data needed for computation will be allocated at the executing node. Prefetching of data needed in the near future of the computation is another alternative to hide latencies due to the DSM environment. So far OpenMP and JaMP only support work sharing on algorithms that process data such as multidimensional arrays. Arbitrary data structures such as objects graph cannot to be partitioned. There is also a lack of support for true object-orientation. For example, OpenMP does not support arbitrary reductions (e. g. reduce data of arbitrary types) in a work sharing directive. For applications that rely on reductions on user-defined data types, parallelization still has do be done manually, which proves to be a complex and cumbersome task that often leads to load imbalances and an overall bad scalability. JaMP currently lacks support for changing the number of machines. For example, nodes might be removed because of a hardware failure or the like, while the computation should continue without any interruption. In addition, one might want to add or remove nodes to dynamically adapt the application’s need of resources or to the current availability of those. Increasing or decreasing the number of nodes participating in a parallel computation involves a repartitioning of the data after the number of nodes has changed.
8
Conclusions
We have shown an implementation of JaMP as an extension to Jackal’s native Java compiler. With JaMP, a programmer writes a purely sequential program and enriches it with parallelization directives to write a parallel JaMP program. The directives are expressed as pragmas that are implemented by a special type of Java comment. Our current implementation of JaMP supports a large subset of the OpenMP directives and provides a programming model that is as expressive as that provided by OpenMP. We evaluated the performance of our JaMP implementation using a set of micro-benchmarks and the computationally intensive Lattice-Boltzmann Method. For LBM, our measurements prove that the scalability of the parallelization achieves about 83 % of a manually parallelized LBM. For 8 nodes, a speed-up of about 6.1 was achieved. Measurements have shown that the decreasing efficiency is due to the DSM implementation of the Jackal DSM.
References 1. C. Brunschen and M. Brorsson. OdinMP/CCp - a Portable Implementation of OpenMP for C. Concurrency: Practice and Experience, 12(12):1193–1203, 2000. 2. J. M. Bull and M. E. Kambites. JOMP — an OpenMP-like Interface for Java. In Java Grande, pages 44–53, 2000. 3. U. Drepper and I. Molnar. The Native POSIX Thread Library for Linux. Technical report, Redhat, February 2003. 4. H. Harada, Y. Ishikawa, A. Hori, H. Tezuka, S. Sumimoto, and T. Takahashi. Dynamic Home Node Reallocation on Software Distributed Shared Memory. In Proc. of the 4th Intl. Conf. on High-Performance Computing in the Asia-Pacific Region, pages 158–163, Bejing, China, May 2000. 5. IEEE. Threads Extension for Portable Operating Systems (Draft 6), February 1992. P1003.4a/D6. 6. P. Keleher, A.L. Cox, S. Dwarkadas, and W. Zwaenepoel. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. In Proc. of the Winter 1994 Usenix Conf., pages 115–131, San Francisco, CA, January 1994. 7. M. Klemm, R. Veldema, and M. Philippsen. Latency Reduction in Software-DSMs by Means of Dynamic Function Splicing. In Teofilo Gonzales, editor, Proc. of the 16th IASTED Intl. Conf. on Parallel and Distributed Computing and Systems, pages 362–367, 2004. 8. L. Kowalik, editor. PVM: Parallel Virtual Machine. MIT Press, 1994. 9. D. Lea. Concurrent Programming in Java. Addison-Wesley, Boston, 2nd edition, 2003. 10. M. Sato and S. Satoh and K. Kusano and Y. Tanaka. Design of OpenMP compiler for an SMP cluster. In Proc. of the 1st European Workshop on OpenMP, pages 32–39, Lund, Sweden, September 1999. 11. Message Passing Interface Forum. MPI-2: Extensions to the Message-Passing Interface, July 1997. 12. Y. Ojima and M. Sato. Performance of Cluster-enabled OpenMP for the SCASH Software Distributed Shared Memory System. In Proc. of the 3rd Intl. Symp. on Cluster Computing and the Grid, pages 450–456, Tokyo, Japan, May 2003. 13. OpenMP C and C++ Application Program Interface, Version 2.0, March 2002. 14. T. Pohl, N. Threy, F. Deserno, U. Rde, P. Lammers, G. Wellein, and T. Zeiser. Performance Evaluation of Parallel Large-Scale Lattice Boltzmann Applications on Three Supercomputing Architectures. In Proc. of the IEEE/ACM SC 2004 Conf., pages 21–33, Pittsburgh, PA, USA, August 2004. 15. R. Veldema, R. F. H. Hofman, R. A. F. Bhoedjang, and H. E. Bal. Runtime Optimizations for a Java DSM Implementation. In 2001 joint ACM-ISCOPE Conf. on Java Grande, pages 153–162, Palo Alto, CA., June 2001. 16. R. Veldema, C. Jacobs, R.F.H. Hofman, and H.E. Bal. Object Combining: A New Aggressive Optimization for Object Intensive Programs. In Proc. of the 2002 Joint ACM-ISCOPE Conf. on Java Grande, pages 165–174, Seattle WA, 2002. 17. Dieter A. Wolf-Gladrow. Lattice-Gas Cellular Automata and Lattice Boltzmann Models, volume 1725 of Lecture Notes in Mathematics. Springer, 2000.