programs. With the increasing interest in using Java for high-performance computing several groups have .... A much simpler solution is to follow the precedent ..... guide and reference manual. ... (12th IPPS and 9th SPDP), LNCS, Springer, pp.
Design Issues for Ecient Implementation of MPI in Java 1
Glenn Judd1, Mark Clement1 , Quinn Snell1 Computer Science Department, Brigham Young University, Provo, USA Vladimir Getov2 2 School of Computer Science, University of Westminster, London, UK
Abstract While there is growing interest in using Java for high-performance applications, many in the highperformance computing community do not believe that Java can match the performance of traditional native message passing environments. This paper discusses critical issues that must be addressed in the design of Java based message passing systems. Ecient handling of these issues allows Java-MPI applications to obtain performance which rivals that of traditional native message passing systems. To illustrate these concepts, the design and performance of a pure Java implementation of MPI are discussed.
1 Introduction The Message Passing Interface (MPI) [1] has proven to be an eective means of writing portable parallel programs. With the increasing interest in using Java for high-performance computing several groups have investigated using MPI from within Java. Nevertheless, there are still many in the high-performance computing community who are skeptical that Java MPI performance can compete with native MPI. These skeptics usually refer to data showing early Java implementations of message passing standards [2] that have performed orders of magnitude slower than native versions. Their skepticism is further backed by the fact that, as of yet, there has not been an MPI implementation completely written in Java that has been competitive with native MPI implementations. To investigate possible Java MPI performance, we have designed and implemented MPIJ, an MPI implementation written completely in Java which seeks to be competitive with native MPI implementations in clus-
tered computing environments. In this paper we discuss issues that must be addressed in order to eciently implement MPI in Java, explain how they are concretely addressed in MPIJ, and compare the performance obtained using MPIJ with native MPI. Our results show that there are many cases where MPI in Java can compete with native MPI. Section 2 reviews previous work on Java message passing. In Section 3, we discuss issues which must be addressed in order to allow for ecient communication between MPI processes in Java. In Section 4 we discuss issues related to supporting threads in Java MPI processes. Section 5 discusses methods for integrating high performance libraries into Java. In Section 6 we discuss the design of a pure Java implementation of MPI: MPIJ. Section 7 presents performance results.
2 Related Work Over the last few years many Java message passing systems have been developed. A large number of these systems such as JavaParty [3], JET [4], and IceT [5] have developed novel parallel programming methodologies using Java. Others have looked at using variations on Java Remote Method Invocation [6] or JavaSpaces [7] for high performance computing. A number of eorts have also investigated using Java versions of established message passing standards such as MPI [1] and Parallel Virtual Machine (PVM) [8]: JPVM [2] is an implementation of PVM written completely Java. Unfortunately, JPVM has very poor performance compared to native PVM and MPI systems. mpiJava [9] is a Java wrapper to native MPI implementations. It allows application code to be written in pure Java, but currently requires native MPI implementations in order to function. JavaMPI [10] is also a Java wrapper to native MPI libraries, but JavaMPI wrappers are generated au-
3 Network Communication in Java 3.1 Native Marshalling High network communication performance is a critical element of any MPI implementation. Achieving high network communication performance under Java requires consideration of issues not found under native code. Before discussing these issues, consider Figure 6 which compares Java byte array communication to C byte array communication. It is clear that Java communication of byte arrays completely matches C in this case. As both C and Java rely on the same underlying communication library to carry out the communication, it is not surprising that they should achieve very comparable results. However, most communication in MPI consists of data other than bytes. In C this is a trivial issue since arrays of any type can be type cast to be a byte array, but in Java this issue is signi cant because a simple type cast is not permitted. The most common method of sending non-byte data in Java is to marshal the data into a byte array and then send this byte array. Performing this marshaling in Java code is an intrinsically less ecient operation than performing this marshaling in native code. Consider the following Java code fragment which marshals an array of doubles: void marshal(double[] src, byte[] dst) { int count; count = 0; for(int i = 0; i < src.length; i++) { long value = Double. doubleToLongBits(src[i]); dst[count++] = (byte)((int)(value >>> 56)); dst[count++] = (byte)((int)(value >>> 48)); dst[count++] = (byte)((int)(value >>> 40)); dst[count++] = (byte)((int)(value >>> 32)); dst[count++] = (byte)((value >>> 24)); dst[count++] = (byte)((value >>> 16)); dst[count++] = (byte)((value >>> 8));
700 Native Marshalling Java Marshalling
600
500
Mb/s
tomatically with the help of a special-purpose tool called JCI (Java-to-C Interface generator). Eorts are currently underway to develop a standard Java MPI binding in order to increase the interoperability and quality of Java MPI bindings [11]. This research adds to this eort by exploring issues that must be addressed in order to eciently implement MPI in Java. These issues can then be addressed by both the developing Java MPI bindings, and by the Java environment itself in order to foster the development of ecient message passing systems which are written completely in Java.
400
300
200
100
0 0
1
2
3
4
5
6
7
8
9
kbytes
Figure 1: Native vs. Java Marshalling (JDK 1.2 on a Pentium II 266MHz machine running Windows NT)
}
}
dst[count++] = (byte)((value >>>
0));
In C this marshaling can be accomplished with a simple memcpy. In Java the equivalent marshaling code requires a total of 135 Java bytecode instructions including a method invocation, several shift operations, and several type conversion operations all of which are not required in C. It is suggested in [12] that the just-in-time (JIT) compilers should be able to optimize data marshaling into its native equivalent (i.e. eliminate the method invocation, shifts, etc., and replace them with memcpy code). While this is theoretically possible, it is complicated by the fact that there are several primitive types and several approaches to marshaling each of these types that a JIT compiler would need to be able to optimize. At this time, we are not aware of any JIT compiler which even attempts this optimization. A much simpler solution is to follow the precedent established by the System.arraycopy method already included in Java. This method is used to eciently copy data between Java arrays. Currently this method requires both source and destination to be of the same data type. This routine could be extended to allow the source and destination arrays to be of dierent primitive types. Alternatively, a new method could be added which would speci cally allow for copying data between arrays of dierent primitive types. Note that such a method would not compromise Java language safety or security as it introduces no new functionality, but rather expedites existing functionality. As shown in Figure 1, this method enables huge performance increases in data marshaling speed, and enables data marshaling to occur at the same speed as a memcpy.
3.2 Typed Array Communication The native marshaling we have discussed still requires a memory copy. Native code is able to transfer data over the network without any copy unless the destination machine of a message uses dierent byte ordering in which case the byte order changing memory copy is required. The same approach could be used in Java code, but Java's current design does not allow this. Currently in Java, all network communication is sent using java.io.InputStream and java.io.OutputStream these classes only provide routines for sending bytes. This design allows Java to leave byte ordering unde ned, allowing each Java Virtual Machine (JVM) to use the native machine's byte ordering. Since JVMs are only able to send bytes to each other, dierent internal byte ordering is unimportant. While this is a clean design, it limits performance for primitive arrays of type other than byte. A possible solution is to add input and output classes which are able to send typed data directly without any memory copy. These classes would have the ability to automatically determine when dierent byte ordering is used on input and output machines, and introduce byte ordering changes only when needed. These classes could also provide a uniform means for access to non-TCP/IP communication found in many high performance computing clusters. Under this scheme, Java MPI implementations would request that a "factory" provide a typed array communication class capable of communicating on the local machine's specialized network. Now consider a typical MPI implementation of MPI's standard communication mode: For messages below a certain size threshold, messages are buered and sent. For messages above the threshold, the sender blocks until the receiver actually posts the receive. This allows the message to be sent without any buering. Native marshaling allows buering in Java to occur at essentially the same speed as buering in native code. However, native marshaling does not allow Java code to send without buering. Using typed input and output classes alleviates this problem by allowing communication to occur directly between send and receive buers. The two additions to Java we have discussed allow Java applications to achieve communication performance which is comparable to that of native applications.
3.3 Shared Memory Communication On multi-processor machines, it is desirable to use direct memory transfers for communication. In Java this is accomplished by placing the multiple MPI processes in a single JVM. This introduces a signi cant dierence
with native MPI: MPI processes become Java threads. This means that multiple threads in a single class which uses class variables (class variables are global to the JVM) will all see the same data. This is inconvenient for applications which assume a native MPI process model where MPI processes do not see the same global information. However, the cost of this inconvenience is more than made up for by the fact that Java class variables can be exploited by programmers to provide a very simple and very powerful shared memory mechanism for threads residing on the same machine. MPI implementations can exploit this shared memory mechanism to speed up both point-to-point communication as well as global operations. We have found that global operations in Java MPI bene t greatly from using Java class variables to organize a rendezvous and direct memory transfer rather the standard method of using point-to-point communication for global operations.
4 Thread Support 4.1 Thread Support for Shared Memory Utilization Programmers desiring the maximum amount of performance on multi-processor machines can write programs which utilize Java MPI calls between machines, and Java threads within the machine. One simple way that Java MPI implementations can aid this process is to provide a method for determining the number of processors on the machine. Java provides no mechanism for determining the number of processors on a machine, but this can be overcome by writing a method which determines the number of eective processors on a machine. This is easily accomplished by writing a routine which divides work among increasing numbers of threads. The number of processors is then determined by the occurrence of a signi cant drop in the amount of work per cpu. This method can also be used by Java MPI implementations to automatically determine the number of MPI processes to run on the machine, rather than relying on a process group le.
4.2 General Thread Support Threads in Java are useful for more than just taking advantage of multiple processors. They allow many important functions such as i/o etc. to be performed outside of the main thread of execution. The widespread use of threading in Java makes support for threading a very important issue. Unfortunately, MPI includes very little thread support. Rather, MPI merely delineates what a \threaded" version of MPI should provide, and what the user should be required to provide. As threads are pervasive in Java, any Java MPI implementation should, at least, follow the guidelines provided by
MPI for a threaded MPI. However, the easy and power of Java threading begs for a more elegant solution.
5 Integrating Standard High Performance Libraries Into Java A signi cant issue that must be addressed is how to integrate the additions we have proposed into Java. Native marshaling could easily be added into Java's current API with very little diculty, but the proposed classes for high performance communication of typed arrays introduces a more substantial change. As Java is largely driven by the needs of business applications, it is unlikely that a substantial class like the typed array networking class will make it into the core Java API, in spite of the huge performance increases possible. The issue of how to integrate a high performance computing API into Java is faced by several eorts to establish standard Java high performance computing libraries [12]. No standard method has yet been established for inclusion of these libraries in Java. Therefore, we propose a straightforward, and exible system for addition of Java high performance libraries to Java. With the introduction of the Java 2 platform, Java contains core API, packages named java.*, and several standard extension packages named javax.*. When Sun introduced the Swing API (Java's new GUI API), Sun originally de ned its package as com.sun.java.swing.*. In order to move Swing into the core library for Java 2, but leave it as an extension for JDK 1.1, Sun de ned Swing's package to be javax.swing. Sun then de ned javax.swing as a core API, in addition to the java.* packages, in Java 1.2. Following this pattern, we propose that libraries critical for high-performance computing be included in standard Java Grande extensions javax.grande.*. Parts of these libraries which are useful to the general public could eventually be de ned as core. Less critical libraries, or libraries still under development could be de ned in grande.org.* packages. These classes could eventually be promoted to javax.* if necessary. So, under this scheme, standard native libraries critical for performance would be installed on systems. If a native library could not be found, a default Java implementation is substituted. In this way applications can have both portability to machines which do not have any native code installed, and superior performance on machines which do.
6 Design Principles In designing our implementation of MPI in Java, we followed four major principles:
Pure Java Implementation A pure Java implementation is very desirable as it inherits all of Java's cross platform, security, and language safety features. The only exception we allowed to the pure Java implementation is on systems where a library for native marshaling of arrays is available. In this case, it is best to use native marshaling. If the native marshaling library is unavailable, we simply use Java marshaling. As will be shown, this small bit of native code allows MPIJ messaging to compete favorably with native message passing schemes. Java Grande Forum MPI binding proposal compliance
The MPI standard contains bindings for C, FORTRAN, and C++, but none for Java. It is important to have well-de ned Java bindings for MPI in order to foster compatible, high-quality implementations. To remedy this situation, we are working as part of the Java Grande Forum Concurrency and Applications Working Group to develop MPI bindings. We sought to follow these emerging bindings in order to allow programs written under MPIJ to run under other Java MPI systems and vice versa.
High communication performance Ecient commu-
nication is critical in order to make Java MPI a viable alternative to native MPI. When MPIJ is started, it rst searches for a native marshaling library when it is loaded. If a native library is found MPIJ uses it to perform native marshaling as we described it earlier. If no native marshaling library is found, MPIJ uses a Java library for marshaling data. MPIJ does not yet use any typed array communication classes. We are currently working on incorporating them into MPIJ, and we expect to see signi cant performance increases when they are included. On multi-processor machines, MPIJ makes use of shared access to class variables to perform ecient collective communication. This allows us to directly copy data between source and destination buers, and achieve a high degree of eciency.
Independence from any particular application framework This greatly increases the usability of MPIJ by
allowing it to be used by any framework which provides a few simple startup functions.
7 Performance Results 7.1 Test Environment To quantify the performance of our current MPIJ implementation, we ran benchmarks on three dierent parallel computing environments:
10
100
10
MB/s
MB/s
1
0.1
1
0.1
0.01
0.01 MPIJ
MPIJ
WMPI
WMPI
0.001
0.001 1
10
100
1000
10000
1
100000 1000000
10
100
Figure 2: Ping Pong Distributed Memory
7.2 Point-to-Point Communication Performance Ping Pong The Ping Pong benchmark nds the max-
imum bandwidth that can be achieved sending bytes between two nodes, one direction at a time. As can be seen in Figure 2 and Figure 3 MPIJ distributed memory communication is essentially equivalent to that of WMPI. Shared memory performance of MPIJ is reasonably close to that of WMPI. (Note that this test was run with explicit tags and sources in the MPI receive call. MPIJ currently performs signi cantly slower when using the MPI.ANY SOURCE wildcard. The cause of this
10000
100000 1000000
Figure 3: Ping Pong Shared Memory 10
1
MB/s
1. A cluster of dual processor Pentium II 266 MHz Windows NT machine under JDK 1.2 communicating via switched 100 Mbps switched Ethernet with only one MPI process per machine. 2. The same cluster as in 1, but with each machine having up to two MPI processes (one per CPU). 3. A 4 processor Xeon 400 MHz Windows NT machine under JDK 1.2. As stated, one of our major aims is to show that MPI under Java can match native MPI performance. In order to demonstrate this we compare MPIJ performance with one of the best available MPI systems for Windows NT { WMPI [13]. WMPI was chosen because an evaluation study elsewhere [14] showed it to have very good shared memory and distributed memory communication performance. We do not compare our results with JPVM and PVM 3.4 under Windows NT because their performance is signi cantly less than that of WMPI. We also do not compare against Linux MPI because WMPI performance is fairly comparable to MPI on Linux (NT and Linux bandwidth on our hardware is nearly equal while Linux latency is much lower), and because Java on Linux is far less advanced than Java on Windows NT.
1000 Bytes
Bytes
0.1
0.01 MPIJ WMPI
0.001 1
10
100
1000
10000
100000 1000000
Bytes
Figure 4: Ping Ping Distributed Memory ineciency seems to be synchronization overhead, and we are investigating more ecient implementations.)
Ping Ping The Ping Ping test (Figure 4 and Figure 5) nds the maximum bandwidth that can be obtained between two nodes when messages are being sent simultaneously in both directions. Once again, MPIJ distributed memory performance is equivalent to that of WMPI. However, in this case, WMPI signi cantly outperforms MPIJ in a shared memory environment. Communication of Various Primitive Types The Ping Pong and Ping Ping tests measure communication of bytes. As stated previously, communicating other data types is more troublesome in Java. Figure 6 compares MPIJ and WMPI communication of double precision oating point data and integer data. The native marshaling technique mentioned previously allows MPIJ to reach essentially the same performance as WMPI on double precision oating point data, and on integer data, MPIJ actually outperforms WMPI slightly. Now, if a native marshaling library is unavailable to MPIJ, MPIJ will use pure Java marshaling. The
System
Startup Latency Shared Distributed Memory Memory WMPI 60sec 422sec MPIJ 109sec 352sec Table 1: Latencies for MPIJ and WMPI
100
3500 3000
Microeconds
MB/s
10
1
0.1
0.01
WMPI
100
1000
10000
2000 1500
500
0.001 10
2500
1000
MPIJ
1
MPIJ WMPI
100000 1000000
Bytes
Figure 5: Ping Ping Shared Memory
0
2
4
8
Processors
16
Figure 7: Barrier Hybrid Memory lowest line in Figure 6 represents MPIJ double precision
oating point communication performance if Java marshaling is used, and clearly shows that the use of Java marshaling instead of native marshaling results in signi cantly worse performance.
Startup Latency Table 1 shows MPIJ startup latency
for both distributed and shared memory. MPIJ distributed memory latency is lower than that of WMPI. This is possibly due to the fact MPIJ is implemented directly on the Java socket API while WMPI relies on an intermediate API before accessing the Windows socket API. However, MPIJ is signi cantly slower in shared memory mode. This is possibly due to Java synchronization overhead.
100
90
80
70
Mbps
60
7.3 Other benchmarks Barrier The Barrier test measures process synchro-
50
40
30
MPIJ double (native marshalling)
20
WMPI MPIJ double (Java marshalling)
10
MPIJ int (native marshalling) 0 0
20
40
60
80
100
120
140
Message size in kbytes
Figure 6: Communication of Primitive Types
nization performance. Figures 7, 8, and 9 compare MPIJ performance to WMPI for the hybrid system, the shared memory system and the distributed memory system. MPIJ performs well in both the hybrid and and distributed memory modes, but is signi cantly slower in shared memory. This performance gap should shrink signi cantly once we optimize the shared memory barrier code.
NAS Parallel Benchmarks: Integer Sort As a nal test, we evaluated the performance of MPIJ on a single NAS Parallel Benchmark: Integer Sort [15]. We compare this performance with the performance of both
400 MPIJ WMPI
300
45 PentiumII 266 MPIJ
40
250
Xeon 400 WMPI IBM SP2 LAM
30
150
IBM SP2 IBM MPI
25
100
20
50
15
0
10
2
Processors
4
5 0 1
Figure 8: Barrier Shared Memory 1400
MPIJ WMPI
1200 1000 800 600 400 200 0
2
2
4
8
16
Processors
Figure 10: Barrier Shared Memory - Integer Sort
1600
Microeconds
Xeon 400 MPIJ
35
200
Seconds
Microeconds
350
4
Processors
8
Figure 9: Barrier Distributed Memory WMPI on the four processor Xeon, and of MPI on an SP2. A critical element for this benchmark is the performance of the MPI function ALLTOALLV. We optimized this function to exploit shared memory variables. As shown in gure 10, MPIJ was able to outperform WMPI, and perform quite well compared to the SP2 [16].
8 Conclusions We have shown several instances where MPI implemented in Java can match performance of native MPI in a clustered environment. Achieving this performance in Java requires careful implementation of data marshaling. Currently, data marshaling must occur in native code in order to achieve high performance. In our view, this functionality should be added to the core Java classes either by allowing System.arraycopy to copy between arrays of dierent types or adding a method which has this functionality. However, this still requires a memory copy. The most demanding environments will require a zero copy communication system. This is possible by adding a class similar to DataOutputStream that is capable of sending arrays without marshaling unless the message destination requires dierent byte ordering.
We have also shown that Java MPI implementations which allow multiple threads to exist in a single JVM can exploit static shared access to static variables. We have demonstrated how this technique can be used by MPI to speed up global operations, but application programmers could also use threads directly to allow shared access to data without any message passing. As Java MPI implementations mature and incorporate key communication capabilities, they will be able to provide a viable alternative to native MPI implementations.
9 Future Work We have examined some of the most critical Java MPI performance issues, but there are still many other open questions to be addressed. In addition, while our implementation of MPI contains the most essential functionality, it is not yet complete. Future work will address implementation of remaining MPI features as they are included in the nal Java MPI bindings as well as implementation on supercomputers such as the IBM SP2.
References [1] MPI Forum. MPI: A message-passing interface standard. International Journal of Supercomputer Applications, 8(3/4), 1994. [2] A. Ferrari. JPVM: network parallel computing in Java. Concurrency: Practice and Experience, vol. 10 (11-13), pp. 985{992, 1998. http://www.cs.virginia.edu/jpvm [3] M. Philippsen and M. Zenger. JavaParty - transparent remote objects in Java. Concurrency: Pract. Exper., vol. 9 (11), pp. 1225{1242, 1997. [4] H. Pedroso, L. M. Silva, and J. G. Silva. Web-based metacomputing with JET. Concurrency: Pract. Exper., vol. 9 (11), pp. 1169{1173, 1997.
[5] P. Gray and V. Sunderam. IceT: Distributed computing and Java. Concurrency: Pract. Exper., vol. 9 (11), pp. 1161{1167, 1997. [6] Javasoft. Remote method invocation. Technical report, http://java.sun.com/products/jdk/1.1/ docs/guide/rmi/index.html, 1997. [7] Javasoft. Javaspaces. Technical report, http://chatsubo.javasoft.com/ javaspaces/, 1997. [8] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam. PVM 3 user's guide and reference manual. Technical Report ORNL/TM-12187, Oak Ridge National Laboratory, Sept. 1994. [9] B. Carpenter, G. Fox, G. Zhang, and X. Li. A draft Java binding for MPI., Nov. 1997. http://www.npac.syr.edu/projects/ pcrc/HPJava/mpiJava.html [10] S. Mintchev and V. Getov, Towards portable message passing in Java: Binding MPI, in M. Bubak, J. Dongarra, J. Wasniewski (Eds.), Recent Advances in PVM and MPI, LNCS, Springer, pp. 135{142, Nov. 1997. http://perun.hscs.wmin.ac.uk/JavaMPI/ [11] B. Carpenter, V. Getov, G. Judd, T. Skjellum, G. Fox. MPI for Java: Position Document and Draft API Speci cation, Technical Report JGF-TR-03, Java Grande Forum, Nov. 1998. http://www.javagrande.org/reports.htm [12] Java Grande Forum. Making Java Work for High-End Computing. Technical Report JGF-TR-01, Java Grande Forum, Nov. 1998. http://www.javagrande.org/reports.htm [13] Wmpi. Technical report, http://dsg.dei.uc.pt/w32mpi/, 1998. [14] M. Baker and G. Fox. Mpi on nt: A preliminary evaluation of the available environments. in: Jose Rolim (Ed.), Parallel and Distributed Computing, (12th IPPS and 9th SPDP), LNCS, Springer, pp. 549{563, April 1998. [15] D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, S. Fineberg, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga, The NAS parallel benchmarks, Technical Report RNR-94-007, NASA Ames Research Center, http://science.nas.nasa.gov/Software/ NPB/ (1994). [16] V. Getov, S. Flynn-Hummel, and S. Mintchev, High-performance parallel programming in Java:
Exploiting native libraries, Concurrency: Pract. Exper., vol. 10 (11-13), pp. 863{872, 1998.