computationally intensive numeric applications that are the typical fodder of highly ... cation that would make 100%-Pure Java parallel pro- gramming more natural. ... tions (desktop management and publishing, for exam- ple) 10]. Indeed, theĀ ...
Massively Parallel Computing in Java Vladimir Getovy, Susan Flynn-Hummelz, Sava Mintchevy, Ton Ngoz y University of Westminster z IBM T. J. Watson Research Center TR-CSPE-10, November 12, 1997 To appear in the Proceedings of MPPM'97, IEEE Computer Society Press
Abstract
Although Java was not speci cally designed for the computationally intensive numeric applications that are the typical fodder of highly parallel machines, its widespread popularity and portability make it an interesting candidate vehicle for massively parallel programming. With the advent of high-performance optimizing Java compilers, the open question is: How can Java programs best exploit massive parallelism? The authors have been addressing this question via libraries of Java-routines for specifying and coordinating parallel codes. It would be most desirable to have these routines written in 100%-Pure Java; however, a more expedient solution is to provide Java wrappers (stubs) to existing parallel coordination libraries, such as MPI. MPI is an attractive alternative, as like Java, it is portable. We discuss both approaches here. In undertaking this study, we have also identi ed some minor modi cations of the current language speci cation that would make 100%-Pure Java parallel programming more natural.
1 Introduction
Because of fundamental physical limits on the speed of uniprocessors, the advent of parallel computing as the panacea for the ever increasing demands for computational power in this information age has been widely heralded. To paraphrase Mark Twain, however, the death of the uniprocessor has been greatly exaggerated, and the parallel computing industry continues to ounder. Although, companies such as IBM and Intel market machines with over 512 processors, parallel computing has yet to become ubiquitous. Arguably the most serious obstacle to the acceptance of parallel computing is the so-called software crisis (see papers in [15]). Software, in general, is considered the most complex artifact in computer science [4]; since the lifespan of parallel machines has been so
brief, their software environments rarely reach maturity and the parallel software crisis is especially acute. Hence, portability, in particular, is a critical issue in enabling high-performance parallel computing. It has been posited that the parallel computing industry needs to take a lesson from the PC and workstation industries, whose pro tability was driven by applications (desktop management and publishing, for example) [10]. Indeed, the parallel computing industry has been described as solutions looking for problems [1]| what have been in short supply are killer applications. Two attractive candidates are nancial and new media applications: they are information-driven and computationally intensive, as well as being growing markets. Both nancial and new media applications are increasingly being written in Java, a highly portable language [5]; thus targeting the dual goals of portable parallel software and support for potential killer parallel applications. Obviously single-node performance is also important, and, fortunately, rapid progress is being made in developing optimizing Java compilers, such as hpcj from IBM for the RS6000 processor. There are severals ways that Java programs can exploit parallelism, for example, by using: 1. the Java concurrency constructs, 2. the Java Remote Method Invocation (RMI) standard, 3. or Java wrappers (stubs) for existing parallel constructs, e.g. those of the Message Passing Interface (MPI) library. Java and RMI were designed as vehicles for programming the World Wide Web (WWW), and hence their concurrency features are, not surprisingly, less than ideal for programming large-scale parallel machines.
Large scale parallel machines are often programmed in a Single Program Multiple Data (SPMD) style, which means that all nodes execute the same code that is parameterized by their node ids. The typical internode communication/synchronization paradigms are either collective (e.g., barriers, broadcasts etc.) or producerconsumer (i.e., one node produces data that is consumed by another). MPI [13, 14] is a widely adopted communication library standard, and hence like Java, portable. It has been implemented on many parallel machines and clusters of workstations. MPI has a rich set of collective message passing routines; although one can establish channels between processors, it does not provide high-level support for producer-consumer sharing. Unfortunately, for minor technical reasons, producer-consumer sharing currently is also awkward to implement in Java [6]. In the remainder of this paper, we explore using Java as a massively parallel programming language. We begin in section 2 by characterizing parallel machine applications. The suitability of writing such applications in Java is then considered and some problems identi ed. It is concluded that the most expedient way to write massively parallel programs in Java is to write wrappers for MPI routines. Using a tool developed at the University of Westminster, Java wrappers for MPI routines were generated [12], and ported to an IBM SP. The SP consists of a collection of RS6000 connected by a high-speed switch, which allowed us to use the IBM optimizing hpcj compiler. The generation of the wrappers and some experimental results are described in section 3.
a data-parallel or SPMD manner, where data (e.g., an array) provides the parallel dimension. Control and data parallelism also dier in their abundance| in systems there are often only a few parallel activities, while in numeric codes the parallelism may be massive. The parallel constructs of Java were intended to support Internet and Graphical User Interfaces (GUIs), that is systems. They include parallel threads and monitors. A Java Remote Procedure Call (RPC) interface, called RMI, has been de ned for distributed computing. Java does not contain a parallel loop construct|one must explicitly spawn o and initialize each parallel thread of execution, nor does it have a construct for collective communication. A rather odd quirk of the Java language is that monitor signals (noti es in Java terminology) are not queued|that is, a signal issued by a thread is lost if there are no pending waits. This makes programming producerconsumer code awkward [6], since one has to ensure that the consumer is ready before the producer. To facilitate SPMD programming in Java, a library of routines could be provided. One approach is to use calls to Java concurrent constructs in these routines; a compiler with a priori knowledge of the routines could optimize code generated for a particular architecture. The program development methodology would consist of three steps: 1. Compile the code using the standard Sunsoft Java compiler to ensure compatibility. 2. Compile the code using a compiler optimized for a speci c machine; portability is lost, but safety is retained. 3. Further improve performance by turning o type checking, and recompiling.
2 Parallelism
The parallelism found in systems and in numeric applications dier signi cantly, and have been classi ed as control and data parallelism (respectively) [6]. Parallelism in systems is typically heterogeneous with irregular communication patterns, while, as noted above, in numeric applications it is typically homogeneous with regular communication patterns. Parallel constructs, such as monitors, were originally included in program languages to express the inherent concurrency found in operating systems. These constructs are also well-suited for distributed computing. However, they tend to be too heavy-weight for parallel computing, whose raison d'etre is eciency. The primary sources of parallelism in numeric codes are loops, whose iterates communicate collectively or in a producer-consumer fashion. They are written in
We have proposed such a library elsewhere [6], and are currently working on its implementation. A second approach to massively parallel programming in Java is to generate wrappers for an existing communications library; MPI, although it is fairly low-level, is a good choice of a library as it is very portable. In the next section, we discuss the generation of and experimentation with Java MPI-routine wrappers. The Java wrappers were generated automatically from C function declarations. Higher-level communication libraries that use calls to MPI routines have been designed for C and Fortran [11], and the same could be done for Java, thereby achieving ease-of-use and portability. 2
3 Binding a native MPI library to Java
Rewrite the library C functions so that they conform to the particular native interface of our Java VM; or write an additional layer of \stub" C functions which would provide an interface between the Java VM (or rather its native interface) and the library. Software engineering considerations make the rst option a nonstarter: it is not our job to tamper with a library supported by others. But the second option is not very attractive either, considering that a native library like MPI can have hundreds of accessible functions. The solution is to automate the creation of the additional interface layer. The Java{to{C interface generator, or JCI, takes as input a header le containing the C function prototypes of the native library. It outputs a number of les comprising the additional interface: a le of C stub{ functions; a Java le of class and native method declarations; shell scripts for doing the compilation and linking. The JCI tool generates a C stub{function and a Java native method declaration for each exported function of the native library. Every C stub{function takes arguments whose types correspond directly to those of the Java native method, and converts the arguments into the form expected by the C library function. As we mentioned in Section 3, dierent Java native interfaces exist, and thus dierent code may be required for binding a native library to each Java implementation. We have tried to limit the implementation dependence of JCI output to a set of macro de nitions describing the particular native interface. Thus a library can be re-bound to a new Java machine simply by providing the appropriate macros. Every MPI library has in excess of 120 functions. The JCI tool allowed us to bind all those functions to Java without extra eort. Since all MPI libraries are standardized, the binding generated by JCI should be applicable without modi cation to any MPI library. From the application programmer's perspective accessing MPI functions in Java is no harder than it is in C.
The binding of MPI to Java amounts to dynamically linking an existing C library to the Java virtual machine. At rst sight it appears that this should not be a problem, as Java implementations support a native interface via which C functions can be called. There are some hidden problems, however. First of all, native interfaces are reasonably convenient when writing new C code to be called from Java, but rather inadequate for linking pre-existing C code. The diculty stems from the fact that Java has in general dierent data formats from C, and therefore existing C code cannot be called from Java without prior modi cation. Linking a C library (e.g. MPI) to Java is also accompanied by portability problems. The native interface is not part of the Java language speci cation [5], and different vendors oer incompatible interfaces. Furthermore, native interfaces are not yet stable and are likely to undergo change with each new major release of a Java implementation1. Thus to maintain the portability of MPI libraries (which is after all their de ning feature!), one would have to cater for a variety of native interfaces.
3.1 The Java{to{C interface generator
In order to call a C function from Java, we have to supply for each formal argument of the C function a corresponding actual argument in Java. Unfortunately, the disparity between data layout in the two languages is large enough to rule out a direct mapping in general. For instance:
primitive types in C may be of varying sizes, different from the standard Java sizes; there is no direct analog to C pointers in Java; multidimensional arrays in C have no direct counterpart in Java; C structures can be emulated by Java objects, but the layout of elds of an object may be dierent from the layout of a C structure; C functions passed as arguments have no direct counterpart in Java.
Example 3.1 A small test program using MPI: class TestMPI { public static { MPIconst MPI ObjectOfJint argc =
We want to link a large C library | MPI | to a Java virtual machine. Because of the disparity between C and Java data types, we are faced with two options: JNI in Sun's JDK 1.1 is regarded as the de nitive native interface, but it is not yet supported in all Java implementations on dierent platforms by other vendors. 1
3
void main(String args[]) constMPI= new MPIconst(); javaMPI = new MPI(); new ObjectOfJint(args.length),
We have run the IS benchmark on two platforms: a cluster of Sun Sparc workstations, and the IBM SP2 system at the Cornell Theory Center. Each SP node used has a 120 MHz POWER2 Super Chip processor, 256 MB of memory, 128 KB data cache, and 256 bit memory bus. The results obtained on the SP2 machine are shown in Table 1 and Figure 1. The two Java implementations used are IBM's port of JDK 1.0.2D (with the JIT compiler enabled), and IBM's Java compiler (hpcj with ags -O and jnoruncheck). The MPI library is a customized version of LAM 6.13. We opted for LAM rather than the proprietary IBM MPI library because the version of the latter available to us (PSSP 2.1) does not support the re-entrant C library required for Java [7]. A fully thread-safe implementation of MPI has not been required in our experiments so far since MPI is used in a single-threaded way; however programs compiled by hpcj must be linked to the reentrant C library. The results for the C version of IS under both LAM and IBM MPI are also given for comparison. It is important to identify the sources of the slowdown of the Java version of IS running under JDK with respect to the C version. To that end we have instrumented the JavaMPI binding, and gathered additional measurements. It turns out that the cumulative time spent in the C functions of the JavaMPI binding is approximately 20 milliseconds in all cases, and thus has a negligible share in the breakdown of the total execution time for the Java version of IS. Clearly the JavaMPI binding does not introduce a noticeable overhead in the results from Table 1. The performance of Java IS programs compiled with hpcj is very impressive, and provides evidence that the Java language can be used successfully in highperformance computing.
rank = new ObjectOfJint(), answer = new ObjectOfJint(0); constMPI.MPI_Init (argc, args); javaMPI.MPI_Comm_rank(constMPI.MPI_COMM_WORLD, rank); if (rank.val == 0) answer.val = 42; javaMPI.MPI_Bcast (answer, 1, constMPI.MPI_INT, 0, constMPI.MPI_COMM_WORLD); System.out.println ("My rank is " + rank.val + ", The Answer is " + answer.val); javaMPI.MPI_Finalize (); } }
When the program of Example 3.1 is run in SPMD mode, the root process (whose rank is 0) broadcasts an integer to all other processes. The example illustrates the use of Java objects to simulate C pointers in a type-safe way. The class ObjectOfJint, whose de nition has been generated by the JCI tool, contains a single eld val of type int. An object, e.g. answer, of that class acts as a pointer to an integer, and it can be dereferenced as answer.val. As the Java binding for MPI has been generated automatically from the C prototypes of MPI functions, it is very close to the C binding. This similaritymeans that the Java binding is almost completely documented by the MPI-1 standard, with the addition of a table of the mapping of C types into Java types2 . All MPI functions reside in one class (MPI), and all MPI constants in another class (MPIconst). However, there is nothing to prevent us from parting with the MPI-1 C{style binding and adopting a more object{oriented approach by grouping MPI functions into a hierarchy of classes. So far we have bound MPI to two varieties of the Java virtual machine | JDK 1.0.2 [9] for Solaris and for AIX 4.1 [8]. The MPI implementation we have used is LAM of the Ohio Supercomputer Center [3].
References
[1] G. Almasi and A. Gottlieb, Highly Parallel Computing, Benjanin Cummings, 2nd Edition, 1994. [2] D. Bailey et al. The NAS parallel benchmarks. Technical Report RNR-94-007, NASA Ames Research Center, 1994. http://science.nas.nasa.gov/ Software/NPB. [3] G. Burns, R. Daoud, and J. Vaigl. LAM: An open cluster environment for MPI. In Supercomputing Symposium '94, Toronto, Canada, June 1994. http://www.osc.edu/lam.html.
3.2 Experimental results
In order to evaluate the performance of the Java binding to native libraries, we have translated into Java a C + MPI benchmark | the IS kernel from the NAS Parallel Benchmark suite NPB2.2 [2] The program sorts in parallel an array of N integers; N = 8M for IS Class A. The original C and the new Java versions of IS are quite similar, which allows a meaningful comparison of performance results.
2 For the mapping of primitive types see the documentation of the particular Java implementation; for other types see the documentation of the JavaMPI binding.
3 Earlier results obtained with the original LAM 6.1 as reported in [12] show poor scalability w.r.t. the number of processors.
4
No of processors Class Language MPI implem 1 2 4 8 Execution time (sec): A JDK LAM 48.04 24.72 12.78 hpcj LAM 23.27 13.47 6.65 C LAM 42.16 24.52 12.66 6.13 C IBM MPI 40.94 21.62 10.27 4.92 Mop/s total: A JDK LAM 1.75 3.39 6.56 hpcj LAM 3.60 6.23 12.62 C LAM 1.99 3.42 6.63 13.69 C IBM MPI 2.05 3.88 8.16 14.21
16 6.94 3.49 3.28 2.76 12.08 24.01 25.54 30.35
Table 1: Execution statistics for the C and Java IS benchmarks on the IBM SP2 machine at Cornell Theory Center, July 1997
50 40
JDK + LAM hpj + LAM C + LAM C + IBM MPI
Execution time (sec)
30 20
10
5 4 3 2 1
2 4 8 Number of processors
16
Figure 1: Execution time for IS class A on the IBM SP2 system at Cornell Theory Center, July 1997
5
[4] E. A. Feigenbaum, The Intelligent Use of Machine Intelligence, CrossTalk The Journal of Defense Software Engineering 8(8) pp. 10-13, Aug. 1995. [5] J. Gosling, W. Joy, and G. Steele. The Java Language Speci cation, Version 1.0. Addison-Wesley, Reading, Mass., 1996. [6] S.F. Hummel, T. Ngo, and H. Srinivasan. SPMD programming in Java. Concurrency: Practice and Experience, June 1997. [7] IBM. PE for AIX: MPI Programming and Subroutine Reference. http://www.rs6000.ibm.com/ resource/aix resource/sp books/pe/. [8] IBM UK Hursley Lab. Centre for Java Technology Development. http://ncc.hursley.ibm.com/ javainfo/hurindex.html. [9] JavaSoft. Home page. http://www.javasoft.com/. [10] A. Krekelis, Multimedia|an Arena for the Culmination of Parallel Computing, IEEE Concurrency pp. 6-8, Jan.-Mar. 1997. [11] S. Mintchev and V. Getov. PMPI: Highlevel message passing in Fortran77 and C. In Bob Hertzberger and P. Sloot, editors, High{Performance Computing and Networking (HPCN'97), pages 603{614, Vienna, Austria,
1997. Springer LNCS 1225. [12] S. Mintchev and V. Getov. Towards portable message passing in Java: Binding MPI. In Proceedings of EuroPVM-MPI, pages 135{142, Krakow, Poland, November, 1997. Springer LNCS 1332. [13] MPI Forum. MPI: A message-passing interface standard. International Journal of Supercomputer Applications, 8(3/4), 1994. [14] MPI Forum. MPI-2: Extensions to the MessagePassing Interface. http://www.mcs.anl.gov/ Projects/mpi/mpi2/mpi2.html, 1997. [15] Developing a Computer Science Agenda for HighPerformance Computing, pp. 30-31, U. Vishkin (Editor), ACM Press, 1994.
6