AJaPACK: Experiments in Performance Portable Parallel Java ...

2 downloads 22052 Views 953KB Size Report
performance parallel BLAS library for Java which "tunes" itself to the environment to ... speed of optimized C-ATLAS and vendor supplied BLAS libraries, and with ..... the optimized kernel versus the higher-level classes. More- over, as far as ...
AJaPACK: Experiments in Performance Portable Parallel Java Numerical Libraries Shigeo Itou,

Satoshi Matsuoka, Hirokazu Hasegawa Tokyo Instituteof Technology 2-12-10okayama, Meguro-ku,Tokyo,Japan {itou,matsu,b962220} @ is.titech.ac.jp

ABSTRACT

distributed environment. A l t h o u g h efforts such as H P F a n d O p e n M P strive to achieve some portability of parallel code, a single b i n a r y working across different parallel machines is still unavailable. Java has received considerable a t t e n t i o n in the H P C community, t h a n k s to its good object-oriented language design, as well as s u p p o r t of s t a n d a r d language features for parallel and d i s t r i b u t e d c o m p u t i n g such as threads and RMI. Moreover, standardized J a v a bytecode s u p p o r t s portable execution via t h e J a v a Virtual Machine. On the other hand, it is not clear w h e t h e r J a v a AS IS is entirely appropriate for portable distributed H P C programming, as pointed out by various efforts including those by J a v a G r a n d e . In particular, J a v a execution e n v i r o n m e n t is quite divergent, ranging from pure interpretation, d y n a m i c J I T compilation, to static compilation; as a result, a n optimization scheme for a particular J a v a platform may not be well-applicable for others. This includes not only the compiler, b u t also the run-time system. We claim t h a t the i m p o r t a n t characteristics for H P C in an network e n v i r o n m e n t is "performance portability" for code t h a t are downloaded and move across the network. By performance portability we do not m e a n t h a t a given code must perform the same across divergent machines; rather, it is a property whereby the same code will perform within some certain fraction of the peak performances of different machines. For example, when a certain optimization is performed on t h e code, if the code exhibits measurable performance i m p r o v e m e n t over different machines, then the optimization is performance portable. For Fortran and C, this is typically the case, b u t for Java, whether this is the case is not clear, and in fact several counter cases have been reported [1]. From this perspective, we categorize several previous approaches employed for H P C c o m p u t i n g in Java:

Although J a v a promises platform portability amongst diverses et of systems, for most Java platforms today, it is not clear if they are appropriate for high-performance numerical computing. In fact, most previous a t t e m p t s at utilizing J a v a for H P C sacrificed Java's portability, or did not achieve necessary performance required for HPC. Instead, we propose an alternative methodology based on Downloadale Self-tuning Library, and constructed an experimental prototype called A J a P A C K , which is a portable and highperformance parallel BLAS library for J a v a which "tunes" itself to the environment to which it is installed upon. Once A J a P A C K is downloaded and executed, t h e Java version of ATLAS (ATLAS for Java) and the parallelized version of J L A P A C K combine to achieve optimized pure Java execution for the given environment. B e n c h m a r k s have shown t h a t A J a P A C K achieves approximately 1/2 to 1/5 of the speed of optimized C-ATLAS and vendor supplied BLAS libraries, and with portable parallelization in SMP environnmnts, achieves superior performance to single-threaded Cbased native libraries. This is an order of m a g n i t u d e superior w.r.t, performance compared to previous pure Java BLAS libraries, and opens up further possibilities of employing J a v a in H P C settings, b u t still shows t h a t J I T compilers with optimizations expecting numerical code highly-tuned at the source- or bytecode level would be highly desirable.

1.

INTRODUCTION

Distributed and high-performance c o m p u t i n g ares becoming much more synergetic t h a n k s to the widespread availability of high-performance networks and inexpensive computing nodes such as P C clusters. In such an environment, HPC applications and libraries highly-portable across a variety of highly-divergent execution platforms are in absolute need. Traditional H P C languages such as C or Fortran do not completely standardize the language, t h e library, the machine binary, the parallel machine architecture, nor the execution environment. As a result, it is quite difficult to have portable H P C applications t h a t work ubiquitously in a



Employing Native Libraries: A H P C native lib r a r y is called, using interfaces such as JNI. Kernel numerical performance obviously matches t h a t of C or Fortran, b u t portability n a t u r a l l y suffers, and moreover, m a i n t a i n i n g multiple native code for downloading would be quite cumbersome. Examples include J C I ( J a v a - t o - C interface generator)[2] and the frontend wrapper in IceT[3].



Optimizing (Static) Compiler: Traditional optimizing compiler as is with C and Fortran. Examples include IBM High Performance Compiler for Java[4] and Fujitsu HBC[6]. In this case, one obtains u t m o s t

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed tbr profit or commercial advantage and that

copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Java 2000 San Francisco CA USA Copyright ACM 2000 1-58113-288-3/00/6...$5.00

140

success for Fortran or C such as PhiPAC[9] and ATLAS[10], often achieving t h e performance of vendort u n e d libraries, it is not clear whether the same strategy would be effective for Java platforms. In fact, there are various possibilities where traditional optimization strategies applicable to F o r t r a n and C would not be effective, we must investigate whether self-tuning will (or will not) perform well under different Java platforms. A partial s t u d y for the Ll-blocked BLAS core has been done in [1], b u t a s t u d y using a larger-grained library with parallel execution is required.

efficiency, sometimes almost m a t c h i n g t h a t of optimizing Fortran[4]. On the other hand, not only portability suffers, because it only speeds up program on a particular platform, but it becomes more difficult to implement the dynamic aspects of Java, such as dynamic loading, security checks, reflection, etc. In fact, as far as we know b o t h compilers do not support the official full Java spec in this respect. •



Automated translation from other programming languages to Java: There are several tools t h a t translate existing code in traditional H P C languages, such as Fortran, into Java. An example is f2j[7], where F o r t r a n L A P A C K is automatically translated into Java in a source-to-source fashion. However, in order to implement Fortran semantics~ the t r a n s l a t e d source embodies various artifacts such as "pseudo-gotos", and as a result, becomes r a t h e r difficult to u n d e r s t a n d for further tuning. Moreover, as it does not take into account specific J a v a features, performance for pure J a v a code is low compared to the original native Fortran.

• E f f e c t i v e A r c h i t e c t u r e for S e l f - t u n i n g L i b r a r i e s : Since libraries are somewhat persistent and used multiple times over different applications, it could afford longer time durations for tuning. However, when libraries are huge, t u n i n g time may still outweigh the gain in execution time of the application utilizing the library. Thus, as is with traditional libraries, the tuning architecture would ideally identify the kernel routines t h a t would be most beneficial, and leave the nonkernel, higher-level routines to s t a n d a r d J I T compiler optimization. It is not clear, however, how much the slower execution of higher-level routines due t o Java (such as non-aliasable, non-contiguous array semantics often requiring array copies) would penalize the overall library performance.

Manual Tuning: We tune a particular Java H P C code at either source- or bytecode level by hand, leaving the low-level optimization to the J I T compiler. Examples include the evaluation of Daxpy and other BLAS implementations by Pozo[8]. A l t h o u g h portability is quite high with this approach, there is no guarantee t h a t performance portability is achieved on different platforms, due to the divergent n a t u r e of Java platform implementations as mentioned above.

• Use

In order to achieve performance portability of J a v a numerical code, we are investigating the construction of a selftuning library and compilation framework for Java. Basically, we obtain best performance on each Java execution platform by automatically t u n i n g for t h a t particular platform at the source- and bytecode level. This allows leveraging of existing J V M s and J I T compilers, while still obtaining the best speed achievable for t h a t particular platform. More concretely, we implement the compiler/code-generator, performance monitor, code tester, and the high-level ghm library, etc. entirely in Java, and when a particular library is downloaded over the network for the first time, not only the library class file itself b u t the entire self-tuning framework is downloaded, and optimization will occur either on the spot or off-line when the machine utilization is low. Moreover, we define or generate parallel m u l t i t h r e a d e d code when possible, t h a n k s to the portability of Java threads. Although a similar strategy has been a t t e m p t e d in various settings, more recently in efforts such as ATLAS and P h i P A C we describe below, with Java it would be easy to make the entire process automated, without user intervention. Furthermore, self-tuning library is more significant for Java, again due to relatively divergent performance characteristics of J a v a execution environment. On the other hand, there are several technical challenges t h a t are open questions, mainly regarding the feasibility of the approach: •

of Java Threads

for Large-scale

HPC:

Also, although Java is multithreaded at the programming language level, it is not clear how much parallel multithreading will scale, especially w.r.t, numerical computation. Early versions of Java only supported "green threads" which were merely coroutines. Many recent versions of Java are truly multithreaded (native threads), but various research has shown the substantial overhead associated with multithreading in Java, and proposed various solutions (Six papers on Java thread synchronization appeared in recent OOPSLA'1999[II]). All such research we know to date, however, optimize cases where the threads do NOT synchronize, eliminating or minimizing the cost of synchronization. It is not clear how scalable Java performance is for highly-tuned numerical code, especially for code which DO synchronize, and will suffer from other overheads such ms thread scheduling. In order to investigate whether downloadable self-tuning libraries are feasible for attaining performance portability, we are constructing AJaPACK, a prototype self-tuning linear algebra package for dense matrices, as a proof-of-concept experimentation. AJaPACK employs our pure-Java port of ATLAS[10], called ATLAS for Java. ATLAS for Java, as is with the original ATLAS, allows generation of optimized LIblocked small-matrix kernel Level 3 BLAS code for each Java platform. Experiments show that, on various platforms, ATLAS for Java benefits from the L1 blocking optimization, and exhibits I/2 to 1/4 the speed of C-version of ATLAS on the best JIT compilers for the particular hardware. In order to implement a higher-level linear-algebra routine on top of ATLAS for Java, we modified and extended J L A P A C K / H a r p o o n [ 1 2 ] significantly, as well as parallelizing some of the routines, notably gemm 0 . This allowed several factors of performance increase compared to the original J L A P A C K / H a r p o o n implementation. Parallelization of

Ubiquitous Effectiveness of Self-tuning on Various J a v a P l a t f o r m s : Although previous work on adaptive compilation, as well as more aggressive self-tuning libraries have shown

141

• Performance Monitor: Receives report from the driver on performance of each generated kernel routine, and feeds back performance to the code generator, guiding generation of additional test kernel code. As such the performance monitor is parameterized by a search heuristics, a n d guides the p r u n e d search of possibly optimal kernel code.

gemm 0 was done in several ways, including static and dynamic job partitioning and thread allocation, and tested its scalability on several different large SMP platforms. Benchmark results show t h a t , with static parallelization, we obtain high scalability of gemm 0 code on some platforms; however, dynamic job allocation did not scale as well. Moreover, we found t h a t some platforms have relatively poor support for multithreading, resulting in performance loss due to parallelization. We believe this is because Java in general is still in development stage for scalable parallel p r o g r a m m i n g - - a s such, support for low-overhead context switching, effective memory m a n a g e m e n t in the context of parallel processing including parallel garbage collection, etc. will need ubiquitous deployment on all J a v a platforms.

.

2.1

• Higher-Level Library Classes: P a r t s of libraries t h a t Provide higher-level APIs, b u t are not on the performance-critical p a t h of the library. By embedding the generated kernel t u n e d to the particular Java platform, the library executes at the optimal performance possible. We owe much of t h e higher-level architecture as a generalization of ATLAS. Again, we do feel t h a t a u t o m a t e d code t u n i n g has more t h a n the i m p o r t a n c e t h a t ATLAS had for C / F o r t r a n for several reasons including those mentioned earlier, namely:

AJAPACK: AUTOMATIC PARALLEL TUNING LIBRARY FOR PARALLEL BLAS IN JAVA Overview ofDownloadable Self-TuningLibrary and AJaPACK

1. A u t o m a t e d downloading a n d execution of t h e entire self-tuning library is easy because of Java portability.

We first overview the general architecture of downloadable self-tuning library; as seen in Figure I this is composed of five components, all written in Java.

2. J a v a platform is much more divergent, a n d perform a n c e portability is more difficult to achieve with mere static code optimization.

Performance i[I Monitor [ ILAPACKI ~ ~parameter [ I ~] l I Code I [BLASI ![1 resuaI [Generator] I II ill l ~generation ~EMM I It

3. As mentioned earlier, m a n u a l t u n i n g of J a v a numeric code is difficult due to language design and implement a t i o n features, a n d it is b e t t e r to resort to a u t o m a t e d means.

I

r---l-at"~eneratedl

[Driver~l~ Kernel ~ ,

,. , A T L A S f o r Java

.

~'"'"

~wrary~tass (JLAPACK)

F i g u r e 1: O v e r v i e w o f D o w n l o a d a b l e brary in

Java

r--

"

~y2gne -~e,ve,

4. As the t u n i n g occurs at source-code or bytecode level, it can readily a d a p t to cases where the underlying architecture changes, for with J a v a it is easy to m a i n t a i n t h e downloaded code in p o r t a b l e format.

ILl

[|t i| I ill

5. Since t h e validity of t h e generated kernel code is checked by t h e J a v a security checker, the user will have b e t t e r confidence t h a t w h a t e v e r code dynamically generated will be safer t h a n F o r t r a n or C code. In A J a P A C K , each c o m p o n e n t has the following implem e n t a t i o n with respect to the systems we have ported or developed (to be described in detail later):

S e l f - t u n i n g Li-

• Code G e n e r a t o r / C o m p i l e r : Generates performance critical portion of the library. Ideally, we employ general compiler framework such as our OpenJIT[13] for this purpose as a toolkit framework to easily construct a customized code generator for each library. Since A J a P A C K is a proof-of-concept prototype, however, we directly targeted dense linear algebra computation, using the code generator of ATLAS for J a v a as well as additional code for feasibility experimentation.

• Code G e n e r a t o r / C o m p i l e r : T h e code generator portion ATLAS for Java, namely p a r t s of xemit.java and x m m s e a r c h we see in Figure 2. • G e n e r a t e d Kernel: T h e small L1 cache-blocked Level 3 • BLAS in J a v a which xemit.java generates. Currently, we emit the source file which is compiled by javac, a l t h o u g h in principle direct generation of bytecode is also possible. • Driver: Corresponds to fc.java of ATLAS for Java. • Performance Monitor: Corresponds to parts of xmmsearch.java in ATLAS for Java.

• Generated Kernel: T h e numerical kernel generated by the code generator. T h e driver tests the performance and the optimal one is automatically e m b e d d e d into the higher-level library classes.

• Higher-Level Library Classes: As mentioned earlier, we employed J L A P A C K / H a r p o o n [ 1 2 ] by Chatterjee et al., to work with ATLAS for Java, and furthermore parallelized the BLAS routine. We describe the details of parallelization a n d their implications in later sections.

• Driver: T h e driver module t h a t tests the performance of t h e Generated Kernel. Reports the measured performance to the performance monitor.

142

2.2

class ram{ void dNBmm(myarray ma, int idc){ do /* N-loop */{ do /* M-loop */{ indexb = indexB;

Overview of the Original ATLAS

The original ATLAS(Automatically Tuned Linear Algebra Software) [10] is a source-level automated tuning tool for BLAS, being developed by Whaley and Dongarra at University of Tennessee. ATLAS searches for and finds the most appropriate L1 cache blocking factor, number of unrolls, software pipelining latencies, and generates the best C-based BLAS Level 3 code for a given platform.

c00_0 = 0.0; b0 = ma.B[indexb]; a0 = ma.A[indexA]; /* Start floating point pipe */ m0 = a0 * b0;

b0 = ma.B[indexb + i] ; a0 = ma.A[indexA + i] ; /* easy loop to unroll */ for (k=O; k > O; k--) { indexA++; indexb++;} /* Do last iteration of K-loop, and drain the pipe */ cO0_O += mO; m0 = a0 * b0; cO0_O += mO; indexA += 2; ma.CO[COindex] += cOO_O; COindex ++;} while(indexA != stM); indexA = ma.indexa; COindex += incC; indexB += 2;} while(indexB != siN);}

F i g u r e 3: S a m p l e (Not Optimal)

generated

by ATLAS for Java

can produce without resorting to non-portable means such as native methods. ATLAS for Java is entirely written in Java to work portably across all s t a n d a r d J D K environments t h a t facilitate standards tools such as javac. It basically employs the same search algorithm as the original ATLAS, employing cache blocking, loop unrolling and software pipelining. The original ATLAS makes extended usage of pointer passing for arrays and scalars. Because Java does not have pointers to scalars or array interiors, we defined a wrapper class which embodies the array indices as well as the contents of the array in one-dimensional form. As objects are passed by reference, we can pass all the arguments at once. We also performed initial analysis to verify that use of object field access was being compiled away and not causing overhead. The same applies for normal scalars, where the argument is being used as an in-out parameter; again, we defined wrapper classes, and made sure t h a t no overhead will occur due to this indirection. Figure 3 illustrates the sample code generated by ATLAS for Java. We note t h a t arrays are referenced via fields, but otherwise local variables are employed exclusively in hopes t h a t it will be m a p p e d to registers, just as is with C and Fortran. We also note t h a t the parameters for this code is far from the optimal actually found; on modern-day processors with deep pipelining and large L1 caches, inner loops are unrolled over 30 times, and the generated classfile typically exceeds 200 Kilobytes.

F i g u r e 2: A r c h i t e c t u r e o f A T L A S ( v . l . 0 ) The parameters employed in ATLAS v.l.01 are basically muladd, NB, MU, NU, KU, and LAT. Muladd indicates where fused multiply-add is available. NB is the L1 cache blocking size, and MU, NU, and KU are number of unrolls for each loop, and LAT is the latency factor for software pipelining. Given such parameter space, ATLAS generates the kernel cache-blocked code for each parameter, tests the performance, and finally o u t p u t s the one with the best performance. In order to prune the search space, ATLAS employs a prescribed search strategy[10]. Although the search space is considerably pruned, ATLAS execution is still quite expensive, and such extensive optimization is difficult to embed in a compiler. However, for libraries, such cost could be amortized over multiple execution of the library, and in fact it is reported t h a t ATLAS generated code matches the best results by the vendor-supplied optimized BLAS library. We also note here t h a t ATLAS represents all matrices in one-dimensional form, as required by LAPACK. This is an advantage for Java, as it avoids several overhead issues relevant to performance of array accesses, including nonrectangular representation of multi-dimensional arrays, and requirement to check array bounds for each dimension.

2.3

code

ATLAS for Java

ATLAS for Java is a port of ATLAS v.l.0 onto Java. Because we have the Java VM intervene between the native C P U and ATLAS, what we obtain by automated tuning is the best possible performance for the particular Java platform, and not for t h a t particular hardware architecture. Still, this in a sense satisfies performance portability, as it is the best performance t h a t the particular Java platform

2.4

AJaPACK High-Level Library Class - JLAPACK/Harpoon

For A J a P A C K ' s high-level library class API, we considered the available L A P A C K s in Java. In order to achieve portable performance, we also undertook parallelizing the code so t h a t multiprocessing platforms can automatically take advantage of parallel Java threads. We found two good Java L A P A C K S available; one me-

1The current version of ATLAS is 3.0.

143

language specification itself, b u t also t h a t the entire system, including the r u n - t i m e system, t h e libraries, profilers and debuggers, etc. assume t h a t the language is multithreaded. This is i m p o r t a n t for ubiquitous m u l t i - t h r e a d e d parallel execution e.g., the libraries and J V M s are designed to be reentrant. I n h e r e n t s u p p o r t for parallelism is w h a t distinguishes J a v a from C or Fortran, where parallel execution is added as an afterthought, a n d its safety is not at all guaranteed across all systems, libraries, compilers, etc. In order to achieve good performance in a portable way, downloadable self-tuning libraries should exploit the availability of multiple C P U s on SMPs wherever possible. O n the other hand, characteristics of t r u l y parallel execution on all J a v a platforms, especially their scalability on intensive scientific code, have not been well-established. We explored w h e t h e r parallelization of A J a P A C K would be feasible using Java t h r e a d s across all platforms. Parallelization for BLAS is well known, especially with subarray blocking, since if A × B is performed as s u m m a t i o n of Aij x Bjk, t h e n each A~j × Bjk can be executed independently in parallel. T h e question is rather, w h a t style or a method of parallel p r o g r a m m i n g would be appropriate for achieving proper scalability. So, for BLAS we implemented three most likely styles for m u l t i t h r e a d e d parallelism, and compared their performance on SMPs ranging from 2 to 60 processors on various J a v a platforms: Fine-grained Master-Worker (FMW)

chanically converts the L A P A C K written in Fortran into equivalent J a v a code[7]. Thus, the resulting code is not object-oriented, b u t rather, J a v a is employed as an intermediate target language. We considered this to be somewhat inappropriate, as it is difficult to have clean A P I between the optimized kernel versus the higher-level classes. Moreover, as far as we found not all of L A P A C K routines execute correctly, likely due to mechanical translation. T h e other one we considered is the J L A P A C K / H a r p o o n . Its architecture is well-designed with appropriate object encapsulation of array types and operations. T h e problem is t h a t only a very few L A P A C K routines are actually implemented (gesv, getrf, getrs, getf2, and laswp; by contrast, f2j implements all 288 L A P A C K routines). For the purpose of our research on producing an prototype proof-of-concept, we decided to use J L A P A C K , due to its clean object-oriented API, plus stable and correct operations. T h e original J L A P A C K consists of the 3 packages below: • JLAPACK • JBLAS • JLASTRUCT Packages J L A P A C K and J B L A S are port of L A P A C K into Java, and the BLAS into Java, respectively. Package JLAST R U C T adds various helper and glue m e t h o d s where a mere port of the Fortran L A P A C K is insufficient. For A J a P A C K , we reimplemented J B L A S so t h a t JLAPACK calls the optimized kernel generated by ATLAS for Java. More concretely, we changed the xBLAS classes (where x is D, C, etc.), in particular the x g e m m 0 methods so t h a t it performs optimized blocking, and calls the kernel routines. Moreover, we wrote glue code so t h a t the kernel routines produce appropriate the descriptor object for arrays employed by J L A P A C K . This t u r n e d out to be not trivial, as considerable modifications were required for ATLAS for Java, as well as requiring good a m o u n t of glue code for interfaces. We also are writing additional routines available in LAPACK, such as LU decomposition as well as parallelizing gemm 0 and additional algorithms. For LU, we are implementing b o t h sequential and parallel versions of blocked LU (dgetrf) and recursive algorithm by Gustavson[14] (rgetf2). B o t h employ gemm 0 for blocked operations. There are several minor notes in the implementation. First, the the blocked subarrays always need to be copied, as no aliasing is possible with Java. Instead of generating new submatrices every time (typically a b o u t 36 by 36), we try to reuse the submatrices avoiding object allocation and deallocation 2. A n o t h e r minor note is when arrays do not exactly m a t c h the block size. In order to always employ the fast kernel code, we fill the non-used portion with 0. A simple strategy is to dynamically determine this, b u t instead we generate the b o u n d a r y blocks once, and cache them. Although this uses O(n) memory for n × n arrays, in our benchmarks this strategy was 10-30% faster, b u t in some cases the inverse was seen. We do provide a switch so t h a t dynamic filling can be used for memory-tight situations.

3.

We generate a prescribed n u m b e r of worker threads (typically n u m b e r of processors ÷ 1-2), a n d each worker t h r e a d on each iteration requests work from the master worker queue, c o m p u t e s the product, stores the result, a n d repeats t h e sequence until all the work are exhausted. For fine-grained we allocate a single small blocked m a t r i x multiply per each worker. Although this seems too fine-grained, it eliminates various overhead, as in J a v a t h e arrays must be copied up-front as there is no aliasing. Synchronization occurs on fetching work from t h e m a s t e r queue, and when the result is w r i t t e n back to t h e p r o d u c t matrix. Coarse-Grained

Master-Worker

(CMW)

Similar to F M W , b u t t h e worker acquires multiple tasks at once, increasing granulaxity. In order to reduce synchronization costs, we stage the execution into three phases, namely array copying, m a t r i x multiply, a n d writing back of all the results. Although this may seem more efficient t h a n F M W , in practice it could add overhead due to the e x t r a complexity involved. Statically-Decomposed

Fork-Join (SFJ)

Since m a t r i x multiply is deterministic, we decompose the o u t e r m o s t I-loop, and allocate tasks to each forked worker in a balanced fashion. T h e program needs no synchronization for blocked distribution except for the o u t e r m o s t fork-join, as writebacks into the product array are independent. More sophisticated strategies such as workstealing amongst multiple work queues, are possible. Of the current three, Master-Worker parallelization may incur more overhead, especially for BLAS where it is relatively easy to perform static decomposition, b u t for a r b i t r a r y problems load balancing occurs naturally, and is thus more flexible. Since it was obvious t h a t SFJ would win out, we compared F M W and

PARALLELIZING AJAPACK

T h e advantage of J a v a for performance portability is mult i t h r e a d i n g at the language level. In fact, it is not just the 2Note t h a t this does preclude the exploitation of Java semantics for the 0 filling, as newly allocated arrays are guaranteed to contain 0

144

C M W against SFJ to see if they would scale equivalently or not, exhibiting the cost of Java multithreading overhead. These will become apparent in the next section.

4.

-

PERFORMANCE B E N C H M A R K

Origin2000(R10000 250Mhz x 16 procs) + IRIX L1 Cache Inst-32KB + Data-32KB L2 Cache 4MB SGI JDK 1.2.1

For all the platforms, we performed the benchmarks with the following methodologies:

We measure the performance of A J a P A C K , how the kernel tuned by ATLAS for Java compare against the C kernels tuned by the original ATLAS (v.2.0), and also explore how much overhead the higher-level library APIs will sacrifice performance, for each platform. We also investigate how much of the performance could be recovered by parallel execution, and how they scale. We also explore what style of parallelism is effective for implementing BLAS, as well as LU factorization.

4.1

SGI * * *

• We tested the basic kernel performance as reported by xmmsearch, the gemm 0 performance and the performances of blocked and blocked recursive LU. • As a reference, we tested C performance for the original ATLAS and the ATLAS-enhanced C-LAPACK against A J a P A C K . - For Java gemm 0 and LU factorizations, we tested sequential and parallel versions. For g e m m 0 , we test the three methodologies for parallelization, namely SFJ, F M W , and CMW. (The parallel LU factorization is in early development, and is not fully optimized as we see in the benchmarks.)

Evaluation Environment and Methodologies

Our previous work[l] showed t h a t x86 platforms exhibited the best sequential performance for Java. We re-tested several JIT compilers, and chose the IBM J D K 1.1.8 which includes a tuned and heavily modified version of the Sun's original JVM, and also a high-performance JIT compiler, which exhibited the best performance in our initial tests. For Sparcs, we employed Sun's Research VM (Solaris 7 Production Release JVM), and turned on the flag to enable the optimizing JIT compiler, which is claimed to be faster than the default JIT ( j a v a - X o p t i m i z e ) . According to our measurements, Sun's Hotspot was much slower with respect to numerical code, and in fact the adaptive compilation in the new release failed to get turned on in ATLAS for Java, resulting in performance around 1MFlops. We also tested several SMP and CC-NUMA platforms, namely the Ultra Enterprise Servers, and the Origin 2000s. As mentioned above, these are shown not to have the best sequential numerical execution speed, especially in comparison to the C counterparts. On the other hand, they claim to support high-performance native threads, which the JVMs should be exploiting for server-style applications.

• For all benchmarks, we vary the sizes of the problem matrices. For parallel benchmarks, we also vary the number of threads. The reported performance are the best scores achieved for respective parameters. For example, when we vary the matrix size for the parallel versions, the reported score is the best amongst different number of threads benchmarked.

4.2

Overview of Results

Table 1 shows the peak performance achieved by each library. As we see, sequential performance differs for the type of CPU, and the Java platform employed. On Athlon, we obtain the best score for xmmseareh.java, where we record nearly 300MFlops, which is approximately about 1/2 of C performance. The same is true for Pentium Ill, but the performance is lower, although still being 1/2 of C. However, for SPARCs, the Java score is lower relative to C, being approximately 1/3 of C performance. Similar phenomenon was seen for the Origin 2000. We also see a sharp decline in performance when we move to higher-level class libraries, in contrast to C. In fact, we observe approximately 40-60% performance drop compared to xmmsearch performance, whereas for C, we observe much smaller or very little performance penalty. Preliminary investigation for the cause revealed that, this is attributable to overhead imposed by the the object-oriented nature of the library. Because gemm 0 generally accepts various forms of underlying d a t a representations of the matrix (such as being transposed), copying of the portions of the matrix to Ll-blocked subarray requires sophisticated translations. In fact, the current version of A J a P A C K employs element-wise copying by specifying higher-level matrix (not array) indices, incurring invocation of the accessor m e t h o d for each element copied. We initially assessed t h a t the copying cost would be considerably smaller compared to actual computation of the Ll-blocked subarray, and t h a t the compiler would compile away the m e t h o d call and and the overhead, but this was not the case. For future versions we plan to add methods to copy the subarray all at once, according to each underlying matrix d a t a representation. For blocked LU, we see t h a t for C the Gustavson's recursive algorithm is generally superior to standard, nonrecursive algorithm. On the other hand, for Java, the re-

• PC Platforms Dual PIII PC (Pentium III 450MHz x 2) procs + Linux Redhat 6.0 * L1 Cache Inst-16KB + Data-16KB * L2 Cache 512KB IBM JDK-I.I.8 JVM with optimizing JIT compiler - Athlon PC (Athlon 600MHz) + Linux Redhat 6.0 * L1 Cache Inst-64KB + Data-64KB * L2 Cache 512KB * IBM JDK-I.I.8 JVM with optimizing JIT compiler

-

*

• SMP Platforms Sun Enterprise4000(UltraSPARC 300MHz x 8 procs) + Solaris 2.6 * L1 Cache Inst-16KB + Data-16KB * L2 Cache 1MB * Solaris Production Release JVM 1.2 with optimizing JIT compiler (JBE) Sun Enterprise 10000(StarFire))(UltraSparc 250Mhz x 60 procs) + Solaris 2.6 * L1 Cache Inst-16KB + Data-16KB * L2 Cache 4MB * Solaris Production Release JVM 1.2 with optimizing JIT compiler (JBE) -

-

145

T a b l e 1: S u m m a r y o f P e a k P e r f o r m a n c e s Peak(MFlops) E4KC E 4 K J I E10KC E10KJ I P I I I C xmmsearch 401.9 132.9 281.2 110.4'375.6 I

GEMM GEMM

LU LU LU LU

seq. par.

block seq. block par. recur, seq. recur par.

321.5 --

52.10 i 349.9 ,

286.0 --

52.6 'i 3 2 5 . 0 1365.4

216.2

47.25 i 152.9 , 34.23 i 58.85

--

- - , 216.6 -- , ---273.1 ! --

250.8 --

--

---

Achieved by Each Library P I I I J I AthlC A t h l J I O2KC 296.0 ' 330.5 171.7 ~ 570.5 165.1 i 340.9 102.6 ' 555.7 112.5 I -77.3 ' -87.40 '[ 298.5 98.0 i . 36.70 399.8 79.4 i

O 2 K J ]I ~O21~.J 81.95 ' 43.04' 487.3'

' I

140.9 I . .f 86.40 I

--I!! : --

. --

E4K = Sun Enterprise 4000 (UltraSparc 300MHz x 8) E10K = Sun Enterprise 10000 (UltraSparc 250MHz x 60) P I I I = Dual P e n t i u m III P C ( P e n t i u m III 450MHz x 2) Athl = A t h l o n P C (Athlon 600MHz x 1) O2K = Origin 2000 (R10000 250Mhz x 16) C and J denote C a n d Java, respectively. " - - " indicates b e n c h m a r k not yet performed due to time restrictions.

~o

~j.° ..o.°.-"

c ....... 150

.f 1,10

i2OO

110

~

F i g u r e 4: A t h l o n P C ( 6 0 0 M H z x 1) G e m m mance with Varying Matrix Sizes

I 4

= 8

, S

L 10

, 12

I ~,1

F i g u r e 5: A t h l o n P C ( 6 0 0 M H z x 1) G e m m mance with Varying Num. of Threads

Perfor-

Perfor-

we again only reach approximately 50% of C performance due to similar reasons. Parallelization also penalizes performance, as we see in Table and Figures 4-6, especially for master-worker parallelism. This shows t h a t t h e portable library must judge the n u m b e r of C P U s available, and employ the sequential version if only a single C P U is available.

cursive algorithm is slower; our preliminary profiling analysis has revealed t h a t this is p r o b a b l y due to the BLAS L1 and L2 operations not being appropriately cache-blocked. In fact, as recursion becomes deeper, the blocksize of the ATLAS for Java-generated multiply routine t u r n e d out to be excessively large. Still, more investigation is needed. Parallel execution on the SMP platforms scaled well, b u t not so on a PC, where practically no benefit and even performance loss is incurred with parallelization.

4.3

I Z

Dual Pentium II1 PC

Detailed Results for Each Platform

Figures 4 t h r o u g h 16 describe detailed performance measurements for each platform.

Athlon PC Athlon is a uniprocessor machine, as there is no current chipsets t h a t support a multiprocessor configuration. We nevertheless tested A t h l o n u n d e r multiprocessing setting to investigate whether the parallelized code would penalize performance and if so, by how much. As mentioned earlier, the performance of A t h l o n is quite impressive, with xmmsearch reaching nearly 300MFlops. However, whereas the C ATLAS incurs very little penalty for g e m m 0 , we are impacted with nearly 45% overhead for Java, likely due to subarray coping overhead. For blocked LU,

146

T h e dual processor should give us twice t h e performance, since there should be little sequential overhead for such a low-parallel machine. However, this is not necessarily the case--here, the sequential speed outclasses all parallel versions. Close e x a m i n a t i o n of the g r a p h in Figure 8 reveals t h a t , the performance s a t u r a t e s at threads ---- 2. A n o t h e r anomaly is seen Figure 7, where the master-worker performance suddenly drops at m a t r i x size ---- 500. These suggest t h a t , although I B M J D K 1.1.8 is a native threads implementation, there seem to be anomalies which precludes s m o o t h parallel operations, especially on Linux. We plan to investigate t h e p h e n o m e n o n on a larger P e n t i u m Xeon machines.

Enterprise 4000 and 10000 Here we observe the parallel speedup Figures 1(~14. For relatively lower n u m b e r of processors (For Enterprise 4000 and Also Enterprise 10000 with n u m b e r of threads < 20),

AMD ATHLON tJnux IBM JOK 1.1.8

PC d ~ l CPU Madairm 18M JDK SpeedUp

40O •

dgetd~C . . . . . . , tf2-seq .-....__:

35O

SFJ . . . . .

aCO

//// /./...../" / ..../'" ........ ~,~, ...... .................. - . j/e ~-

250

~

...........

qzoo

/

Iso

/ !

/

/

too •

,.

. - ' . . . . . . . . . . . . ." . . . . . . . .....................................,,,.. . . . . . . . . . . . . . . . . . . . . . .

so

o

I~}

200

300

400

500 600 Matlix Size

700

6CO

900

1000

1100

2

..-

,.

6

b~,

8 OI 131r ~ d s

10

12

14

16

F i g u r e 8: D u a l P e n t i u m I I I P C G e m m P e r f o r m a n c e ( 4 5 0 M h z x 2) w i t h V a r y i n g N u m . o f T h r e a d s

F i g u r e 6: A t h l o n P C ( 6 0 0 M H z x 1) L U P e r f o r m a n c e with Varying Matrix Sizes

~.,..~'~v~'

4

3OO

aoq-SFJ . . . . . . . .

3OO

C---

2S0

, 250

!

/

~0

I

q

lS0

150 loo 100

so

SO

0

:::::::::::::::::::::::: 200

,Ice

....................... i ................., 8oo

soo

o

tooo

~t~x

too

200

300

4Go

:Too

600

700

800

9,0o

i~oo

110o

M=tnx S~ze

F i g u r e 7: D u a l P e n t i u m I I I P C G e m m P e r f o r m a n c e ( 4 5 0 M h z x 2) w i t h V a r y i n g M a t r i x S i z e s

F i g u r e 9: D u a l P e n t i u m I I I P C L U P e r f o r m a n c e ( 4 5 0 M h z x 2) w i t h V a r y i n g M a t r i x S i z e s

gemm 0 does scale well for static fork-join, whereas masterworker seems to incur some overhead. However, when we increase the n u m b e r of threads beyond 20 (Figure 14), we start observing the dropoff in scalability even for static decomposition. Still, we reach m a x i m u m performance when n u m b e r of threads reaches the n u m b e r of processors for each machine (Figures 11 and 14). This suggests t h a t , with better J I T compilers t u n e d for b e t t e r numerical performance, we could obtain significant performance with SMPs for J a v a numerical computing.

or subject to optimization by a special optimizing compiler. A n o t h e r is J a v a Numerical c o m p u t i n g in Java[8], by where it supports s t a n d a r d linear algebra operations such as BLAS, LU-decomposition, QR-decomposition, etc. B. Blount [12] and f2j are the efforts of porting J L A P A C K to Java. Pozo et. al. have proposed SciMark as a b e n c h m a r k for Java in numerical c o m p u t i n g 3 These efforts and others, especially those by the JavaG r a n d e Numerics Working G r o u p are quite significant in a t t e m p t i n g to make Java applicable to hard-core numerical computing traditionally d o m i n a t e d by C and Fortran. However, a l t h o u g h optimizing individual Java compilers have been investigated, especially in the context of the Java array package and the IBM HPCJ[4], achieving performance portable numerical code, and their implications especially for parallel machines, have not been well investigated. Both PHiPAC[9] and ATLAS are numerical kernel generators for optimized blocked GEMM. Although they have very similar objectives, while ATLAS aims to tune the BLAS kernel for each platform, P H i P A C aims to be more general; however, search time on P H i P A C takes considerably longer, reportedly requiring several days. MTTL[15] and BLITZ[16] are portable matrix and linear algebra libraries in C + + . By extensive use of C + +

Origin 2000 We observe similar behavior to Enterprise server for Origin 2000 (Figure 15 16). At 16 processors, it does not seem to have reached the limits of scalable J a v a parallel execution. T h e difference is, however, t h a t the fine-grain multithreading has very little scalability, and exhibits extremely poor performance compared to static fork-join. This could be att r i b u t e d to the difference in t h r e a d scheduling and mntual exclusion in the operating system, the JVM, or both.

5.

RELATED W O R K

There have been recent surge of efforts of implementing numerical libraries in pure Java. A notable example the Java Array Package in Java by IBM[5], where a flexible Java array class is defined so t h a t it could be r u n as pure Java code

3http://gams.hist.gov/javanumerics/ 147

~mn=

SUN E n ~

4ooo

.

F~ -----'~ ~o

/

Y v

"~''"\'VV

f"

'," \

v v

~.--------~.

.

.

.

.

.

.

.

,mOO

.

,.~ '--:

2~

-':'.::"

^

1oo /

..." .... So

o

o

Ma~ $~e

minx

F i g u r e 10: E n t e r p r i s e 4 0 0 0 G e m m P e r f o r m a n c e Varying Matrix Sizes

Figure 12: E n t e r p r i s e 4 0 0 0 L U P e r f o r m a n c e Varying Matrix Sizes

with

with

t400

FI~N~ SFJ

~ q ....... tZO0

tO00

~.___/ 9

Ir~

o 2

4

S

a

10

12

14

NUfl~,C~Tiwe4ds

F i g u r e 11: E n t e r p r i s e 4 0 0 0 G e m m P e r f o r m a n c e Varying N u m . o f T h r e a d s

iSGO

F i g u r e 13: E n t e r p r i s e 1 0 0 0 0 G e m m with Varying Matrix Sizes

with

templates allowing template metaprogrammingtechnique, b o t h allow extensive optimizations such as loop unrolls and some loop restructuring t h a t traditionally compilers h a d performed for C and Fortran. This allows performance nearing or m a t c h i n g t h a t of Fortran-optimized code or even vendor libraries, just as is with P H i P A C or ATLAS. However, they do not offer any s u p p o r t of "tuning" the parameters for loop unrolls etc; thus, they need to integrate the tuning techniques of, say, ATLAS to be truly portable. For Java, it becomes more difficult due to the lack of compiletime t e m p l a t e support; rather, source-code generation tools in Java such as EPP[17] or O p e n J I T could be used in place of templates. W h a t we really propose is to not only to follow the footsteps of Fortran and C and create "static" libraries, b u t exploit the characteristics of Java, such as dynamic compilation, portable code, a u t o m a t i c downloading, security checks, etc. to aggressively optimize J a v a libraries automatically for each Java platform. This is more difficult with non-portable languages such as C or Fortran, where such infrastructures are not provided.

6.

1000 mbl=

2000

Performance

technology, and utilizes the parallel t h r e a d s which is natively provided by J a v a platform r u n n i n g on SMPs. Benchmarks show t h a t A J a P A C K reaches approximately 50%-25% performance of ATLAS-based C library, a n d with parallelization, exceeds t h a t performance. It is substantially superior to tested figures for f2j a n d J L A P A C K / H a r p o o n on the same platforms. On the o t h e r h a n d , A J a P A C K is still s o m e w h a t reliant on t h e quality of b o t h the J V M and t h e J I T compiler, t h e former for m u l t i t h r e a d i n g performance, and the latter for code quality. We see s u b s t a n t i a l overhead for higher-level libraries; this is due to object-oriented d a t a encapsulation and also t h a t it is difficult to alias J a v a subarrays, requiring element-wise array copies to incur m e t h o d invocation on every copy. We need to add considerable code to A J a P A C K to cope with optimal copying of subarrays, and is a subject of immediate future work. T h r e a d parallelization scaled well for Solaris and O2K, b o t h server platforms; this was s o m e w h a t negated by relatively underperforming J I T compiler for numerical computing purposes. On the other hand, for P C s / L i n u x dual processor, a l t h o u g h the J I T compiler was superb, it did not have proper m u l t i t h r e a d i n g support. Future work includes identifying other overhead of higherlevel libraries. Also, for A J a P A C K in particular, we need to improved the search strategy; for this purpose, we are

CONCLUSION

We have introduced A J a P A C K , a self-tuning parallel linear algebra package for dense matrices for Java. A J a P A C K tunes itself to respective Java platforms using the ATLAS

148

o ~ g i . moo s p e ~ u p

sso

I4C~

FMW

SFJ - -

- SFJ .....

~q

/ j .....

30o

1200

........

2so

zoo

9 "~

1so

/

loo

/

so

........... 7 - ............ 7--- ......... T .........., ...........• ...........

o

m

2o

3o

40

so

o

S

eo

Number of Threada

F i g u r e 14: E n t e r p r i s e 10000 Gemm with Varying Num. of Threads

10 Ntwnker o l T h ~ d a

15

F i g u r e 16: O r i g i n 2 0 0 0 G e m m Varying Num. of Threads

Performance

20

Performance

with

High-Performance Computing, Rhodes, Greece, July, 1999. ..". . . . . . . . . B

~

.............

~w x~2___

..... ......

(h~tp

..." / ,./'

/'

..."

/

S 2~ ./ /"

/ 0

2

/ /

l ~o

F i g u r e 15: O r i g i n 2 0 0 0 Varying Matrix Sizes

i Iooo

Gemm

I 1500

Performance

://w~rw.

csrd.

uiuc. edu/fcs99/workshops,

html)

[2] Sava Mintchev and Vladimir Getov. Automatic Binding of Native Scientific Libraries to Java. Proceedings of ISCOPE'97, Springer LNCS 1343, pp. 129-136. [3] Paul Gray and Vaidy Sunderam. The IceT Environment for Parallel and Distributed Computing. Proceedings of ISCOPE'97, Springer LNCS 1343, pp. 275-282. [4] J. E. Moreira, S. P. Midkiff, and M. Gupta. From flop to Megaflops: Java for technical computing. In Proceedings of the l l t h International Workshop on Languages and Compilers for Parallel Computing, LCPC '98, 1998. IBM Research Report 21166. [5] J. E. Moreira S. P. Midkiff M. Gupta R. Lawrence. High Performance Computing with the Array Package for Java: A Case Study using Data Mining. Proceedings of Supercomputing'99, Portland, Oregon, 1999 (CD-ROM proceedings). [6] The J-Accelerator and HBC (High-Speed Bytecode Compiler).

....Y

4C0

£~0

with

h t t p : / / w w w . f u j it su. co. j p / h y p e r t e x ~ softinfo/produc~/use/j ac/.

/

[7] David M. Doolin, Jack Dongarra, and Keith Seymour. JLAPACK-Compiling LAPACK FORTRAN to Java. Technical Report ut-cs-98-390, University of Tennessee, 1998. [8] Ronald F.Boisvert, Jack Dongarra, Roldan Pozo, Karin Remington, and G. W. Stewart. Developing Numerical Libraries in Java. Proceedings of ACM 1998 Workshop on Java for High-Performance Network Computing, 1998. [9] Jeff Bilmes, Krste Asanovic, Chee-Whye Chin, and Jim Demmel. Optimizing Matrix Multiply Using PhiPAC: a Portable, High-Performance, ANSI C Coding Methodology. Proceedings of ACM International Conference on Supercomputing, July 1997. [10] R. Clint Whaley and Jack Dongarra. Automatically Tuned Linear Algebra software. Proceeding of IEEE/ACM Supercomputing '98, Nov. 1998. [11] Proceedings of OOPSLA'99, Denver, Colorado, The ACM Press, November 1999. p [12] Brian Blount and Sid Chatterjee. An evaluation of Java for numerical computing. Proceedings of ISCOPE'98, Springe LNCS 1505, 1998, pp. 35-46. [13] Satoshi Matsuoka et. al. The OpenJIT Project,

l o o k i n g a t t h e n e w l y r e l e a s e d A t l a s 3.0. W e w o u l d like to, however, g e n e r a l i z e o u r f r a m e w o r k by n o t d i r e c t l y r e l y i n g on t h e A T L A S for J a v a , b u t r a t h e r c o m i n g up w i t h a m o r e g e n e r a l i z e d t o o l k i t for c o d e t r a n s f o r m a t i o n a n d g e n e r a t i o n b a s e d on O p e n J I T . T h i s will a l l o w us to e x p a n d t h e d o m a i n of a p p l i c a b i l i t y t o o t h e r n u m e r i c a l a l g o r i t h m s , as well as a l l o w i n g o t h e r p e o p l e t o c o n s t r u c t t h e i r o w n l i b r a r i e s . Ideally, c o d i n g a l i b r a r y p l u s a s m a l l effort, s u c h as a n n o t a t i n g t h e core e l e m e n t s a n d / o r following a c e r t a i n d e s i g n p a t t e r n , b a s e d on s u c h a f r a m e w o r k , will a l l o w u s e r s t o code a w i d e r a n g e of p e r f o r m a n c e p o r t a b l e p a r a l l e l l i b r a r i e s for 3ava.

Acknowledgments We deeply t h a n k Jack Dongarra, Clint Whaley, and Sid Chatterjee for the development of their respective libraries which we have substantially employed as a basis, and also their valuable feedback on our work. We also t h a n k the Dep a r t m e n t of Information Science, the University of Tokyo,

h t t p : / / w u w . o p s n j i~. o r E,

[14] Fred G. Gustavson. Recursion leads to automatic variable blocking for dense linear-algebra algorithms IBM Journal of Research and Development Vol.41, No.6, 1997 p.737 [15] Jeremy Siek and Andrew Lumsdaine. The Matrix Template Library: A Generic Programming Approach to High-Performance Numerical Linear Algebra. Proceedings of ISCOPE'98, Springer LNCS 1505, 1998, pp. 59-70. [16] Todd Veldhuizen. Arrays in Blitz++. Proceedings of ISCOPE'98, Springer LNCS 1505, 1998, pp. 223 230. [17] Yuuji Ichisugi and and Yves Roudier. Extensible Java Preprocessor Kit and Tiny Data-Parallel Java. Proceedings of ISCOPE'97, Springer LNCS 1343, 1997, pp. 153-160.

and Electrotechnieal Laboratory, especially Kenjro Taura a n d O s a m u T a t e b e , w h o h a v e m a d e o u r u s a g e of E l 0 0 0 0 a n d O r i g i n 2000 p o s s i b l e .

REFERENCES [1] Satoshi Matsuoka and Shigeo Itou. Towards Performance Evaluation of High-Performance Computing on Multiple Java Platforms. Proceedings of ICS'99 workshop on Java for

149