Comparison of C and Java Performance in Finite Element Computations

Computers and Structures, 81, 2003, pp. 2401-2408

Comparison of C and Java Performance in Finite Element Computations G.P.Nikishkov * University of Aizu, Aizu-Wakamatsu City, Fukushima 965-8580, Japan

Yu.G.Nikishkov Alventive Inc, 700 Galleria Parkway, Suite 400, Atlanta, Georgia 30339, USA

V.V.Savchenko Hosei University, 3-7-2, Kajino-cho, Koganej-shi, Tokyo 184-8584, Japan

Summary The Java programming language has several features, which makes it attractive for software development in computational engineering and sciences. One major obstacle to use of Java in computationally intensive applications is its reputation of a slow language in comparison to Fortran or C. In this paper the performance of the developed Java finite element code is compared to that of the C code on the solution of three-dimensional elasticity problems. It is shown that simple code tuning can provide Java/C performance ratio 90% for the LDU solution of finite element equations. The PCG iterative solution algorithm is 1.5 times slower using the Java tuned code in comparison to the C code. We recommend using JVM 1.2 since in many cases it is considerably faster in finite element computations than JVMs 1.3 and 1.4. Keywords: Finite element method, Java, Performance, Tuning. Introduction Finite element codes were traditionally developed in Fortran [1] and recently in Fortran 90 [2]. During last decade FEM developers started to use C programming language. Some of them advocate using objects and C++ language in order to handle complexity in finite element software [3-5]. Using data hiding, encapsulation and inheritance, which are typical for the object-oriented approach, allows creating more reliable finite element codes. Such codes are easier to modify and support. In 1995, Sun Microsystems released Java, new multipurpose, object-oriented programming language [6]. Java language constructions are similar to those in C and C++. However, Java is simpler than C++ and possesses some attractive features, which C++ does not have. Java has built-in garbage collection preventing memory leaks. Memory is freed automatically, by detecting objects that are no longer in use. Java includes built-in data structures and methods for creating Graphical User Interfaces and communicating with other computers over a network. Another advantage of Java is its portability. A Java program is compiled into byte code, which can be interpreted by a Java Virtual Machine (JVM) [7]. Once a specific computer has JVM designed for it, the computer can execute any Java program that has been compiled into byte code. JVM is embedded in most popular Web browsers. Java applets can be downloaded through the net and executed within any Java enabled browser. Although Java has attractive features for producing portable, architecturally neutral code, it is not widely used in engineering computations. Slower speed of Java codes is usually considered its main disadvantage. Unlike natively compiled code, which is a series of microprocessor instructions, an interpreter must first translate the Java byte code into the *

E-mail: [email protected], Tel: +81 242 37 2649, Fax: +81 242 37 2706

1

Computers and Structures, 81, 2003, pp. 2401-2408 equivalent microprocessor instructions. Obviously, this translation leads to a slower operation of Java code. However, Just-In-Time compiler (JIT) can significantly speed up the execution of Java applications and applets. The JIT, which is an integral part of the JVM takes the bytecodes and compile them into native code before execution. Since Java is a dynamic language, the JIT compiles methods on a method-by-method basis just before they are called. If the same method called many times or if the method contains loop with many repetitions the effect of re-execution of the native code can dramatically change the performance of Java code. Java performance in numerical computing was considered in several publications [810]. It was shown that high-performance numerical codes could be developed in Java with suitable code development techniques. While papers [8-10] deal with general issues of numerical computing, this paper addresses Java performance and tuning in finite element computations. We present our experience in designing the efficient finite element code in Java. The code is developed in such a way that object-oriented features with considerable overhead are not used in critical for performance code sections. The performance of the developed Java finite element code is compared to that of the analogous C code on finite element solutions of three-dimensional elasticity problems using Intel computers. For running Java code we employed Sun JVMs 1.1, 1.2, 1.3 and 1.4. It is shown that the use of different Java machines can lead to considerably different performance of the Java code. With proper coding and JVM selection the Java finite element code can be almost as fast as the C code. Design of the Java Finite Element Code Object-oriented approach is used widely in order to create reusable, extensible, and reliable components, which can be used in later research and practical applications. It should be noted that the key concept of “classes” in object-oriented programming might not be always ideal for computationally intensive sections of codes. Object creation and destruction in Java are expensive operations. Besides the time overhead of object creation and garbage collection, there is a significant space overhead for objects. The JVM adds internal information to each allocated object for future garbage collection process and other information required by the Java language. The use of large amount of small objects can lead to considerable time and space overhead. As experiments show, a possible way to increase computing performance is reducing expenses for object creation in the code by using primitive types in place of objects. For a variable of a primitive type the JVM allocates the variable directly on the stack (local variable) or within the memory used for the object (member variable). For such variables there is no object creation overhead, and no garbage collection overhead. Another useful technique is an object reuse. This technique can be illustrated by an example of an inner loop within a method doing a large number of iterations. In this case it is necessary to move the object allocations outside the loop so that they are only done once. The object reuse technique may work especially well in combination with the approach of using primitive values in place of objects. It should be noted that efficiency critical code sections are small in comparison to the whole element code. We propose to employ useful object-oriented features of the Java language in designing the whole finite element code and to find a compromise between using objects and providing high efficiency for the computationally intensive sections of the code. Keeping in mind the above efficiency considerations we developed the Java finite element code JFEM for the solution of two-dimensional and three-dimensional elasticity problems. The class hierarchy of the JFEM code is presented in Fig. 1. The class design allows extensibility of the code. Abstract classes are used for the definition of classes for nodes, finite elements, material models and equation solver. The abstract class defines the overall structure of the hierarchy. It contains the data members and 2

Computers and Structures, 81, 2003, pp. 2401-2408 member methods. Some methods can be implemented in the abstract class; other methods are implemented in class, which is lower in the hierarchy. For example, abstract class Element contains methods for data manipulations (connectivity data and nodal data), which are common to all element types. Methods for computing shape functions, derivatives of shape functions, element stiffness matrix, element load vector etc. are implemented in classes Element2D8N and Element3D20N for the two-dimensional 8-noded element and for the three-dimensional 20-noded element. It is worth noting that we try to restrict using objects in computationally intensive parts of the finite element procedure. Class Node is used during input of the nodal data for the finite element model. During calculation of the element stiffness matrices and during the assembly and solution of the equation system only primitive types are used in operations with nodal data. Assembly and Solution of Equation System For linear problems main fraction of computing time is related to calculation of element stiffness matrices, assembly of the equation system and its solution. Here we present algorithms of element stiffness matrix computation and consider two algorithms of equation solution: LDU decomposition and preconditioned conjugate gradient (PCG) methods. Computation of element stiffness matrices Usual relation for calculation of the element stiffness matrix [k ] has the following appearance:

[k ] = ∫ [ B]T [ E ][ B]dV

(1)

V

where [B] is the displacement differentiation matrix, [E ] is the elasticity matrix and V is the element volume. Integration is performed numerically with the use of Gaussian rule. We calculated element stiffness matrices in a more efficient way using explicit expression for integrated function. Coefficients of the element stiffness matrix after performing multiplications in relation (1) are equal to: ⎡ ⎛ ∂N ∂N m ∂N m ∂N m ⎞⎤ ∂N ∂N n ii ⎟⎟⎥dV = ∫ ⎢( λ + 2 μ ) m + μ ⎜⎜ m + k mn ∂ ∂ ∂ ∂ ∂ ∂ x x x x x x ⎢ i i i+2 i + 2 ⎠⎥ ⎝ i +1 i +1 ⎦ V⎣ ij k mn

⎛ ∂N m ∂N n ∂N m ∂N n ⎞⎟ = ∫⎜λ +μ dV ⎜ ∂xi ∂x j ⎟ ∂ ∂ x x j i ⎠ V⎝

(2)

Here m, n are local node numbers; i, j are indices related to coordinate axes (x1, x2, x3); N m is the shape function for node m; λ and μ are Lame elastic constants. In our codes expressions (2) are calculated using special 14-point integration rule suitable for 20-noded brick-type element. Since the element stiffness matrix possesses symmetry property, only symmetrical part of the matrix and diagonal coefficients are computed and then used for assembly of the global stiffness matrix.

3

Computers and Structures, 81, 2003, pp. 2401-2408 LDU solution of equation system Symmetric matrix [ A] = aij of the order n is stored in a profile form by columns. Each column of the matrix starts from the first top nonzero element and ends at the diagonal element. The matrix is represented by two arrays: one-dimensional double array a, containing matrix elements and index array pcol. Assuming that array indices begin from one, the ith element of pcol contains the index in the array a of the first element of the ith column minus one. The length of the ith column is given by pcol[i+1]-pcol[i]. The length of the array a is equal to pcol[n+1]. The location (row number) of the first nonzero element in the ith column of the matrix [A] is given by the function FN(i): FN(i)=i-(pcol[i+1]-pcol[i])+1.

The following correspondence relation can be easily obtained for a transition from two-index matrix notation to one-dimensional array notation: aij → a[i+pcol[j+1]-j]. Note that this relation is valid only for the elements of the matrix [A] that are stored in the array a. Solution of a symmetric equation system consists of [U]T[D][U] decomposition of the system matrix followed by forward reduction and backsubstitution for the right-hand side. The [U]T[D][U] decomposition takes overwhelming majority of the computing time. The right-looking algorithm of the decomposition can be presented as the following pseudo-code: do j=2,n Cdivt(j) do i=j,n Cmod(j,i) end do end do do j=2,n Cdiv(j) end do Cdivt(j) = do i=FN(j),j-1 ti = aij/aii end do Cmod(j,i) = do k=max(FN(j),FN(i)),j-1 aji = aji – tk*aki end do Cdiv(j) = do i=FN(j),j-1 aij = aij/aii end do

Do loop, which takes most time of LDU decomposition is contained in the procedure Cmod(j,i). One column of the matrix is used to modify another column inside inner do loop. Two operands should be loaded from memory in order to perform one Floating-point Multiply-Add (FMA) operation. Data loads can be economized by outer loop unrolling. After unrolling two outer loops, the tuned version of the LDU decomposition is as follows: 4


do j=1,n,d Bdivt(j,d) do i=j+d,n,d BBmod(j,i,d) end do end do do j=2,n Cdiv(j) end do Bdivt(k,d) = do j=k,k+d-1 do i=FN(k),j-1 tij = aij/aii end do do i=j,k+d-1 do l=max(FN(j),FN(i)),j-1 aji = aji – tlj*ali end do end do end do BBmod(j,i,d=2) = do k=max(FN(j),FN(i)),j-1 aji = aji – tkj*aki aj+1i = aj+1i – tkj+1*aki aji+1 = aji+1 – tkj*aki+1 aj+1i+1 = aj+1i+1 – tkj+1*aki+1 end do if j≥FN(j) then aj+1i = aj+1i – tjj+1*aji aj+1i+1 = aj+1i+1 – tjj+1*aji+1 end if

Method BBmod(j,i,d) performs modification of a column block, which starts from column i by a column block, which starts from column j and contains d columns. The pseudo-code above shows it for the block size d = 2 for brevity. In three-dimensional problems, which are solved here, the natural choice for the block size is d = 3. It is assumed that columns in the block start at the same row of the matrix a. This is fulfilled automatically if the column block contains columns, which are related to one node of the finite element mesh. PCG solution of equation system Preconditioned conjugate gradient (PCG) method is an iterative procedure, which does not alter the equation matrix. Because of this, only nonzero coefficients of the finite element global stiffness matrix can be stored. Sparse structure of the matrix should be taken into account in matrix-vector multiplications. We use sparse row format for the equation matrix. In this format all information about matrix is contained in three arrays: a – array of doubles containing non-zero elements of the matrix, row by row; col – array of column indices for non-zero elements of the array a; prow – array of indices of starting elements of matrix rows in the array a, again assuming that indices start from one.

5

Computers and Structures, 81, 2003, pp. 2401-2408 Preconditioning techniques are not the subject of this work. Simple diagonal preconditioning is used in our PCG solution procedure of finite element equations. The most time consuming operation in the PCG solution procedure is the sparse matrix-vector product inside iteration loop. Matrix-vector multiplication { y} = [ A]{x} for matrix [A] in sparse-row format is performed as follows: do j=1,n y[j] = 0 do i=prow[j],prow[j+1]-1 y[j] = y[j] + a[i]*x[col[i]] end do end do

Previous experience with tuning C codes showed that little can be done to speed up sparse matrix-vector product. To our surprise simple inner loop unrolling may improve Java code performance on certain Java machines: do j=1,n y[j] = 0 do i=prow[j],prow[j+1]-1,3 y[j] = y[j]+a[i]*x[col[i]]+a[i+1]*x[col[i+1]]+a[i+2]*x[col[i+2]] end do end do

Experiments with unrolling the outer loop lead to slower calculations. The speed up of the sparse matrix-vector product after inner loop unrolling and lack of it after outer loop unrolling can be explained by the internal compilation features of the Java compilers. Experimental Results

We compared our C and Java implementations of the finite element method on the series of three-dimensional elasticity problems. The test problem is simple tension of an elastic cube. Three-dimensional meshes of E × E × E brick-type 20-noded elements are used for C-Java benchmarking. The value of E varies from 4 to 12 thus providing meshes from 64 elements (1275 degrees of freedom) to 1728 elements (24843 degrees of freedom). The mesh E=8 is shown in Fig. 2. Two computer systems were used for running the C and the Java finite element codes: 1) Notebook computer with Intel Pentium III 1.0GHz processor; 2) Desktop computer with Intel Pentium 4 1.8GHz processor; Both computers have Microsoft Windows XP Professional operating system. The C code was compiled using Microsoft Visual C++ 6.0 with maximum speed optimization. The Java code was compiled using javac compiler developed by Sun Microsystems with optimization option -O and run using Java virtual machine (JVM). Four JVMs were used: 1) JVM 1.1.8; 2) JVM 1.2.2-011 with Symantec Just-In-Time compiler. 3) Java HotSpot Client VM 1.3.1_02-b02. 4) Java HotSpot Client VM 1.4.0-b92. Results for assembly of the global stiffness matrix in the profile format and for the LDU solution of the equation system are presented in Fig 3. Since it is difficult to determine 6

Computers and Structures, 81, 2003, pp. 2401-2408 megaflops rate for the assembly phase we present C/Java performance comparison as ratios of computing time used by the C code to computing time used by the Java code. For the mobile system Pentium III 1.0GHz, JVM 1.2 provides performance around 90% of C performance. JVM 1.1, 1.3 and 1.4 show lower performance. For the desktop system Pentium 4 1.8GHz, assembly of matrix in the profile format is faster with JVM 1.2 than with C code. Performance of JVM 1.1, 1.3 and 1.4 is around 75% of C code performance. Fig. 4 shows megaflops rates for the LDU solution of the equation system stored in the profile format. Surprisingly, untuned version of Java code produces approximately same speed of calculation not only for all JVMs but also for both computer systems (JVM 1.4 demonstrates lower megaflops rate on the mobile system). C code megaflops rates are quite different for the mobile and desktop computers. Java/C performance of the untuned code is roughly 80% for the mobile system and only 40% for the desktop system. Tuning of C and Java codes changes the performance ratios dramatically. JVM 1.2 shows computing rates, which are very close to C rates. For both computer system Java/C megaflops rate is around 90%. JVMs 1.3 and 1.4 produces lower speed for the tuned LDU code and JVM 1.1 provides worst results. Significant performance drops are observed for the tuned LDU code when using JVM 1.3. Such phenomena can be explained by data block conflicts in cash memory for certain profiles of the equation system. Fig. 5 presents comparison of C and Java speeds for the assembly of the global stiffness matrix in the sparse row format. JVM 1.2 produces best speeds both for the desktop and for the mobile systems. For the desktop system the speed of Java code run with JVM 1.2 is higher than the C code speed. Lowest speeds are shown by JVMs 1.3 and 1.4 (40-60% of the C speed). Megaflops rates for the PCG solution of equation system are depicted in Fig. 6. For the mobile system, JVMs 1.1 and 1.2 demonstrate megaflops rates around 100 Mflops for both untuned and tuned PCG procedure. Megaflops rate of JVMs 1.3 and 1.4 for untuned code is considerably lower. Tuning has no effect on the speed of the Java code for JVM 1.4. The C code speed is 100-120 Mflops in both tuned and untuned versions. For the desktop system and untuned PCG solution, Java is about 2.5 times slower then C. Tuning does not affect the speed of the C code. However, simple code tuning with unrolling only inner loop of the sparse matrix-vector product improves Java performance considerably for JVMs 1.2 and 1.3 making the Java speed equal to 2/3 of the C speed. Running the tuned Java code on JVM 1.4 leads to paradox: the speed becomes lower than the speed of the untuned Java code. The data presented in Figs 3-6 shows performance results for the three types of computations: 1) Calculation of element stiffness matrices and assembly of the global stiffness matrix: mostly computations with scalar variables; 2) LDU solution of the equation system: mostly triple loop for multiply-add operations for columns with a consecutive access to operands; 3) PCG solution of the equation system: mostly double loop for multiply-add operations with a non-consecutive access to operands. The experimental results show that the performance of Java is on par with C (or slightly lower) for computations involving mostly scalar variables. For multiply-add operations with the consecutive access to array elements inside the triple loop the Java performance can be 90% of the C performance after tuning. For multiply-add operations with the non-consecutive access to array elements inside double loops, the Java performance is 6590% of the C performance depending on computer system. It should be noted that this conclusion is true if the proper choice of the Java machine is done (JVM 1.2). While it is reasonable to use the latest Java SDK (Software Development Kit) for most purposes, we can recommend also to install Java Runtime Environment JRE 1.2 and to employ it for performing large finite element analyses. 7


Conclusion

We have designed the object-oriented version of the three-dimensional finite element code for elasticity problems and implemented it in Java programming language. Special attention has been devoted to the efficient implementation of computationally intensive sections of the code. Only primitive type variables and one-dimensional arrays are used in computation of element stiffness matrices and in solution of the equation system thus excluding overhead associated with a full object-oriented programming. The performance of the Java code has been compared to the performance of the similar C code on the solution of three-dimensional elasticity problems using computers with Intel Pentium processors: Pentium III 1.0 GHz mobile and Pentium 4 1.8 GHz. Java Virtual Machines 1.1, 1.2, 1.3 and 1.4 were used for running Java code. The experimental results show that the performance of the Java finite element code is roughly equal to the performance of the C code for calculation of element stiffness matrices and assembly of the global equation system when using JVM 1.2. JVMs 1.3 and 1.4 provide considerably lower performance in comparison to two other JVMs for the assembly of global stiffness matrix in sparse row format. Untuned Java code shows relatively low performance for the LDU solution of the equation system in the profile format. However, tuning affects speed of the Java code more than speed of the C code. Performance of tuned Java code running on JVM 1.2 is about 90% of the C code performance. The PCG iterative solution of the equation system is 1.5 times slower using the Java tuned code in comparison to the C code. It is worth mentioning that the latest JVMs 1.3 and 1.4 in many cases provide noticeably lower performance in comparison to JVM 1.2. Our general conclusion is that the Java language is quite suitable for development of finite element software. With the use of proper coding and tuning the performance of the Java code is comparable to the performance of the corresponding tuned C code. We recommend using JVM 1.2 for large finite element analyses.

8


References

1. Bathe, K-J. Finite Element Procedures. Englewood Cliffs: Prentice-Hall, 1996. 2. Smith IM, Griffiths DV. Programming the Finite Element Method. Chichester: Wiley, 1998. 3. Mackie RI. Using objects to handle complexity in finite element software, Engineering with Computers 1997; 13: 99-111. 4. Mackie RI. Object-Oriented Methods and Finite Element Analysis. Stirling: Saxe-Coburg, 2001. 5. Dubois-Pelerin Y, Pegon P. Object-oriented programming in nonlinear finite element analysis, Computers and Structures 1998; 67: 225-241. 6. Gosling J, Joy B, Steele G. The Java Language Specification. Reading, MA: AddisonWesley, 1996. 7. Lindholm T, Yellin F. The Java Virtual Machine Specification. Reading, MA: AddisonWesley, 1996. 8. Boisvert RF, Moreira J, Philippsen M, Pozo R. Java and numerical computing, Computing in Science and Engineering 2001; March/April: 18-24. 9. Kruger D. Performance tuning in Java, Java Developers Journal 2002; August: 44-52. 10. Moreira JE, Midkiff SP, Gupta M, Artigas PV, Snir M, Lawrence RD. Java programming for high-performance numerical computing, IBM Systems Journal 2000; 39: 21-56.

9


class JFem – main class controlling FEM solution interface CNST – collection of constants used during solution class Element – abstract finite element class Element2D8N – 2D quadrilateral 8-noded element class Element3D20N – 3D hexahedral 20-noded element class FiniteElementModel - description of the finite element model class LoadVectorAssembler – boundary conditions for the finite element model class Material – abstract material model class ElasticMaterial – material model for elasticity problems class DataFileReader – reading data file class Solver – abstract finite element solver class ProfileLDUSolver – solution of the finite element equation system by the direct LDU method with profile storage of the matrix class SparseRowPCGSolver - solution of the finite element equation system by the preconditioned conjugate gradient method class Node – abstract node of the finite element model class Node2D – node of the 2D finite element model class Node3D – node of the 3D finite element model

Fig. 1. The class hierarchy of the Java finite element code JFEM

Fig. 2. Finite element mesh consisting of 20-noded brick-type elements.

10


1.25 Assembly of profile system Pentium III 1.0GHz mobile

1.00

JVM 1.2

tC/tJava

0.75 JVM 1.4

JVM 1.3

0.50 JVM 1.1

0.25

0

5

10

15

3

20

25

Number of DOF, 10

1.50 Assembly of profile system, Pentium 4 1.8GHz

1.25 JVM 1.2

tC/tJava

1.00 JVM 1.1

0.75 JVM 1.4

0.50 JVM 1.3

0.25

0

5

10

15

Number of DOF, 103

20

25

Fig. 3. Ratio of the C code time to Java code time for the assembly of the global stiffness matrix in the profile format.

11


500

MFlops

400

LDU solution, Pentium III 1.0 GHz mobile JVM 1.2 Tuned MS C Tuned JVM 1.3 Tuned

300 200

JVM 1.1 Tuned MS C

JVM 1.4 Tuned

100 JVM 1.1, 1.2, 1.3

0

5

10

JVM 1.4

15

3

20

25

Number of DOF, 10

800 LDU solution, Pentium 4 1.8GHz MS C Tuned

600

MFlops

JVM 1.2 Tuned

400

JVM 1.1 Tuned

MS C

JVM 1.3 Tuned JVM 1.1, 1.2, 1.3, 1.4

200

0

JVM 1.4 Tuned

5

10

15

3

20

25

Number of DOF, 10

Fig. 4. Megaflops rate for the LDU solution of the equation system in the profile format.

12


1.50 1.25

Assembly of sparse row system Pentium III 1.0GHz mobile JVM 1.2

tC/tJava

1.00

JVM 1.1

0.75

JVM 1.4

0.50 JVM 1.3

0.25

0

5

10

15

Number of DOF, 103

20

25

20

25

1.50 1.25

Assembly of sparse row system Pentium 4 1.8GHz JVM 1.2

tC/tJava

1.00

JVM 1.1

0.75 JVM 1.3

0.50

JVM 1.4

0.25

0

5

10

15

Number of DOF, 10

3

Fig. 5. Ratio of the C code time to Java code time for the assembly of the global stiffness matrix in the sparse row format.

13


200 PCG solution Pentium III 1.0GHz mobile

150

MS C

MFlops

MS C Tuned

100 JVM 1.1, 1.2 JVM 1.1, 1.2, 1.3 tuned JVM 1.3

50 JVM 1.4, JVM 1.4 tuned

0

5

10

15

Number of DOF, 103

20

25

400 PCG solution, Pentium 4 1.8GHz

300

MS C Tuned MS C

MFlops

JVM 1.2 Tuned

200 JVM 1.3 Tuned JVM 1.1 Tuned

JVM 1.1, 1.2, 1.3, 1.4

100 JVM 1.4 Tuned

0

5

10

15

Number of DOF, 10

3

20

25

Fig. 6. Megaflops rate for the PCG solution of the equation system in the sparse row format.

14