Java/C++/Fortran cross-platform performance and ...

Proceedings of the 10th International Conference on Computational and Mathematical Methods in Science and Engineering, CMMSE 2010 27–30 June 2010.

Java/C++/Fortran cross-platform performance and consistency issues for large simulation codes Satya Baboolal1 1

School of Computer Science, University of KwaZulu-Natal, Durban, South Africa. emails: [email protected]

Abstract The suitability of different programming languages for scientific computing has been the subject of many debates and studies. Java is a popular multi-purpose programming language and it is not surprising that many recent studies have been conducted on its performances in various application areas, in particular, in the scientific computing arena. In this instance, one aspect of Java involves its documented unpredictable behaviour in respect of floating point computations. Since the Java virtual machine’s floating point behaviour has been designed to adhere very strictly to IEEE standard for binary floating point systems with the intention of achieving portability and consistency, we find that this restriction can lead to inconsistent behaviours across various platforms. Moreover, such behaviour is not confined to Java alone. The other aspect which has interested the scientific computing community has been the speed benchmarks for different languages employed in this arena. We attempt to gain useful insight into these two performance aspects by porting into Java, C/C++ and Fortran77, a code employing a two-dimensional highresolution finite-difference scheme for simulating nonlinear wave propagation a multi-fluid plasma under electromagnetic fields and comparing the relative performances of these implementations on 32-bit and 64-bit PC platforms. Key words: Java floating point, IEEE floating point, numerical simulation

1

Introduction

In this report we are concerned, in the first instance, in examining the behaviour of Java as candidate of the IEEE 754 standard (1985) and revised IEEE 854 (1987, 2008) standard [1, 2, 3] for binary floating point representation and computation so we briefly review some salient features of this standard: Here, floating-point numbers in general are normalized before storage and can be represented in one of the forms [1, 2]:

@CMMSE

Page 121 of 1328

ISBN 13: 978-84-613-5510-5

Java/C++/Fortran cross-platform performance

single precision (32 bits): s E F ≡ s eeee eeee fff ffff ffff ffff ffff ffff with value (−1)s × 1.F × 2E−127 double precision (64 bits): s E F ≡ s eee eeee eeee ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff with value (−1)s × 1.F × 2E−1023 where the sign bit (bit 31) is s = 0 for a positive number and s = −1 for a negative number, the binary fraction F=fff...ff occupies bits 0-22, the exponent E=eeeeeeee, which is biased (by 127) to avoid storing negative values occupies bits 23-30 so that Emax = 111111112 = 255 and Emin = 0 in the first case. These extreme E values are reserved for special conditions, so that the allowable range of E is 00000001 ... 11111110 (1... 254) giving an exponent range E-127 = -126... +127. Similar considerations apply to the double precision case. In addition the IEEE standard includes provisions for extended-precision numbers, for handling denormal numbers, i.e. numbers obtained from calculations whose results fall in the range between the smallest non-zero number that can be represented in the floating point system and zero (on the positive side), for infinities and for not-a-numbers (NANs). Moreover, the default behaviour specified by the IEEE is to allow the computations to continue in spite of the occurrence(s) of these special values, by masking such particular exceptions. This may or may not be desirable in every situation. Most present day processors have floating point units (FPUs) which implement the IEEE standard by default. In particular, on Intel x86 processors [4, 5], floating point behaviour can be controlled by setting its floating point control word register (FPCSR), a special 16-bit register. Then the current control word in the FPCSR, will control the arithmetic accuracy employed in calculating intermediate results in the 80-bit FPU general registers, control how rounding is done when register contents are manipulated and stored in memory, and how denormals are handled, amongst other effects. Corresponding to the FPCSR, is another floating point status-word register, which the CPU sets, depending on the result of the last executed floating point instruction. The x87 instruction set includes the FLDCW (load control word) and FSTCW (store control word) machine instructions for manipulating the FPCSR. For example, the instruction FLDCW 639 will set it to the hexadecimal value 027F, allowing for 53-bit mantissa precision and rounding to the nearest floating point number, and the instruction FLDCW 895 will set it to the hexadecimal value 037F, allowing for 64-bit mantissa precision and

@CMMSE

Page 122 of 1328

ISBN 13: 978-84-613-5510-5

Satya Baboolal

rounding to the nearest floating point number. In addition, more convenient functions may be available in some operating systems with particular language bindings. On more recent Intel processors, additional floating point operations [4, 5] can carried out on separate processing units within the CPU as streaming pipelined instructions (MMX, SSE, SSE2, SSE3, ....) which can be enabled by compiler switches. These machine instructions can be controlled by a separate combined control-status word register, MXCSR. This is a 32-bit register which allows one to set flags to handle denormal processing, rounding and so on. In this paper we shall deal essentially with the former x87 instructions, since they are the more accurate, as the latter employ the default IEEE register precision, a crucial aspect for our codes.

2

Code for simulating nonlinear waves in electromagnetic plasmas

We handle performance issues here by multi-language implementations of a code for simulating electromagnetic shock-like structures in a plasma fluid consisting of singly charged ions an electrons subject to the electromagnetic field. A complete description of the model and algorithm used is given elsewhere[6]. It suffices to mention that it employs a two-dimensional high-resolution Riemann-solver-free central difference scheme on staggered grids to numerically solve model equations cast into the first-order PDE hyperbolic system form [6]: ∂F (U ) ∂G(U ) ∂U + + = S(U ) (1) ∂t ∂x ∂y In the above U (x, y, z, t) is the unknown (m-dimensional) vector, F (U ) is the x-flux vector, G(U ) is the y-flux vector and S(U ) is a source vector function, with x and y the only two spatial coordinates considered (for no variation in the z direction) and t is the time coordinate. To numerically integrate this system, a uniform rectangular grid with spacings ∆x and ∆y in the respective X and Y directions is used to obtain [6], ¤ 1 £ ¯n ¯n ¯n ¯n ¯ n+1 Ujk + U = U j,k+1 + Uj+1,k + Uj+1,k+1 j+ 21 ,k+ 12 4 ¤ 1 £ n n n n Uxj,k − Uxj+1,k + + Uxj,k+1 − Uxj+1,k+1 16 ¤ 1 £ n n n n Uyj,k − Uyj,k+1 + Uyj+1,k − Uyj+1,k+1 + 16 · ¸ ∆t n+ 21 n+ 12 n+ 21 n+ 21 F − Fj,k + Fj+1,k+1 − Fj,k+1 − 2∆x j+1,k ¸ · ∆t n+ 21 n+ 21 n+ 21 n+ 21 Gj,k+1 − Gj,k + Gj+1,k+1 − Gj+1,k − 2∆y · ¸ ∆t n+ 21 n+ 21 n+ 21 n+ 12 Sj+1,k+1 + Sj+1,k + Sj,k+1 + Sj,k . + (2) 4

@CMMSE

Page 123 of 1328

ISBN 13: 978-84-613-5510-5


¯ n where j, k are the spatial disThis scheme advances the cell average vectors U j,k cretization indices and n is the time level index, with time spacing ∆t. It is used in conjunction with the derivative array aproximations (Ux and Uy ) and suitable boundary conditions. For more details consult [6].

3

Multi-platform implementations

We have coded the complete time-evolutionary algorithm into Fortran 77, C/C++ and Java and linked the DISLIN package [7] to include real modelling-time graphics. Double precision words (64-bits) were used for all the floating-point variables throughout. The codes were run under Microsoft Windows XP 32-bit (Xp32) on various PC configurations (Pentium IV desktop, Notebook with Intel Centrino CPU, Intel Pentium Core 2 Quad). Additional tests on the Core 2 Quad machine were done with the Salford (32-bit) and Gfortran (32- and 64-bit) compilers under the Microsoft windows Xp 64-bit (Xp64) and under SUSE Linux 64-bit (SuSe64) operating systems. By selecting a somewhat modest-sized problem (XY grid size of 201 × 201 discretization points) our findings for various compiler suites, with their default settings are tabulated in summary form below:Compiler/OS Salford FTN95 (Fortran 77)/Xp32 Mingw g77/ Xp32 Gfortran 32-bit/ Xp32 Mingw C (GNU c)/Xp32 MS Visual C++ Express 8/Xp32 Sun Java 5/ Xp32 Salford FTN95 (Fortran 77)/Xp64 Gfortran 32-bit/ Xp64 Gfortran 64-bit/ Xp64 Gfortran 64-bit/ SuSe64

Stable? Yes, over long times Yes, over long times Yes, over long times Yes, over long times Emits NANs over long times Emits NANs over long times Emits NANs over long times Emits NANs over long times Emits NANs over long times Yes, over long times

Consistent? — Yes, with above Yes, with above Yes, with above No: results meaningless No: results meaningless No: results meaningless No: results meaningless No: results meaningless Yes, with above 1st four

We observe that stable and consistent behaviour is obtained in the first four and last cases in the table with Fig. 1 giving a typical result of the evolution of the electron

@CMMSE

Page 124 of 1328

ISBN 13: 978-84-613-5510-5

Satya Baboolal

fluid density as a shock wave. Such computations prevail over several thousands time steps, whilst remaining stable and giving meaningful physical results. For all other cases the results obtained are unstable and quite meaningless. Fig. 2 depicts such results for codes written in Visual C++ and Sun Java 5. These codes emit NANs or meaningless results, of no physical significance. Upon investigation of the discrepancies we find that both MS Visual C++ and Sun Java adhere to the IEEE standard strictly: in particular, the most significant feature of departure is that the FPU register precision is 64-bits (53-bit mantissa) whilst all the first four compilers employ 80-bits (64-bit mantissa). Thus intermediate results in the FPU registers can suffer from a significant loss in precision even before being rounded to 64-bits double-precision words for storage in memory. On large simulations codes such as here, such errors can accumulate and swamp the computations over time. Another IEEE feature is that floating point over/under flows are masked to allow computations to continue, regardless of the occurrence of NANs at intermediate steps, which is the situation observed in the last two implementations. Furthermore, tests were performed on an Intel Core 2 Quad machine by installing 64-bit operating systems and compilers. We find here that, under MS Windows Xp64 all the compilers including the Gfortran 64-bit compiler fail to achieve consistency and stability. However, under SUSE Linux-64 we can again achieve consistency and stability with the Gfortran 64-bit compiler by setting the appropriate command-line switches, as indicated in the next section.

4

Code fixes for numerical consistency

4.1

Fortran/C++/Java 5 MS Windows Xp-32 implementations

In the case of Salford FTN95,GCC(MinGW C,C++) and Visual C++ 2008 we can set the floating point control word either with inline assembly code or by using WIN32 functions, such as the call to the system function _control87(_PC_64,MCW_PC); or _control87(0x0008001F,0xFFFFFFFF); These details are available in other works [4]. However, for Sun Java we cannot employ these means. We have thus created a C++ DLL which may be called from Java by employing the following process: 1. Create a (Mingw) C++ project e.g cpplibDLL 2. Set the project options to WIN32 DLL The default output file name will be cpplibDLL.dll 3. Include the standard header jni.h (Sun Microsystems’) and your cpplibDLL.h

@CMMSE

Page 125 of 1328

ISBN 13: 978-84-613-5510-5


FORTRAN NT= 0

NT= 100

NT= 500

Figure 1: Salford FTN95/GNU C computed electron density shock structures

@CMMSE

Page 126 of 1328

ISBN 13: 978-84-613-5510-5

Satya Baboolal

JAVA NT = 0

NT = 100

NT = 120

Figure 2: Sun Java 5/VC++ computed electron density shock structures

@CMMSE

Page 127 of 1328

ISBN 13: 978-84-613-5510-5


4. Compile/build this into cpplibDLL.dll in your working directory 5. Create the Java invoking program javacppprog.java 6. Compile and run the Java program. Then the following code implementations may be used: //cpplibDLLcpp file looks like: //--------------------------#include "cpplibDLL.h" #include "jni.h" #include #include #include //To create an export function for //the export library: JNIEXPORT void JNICALL Java_javacppprog_controlWord( JNIEnv *env, jobject obj) { printf("Hello from C++ function controlWord()!\n"); printf( "Original: 0x%.8x\n", _control87(0,0)); //Set FPU control word: _control87(0x0008001F,0xFFFFFFFF); //system("PAUSE"); printf( "New: 0x%.8x\n", _control87(0,0)); return; } //etc...for other functions //------------------------//cpplibDLL.h header file //----------------------/* The header file may be generated by: javah javacppprog.java or you can edit the abovementioned header such as: */ #include "c:\...\jni.h" #ifndef _Included_cpplibDLL #define _Included_cpplibDLL #ifdef __cplusplus

@CMMSE

Page 128 of 1328

ISBN 13: 978-84-613-5510-5

Satya Baboolal

extern "C" { #endif JNIEXPORT void JNICALL Java_javacppprog_controlWord( JNIEnv *, jobject); JNIEXPORT void JNICALL Java_javacppprog_controlWord2( JNIEnv *, jobject); #ifdef __cplusplus } #endif #endif //end of header file cpplibDLL.h //-----------------------------//javacppprog.java //---------------import java.io.IOException; import java.text.NumberFormat; ......... public class javacppprog{ static final int NxPts = 201, NyPts = 201, MEQS = 16, ..., .............. .............. public static native void controlWord(); .... static{System.loadLibrary("cpplibDLL");} public static void main(String [] args) { int NGRIDA, NDH, NT, ........ ...................... //Set FPU control word: controlWord(); ..... //Main Time/While loop //--------------------while (NT < NSTEPS) { controlWord(); t=t+dt; ......... }//end main while loop ..... } //main } //end of class javacppprog

@CMMSE

Page 129 of 1328

ISBN 13: 978-84-613-5510-5


Now, setting the FPU control word as in the Fortran/Mingw value we obtain stable and meaningful results in agreement with those Fig. 1. The results from the similarly amended Visual C++ program are the same. We note that in both these cases we found it necessary to repeatedly apply this setting in the time-loop of the code since returns from certain functions can cause the compiler or Windows default setting to be resumed. The computational cost of this process is negligible in comparison to the time-loop traversal time for any realistic simulation.

4.2

Sun Java 6 MS Windows Xp-32 implementations

In the case of Java 6 we find that attempts to set the control word as in the above section fail, with the results as in Fig. 2 again. In fact Java 6 masks this low-level function call. At this stage we can only speculate that this must be a deliberate design feature in order to protect the Java working environment.

4.3

MS Windows Xp-64 implementations

We find, in these cases that even setting the control word as above does not fix the problem. In fact, the Xp-64 operating system overrides the settings, so that the IEEE standard is maintained. For instance, the command line to invoke the Gfortran 64-bit compiler from an install-directory’s bin sub-directory for compilation of some prog.for and force the generation of code for the the x87 FPU with 80-bit register precision is: ..\bin\x86_64-pc-mingw32-gfortran -c -mpc80 -mfpmath=387 prog.for But, even this imposition is ignored under Windows Xp-64, since the latter switches off access to the MMX and x87 units. It appears that [5] that Microsoft would migrate to developing code employing floating point computations only for the SSE2 (and successor) architectures, thus dropping the x87 legacy architecture, which might explain our observations.

4.4

SUSE Linux-64 implementations

Here although the default floating point behaviour is IEEE compliant, the use of the x87 FPU can be enforced by the command-line: gfortran -c -mpc80 -mfpmath=387 prog.for

5

Speed tests

To gain some insight into the relative performance of Java with respect to speed, we have conducted some benchmark tests on the Java-, C- and Fortran-code versions and compare the CPU (user single code/thread + operating system support code) execution

@CMMSE

Page 130 of 1328

ISBN 13: 978-84-613-5510-5

Satya Baboolal

times of the three versions of our 2D code. The results are summarized in the table for a 500 time step run in each case: Compiler/OS Salford FTN95(Fortran 77)/MS Win Xp32 Mingw C(GNU C)/MS Win Xp32 Sun Java6/MS Win Xp32

CPU time (secs) 264.141 155.968 200.031

Thus it is clear that Java is competitive with C and moreover, performs better than Fortran when the latter is runnning with no code optimizations.

6

Conclusion

By means of multi-language codings of an algorithm for the numerical integration of hyperbolic systems for 3-D electromagnetic plasma fluid equations allowing wave propagation in two dimensions [6] we have conducted floating point consistency tests and speed benchmarks. Our findings indicate that for languages, such as Salford Fortran 95 and GNU Mingw Fortran and C/C++, that employ 80-bit FPU accuracy in the FPU registers for intermediate calculations, overriding the IEEE standard of 64-bit accuracy, the results over long time runs are stable and consistent, in agreement with previous results. When, we employ compilers (Sun Java 5/6 and MS Visual C++ Express 8) that strictly adhere to the IEEE standard in this respect, the results obtained are not consistent with the previous case, and more so, degenerate into meaningless computations which just evolve NANs. However, for 32-bit operating systems and compilers (MS Visual C++ Express 8, Java 5), when in the latter we set the processor FPU control word to override the IEEE accuracy to 80-bits we obtain stable and consistent results as before. Exceptionally, in the case of Java 6 (and later) attempts to set the control word fail. Under Windows Xp-64, all compilers (32- and 64-bit) are restricted by the operating system to the IEEE default, and hence the codes fail. Nevertheless, under SUSE Linux-64, we find Gfortran to exhibit stable behaviour when the compiler is invoked to generate x87 code. As far as speed benchmarks are concerned, our tests on 32-bit compilers indicate the C++ performs best, followed (surprisingly) by Java 6 and then Salford FTN95. Thus Java may be seen to be competitive for large simulation codes, were it not for inconsistencies in floating point behaviour.

References [1] W. Kahan, Lecture Notes on the Status of the IEEE Standard 754 for Binary Floating-Point Arithmetic, Dept. of Elect. Eng. and Computer Science, University of California, Berkeley, (1996) 1–29. [2] D. Goldberg, What Every Computer Scientist Should Know About FloatingPoint Arithmetic, ACM Comput. Surv. 23 (1991) 5-48.

@CMMSE

Page 131 of 1328

ISBN 13: 978-84-613-5510-5


[3] W. Kahan and J.D. Darcy, How Java’s Floating-Point Hurts Everyone Everywhere, Dept. of Elect. Eng. and Computer Science, University of California, Berkeley, (1998, 2004), 1-81. Available online: http://www.cs.berkeley.edu/ wkahan/JAVAhurt.pdf (Originally presented at ACM 1998 Workshop on Java for High-Performance Network Computing, Stanford University, March 1, 1998). [4] Shawn D. Casey, x87 and SSE floating point assists in IA-32: Flush-to-zero (FTZ) and Denormals-are-zero (DAZ), Intel Corp., (2007, 2008). Available online: http://software.intel.com/en-us/articles/x87-and-sse-floating-point-assists-in-ia32-flush-to-zero-ftz-and-denormals-are-zero-daz/ [5] Microsoft Corp., Run-time library reference, http://msdn.microsoft.com/en-us/library (Dec. 2009). [6] S. Baboolal, High-resolution numerical simulation of 2D nonlinear wave structures in electromagnetic fluids with absorbing boundary conditions, J. Comput. Appl. Math. 234 (2010) 1710-1716. [7] H. Michels, Dislin Graphics Package: ver. 9.4, Max Planck Instutute for Solar System Research, Katlenburg-Lindau, Germany. Available online: http://www.mps.mpg.de/dislin, November, 2008.

@CMMSE

Page 132 of 1328

ISBN 13: 978-84-613-5510-5

Java/C++/Fortran cross-platform performance and ...

Java/C++/Fortran cross-platform performance and ...

Suggest Documents

StorageCraft Recovery Environment CrossPlatform User Guide

CrossPlatform Desktop Applications Using Node ... - Google Sites

Crossplatform UI Dev With Xamarin.pdf - Google Drive

Typical Performance, Maximal Performance, and Performance Variability

Performance-based funding and performance ... - Rijksoverheid

Performance Goals and Task Performance - Serval - Unil

Environmental Performance and Financial Performance of Green ...

Industrial relations performance, economic performance and the

Diagnosing performance management and performance budgeting ...

Performance Appraisal, Performance Management and Improving ...

Setting Performance Expectations and Telling Performance Stories

Performance Appraisal, Performance Management and Improving ...

Performance-based funding and performance ... - Rijksoverheid

corporate social performance and financial performance - Webs

sport and performance

sport and performance

Nutrition and Athletic Performance

agencification and performance

Interview Performance and Skills

and Business Performance

SPORT AND PERFORMANCE

Seals – Performance and Selection

notation and performance

Nutrition and Athletic Performance