Java in High-Performance Computing - GeeCON [PDF]

2 downloads 172 Views 3MB Size Report
Crosscutting: (un?)common pitfalls and performance killers. Some. HotSpot internals. ... High-performance computing (HPC) uses ... Is Java faster than C/C++?.
Java in High-Performance Computing Dawid Weiss Carrot Search Institute of Computing Science, Poznan University of Technology ´ 05/2010 GeeCon Poznan,

Learn from the mistakes of others. You can’t live long enough to make them all yourself. — Eleanor Roosevelt

Talk outline • What is “High performance”? • What is “Java”? • Measuring performance (benchmarking). • HPPC library.

Talk outline • What is “High performance”? • What is “Java”? • Measuring performance (benchmarking). • HPPC library.

Crosscutting: (un?)common pitfalls and performance killers. Some HotSpot internals.

Divide-and-conquer style algorithm for (Example e : examples) { e.hasQuiz() ? e.showQuiz() : e.showCode(); e.explain(); e.deriveConclusions(); }

— PART I —

High Performance Computing

High-performance computing (HPC) uses supercomputers and computer clusters to solve advanced computation problems. — Wikipedia

Is Java faster than C/C++? The short answer is: it depends. — Cliff Click

It’s usually hard to make a fast program run faster.

It’s usually hard to make a fast program run faster. It’s easy to make a slow program run even slower.

It’s usually hard to make a fast program run faster. It’s easy to make a slow program run even slower. It’s easy to make fast hardware run slow.

For now, HPC •

limited allowed computation time,



constrained resources (hardware, memory).

For now, HPC •

limited allowed computation time,



constrained resources (hardware, memory).

Good HPC software ∝ no (obvious) flaws.

— PART II —

What is Java? (Recall: Is Java faster than C/C++?)

Example 1

public void testSum1() { int sum = 0; for (int i = 0; i < COUNT; i++) sum += sum1(i, i); result = sum; }

public void testSum2() { int sum = 0; for (int i = 0; i < COUNT; i++) sum += sum2(i, i); result = sum; }

Example 1

public void testSum1() { int sum = 0; for (int i = 0; i < COUNT; i++) sum += sum1(i, i); result = sum; }

public void testSum2() { int sum = 0; for (int i = 0; i < COUNT; i++) sum += sum2(i, i); result = sum; }

where the body of sum1 and sum2 sums arguments and returns the result and COUNT is significantly large. . .

VM sun-1.6.0-20

sum1

sum2

VM

sum1

sun-1.6.0-20

0.04

sum2

VM

sum1

sum2

sun-1.6.0-20 sun-1.6.0-16

0.04

2.62

VM

sum1

sum2

sun-1.6.0-20 sun-1.6.0-16 sun-1.5.0-18

0.04 0.04

2.62 3.20

VM

sum1

sum2

sun-1.6.0-20 sun-1.6.0-16 sun-1.5.0-18 ibm-1.6.2

0.04 0.04 0.04

2.62 3.20 3.29

VM

sum1

sum2

sun-1.6.0-20 sun-1.6.0-16 sun-1.5.0-18 ibm-1.6.2 jrockit-27.5.0

0.04 0.04 0.04 0.08

2.62 3.20 3.29 6.28

VM

sum1

sum2

sun-1.6.0-20 sun-1.6.0-16 sun-1.5.0-18 ibm-1.6.2 jrockit-27.5.0 harmony-r917296

0.04 0.04 0.04 0.08 0.18

2.62 3.20 3.29 6.28 0.16

VM

sum1

sum2

sun-1.6.0-20 sun-1.6.0-16 sun-1.5.0-18 ibm-1.6.2 jrockit-27.5.0 harmony-r917296

0.04 0.04 0.04 0.08 0.18 0.17

2.62 3.20 3.29 6.28 0.16 0.35

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

VM

sum1

sum2

sum3

sum4

sun-1.6.0-20 sun-1.6.0-16 sun-1.5.0-18 ibm-1.6.2 jrockit-27.5.0 harmony-r917296

0.04 0.04 0.04 0.08 0.18 0.17

2.62 3.20 3.29 6.28 0.16 0.35

1.05 1.39 1.46 0.16 1.16 9.18

3.76 4.99 5.20 14.64 3.18 22.49

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

int sum1(int a, int b) { return a + b; }

Integer sum2(Integer a, Integer b) { return a + b; }

↓ Integer sum2(Integer a, Integer b) { return Integer.valueOf( a.intValue() + b.intValue()); }

int sum3(int... args) { int sum = 0; for (int i = 0; i < args.length; i++) sum += args[i]; return sum; }

Integer sum4(Integer... args) { int sum = 0; for (int i = 0; i < args.length; i++) { sum += args[i]; } return sum; }

↓ Integer sum4(Integer [] args) { // ... }

Conclusions •

Syntactic sugar may be costly.



Primitive types are fast.



Large differences between different VMs.

Example 2

Write once, run anywhere!

But it’s the same VM!

It works on my machine!

private static boolean ready; public static void startThread() { new Thread() { public void run() { try { sleep(2000); } catch (Exception e) { /* ignore */ } System.out.println("Marking loop exit."); ready = true; } }.start(); } public static void main(String[] args) { startThread(); System.out.println("Entering the loop..."); while (!ready) { // Do nothing. } System.out.println("Done, I left the loop!"); }

while (!ready) { // Do nothing. }

≡?

boolean r = ready; while (!r) { // Do nothing. }

while (!ready) { // Do nothing. }

≡?

boolean r = ready; while (!r) { // Do nothing. }

In most cases true, from a JMM perspective.

JVM Internals. . .

C1: •

fast



not (much) optimization

C2: •

slow(er) than C1



a lot of JMM-allowed optimizations

There are hundreds of JVM tuning/diagnostic switches.

My personal favorite:

Conclusions •

Bytecode is far from what is executed.



A lot going on under the (VM) hood.



Bad code may work, but will eventually crash.



HotSpot-level optimizations are good.

Conclusions •

Bytecode is far from what is executed.



A lot going on under the (VM) hood.



Bad code may work, but will eventually crash.



HotSpot-level optimizations are good.



If there is a bug in the HotSpot compiler. . .

Any other diversifying factors?

J2ME •

more VM vendors,



hardware diversity,



software and hardware quirks.

Non-JVM target platforms •

Dalvik



GWT



IKVM

Conclusions •

There is no “single” Java performance model.



Performance depends on the VM, environment, class library, hardware.



Apply benchmark-and-correct cycle.

Benchmarking

Example 3

public void testSum1() { int sum = 0; for (int i = 0; i < COUNT; i++) sum += sum1(i, i); result = sum; }

public void testSum1_2() { int sum = 0; for (int i = 0; i < COUNT; i++) sum += sum1(i, i); }

VM sun-1.6.0-20

sum1

sum1_2

VM

sum1

sun-1.6.0-20

0.04

sum1_2

VM

sum1

sum1_2

sun-1.6.0-20

0.04

0.00

VM

sum1

sum1_2

sun-1.6.0-20 sun-1.6.0-16 sun-1.5.0-18 ibm-1.6.2 jrockit-27.5.0 harmony-r917296

0.04 0.04 0.04 0.08 0.17 0.17

0.00 0.00 0.00 0.01 0.08 0.11

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

java -server -XX:+PrintOptoAssembly -XX:+PrintCompilation ...

java -server -XX:+PrintOptoAssembly -XX:+PrintCompilation ... - method holder: - access: - name: ... 010 pushq rbp subq rsp, nop 016 addq rsp, popq rbp testl rax, 021

ret

’com/dawidweiss/geecon2010/Example03’ 0xc1000001 public ’testSum1_2’

#16 16

# Create frame # nop for patch_verified_entry # Destroy frame

[rip + #offset_to_poll_page] # Safepoint: poll for GC

Conclusions •

Benchmarks must be executed to provide feedback.



HotSpot is smart and effective at removing dead code.

Example 4

@Test public void testAdd1() { int sum = 0; for (int i = 0; i < COUNT; i++) { sum += add1(i); } guard = sum; } public int add1(int i) { return i + 1; }

Note add1 is virtual.

switch

-XX:+Inlining -XX:+PrintInlining -XX:-Inlining

testAdd1 0.04 ?

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200, JRE 1.7b80-debug).

switch

-XX:+Inlining -XX:+PrintInlining -XX:-Inlining

testAdd1 0.04 0.45

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200, JRE 1.7b80-debug).

Most Java calls are monomorphic.

HotSpot adjusts to megamorphic calls automatically.

Example 5

abstract class Superclass { abstract int call(); } class Sub1 extends Superclass { int call() { return 1; } } class Sub2 extends Superclass { int call() { return 2; } } class Sub3 extends Superclass { int call() { return 3; } } Superclass[] mixed = initWithRandomInstances(10000); Superclass[] solid = initWithSub1Instances(10000);

@Test public void testMonomorphic() { int sum = 0; int m = solid.length; for (int i = 0; i < COUNT; i++) sum += solid[i % m].call(); guard = sum; } @Test public void testMegamorphic() { int sum = 0; int m = mixed.length; for (int i = 0; i < COUNT; i++) sum += mixed[i % m].call(); guard = sum; }

VM

monomorphic

megamorphic

sun-1.6.0-20 sun-1.6.0-16 sun-1.5.0-18 ibm-1.6.2 jrockit-27.5.0 harmony-r917296

0.19 0.19 0.18 0.20 0.22 0.27

0.32 0.34 0.34 0.30 0.29 0.32

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

Example 6 @Test public void testBitCount1() { int sum = 0; for (int i = 0; i < COUNT; i++) sum += Integer.bitCount(i); guard = sum; }

@Test public void testBitCount2() { int sum = 0; for (int i = 0; i < COUNT; i++) sum += bitCount(i); guard = sum; } /* Copied from * {@link Integer#bitCount} */ static int bitCount(int i) { // HD, Figure 5-2 i = i - ((i >>> 1) & 0x55555555); i = (i & 0x33333333) + ((i >>> 2) & 0x33333333); i = (i + (i >>> 4)) & 0x0f0f0f0f; i = i + (i >>> 8); i = i + (i >>> 16); return i & 0x3f; }

VM

testBitCount1

testBitCount2

sun-1.6.0-20 sun-1.7.0-b80

0.43 0.43

0.43 0.43

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

VM

testBitCount1

testBitCount2

sun-1.6.0-20 sun-1.7.0-b80

0.43 0.43

0.43 0.43

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

VM

testBitCount1

testBitCount2

sun-1.6.0-20 sun-1.7.0-b83

0.08 0.07

0.33 0.32

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Windows 7, Intel I7 860).

... -XX:+PrintInlining ...

... -XX:+PrintInlining ... ... Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1 Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1 Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1 Example06.testBitCount1: [measured 10 out of 15 rounds] round: 0.07 [+- 0.00], round.gc: 0.00 [+- 0.00] ... @ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot) @ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot) @ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot) Example06.testBitCount2: [measured 10 out of 15 rounds] round: 0.32 [+- 0.01], round.gc: 0.00 [+- 0.00] ...

... -XX:+PrintOptoAssembly ...

... -XX:+PrintOptoAssembly ... {method} - klass: {other class} - method holder: com/dawidweiss/geecon2010/Example06 - name: testBitCount1 ... 0c2 B13: # B12 B14 >> 20) ^ (h >>> 12); return h ^ (h >>> 7) ^ (h >>> 4); }

HashMap rehashes your (carefully crafted) hash code.

HPPC approach (example): public class LongIntOpenHashMap implements LongIntMap { // ... public LongIntOpenHashMap(int initialCapacity, float loadFactor, LongHashFunction keyHashFunction, IntHashFunction valueHashFunction) { // ... }

Defaults: LongMurmurHash, IntHashFunction.

Example 7

Frequency count of character bigrams in a given text.

• HPPC: final char [] CHARS = DATA; final IntIntOpenHashMap counts = new IntIntOpenHashMap(); for (int i = 0; i < CHARS.length - 1; i++) { counts.putOrAdd((CHARS[i]