Runtime and Data Management for Heterogeneous ...

GPUs GraalVM Juan J. Fumero Introduction

Runtime and Data Management for Heterogeneous Computing in Java

API Runtime Code Generation Data Management

Juan José Fumero, Toomas Remmelg, Michel Steuwer, Christophe Dubach

Results Conclusion

The University of Edinburgh

Principles and Practices of Programming on the Java Platform 2015

1 / 28

GPUs GraalVM Juan J. Fumero

1 Introduction

Introduction API Runtime Code Generation Data Management

2 API 3 Runtime Code Generation

Results Conclusion

4 Data Management 5 Results 6 Conclusion

2 / 28


Heterogeneous Computing

Introduction API Runtime Code Generation Data Management Results Conclusion

NBody App (NVIDIA SDK) ˜105x speedup over seq LU Decomposition (Rodinia Benchmark) ˜10x over 32 OpenMP threads 3 / 28


Cool, but how to program?


4 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion

Example in OpenCL 1 // create host buffers 2 i n t ∗A , . . . . 3 //Initialization 4 ... 5 // platform 6 c l u i n t numPlatforms = 0 ; 7 cl platform id ∗platforms ; 8 s t a t u s = c l G e t P l a t f o r m I D s ( 0 , NULL , &n u m P l a t f o r m s ) ; 9 p l a t f o r m s = ( c l p l a t f o r m i d ∗) m a l l o c ( n u m P l a t f o r m s ∗ s i z e o f ( c l p l a t f o r m i d ) ) ; 10 s t a t u s = c l G e t P l a t f o r m I D s ( n u m P l a t fo r m s , p l a t f o r m s , NULL) ; 11 c l u i n t numDevices = 0 ; 12 c l d e v i c e i d ∗ d e v i c e s ; 13 s t a t u s = c l G e t D e v i c e I D s ( p l a t f o r m s [ 0 ] , CL DEVICE TYPE ALL , 0 , NULL , & numDevices ) ; 14 // Allocate space for each device 15 d e v i c e s = ( c l d e v i c e i d ∗) m a l l o c ( numDevices∗ s i z e o f ( c l d e v i c e i d ) ) ; 16 // Fill in devices 17 s t a t u s = c l G e t D e v i c e I D s ( p l a t f o r m s [ 0 ] , CL DEVICE TYPE ALL , numDevices , d e v i c e s , NULL) ; 18 c l c o n t e x t c o n t e x t ; 19 c o n t e x t = c l C r e a t e C o n t e x t (NULL , numDevices , d e v i c e s , NULL , NULL , &s t a t u s ) ; 20 c l c o m m a n d q u e u e cmdQ ; 21 cmdQ = clCreateCommandQueue ( c o n t e x t , d e v i c e s [ 0 ] , 0 , &s t a t u s ) ; 22 cl mem d A , d B , d C ; 23 d A = c l C r e a t e B u f f e r ( c o n t e x t , CL MEM READ ONLY | CL MEM COPY HOST PTR , d a t a s i z e , A , &s t a t u s ) ; 24 d B = c l C r e a t e B u f f e r ( c o n t e x t , CL MEM READ ONLY | CL MEM COPY HOST PTR , d a t a s i z e , B , &s t a t u s ) ; 25 d C = c l C r e a t e B u f f e r ( c o n t e x t , CL MEM WRITE ONLY , d a t a s i z e , NULL , &s t a t u s ) ; 26 ... 27 // Check errors 28 ... 5 / 28


Example in OpenCL 1 const char ∗ s o u r c e F i l e = ” k e r n e l . c l ” ; 2 source = readsource ( sourceFile ) ; 3 program = c l C r e a t e P r o g r a m W i t h S o u r c e ( c o n t e x t , 1 , ( c o n s t c h a r ∗∗)&s o u r c e , NULL , &s t a t u s ) ; 4 cl int buildErr ; 5 b u i l d E r r = c l B u i l d P r o g r a m ( program , numDevices , d e v i c e s , NULL , NULL , NULL) ; 6 // Create a kernel 7 k e r n e l = c l C r e a t e K e r n e l ( program , ” v e c a d d ” , &s t a t u s ) ; 8 9 status = clSetKernelArg ( kernel , 0 , s i z e o f ( cl mem ) , &d A ) ; 10 s t a t u s |= c l S e t K e r n e l A r g ( k e r n e l , 1 , s i z e o f ( cl mem ) , &d B ) ; 11 s t a t u s |= c l S e t K e r n e l A r g ( k e r n e l , 2 , s i z e o f ( cl mem ) , &d C ) ; 12 13 s i z e t g l o b a l W o r k S i z e [ 1 ] = { ELEMENTS} ; 14 s i z e t l o c a l i t e m s i z e [ 1 ] = {5}; 15 16 c l E n q u e u e N D R a n g e K e r n e l (cmdQ , k e r n e l , 1 , NULL , g l o b a l W o r k S i z e , NULL , 0 , NULL , NULL) ; 17 18 c l E n q u e u e R e a d B u f f e r (cmdQ , d C , CL TRUE , 0 , d a t a s i z e , C , 0 , NULL , NULL) ; 19 20 // Free memory

6 / 28

GPUs GraalVM

OpenCL example

Juan J. Fumero Introduction API Runtime Code Generation

1

Data Management

2

Results

4

Conclusion

5

3

6 7 8 9

k e r n e l void vecadd ( g l o b a l i n t ∗a , g l o b a l i n t ∗b , g l o b a l int ∗c ) { int idx = g e t g l o b a l i d (0) ; c [ idx ] = a [ idx ] ∗ b [ idx ] ; }

• Hello world App ˜ 250 lines of

code (including error checking) • Low-level and specific code • Knowledge about target

architecture • If GPU/accelerator changes,

tuning is required

7 / 28


OpenCL programming is hard and error-prone!! 8 / 28


Higher levels of abstraction


9 / 28


Programming for Heterogeneous Computing


10 / 28


Higher levels of abstraction


11 / 28

GPUs GraalVM

Similar works

Juan J. Fumero Introduction API Runtime Code Generation Data Management

• Sumatra API (discontinued): Stream API for HSAIL

Results

• AMD Aparapi: Java API for OpenCL

Conclusion

• NVIDIA Nova: functional programming language for

CPU/GPU • Cooperhead: subset of python than can be executed on

heterogeneous platforms.

12 / 28

GPUs GraalVM

Our Approach

Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion

Three levels of abstraction: • Parallel Skeletons: API based on functional programming

style (map/reduce) • High-level optimising library which rewrites operations to

target specific hardware • OpenCL code generation and runtime with data

management for heterogeneous architecture

13 / 28

GPUs GraalVM

Example: Saxpy

Juan J. Fumero Introduction

1 2

API Runtime Code Generation

3

Data Management

5

Results

7

Conclusion

4 6 8 9

// Computation function A r r a y F u n c m u l t = new MapFunction (t −> 2 . 5 f ∗ t . 1 ( ) + t . 2 ( ) ) ; // Prepare the input Tuple2 [ ] i n p u t = new T u p l e 2 [ s i z e ] ; f o r ( i n t i = 0 ; i < i n p u t . l e n g t h ; ++i ) { input [ i ] . 1 = ( f l o a t ) ( i ∗ 0.323) ; input [ i ] . 2 = ( float ) ( i + 2.0) ; }

10 11 12

// Computation F l o a t [ ] output = mult . a p p l y ( i n p u t ) ;

If accelerator enabled, the map expression is rewritten in lower level operations automatically. map(λ) = MapAccelerator(λ) = CopyIn().computeOCL(λ).CopyOut() 14 / 28

GPUs GraalVM

Our Approach

Juan J. Fumero

Overview

Introduction API Runtime Code Generation

Func t i on

Data Management

Ar r ay Func

Results Conclusion

...

Reduc e

Map

MapThr eads

MapOpenCL

appl y ( ) { f or ( i = 0; i < s i z e; ++i ) out [ i ] = f . appl y ( i n[ i ] ) ) ; }

appl y ( ) { f or ( t hr ead : t hr eads ) t hr ead. per f or mMapSeq( ) ; } appl y ( ) { c opy ToDev i c e( ) ; ex ec ut e( ) ; c opy ToHos t ( ) ; }

15 / 28

GPUs GraalVM

Runtime Code Generation

Juan J. Fumero

Workflow

Introduction API

Java source Map.apply(f)


Graal VM 1. Type inference

2. IR generation

Data Management Results Conclusion

... 10: 11: 12: 13: 16: 18: 23: 24: 27: ...

aload_2 iload_3 aload_0 getfield aaload invokeinterface#apply aastore iinc iload_3 Java bytecode

3. optimizations

CFG + Dataflow (Graal IR) 4. kernel generation void kernel ( global float* input, global float* output) { ...; ...; } OpenCL Kernel

16 / 28

GPUs GraalVM


Juan J. Fumero Introduction API Runtime Code Generation Data Management

StartNode

Param

Param

Param

MethodCallTarget

StartNode

StartNode

IsNull

DoubleConvert

Results

Const (2.0)

GuardingPi (NullCheckException)

Invoke#Integer.intValue

Conclusion DoubleConvert

*

Unbox

Const (2.0) DoubleConvert

Const (2.0)

*

Return *

MethodCallTarget Box Invoke#Double.valueOf

Return

inline double lambda0 ( int p0 )

{

double cast_1 = ( double ) p0 ; double result_2 = cast_1 * 2.0; return result_2 ;

Return }

17 / 28

GPUs GraalVM

OpenCL code generated

Juan J. Fumero 1 Introduction API

2 3 4


5

Data Management

7

Results Conclusion

6 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

d o u b l e lambda0 ( f l o a t p0 ) { d o u b l e c a s t 1 = ( d o u b l e ) p0 ; double r e s u l t 2 = c a s t 1 ∗ 2 . 0 ; return r e s u l t 2 ; } k e r n e l void lambdaComputationKernel ( g l o b a l f l o a t ∗ p0 , global int ∗ p0 index data , g l o b a l d o u b l e ∗ p1 , global int ∗ p1 index data ) { i n t p0 dim 1 = 0; i n t p1 dim 1 = 0; i n t gs = g e t g l o b a l s i z e (0) ; int loop 1 = g e t g l o b a l i d (0) ; f o r ( ; ; l o o p 1 += g s ) { i n t p0 len dim 1 = p0 index data [ p0 dim 1 ] ; bool cond 2 = l oo p 1 < p 0 l e n d i m 1 ; i f ( cond 2 ) { f l o a t a u x V a r 0 = p0 [ l o o p 1 ] ; d o u b l e r e s = lambd0 ( a u x V a r 0 ) ; p1 [ p 1 i n d e x d a t a [ p 1 d i m 1 + 1 ] + l o o p 1 ] = res ; } e l s e { break ; } } }

18 / 28

GPUs GraalVM

Main Components

Juan J. Fumero Introduction API

API: parallel pattern composition

Runtime Code Generation Data Management

User Code Runtime

Graal Brigde

Skeleton/Pattern rewriten

Results Conclusion

Buffers Management Acceleartor Runtime

Java Threads

Code Cache Management Deoptimisation(*) Code Generator

Optimizer

OCL JIT Compilation

19 / 28

GPUs GraalVM

Investigation of runtime for BS


Black-scholes benchmark. Float[] =⇒ Tuple2 < Float, Float > []

Data Management 1.0

Results

Unmarshaling Amount of total runtime in %

Conclusion

0.8

CopyToCPU

0.6

GPU Execution

0.4

CopyToGPU

0.2

Marshaling

0.0

• Un/marshal data takes

up to 90% of the time • Computation step

should be dominant

Java overhead

This is not acceptable. Can we do better?

20 / 28

GPUs GraalVM

Custom Array Type

Juan J. Fumero

PArray

Introduction API

0

Runtime Code Generation Data Management

Programmer's View

1

n-1

2

Tuple2

Tuple2

Tuple2

float

float

float

double

double

double

Tuple2

...

float double

Results Conclusion Graal-OCL VM

FloatBuffer

DoubleBuffer

0

1

2

float

float

float

0

1

double

double

n-1

... ... ...

n-1

2

double

float

......

double

With this layout, un/marshal operations are not necessary 21 / 28

GPUs GraalVM

Example of JPAI

Juan J. Fumero 1 Introduction API

2 3


4

Data Management

6

Results

8

Conclusion

A r r a y F u n c f = new MapFunction (t −> 2 . 5 f ∗ t . 1 ( ) + t . 2 ( ) ) ;

5 7 9

PArray i n p u t = new PArray ( s i z e ) ; f o r ( i n t i = 0 ; i < s i z e ; ++i ) { i n p u t . p u t ( i , new Tuple2 (( f l o a t ) i , ( d o u b l e ) i + 2 ) ) ; } PArray o u t p u t = f . a p p l y ( i n p u t ) ;

pArray.put(2, new Tuple2(2.0f, 4.0));

FloatBuffer

DoubleBuffer

0.0f

1.0f

2.0f

2.0

3.0

4.0

...

... 22 / 28


Setup


• 5 Applications • Comparison with: • Java Sequential - Graal compiled code • AMD and Nvidia GPUs • Java Array vs. Custom PArray • Java threads

23 / 28

GPUs GraalVM

Java Threads Execution


Number of Java Threads #1 #2 #4


Speedup vs. Java sequential

6

#8

#16

5 4 3 2 1 0

small

large

Saxpy

small

large

K−Means

small

large

Black−Scholes

small

large

N−Body

small

large

Monte Carlo

CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz

24 / 28

GPUs GraalVM

OpenCL GPU Execution


Results Conclusion

Nvidia Marshalling Speedup vs. Java sequential

Data Management

Nvidia Optimized

AMD Marshalling

AMD Optimized

1000 100 10 1 0.1 small

0.004 0.004 large

Saxpy

small

large

K−Means

small

large

Black−Scholes

small

large

N−Body

small

large

Monte Carlo

AMD Radeon R9 295 NVIDIA Geforce GTX Titan Black

25 / 28

GPUs GraalVM

OpenCL GPU Execution

Juan J. Fumero Introduction API


Nvidia Marshalling Speedup vs. Java sequential


Nvidia Optimized

AMD Marshalling

AMD Optimized

12x 70x

1000

10x

100 10 1 0.1 small

0.004 0.004 large

Saxpy

small

large

K−Means

small

large

Black−Scholes

small

large

N−Body

small

large

Monte Carlo

AMD Radeon R9 295 NVIDIA Geforce GTX Titan Black

26 / 28

GPUs GraalVM

.zip(Conclusions).map(Future)

Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion

Present • We have presented an API to enable heterogeneous

computing in Java • Custom array type to reduce overheads when transfer the

data • Runtime system to run heterogeneous applications within

Java Future • Runtime data type specialization • Code generation for multiple devices • Runtime scheduling (Where is the best place to run the

code?)

27 / 28


Thanks so much for your attention

Introduction API Runtime Code Generation

This work was supported by a grant from:


Juan Fumero Email: Webpage:

[email protected] http://homepages.inf.ed.ac.uk/s1369892/

28 / 28