GPUs GraalVM Juan J. Fumero Introduction
Runtime and Data Management for Heterogeneous Computing in Java
API Runtime Code Generation Data Management
Juan Jos´e Fumero, Toomas Remmelg, Michel Steuwer, Christophe Dubach
Results Conclusion
The University of Edinburgh
Principles and Practices of Programming on the Java Platform 2015
1 / 28
GPUs GraalVM Juan J. Fumero
1 Introduction
Introduction API Runtime Code Generation Data Management
2 API 3 Runtime Code Generation
Results Conclusion
4 Data Management 5 Results 6 Conclusion
2 / 28
GPUs GraalVM Juan J. Fumero
Heterogeneous Computing
Introduction API Runtime Code Generation Data Management Results Conclusion
NBody App (NVIDIA SDK) ˜105x speedup over seq LU Decomposition (Rodinia Benchmark) ˜10x over 32 OpenMP threads 3 / 28
GPUs GraalVM Juan J. Fumero
Cool, but how to program?
Introduction API Runtime Code Generation Data Management Results Conclusion
4 / 28
GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion
Example in OpenCL 1 // create host buffers 2 i n t ∗A , . . . . 3 //Initialization 4 ... 5 // platform 6 c l u i n t numPlatforms = 0 ; 7 cl platform id ∗platforms ; 8 s t a t u s = c l G e t P l a t f o r m I D s ( 0 , NULL , &n u m P l a t f o r m s ) ; 9 p l a t f o r m s = ( c l p l a t f o r m i d ∗) m a l l o c ( n u m P l a t f o r m s ∗ s i z e o f ( c l p l a t f o r m i d ) ) ; 10 s t a t u s = c l G e t P l a t f o r m I D s ( n u m P l a t fo r m s , p l a t f o r m s , NULL) ; 11 c l u i n t numDevices = 0 ; 12 c l d e v i c e i d ∗ d e v i c e s ; 13 s t a t u s = c l G e t D e v i c e I D s ( p l a t f o r m s [ 0 ] , CL DEVICE TYPE ALL , 0 , NULL , & numDevices ) ; 14 // Allocate space for each device 15 d e v i c e s = ( c l d e v i c e i d ∗) m a l l o c ( numDevices∗ s i z e o f ( c l d e v i c e i d ) ) ; 16 // Fill in devices 17 s t a t u s = c l G e t D e v i c e I D s ( p l a t f o r m s [ 0 ] , CL DEVICE TYPE ALL , numDevices , d e v i c e s , NULL) ; 18 c l c o n t e x t c o n t e x t ; 19 c o n t e x t = c l C r e a t e C o n t e x t (NULL , numDevices , d e v i c e s , NULL , NULL , &s t a t u s ) ; 20 c l c o m m a n d q u e u e cmdQ ; 21 cmdQ = clCreateCommandQueue ( c o n t e x t , d e v i c e s [ 0 ] , 0 , &s t a t u s ) ; 22 cl mem d A , d B , d C ; 23 d A = c l C r e a t e B u f f e r ( c o n t e x t , CL MEM READ ONLY | CL MEM COPY HOST PTR , d a t a s i z e , A , &s t a t u s ) ; 24 d B = c l C r e a t e B u f f e r ( c o n t e x t , CL MEM READ ONLY | CL MEM COPY HOST PTR , d a t a s i z e , B , &s t a t u s ) ; 25 d C = c l C r e a t e B u f f e r ( c o n t e x t , CL MEM WRITE ONLY , d a t a s i z e , NULL , &s t a t u s ) ; 26 ... 27 // Check errors 28 ... 5 / 28
GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion
Example in OpenCL 1 const char ∗ s o u r c e F i l e = ” k e r n e l . c l ” ; 2 source = readsource ( sourceFile ) ; 3 program = c l C r e a t e P r o g r a m W i t h S o u r c e ( c o n t e x t , 1 , ( c o n s t c h a r ∗∗)&s o u r c e , NULL , &s t a t u s ) ; 4 cl int buildErr ; 5 b u i l d E r r = c l B u i l d P r o g r a m ( program , numDevices , d e v i c e s , NULL , NULL , NULL) ; 6 // Create a kernel 7 k e r n e l = c l C r e a t e K e r n e l ( program , ” v e c a d d ” , &s t a t u s ) ; 8 9 status = clSetKernelArg ( kernel , 0 , s i z e o f ( cl mem ) , &d A ) ; 10 s t a t u s |= c l S e t K e r n e l A r g ( k e r n e l , 1 , s i z e o f ( cl mem ) , &d B ) ; 11 s t a t u s |= c l S e t K e r n e l A r g ( k e r n e l , 2 , s i z e o f ( cl mem ) , &d C ) ; 12 13 s i z e t g l o b a l W o r k S i z e [ 1 ] = { ELEMENTS} ; 14 s i z e t l o c a l i t e m s i z e [ 1 ] = {5}; 15 16 c l E n q u e u e N D R a n g e K e r n e l (cmdQ , k e r n e l , 1 , NULL , g l o b a l W o r k S i z e , NULL , 0 , NULL , NULL) ; 17 18 c l E n q u e u e R e a d B u f f e r (cmdQ , d C , CL TRUE , 0 , d a t a s i z e , C , 0 , NULL , NULL) ; 19 20 // Free memory
6 / 28
GPUs GraalVM
OpenCL example
Juan J. Fumero Introduction API Runtime Code Generation
1
Data Management
2
Results
4
Conclusion
5
3
6 7 8 9
k e r n e l void vecadd ( g l o b a l i n t ∗a , g l o b a l i n t ∗b , g l o b a l int ∗c ) { int idx = g e t g l o b a l i d (0) ; c [ idx ] = a [ idx ] ∗ b [ idx ] ; }
• Hello world App ˜ 250 lines of
code (including error checking) • Low-level and specific code • Knowledge about target
architecture • If GPU/accelerator changes,
tuning is required
7 / 28
GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion
OpenCL programming is hard and error-prone!! 8 / 28
GPUs GraalVM Juan J. Fumero
Higher levels of abstraction
Introduction API Runtime Code Generation Data Management Results Conclusion
9 / 28
GPUs GraalVM Juan J. Fumero
Programming for Heterogeneous Computing
Introduction API Runtime Code Generation Data Management Results Conclusion
10 / 28
GPUs GraalVM Juan J. Fumero
Higher levels of abstraction
Introduction API Runtime Code Generation Data Management Results Conclusion
11 / 28
GPUs GraalVM
Similar works
Juan J. Fumero Introduction API Runtime Code Generation Data Management
• Sumatra API (discontinued): Stream API for HSAIL
Results
• AMD Aparapi: Java API for OpenCL
Conclusion
• NVIDIA Nova: functional programming language for
CPU/GPU • Cooperhead: subset of python than can be executed on
heterogeneous platforms.
12 / 28
GPUs GraalVM
Our Approach
Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion
Three levels of abstraction: • Parallel Skeletons: API based on functional programming
style (map/reduce) • High-level optimising library which rewrites operations to
target specific hardware • OpenCL code generation and runtime with data
management for heterogeneous architecture
13 / 28
GPUs GraalVM
Example: Saxpy
Juan J. Fumero Introduction
1 2
API Runtime Code Generation
3
Data Management
5
Results
7
Conclusion
4 6 8 9
// Computation function A r r a y F u n c m u l t = new MapFunction (t −> 2 . 5 f ∗ t . 1 ( ) + t . 2 ( ) ) ; // Prepare the input Tuple2 [ ] i n p u t = new T u p l e 2 [ s i z e ] ; f o r ( i n t i = 0 ; i < i n p u t . l e n g t h ; ++i ) { input [ i ] . 1 = ( f l o a t ) ( i ∗ 0.323) ; input [ i ] . 2 = ( float ) ( i + 2.0) ; }
10 11 12
// Computation F l o a t [ ] output = mult . a p p l y ( i n p u t ) ;
If accelerator enabled, the map expression is rewritten in lower level operations automatically. map(λ) = MapAccelerator(λ) = CopyIn().computeOCL(λ).CopyOut() 14 / 28
GPUs GraalVM
Our Approach
Juan J. Fumero
Overview
Introduction API Runtime Code Generation
Func t i on
Data Management
Ar r ay Func
Results Conclusion
...
Reduc e
Map
MapThr eads
MapOpenCL
appl y ( ) { f or ( i = 0; i < s i z e; ++i ) out [ i ] = f . appl y ( i n[ i ] ) ) ; }
appl y ( ) { f or ( t hr ead : t hr eads ) t hr ead. per f or mMapSeq( ) ; } appl y ( ) { c opy ToDev i c e( ) ; ex ec ut e( ) ; c opy ToHos t ( ) ; }
15 / 28
GPUs GraalVM
Runtime Code Generation
Juan J. Fumero
Workflow
Introduction API
Java source Map.apply(f)
Runtime Code Generation
Graal VM 1. Type inference
2. IR generation
Data Management Results Conclusion
... 10: 11: 12: 13: 16: 18: 23: 24: 27: ...
aload_2 iload_3 aload_0 getfield aaload invokeinterface#apply aastore iinc iload_3 Java bytecode
3. optimizations
CFG + Dataflow (Graal IR) 4. kernel generation void kernel ( global float* input, global float* output) { ...; ...; } OpenCL Kernel
16 / 28
GPUs GraalVM
Runtime Code Generation
Juan J. Fumero Introduction API Runtime Code Generation Data Management
StartNode
Param
Param
Param
MethodCallTarget
StartNode
StartNode
IsNull
DoubleConvert
Results
Const (2.0)
GuardingPi (NullCheckException)
Invoke#Integer.intValue
Conclusion DoubleConvert
*
Unbox
Const (2.0) DoubleConvert
Const (2.0)
*
Return *
MethodCallTarget Box Invoke#Double.valueOf
Return
inline double lambda0 ( int p0 )
{
double cast_1 = ( double ) p0 ; double result_2 = cast_1 * 2.0; return result_2 ;
Return }
17 / 28
GPUs GraalVM
OpenCL code generated
Juan J. Fumero 1 Introduction API
2 3 4
Runtime Code Generation
5
Data Management
7
Results Conclusion
6 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
d o u b l e lambda0 ( f l o a t p0 ) { d o u b l e c a s t 1 = ( d o u b l e ) p0 ; double r e s u l t 2 = c a s t 1 ∗ 2 . 0 ; return r e s u l t 2 ; } k e r n e l void lambdaComputationKernel ( g l o b a l f l o a t ∗ p0 , global int ∗ p0 index data , g l o b a l d o u b l e ∗ p1 , global int ∗ p1 index data ) { i n t p0 dim 1 = 0; i n t p1 dim 1 = 0; i n t gs = g e t g l o b a l s i z e (0) ; int loop 1 = g e t g l o b a l i d (0) ; f o r ( ; ; l o o p 1 += g s ) { i n t p0 len dim 1 = p0 index data [ p0 dim 1 ] ; bool cond 2 = l oo p 1 < p 0 l e n d i m 1 ; i f ( cond 2 ) { f l o a t a u x V a r 0 = p0 [ l o o p 1 ] ; d o u b l e r e s = lambd0 ( a u x V a r 0 ) ; p1 [ p 1 i n d e x d a t a [ p 1 d i m 1 + 1 ] + l o o p 1 ] = res ; } e l s e { break ; } } }
18 / 28
GPUs GraalVM
Main Components
Juan J. Fumero Introduction API
API: parallel pattern composition
Runtime Code Generation Data Management
User Code Runtime
Graal Brigde
Skeleton/Pattern rewriten
Results Conclusion
Buffers Management Acceleartor Runtime
Java Threads
Code Cache Management Deoptimisation(*) Code Generator
Optimizer
OCL JIT Compilation
19 / 28
GPUs GraalVM
Investigation of runtime for BS
Juan J. Fumero Introduction API Runtime Code Generation
Black-scholes benchmark. Float[] =⇒ Tuple2 < Float, Float > []
Data Management 1.0
Results
Unmarshaling Amount of total runtime in %
Conclusion
0.8
CopyToCPU
0.6
GPU Execution
0.4
CopyToGPU
0.2
Marshaling
0.0
• Un/marshal data takes
up to 90% of the time • Computation step
should be dominant
Java overhead
This is not acceptable. Can we do better?
20 / 28
GPUs GraalVM
Custom Array Type
Juan J. Fumero
PArray
Introduction API
0
Runtime Code Generation Data Management
Programmer's View
1
n-1
2
Tuple2
Tuple2
Tuple2
float
float
float
double
double
double
Tuple2
...
float double
Results Conclusion Graal-OCL VM
FloatBuffer
DoubleBuffer
0
1
2
float
float
float
0
1
double
double
n-1
... ... ...
n-1
2
double
float
......
double
With this layout, un/marshal operations are not necessary 21 / 28
GPUs GraalVM
Example of JPAI
Juan J. Fumero 1 Introduction API
2 3
Runtime Code Generation
4
Data Management
6
Results
8
Conclusion
A r r a y F u n c f = new MapFunction (t −> 2 . 5 f ∗ t . 1 ( ) + t . 2 ( ) ) ;
5 7 9
PArray i n p u t = new PArray ( s i z e ) ; f o r ( i n t i = 0 ; i < s i z e ; ++i ) { i n p u t . p u t ( i , new Tuple2 (( f l o a t ) i , ( d o u b l e ) i + 2 ) ) ; } PArray o u t p u t = f . a p p l y ( i n p u t ) ;
pArray.put(2, new Tuple2(2.0f, 4.0));
FloatBuffer
DoubleBuffer
0.0f
1.0f
2.0f
2.0
3.0
4.0
...
... 22 / 28
GPUs GraalVM Juan J. Fumero
Setup
Introduction API Runtime Code Generation Data Management Results Conclusion
• 5 Applications • Comparison with: • Java Sequential - Graal compiled code • AMD and Nvidia GPUs • Java Array vs. Custom PArray • Java threads
23 / 28
GPUs GraalVM
Java Threads Execution
Juan J. Fumero Introduction API Runtime Code Generation
Number of Java Threads #1 #2 #4
Data Management Results Conclusion
Speedup vs. Java sequential
6
#8
#16
5 4 3 2 1 0
small
large
Saxpy
small
large
K−Means
small
large
Black−Scholes
small
large
N−Body
small
large
Monte Carlo
CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
24 / 28
GPUs GraalVM
OpenCL GPU Execution
Juan J. Fumero Introduction API Runtime Code Generation
Results Conclusion
Nvidia Marshalling Speedup vs. Java sequential
Data Management
Nvidia Optimized
AMD Marshalling
AMD Optimized
1000 100 10 1 0.1 small
0.004 0.004 large
Saxpy
small
large
K−Means
small
large
Black−Scholes
small
large
N−Body
small
large
Monte Carlo
AMD Radeon R9 295 NVIDIA Geforce GTX Titan Black
25 / 28
GPUs GraalVM
OpenCL GPU Execution
Juan J. Fumero Introduction API
Data Management Results Conclusion
Nvidia Marshalling Speedup vs. Java sequential
Runtime Code Generation
Nvidia Optimized
AMD Marshalling
AMD Optimized
12x 70x
1000
10x
100 10 1 0.1 small
0.004 0.004 large
Saxpy
small
large
K−Means
small
large
Black−Scholes
small
large
N−Body
small
large
Monte Carlo
AMD Radeon R9 295 NVIDIA Geforce GTX Titan Black
26 / 28
GPUs GraalVM
.zip(Conclusions).map(Future)
Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion
Present • We have presented an API to enable heterogeneous
computing in Java • Custom array type to reduce overheads when transfer the
data • Runtime system to run heterogeneous applications within
Java Future • Runtime data type specialization • Code generation for multiple devices • Runtime scheduling (Where is the best place to run the
code?)
27 / 28
GPUs GraalVM Juan J. Fumero
Thanks so much for your attention
Introduction API Runtime Code Generation
This work was supported by a grant from:
Data Management Results Conclusion
Juan Fumero Email: Webpage:
[email protected] http://homepages.inf.ed.ac.uk/s1369892/
28 / 28