1
Introduction
Array API
Runtime Code Generation
Preliminary Results
Conclusions
A Composable Array Function Interface for Heterogeneous Computing in Java Juan Jose Fumeroλ
Michel Steuwerπ λ University
π University
Christophe Dubachλ
of Edinburgh, UK
of Münster, Germany
ARRAY’14 , 13.06.2014
2
Introduction
Array API
Runtime Code Generation
Preliminary Results
Programming for Heterogeneous Computing
Conclusions
3
Introduction
Array API
Runtime Code Generation
Preliminary Results
Programming for Heterogeneous Computing
Conclusions
4
Introduction
Array API
Runtime Code Generation
Preliminary Results
Programming for Heterogeneous Computing
Conclusions
5
Introduction
Array API
Runtime Code Generation
Preliminary Results
Conclusions
Previous Work
Embedded DSL in High Level Languages: PyCUDA, PyOpenCL,... JOCL, JavaCL, JCuda, ...
Stream programming: IBM Liquid Metal: new operators for tasks and data parallelism Sumatra: Stream API based on JDK 8 for GPU array programming
Introduction
Array API
Runtime Code Generation
Preliminary Results
API Description Array Programming Interface
Function ArrayFunction
Map
6
Reduce
Zip
Conclusions
Introduction
Array API
Runtime Code Generation
Preliminary Results
Example - dotProduct
P(n−1) 0
ai ∗ b i
f = z i p ( ) . map ( x −> x . _1 ∗ x . _2 ) . r e d u c e ( ( x , y ) −> x + y ) ; F l o a t [ ] a = new F l o a t [ N ] ; F l o a t [ ] b = new F l o a t [ N ] ; F l o a t [ ] r e s u l t = f . a p p l y ( new T u p l e ( a , b ) ) ; 7
Conclusions
Introduction
Array API
Runtime Code Generation
Preliminary Results
Example - dotProduct
P(n−1) 0
ai ∗ b i
f = z i p ( ) . map ( x −> x . _1 ∗ x . _2 ) . r e d u c e ( ( x , y ) −> x + y ) ; F l o a t [ ] a = new F l o a t [ N ] ; F l o a t [ ] b = new F l o a t [ N ] ; F l o a t [ ] r e s u l t = f . a p p l y ( new T u p l e ( a , b ) ) ;
8
Conclusions
Introduction
Array API
Runtime Code Generation
Preliminary Results
Example - dotProduct
P(n−1) 0
ai ∗ b i
f = z i p ( ) . map ( x −> x . _1 ∗ x . _2 ) . r e d u c e ( ( x , y ) −> x + y ) ; F l o a t [ ] a = new F l o a t [ N ] ; F l o a t [ ] b = new F l o a t [ N ] ; F l o a t [ ] r e s u l t = f . a p p l y ( new T u p l e ( a , b ) ) ; 9
Conclusions
Introduction
Array API
Runtime Code Generation
Preliminary Results
Example - dotProduct
P(n−1) 0
ai ∗ b i
f = z i p ( ) . map ( x −> x . _1 ∗ x . _2 ) . r e d u c e ( ( x , y ) −> x + y ) ; F l o a t [ ] a = new F l o a t [ N ] ; F l o a t [ ] b = new F l o a t [ N ] ; F l o a t [ ] r e s u l t = f . a p p l y ( new T u p l e ( a , b ) ) ; 10
Conclusions
11
Introduction
Array API
Runtime Code Generation
Runtime Code Generation
Preliminary Results
Conclusions
12
Introduction
Array API
Runtime Code Generation
Deoptimisation Process
Preliminary Results
Conclusions
13
Introduction
Array API
Runtime Code Generation
Vision in the Future Opportunities for Specialisation
Preliminary Results
Conclusions
Introduction
Array API
Runtime Code Generation
Preliminary Results
Setup Workstation with AMD SDK OCL Driver
Black-Scholes problem Comparison with: Java Sequential: primitives data types Java Objects: using Float and Tuples Array Function: our API Java threads OpenCL GPU
Conclusions
15
Introduction
Array API
Runtime Code Generation
Preliminary Results
Conclusions
Sequential Version Black - Scholes (AMD version) ArrayFunction API
Sequential J. Objects
Sequential J. Primitive
Runtime in milliseconds
1000
800
600
400
200
2
51
1K
2K
4K
8K
K
16
K
32
Input size
K
65
8K
12
25
6K
50
0K
1M
16
Introduction
Array API
Runtime Code Generation
Preliminary Results
Conclusions
Parallel Executions Black-Scholes on AMD GPU and Intel 16 cores #32 Java Threads
OpenCL GPU
50
Speedup
40
30
20
10
0
2
51
1K
2K
4K
8K
K
16
K 32
Input size
K
65
8K
12
25
6K
50
0K
1M
Introduction
Array API
Runtime Code Generation
Preliminary Results
GPU execution time breakdown Black-Scholes on AMD Tahiti
1M elements
Kernel execution workflow
Conclusions
Introduction
Array API
Runtime Code Generation
Preliminary Results
GPU execution time breakdown Black-Scholes on AMD Tahiti
1M elements
18
Kernel execution workflow
Conclusions
Introduction
Array API
Runtime Code Generation
Preliminary Results
GPU execution time breakdown Black-Scholes on AMD Tahiti
1M elements
Kernel execution workflow
Conclusions
Introduction
Array API
Runtime Code Generation
Preliminary Results
GPU execution time breakdown Black-Scholes on AMD Tahiti
1M elements
Kernel execution workflow
Conclusions
Introduction
Array API
Runtime Code Generation
Preliminary Results
GPU execution time breakdown Black-Scholes on AMD Tahiti
1M elements
Kernel execution workflow
Conclusions
Introduction
Array API
Runtime Code Generation
Preliminary Results
.zip(Conclusions).map(Future)
Present Java Array Programming API: very high level approach of using parallel patterns in heterogeneous systems We have presented an early prototype of Map/Reduce by using Graal JDK8 and OpenCL
22
Conclusions
Introduction
Array API
Runtime Code Generation
Preliminary Results
Conclusions
.zip(Conclusions).map(Future)
Present Java Array Programming API: very high level approach of using parallel patterns in heterogeneous systems We have presented an early prototype of Map/Reduce by using Graal JDK8 and OpenCL Future Runtime scheduling (Where is the best place to run the code?) Code generation for multiple devices Specialised code generation at runtime can improve performance and portability
22
Introduction
Array API
Runtime Code Generation
Preliminary Results
Thanks so much for your attention
This work was supported by a grant from:
Juan José Fumero
[email protected]
23
Conclusions