A Composable Array Function Interface for ...

68 downloads 102639 Views 5MB Size Report
Jun 13, 2014 - Array API. Runtime Code ... Sumatra: Stream API based on JDK 8 for GPU array programming. 5 ... Workstation with AMD SDK OCL Driver.
1

Introduction

Array API

Runtime Code Generation

Preliminary Results

Conclusions

A Composable Array Function Interface for Heterogeneous Computing in Java Juan Jose Fumeroλ

Michel Steuwerπ λ University

π University

Christophe Dubachλ

of Edinburgh, UK

of Münster, Germany

ARRAY’14 , 13.06.2014

2

Introduction

Array API

Runtime Code Generation

Preliminary Results

Programming for Heterogeneous Computing

Conclusions

3

Introduction

Array API

Runtime Code Generation

Preliminary Results

Programming for Heterogeneous Computing

Conclusions

4

Introduction

Array API

Runtime Code Generation

Preliminary Results

Programming for Heterogeneous Computing

Conclusions

5

Introduction

Array API

Runtime Code Generation

Preliminary Results

Conclusions

Previous Work

Embedded DSL in High Level Languages: PyCUDA, PyOpenCL,... JOCL, JavaCL, JCuda, ...

Stream programming: IBM Liquid Metal: new operators for tasks and data parallelism Sumatra: Stream API based on JDK 8 for GPU array programming

Introduction

Array API

Runtime Code Generation

Preliminary Results

API Description Array Programming Interface

Function ArrayFunction

Map

6

Reduce

Zip

Conclusions

Introduction

Array API

Runtime Code Generation

Preliminary Results

Example - dotProduct

P(n−1) 0

ai ∗ b i

f = z i p ( ) . map ( x −> x . _1 ∗ x . _2 ) . r e d u c e ( ( x , y ) −> x + y ) ; F l o a t [ ] a = new F l o a t [ N ] ; F l o a t [ ] b = new F l o a t [ N ] ; F l o a t [ ] r e s u l t = f . a p p l y ( new T u p l e ( a , b ) ) ; 7

Conclusions

Introduction

Array API

Runtime Code Generation

Preliminary Results

Example - dotProduct

P(n−1) 0

ai ∗ b i

f = z i p ( ) . map ( x −> x . _1 ∗ x . _2 ) . r e d u c e ( ( x , y ) −> x + y ) ; F l o a t [ ] a = new F l o a t [ N ] ; F l o a t [ ] b = new F l o a t [ N ] ; F l o a t [ ] r e s u l t = f . a p p l y ( new T u p l e ( a , b ) ) ;

8

Conclusions

Introduction

Array API

Runtime Code Generation

Preliminary Results

Example - dotProduct

P(n−1) 0

ai ∗ b i

f = z i p ( ) . map ( x −> x . _1 ∗ x . _2 ) . r e d u c e ( ( x , y ) −> x + y ) ; F l o a t [ ] a = new F l o a t [ N ] ; F l o a t [ ] b = new F l o a t [ N ] ; F l o a t [ ] r e s u l t = f . a p p l y ( new T u p l e ( a , b ) ) ; 9

Conclusions

Introduction

Array API

Runtime Code Generation

Preliminary Results

Example - dotProduct

P(n−1) 0

ai ∗ b i

f = z i p ( ) . map ( x −> x . _1 ∗ x . _2 ) . r e d u c e ( ( x , y ) −> x + y ) ; F l o a t [ ] a = new F l o a t [ N ] ; F l o a t [ ] b = new F l o a t [ N ] ; F l o a t [ ] r e s u l t = f . a p p l y ( new T u p l e ( a , b ) ) ; 10

Conclusions

11

Introduction

Array API

Runtime Code Generation

Runtime Code Generation

Preliminary Results

Conclusions

12

Introduction

Array API

Runtime Code Generation

Deoptimisation Process

Preliminary Results

Conclusions

13

Introduction

Array API

Runtime Code Generation

Vision in the Future Opportunities for Specialisation

Preliminary Results

Conclusions

Introduction

Array API

Runtime Code Generation

Preliminary Results

Setup Workstation with AMD SDK OCL Driver

Black-Scholes problem Comparison with: Java Sequential: primitives data types Java Objects: using Float and Tuples Array Function: our API Java threads OpenCL GPU

Conclusions

15

Introduction

Array API

Runtime Code Generation

Preliminary Results

Conclusions

Sequential Version Black - Scholes (AMD version) ArrayFunction API

Sequential J. Objects

Sequential J. Primitive

Runtime in milliseconds

1000

800

600

400

200

2

51

1K

2K

4K

8K

K

16

K

32

Input size

K

65

8K

12

25

6K

50

0K

1M

16

Introduction

Array API

Runtime Code Generation

Preliminary Results

Conclusions

Parallel Executions Black-Scholes on AMD GPU and Intel 16 cores #32 Java Threads

OpenCL GPU

50

Speedup

40

30

20

10

0

2

51

1K

2K

4K

8K

K

16

K 32

Input size

K

65

8K

12

25

6K

50

0K

1M

Introduction

Array API

Runtime Code Generation

Preliminary Results

GPU execution time breakdown Black-Scholes on AMD Tahiti

1M elements

Kernel execution workflow

Conclusions

Introduction

Array API

Runtime Code Generation

Preliminary Results

GPU execution time breakdown Black-Scholes on AMD Tahiti

1M elements

18

Kernel execution workflow

Conclusions

Introduction

Array API

Runtime Code Generation

Preliminary Results

GPU execution time breakdown Black-Scholes on AMD Tahiti

1M elements

Kernel execution workflow

Conclusions

Introduction

Array API

Runtime Code Generation

Preliminary Results

GPU execution time breakdown Black-Scholes on AMD Tahiti

1M elements

Kernel execution workflow

Conclusions

Introduction

Array API

Runtime Code Generation

Preliminary Results

GPU execution time breakdown Black-Scholes on AMD Tahiti

1M elements

Kernel execution workflow

Conclusions

Introduction

Array API

Runtime Code Generation

Preliminary Results

.zip(Conclusions).map(Future)

Present Java Array Programming API: very high level approach of using parallel patterns in heterogeneous systems We have presented an early prototype of Map/Reduce by using Graal JDK8 and OpenCL

22

Conclusions

Introduction

Array API

Runtime Code Generation

Preliminary Results

Conclusions

.zip(Conclusions).map(Future)

Present Java Array Programming API: very high level approach of using parallel patterns in heterogeneous systems We have presented an early prototype of Map/Reduce by using Graal JDK8 and OpenCL Future Runtime scheduling (Where is the best place to run the code?) Code generation for multiple devices Specialised code generation at runtime can improve performance and portability

22

Introduction

Array API

Runtime Code Generation

Preliminary Results

Thanks so much for your attention

This work was supported by a grant from:

Juan José Fumero [email protected]

23

Conclusions

Suggest Documents