A Proposal for Autotuning Linear Algebra Routines

0 downloads 0 Views 1MB Size Report
6x2. 6x5. 7x1. 7x4. 7x7. Threads. Time. (s ec on ds) icc gcc. P2c (theoretical). 0. 0,5. 1. 1,5. 2. 1x1 1x4 1x7 2x3 2x6 3x2 3x5 4x1 4x4 4x7 5x3 5x6 6x2 6x5 7x1 7x4 ...
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms Javier Cuenca, Luis P. García Domingo Giménez

Parallel Computing Group University of Murcia, SPAIN

parallelum

Introduction parallelum z

z

z

The arrival of multicores: parallel computing more available to the scientific groups To describe parallelism to the compiler: OpenMP, a directive-based syntax Problem: z z

z

there are no standardized OpenMP compilers number of threads Æ the performance greatly varies

Our goal: a Poly-Compilation Engine (PCE) z

z

ICCS 2010. Amsterdam

generates different executables of each routine (one for each compiler in the system) selects the executable which best fits the problem characteristics

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

2

Outline parallelum

z

Introduction

z

Our approach: Poly-Compilation Engine (PCE) z z

z

PCB: Benchmarking Routines of the PCE Proof of concept of the PCE

Conclusions

ICCS 2010. Amsterdam

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

3

Our approach: Poly-Compilation Engine (PCE) parallelum

BR source

X source

PCE ICCS 2010. Amsterdam

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

4

Our approach: Poly-Compilation Engine (PCE) parallelum Compiler A

BR source Compiler B

BRA BRB PCB

X source

PCE ICCS 2010. Amsterdam

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

5

Our approach: Poly-Compilation Engine (PCE) parallelum Compiler A

BR source Compiler B

BRA

SPA

BRB

SPB

PCB

other SP

SP

X source

PCE ICCS 2010. Amsterdam

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

6

Our approach: Poly-Compilation Engine (PCE) parallelum Compiler A

BR source Compiler B

BRA

SPA

BRB

SPB

PCB

other SP

SP

Compiler A

X source Compiler

XA XB

PCE

B

ICCS 2010. Amsterdam

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

7

Our approach: Poly-Compilation Engine (PCE) parallelum Compiler A

BR source Compiler B

BRA

SPA

BRB

SPB

PCB

other SP

SP

Compiler A

X source Compiler

XA

Decision

XB

Engine

PCE

B

ICCS 2010. Amsterdam

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

8

Our approach: Poly-Compilation Engine (PCE) parallelum Compiler A

BR source Compiler B

BRA

SPA

BRB

SPB

PCB

other SP

SP

Compiler A

X source Compiler

XA

Decision

problem size

XB

Engine

X model

PCE

B

ICCS 2010. Amsterdam

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

9

Our approach: Poly-Compilation Engine (PCE) parallelum Compiler A

BR source Compiler B

BRA

SPA

BRB

SPB

PCB

other SP

SP

Compiler A

X source Compiler

XA

Decision

problem size

XB

Engine

X model

PCE

B

AP

ICCS 2010. Amsterdam

(version of X to use, number of threads, other AP) A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

10

Outline parallelum

z

Introduction

z

Our approach: Poly-Compilation Engine (PCE) z z

z

PCB: Benchmarking Routines of the PCE Proof of concept of the PCE

Conclusions

ICCS 2010. Amsterdam

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

11

PCB: Benchmarking Routines of the PCE parallelum

To test the efficiency of OpenMP primitives for threads creation and management

z

z

R-generate z z

z

R-pfor z z

z

A simple for loop where there is a signifficant work inside each iteration To compare the time of distributing dynamically a set of homogeneous tasks

R-barriers z z

ICCS 2010. Amsterdam

Creating a series of threads with a fixed quantity of work to do per thread To compare the time of creating and managing threads

A barrier primitive set after a parallel working area To compare the times to perform a global synchronization of all the threads

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

12

PCB: Benchmarking Routines of the PCE parallelum z

Experiments: Multicore platforms + compilers: z

z

z

z

ICCS 2010. Amsterdam

P2c z Intel Pentium, 2.8 GHz, with 2 cores. z Compilers: icc 10.1 and gcc 4.3.2. A4c z Alpha EV68CB, 1 GHz, with 4 cores. z Compilers: cc 6.3 and gcc 4.3. X4c z Intel Xeon, 3 GHz, with 4 cores. z Compilers: icc 10.1 and gcc 4.2.3. X8c z Intel Xeon, 2 GHz, with 8 cores. z Compilers: icc 10.1 and gcc 3.4.6

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

13

PCB: Benchmarking Routines of the PCE R-generate parallelum P2c

A4c

icc gcc

cc gcc

1 1 2

3 4

5 6

7 8

9 10 11 12 13 14 15 16 17 18 19 20

0,1

19

17

15

13

9

11

7

5

3

0,1

time (seconds) (log. scale)

time (seconds) (log. scale)

1

1

0,01 0,001

0,01

0,001

0,0001 0,00001

0,0001

threads

X4c

threads

X8c

icc gcc

1,00000

1 1 2 3 4 5

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0,1

time (seconds) (log. scale)

time (seconds) (log. scale)

0,10000 0,01000 0,00100 0,00010

0,01 0,001 0,0001 0,00001 0,000001

0,00001 threads

ICCS 2010. Amsterdam

icc gcc

threads

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

14

PCB: Benchmarking Routines of the PCE R-generate Execution time model z

parallelum

number of generated threads ≤ number of available cores

TR − generate = PTgen + NTwork z

number of generated threads > number of available cores

TR − generate = PTgen + NTwork

ICCS 2010. Amsterdam

Tsw P ( 1+ C Tcpu

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

)

15

PCB: Benchmarking Routines of the PCE R-pfor parallelum P2c

A4c

icc gcc

cc gcc

100 19

10

19

17

15

0,1

13

9

11

7

5

3

1

1

time (seconds) (log. scale)

time (seconds) (log. scale)

17

15

13

9

11

7

5

3

1

1

0,01 0,001

0,1

threads

threads

X4c

X8c

icc gcc

icc gcc

0,01

19

17

15

13

0,01 0,001 0,0001 0,00001

0,001 threads

ICCS 2010. Amsterdam

9

7

5

0,1

11

0,1

3

1

19

17

15

13

11

9

7

5

3

1

time (seconds) (log. scale)

time (seconds) (log. scale)

1

1

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

threads

16

PCB: Benchmarking Routines of the PCE R-pfor Execution time model z

number of generated threads ≤ number of available cores

TR − pfor z

parallelum

NT = PTgen + Twork P

number of generated threads > number of available cores

TR − pfor

ICCS 2010. Amsterdam

⎛ Tsw ⎞ NT ⎟ Twork ⎜1 + = PTgen + ⎜ T ⎟ C cpu ⎠ ⎝

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

17

PCB: Benchmarking Routines of the PCE R-barriers parallelum P2c

A4c

icc gcc 100

time (seconds) (log. scale)

time (seconds) (log. scale)

1

0,1

0,01

10

1

0,1

0,01

0,001

threads

threads

X8c

icc gcc

100

1000

10

100

time (seconds) (log. scale)

time (seconds) (log. scale)

X4c

1 0,1 0,01

icc gcc

10 1 0,1 0,01

0,001

threads

ICCS 2010. Amsterdam

cc gcc

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

threads

18

PCB: Benchmarking Routines of the PCE R-barriers Execution time model z

parallelum

number of generated threads ≤ number of available cores

TR −barriers = PTgen + NTwork + PTsyn z

number of generated threads > number of available cores

TR −barriers = PTgen + NTwork

ICCS 2010. Amsterdam

P ⎛⎜ Tsw ⎞⎟ 1+ + PTsyn C ⎜⎝ Tcpu ⎟⎠

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

19

Outline parallelum

z

Introduction

z

Our approach: Poly-Compilation Engine (PCE) z z

z

PCB: Benchmarking Routines of the PCE Proof of concept of the PCE

Conclusions

ICCS 2010. Amsterdam

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

20

Proof of concept of the PCE parallelum Compiler A

BR source Compiler B

BRA

SPA

BRB

SPB

PCB

other SP

SP

Compiler A

X source Compiler

XA

Decision

problem size

XB

Engine

X model

PCE

B

AP

ICCS 2010. Amsterdam

(version of X to use, number of threads, other SP) A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

21

Proof of concept of the PCE routine X parallelum Compiler A

BRA

SPA

other SP with a routine X BR SP zz AAsimple example of how the PCE can work B B simple example of how the PCE can work with a routine X Compiler 1.1. estimating estimatingthe theSP SPby bymeans meansof ofthe thePCB PCB SP B PCB 2.2. estimated SP + model of routine XÆ estimated SP + model of routine XÆimage imageof ofthe thebehaviour behaviourof ofXX 3.3. Compiler image imageof ofthe thebehaviour behaviourof ofXX Æ ÆPCE PCEdecides decidesthe theAP AP(to (toreduce reduceits its execution time) A execution time) problem size Decision XA X source Engine X model XB BR source

PCE

Compiler B

AP

ICCS 2010. Amsterdam

(version of X to use, number of threads, other SP) A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

22

Proof of concept of the PCE routine R-strassen Execution time model

parallelum

z

number of generated threads ≤ number of available cores

z

number of generated threads > number of available cores

ICCS 2010. Amsterdam

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

23

Proof of concept of the PCE routine R-strassen SP values P2c

parallelum

X4c

A4c

X8c

icc

gcc

icc

gcc

cc

gcc

icc

gcc

Tgen (µs)

75

25

75

25

75

25

75

25

Tsw/ Tcpu

2 + 0.01P

7 - 0.01P

0.9 + 0.3P

0.9 + 0.01P

0.8 + 0.2P

0.8 + 0.02P

6 + 0.05P

0.5 + 0.01P

Tadd (ns)

20 + 0.05 P

20

23 + 0.3P

30 – 0.3P

40 + P

40 – 0.1P

10

10

Tmul (ps)

400 + 100P

400 + 0.1P

140 + 10P

140 – P

60

60 – 0.5P

100

100

ICCS 2010. Amsterdam

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

24

Proof of concept of the PCE routine R-strassen Experimental Vs. Theoretical times icc

P2c

P2c (theoretical)

2

Time (seconds)

2 1,5 1 0,5

1,5 1 0,5 0 1x1 1x4 1x7 2x3 2x6 3x2 3x5 4x1 4x4 4x7 5x3 5x6 6x2 6x5 7x1 7x4 7x7

7x 7

7x 4

7x 1

6x 5

6x 2

5x 6

5x 3

4x 7

4x 4

4x 1

3x 5

3x 2

2x 6

2x 3

1x 7

1x 4

1x 1

0

Threads

Threads

icc

X4c

X4c (theoretical) 2,5

Time (seconds)

2,00 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00

2 1,5 1 0,5

7x 7

7x 1

7x 4

6x 5

6x 2

5x 6

4x 7

5x 3

4x 4

4x 1

3x 5

3x 2

2x 6

2x 3

1x 7

0

1x 4

1x 1

icc gcc

gcc Time (seconds)

icc gcc

gcc

2,5

Time (seconds)

parallelum

1x1 1x4 1x7 2x3 2x6 3x2 3x5 4x1 4x4 4x7 5x3 5x6 6x2 6x5 7x1 7x4 7x7

Threads

ICCS 2010. Amsterdam

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

Threads

25

Proof of concept of the PCE routine R-strassen Experimental Vs. Theoretical times

parallelum

cc

A4c

A4c (theoretical)

gcc

gcc 4 3,5

2

Time (seconds)

Time (seconds)

2,5

1,5 1 0,5

2 1,5 1

7x 7

7x 4

7x 1

6x 5

6x 2

5x 6

5x 3

4x 7

4x 4

4x 1

3x 5

3x 2

2x 6

2x 3

1x 7

0 1x 4

1x 1

3 2,5

0,5

0

1x1 1x4 1x7 2x3 2x6 3x2 3x5 4x1 4x4 4x7 5x3 5x6 6x2 6x5 7x1 7x4 7x7

Threads

Threads

icc

X8c

X8c (theoretical)

gcc

Time (seconds)

0,5 0,4 0,3 0,2 0,1

0,3 0,25 0,2 0,15 0,1 0,05 0

7x 7

7x 4

7x 1

6x 5

5x 6

6x 2

5x 3

4x 7

4x 4

4x 1

3x 5

3x 2

2x 6

2x 3

1x 7

1x 4

0 1x 1

icc gcc

0,35

0,6

Time (seconds)

cc

1x1 1x4 1x7 2x3 2x6 3x2 3x5 4x1 4x4 4x7 5x3 5x6 6x2 6x5 7x1 7x4 7x7

Threads

ICCS 2010. Amsterdam

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

Threads

26

Proof of concept of the PCE routine R-strassen taking decisions: AP values

parallelum

P2c

X4c

A4c

X8c

gcc

gcc

gcc

gcc

Number of threads for the 1st level

7

4

4

7

Number of threads for the 2nd level

7

1

1

2

Compiled version

Algorithmic Parameters values, for the R-strassen routine, taken by the PolyCompilation Engine for different platforms and compilers (Problem size=1000).

ICCS 2010. Amsterdam

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

27

Proof of concept of the PCE routine R-strassen executing the routine

parallelum

P2c

X4c

A4c

X8c

PCE

1.19

0.50

0.49

0.16

ORA

1.17

0.49

0.45

0.11

HW

1.37

0.55

0.65

0.12

SW

1.22

1.31

1.20

0.32

Execution times of the R-strassen routine obtained with different Algorithmic Parameters values sets (times in seconds). Problem size=100

ICCS 2010. Amsterdam

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

28

Outline parallelum

z

Introduction

z

Our approach: Poly-Compilation Engine (PCE) z z

z

PCB: Benchmarking Routines of the PCE Proof of concept of the PCE

Conclusions

ICCS 2010. Amsterdam

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

29

Conclusions parallelum z

z

z

z z z

A good choice of compiler in a multicore contributes to accelerating the solution of problems Best compiler depends on: the type of routine, the number of threads, the problem size,... Working on the design and implementation of a complete PolyCompilation Engine (PCE). z Calculates the basic System Parameters (SP) for platform-compiler. z Selects the Algorithmic Parameters (the compiled version and the number of threads), by means of the theoretical model of the execution time of the routine with the SP measurement The behaviour of a PCE prototype has been shown with a routine Similar studies with other linear algebra routines are being performed. Extending to heterogeneous platforms

ICCS 2010. Amsterdam

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

30