A Proposal for Autotuning Linear Algebra Routines

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms Javier Cuenca, Luis P. García Domingo Giménez

Parallel Computing Group University of Murcia, SPAIN

parallelum

Introduction parallelum z

z

z

The arrival of multicores: parallel computing more available to the scientific groups To describe parallelism to the compiler: OpenMP, a directive-based syntax Problem: z z

z

there are no standardized OpenMP compilers number of threads Æ the performance greatly varies

Our goal: a Poly-Compilation Engine (PCE) z

z

ICCS 2010. Amsterdam

generates different executables of each routine (one for each compiler in the system) selects the executable which best fits the problem characteristics

A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

2

Outline parallelum

z

Introduction

z

Our approach: Poly-Compilation Engine (PCE) z z

z

PCB: Benchmarking Routines of the PCE Proof of concept of the PCE

Conclusions



3

Our approach: Poly-Compilation Engine (PCE) parallelum

BR source

X source

PCE ICCS 2010. Amsterdam


4

Our approach: Poly-Compilation Engine (PCE) parallelum Compiler A

BR source Compiler B

BRA BRB PCB

X source



5



BRA

SPA

BRB

SPB

PCB

other SP

SP

X source



6



BRA

SPA

BRB

SPB

PCB

other SP

SP

Compiler A

X source Compiler

XA XB

PCE

B



7



BRA

SPA

BRB

SPB

PCB

other SP

SP

Compiler A

X source Compiler

XA

Decision

XB

Engine

PCE

B



8



BRA

SPA

BRB

SPB

PCB

other SP

SP

Compiler A

X source Compiler

XA

Decision

problem size

XB

Engine

X model

PCE

B



9



BRA

SPA

BRB

SPB

PCB

other SP

SP

Compiler A

X source Compiler

XA

Decision

problem size

XB

Engine

X model

PCE

B

AP


(version of X to use, number of threads, other AP) A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

10

Outline parallelum

z

Introduction

z


z


Conclusions



11

PCB: Benchmarking Routines of the PCE parallelum

To test the efficiency of OpenMP primitives for threads creation and management

z

z

R-generate z z

z

R-pfor z z

z

A simple for loop where there is a signifficant work inside each iteration To compare the time of distributing dynamically a set of homogeneous tasks

R-barriers z z


Creating a series of threads with a fixed quantity of work to do per thread To compare the time of creating and managing threads

A barrier primitive set after a parallel working area To compare the times to perform a global synchronization of all the threads


12

PCB: Benchmarking Routines of the PCE parallelum z

Experiments: Multicore platforms + compilers: z

z

z

z


P2c z Intel Pentium, 2.8 GHz, with 2 cores. z Compilers: icc 10.1 and gcc 4.3.2. A4c z Alpha EV68CB, 1 GHz, with 4 cores. z Compilers: cc 6.3 and gcc 4.3. X4c z Intel Xeon, 3 GHz, with 4 cores. z Compilers: icc 10.1 and gcc 4.2.3. X8c z Intel Xeon, 2 GHz, with 8 cores. z Compilers: icc 10.1 and gcc 3.4.6


13

PCB: Benchmarking Routines of the PCE R-generate parallelum P2c

A4c

icc gcc

cc gcc

1 1 2

3 4

5 6

7 8

9 10 11 12 13 14 15 16 17 18 19 20

0,1

19

17

15

13

9

11

7

5

3

0,1

time (seconds) (log. scale)


1

1

0,01 0,001

0,01

0,001

0,0001 0,00001

0,0001

threads

X4c

threads

X8c

icc gcc

1,00000

1 1 2 3 4 5

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0,1



0,10000 0,01000 0,00100 0,00010

0,01 0,001 0,0001 0,00001 0,000001

0,00001 threads


icc gcc

threads


14

PCB: Benchmarking Routines of the PCE R-generate Execution time model z

parallelum

number of generated threads ≤ number of available cores

TR − generate = PTgen + NTwork z

number of generated threads > number of available cores

TR − generate = PTgen + NTwork


Tsw P ( 1+ C Tcpu


)

15

PCB: Benchmarking Routines of the PCE R-pfor parallelum P2c

A4c

icc gcc

cc gcc

100 19

10

19

17

15

0,1

13

9

11

7

5

3

1

1



17

15

13

9

11

7

5

3

1

1

0,01 0,001

0,1

threads

threads

X4c

X8c

icc gcc

icc gcc

0,01

19

17

15

13

0,01 0,001 0,0001 0,00001

0,001 threads


9

7

5

0,1

11

0,1

3

1

19

17

15

13

11

9

7

5

3

1



1

1


threads

16

PCB: Benchmarking Routines of the PCE R-pfor Execution time model z


TR − pfor z

parallelum

NT = PTgen + Twork P


TR − pfor


⎛ Tsw ⎞ NT ⎟ Twork ⎜1 + = PTgen + ⎜ T ⎟ C cpu ⎠ ⎝


17

PCB: Benchmarking Routines of the PCE R-barriers parallelum P2c

A4c

icc gcc 100



1

0,1

0,01

10

1

0,1

0,01

0,001

threads

threads

X8c

icc gcc

100

1000

10

100



X4c

1 0,1 0,01

icc gcc

10 1 0,1 0,01

0,001

threads


cc gcc


threads

18

PCB: Benchmarking Routines of the PCE R-barriers Execution time model z

parallelum


TR −barriers = PTgen + NTwork + PTsyn z


TR −barriers = PTgen + NTwork


P ⎛⎜ Tsw ⎞⎟ 1+ + PTsyn C ⎜⎝ Tcpu ⎟⎠


19

Outline parallelum

z

Introduction

z


z


Conclusions



20

Proof of concept of the PCE parallelum Compiler A


BRA

SPA

BRB

SPB

PCB

other SP

SP

Compiler A

X source Compiler

XA

Decision

problem size

XB

Engine

X model

PCE

B

AP


(version of X to use, number of threads, other SP) A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

21

Proof of concept of the PCE routine X parallelum Compiler A

BRA

SPA

other SP with a routine X BR SP zz AAsimple example of how the PCE can work B B simple example of how the PCE can work with a routine X Compiler 1.1. estimating estimatingthe theSP SPby bymeans meansof ofthe thePCB PCB SP B PCB 2.2. estimated SP + model of routine XÆ estimated SP + model of routine XÆimage imageof ofthe thebehaviour behaviourof ofXX 3.3. Compiler image imageof ofthe thebehaviour behaviourof ofXX Æ ÆPCE PCEdecides decidesthe theAP AP(to (toreduce reduceits its execution time) A execution time) problem size Decision XA X source Engine X model XB BR source

PCE

Compiler B

AP


(version of X to use, number of threads, other SP) A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms

22

Proof of concept of the PCE routine R-strassen Execution time model

parallelum

z


z




23

Proof of concept of the PCE routine R-strassen SP values P2c

parallelum

X4c

A4c

X8c

icc

gcc

icc

gcc

cc

gcc

icc

gcc

Tgen (µs)

75

25

75

25

75

25

75

25

Tsw/ Tcpu

2 + 0.01P

7 - 0.01P

0.9 + 0.3P

0.9 + 0.01P

0.8 + 0.2P

0.8 + 0.02P

6 + 0.05P

0.5 + 0.01P

Tadd (ns)

20 + 0.05 P

20

23 + 0.3P

30 – 0.3P

40 + P

40 – 0.1P

10

10

Tmul (ps)

400 + 100P

400 + 0.1P

140 + 10P

140 – P

60

60 – 0.5P

100

100



24

Proof of concept of the PCE routine R-strassen Experimental Vs. Theoretical times icc

P2c

P2c (theoretical)

2

Time (seconds)

2 1,5 1 0,5

1,5 1 0,5 0 1x1 1x4 1x7 2x3 2x6 3x2 3x5 4x1 4x4 4x7 5x3 5x6 6x2 6x5 7x1 7x4 7x7

7x 7

7x 4

7x 1

6x 5

6x 2

5x 6

5x 3

4x 7

4x 4

4x 1

3x 5

3x 2

2x 6

2x 3

1x 7

1x 4

1x 1

0

Threads

Threads

icc

X4c

X4c (theoretical) 2,5

Time (seconds)

2,00 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00

2 1,5 1 0,5

7x 7

7x 1

7x 4

6x 5

6x 2

5x 6

4x 7

5x 3

4x 4

4x 1

3x 5

3x 2

2x 6

2x 3

1x 7

0

1x 4

1x 1

icc gcc

gcc Time (seconds)

icc gcc

gcc

2,5

Time (seconds)

parallelum

1x1 1x4 1x7 2x3 2x6 3x2 3x5 4x1 4x4 4x7 5x3 5x6 6x2 6x5 7x1 7x4 7x7

Threads



Threads

25

Proof of concept of the PCE routine R-strassen Experimental Vs. Theoretical times

parallelum

cc

A4c

A4c (theoretical)

gcc

gcc 4 3,5

2

Time (seconds)

Time (seconds)

2,5

1,5 1 0,5

2 1,5 1

7x 7

7x 4

7x 1

6x 5

6x 2

5x 6

5x 3

4x 7

4x 4

4x 1

3x 5

3x 2

2x 6

2x 3

1x 7

0 1x 4

1x 1

3 2,5

0,5

0


Threads

Threads

icc

X8c

X8c (theoretical)

gcc

Time (seconds)

0,5 0,4 0,3 0,2 0,1

0,3 0,25 0,2 0,15 0,1 0,05 0

7x 7

7x 4

7x 1

6x 5

5x 6

6x 2

5x 3

4x 7

4x 4

4x 1

3x 5

3x 2

2x 6

2x 3

1x 7

1x 4

0 1x 1

icc gcc

0,35

0,6

Time (seconds)

cc


Threads



Threads

26

Proof of concept of the PCE routine R-strassen taking decisions: AP values

parallelum

P2c

X4c

A4c

X8c

gcc

gcc

gcc

gcc

Number of threads for the 1st level

7

4

4

7

Number of threads for the 2nd level

7

1

1

2

Compiled version

Algorithmic Parameters values, for the R-strassen routine, taken by the PolyCompilation Engine for different platforms and compilers (Problem size=1000).



27

Proof of concept of the PCE routine R-strassen executing the routine

parallelum

P2c

X4c

A4c

X8c

PCE

1.19

0.50

0.49

0.16

ORA

1.17

0.49

0.45

0.11

HW

1.37

0.55

0.65

0.12

SW

1.22

1.31

1.20

0.32

Execution times of the R-strassen routine obtained with different Algorithmic Parameters values sets (times in seconds). Problem size=100



28

Outline parallelum

z

Introduction

z


z


Conclusions



29

Conclusions parallelum z

z

z

z z z

A good choice of compiler in a multicore contributes to accelerating the solution of problems Best compiler depends on: the type of routine, the number of threads, the problem size,... Working on the design and implementation of a complete PolyCompilation Engine (PCE). z Calculates the basic System Parameters (SP) for platform-compiler. z Selects the Algorithmic Parameters (the compiled version and the number of threads), by means of the theoretical model of the execution time of the routine with the SP measurement The behaviour of a PCE prototype has been shown with a routine Similar studies with other linear algebra routines are being performed. Extending to heterogeneous platforms



30

A Proposal for Autotuning Linear Algebra Routines

A Proposal for Autotuning Linear Algebra Routines

Suggest Documents

performance evaluation of linear algebra routines

Implementing Linear Algebra Routines on Multi-Core ... - The Netlib

Sage for Linear Algebra - A First Course in Linear Algebra

Applied Linear Algebra de Boor Book proposal for Springer

Linear Algebra

LINEAR ALGEBRA

Linear Algebra

Linear Algebra

Linear Algebra

Linear Algebra

Linear Algebra

Linear Algebra

Linear Algebra

Linear Algebra

ScaLAPACK: A Linear Algebra Library for Message

A Glossary for Elementary Linear Algebra

GLOSSARY: A DICTIONARY FOR LINEAR ALGEBRA

Templates for Linear Algebra Problems

Templates for Linear Algebra Problems

Appendix A: Linear Algebra: Vectors

1 SUB THEME PROPOSAL: ROUTINES, TRANSFER AND ...

1 SUB THEME PROPOSAL: ROUTINES, TRANSFER AND ...

Linear Equations in Linear Algebra

Math 511 Linear Algebra with Applications. Text: Linear Algebra and ...