6x2. 6x5. 7x1. 7x4. 7x7. Threads. Time. (s ec on ds) icc gcc. P2c (theoretical). 0. 0,5. 1. 1,5. 2. 1x1 1x4 1x7 2x3 2x6 3x2 3x5 4x1 4x4 4x7 5x3 5x6 6x2 6x5 7x1 7x4 ...
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms Javier Cuenca, Luis P. García Domingo Giménez
Parallel Computing Group University of Murcia, SPAIN
parallelum
Introduction parallelum z
z
z
The arrival of multicores: parallel computing more available to the scientific groups To describe parallelism to the compiler: OpenMP, a directive-based syntax Problem: z z
z
there are no standardized OpenMP compilers number of threads Æ the performance greatly varies
Our goal: a Poly-Compilation Engine (PCE) z
z
ICCS 2010. Amsterdam
generates different executables of each routine (one for each compiler in the system) selects the executable which best fits the problem characteristics
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
2
Outline parallelum
z
Introduction
z
Our approach: Poly-Compilation Engine (PCE) z z
z
PCB: Benchmarking Routines of the PCE Proof of concept of the PCE
Conclusions
ICCS 2010. Amsterdam
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
3
Our approach: Poly-Compilation Engine (PCE) parallelum
BR source
X source
PCE ICCS 2010. Amsterdam
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
4
Our approach: Poly-Compilation Engine (PCE) parallelum Compiler A
BR source Compiler B
BRA BRB PCB
X source
PCE ICCS 2010. Amsterdam
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
5
Our approach: Poly-Compilation Engine (PCE) parallelum Compiler A
BR source Compiler B
BRA
SPA
BRB
SPB
PCB
other SP
SP
X source
PCE ICCS 2010. Amsterdam
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
6
Our approach: Poly-Compilation Engine (PCE) parallelum Compiler A
BR source Compiler B
BRA
SPA
BRB
SPB
PCB
other SP
SP
Compiler A
X source Compiler
XA XB
PCE
B
ICCS 2010. Amsterdam
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
7
Our approach: Poly-Compilation Engine (PCE) parallelum Compiler A
BR source Compiler B
BRA
SPA
BRB
SPB
PCB
other SP
SP
Compiler A
X source Compiler
XA
Decision
XB
Engine
PCE
B
ICCS 2010. Amsterdam
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
8
Our approach: Poly-Compilation Engine (PCE) parallelum Compiler A
BR source Compiler B
BRA
SPA
BRB
SPB
PCB
other SP
SP
Compiler A
X source Compiler
XA
Decision
problem size
XB
Engine
X model
PCE
B
ICCS 2010. Amsterdam
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
9
Our approach: Poly-Compilation Engine (PCE) parallelum Compiler A
BR source Compiler B
BRA
SPA
BRB
SPB
PCB
other SP
SP
Compiler A
X source Compiler
XA
Decision
problem size
XB
Engine
X model
PCE
B
AP
ICCS 2010. Amsterdam
(version of X to use, number of threads, other AP) A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
10
Outline parallelum
z
Introduction
z
Our approach: Poly-Compilation Engine (PCE) z z
z
PCB: Benchmarking Routines of the PCE Proof of concept of the PCE
Conclusions
ICCS 2010. Amsterdam
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
11
PCB: Benchmarking Routines of the PCE parallelum
To test the efficiency of OpenMP primitives for threads creation and management
z
z
R-generate z z
z
R-pfor z z
z
A simple for loop where there is a signifficant work inside each iteration To compare the time of distributing dynamically a set of homogeneous tasks
R-barriers z z
ICCS 2010. Amsterdam
Creating a series of threads with a fixed quantity of work to do per thread To compare the time of creating and managing threads
A barrier primitive set after a parallel working area To compare the times to perform a global synchronization of all the threads
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
12
PCB: Benchmarking Routines of the PCE parallelum z
Experiments: Multicore platforms + compilers: z
z
z
z
ICCS 2010. Amsterdam
P2c z Intel Pentium, 2.8 GHz, with 2 cores. z Compilers: icc 10.1 and gcc 4.3.2. A4c z Alpha EV68CB, 1 GHz, with 4 cores. z Compilers: cc 6.3 and gcc 4.3. X4c z Intel Xeon, 3 GHz, with 4 cores. z Compilers: icc 10.1 and gcc 4.2.3. X8c z Intel Xeon, 2 GHz, with 8 cores. z Compilers: icc 10.1 and gcc 3.4.6
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
13
PCB: Benchmarking Routines of the PCE R-generate parallelum P2c
A4c
icc gcc
cc gcc
1 1 2
3 4
5 6
7 8
9 10 11 12 13 14 15 16 17 18 19 20
0,1
19
17
15
13
9
11
7
5
3
0,1
time (seconds) (log. scale)
time (seconds) (log. scale)
1
1
0,01 0,001
0,01
0,001
0,0001 0,00001
0,0001
threads
X4c
threads
X8c
icc gcc
1,00000
1 1 2 3 4 5
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0,1
time (seconds) (log. scale)
time (seconds) (log. scale)
0,10000 0,01000 0,00100 0,00010
0,01 0,001 0,0001 0,00001 0,000001
0,00001 threads
ICCS 2010. Amsterdam
icc gcc
threads
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
14
PCB: Benchmarking Routines of the PCE R-generate Execution time model z
parallelum
number of generated threads ≤ number of available cores
TR − generate = PTgen + NTwork z
number of generated threads > number of available cores
TR − generate = PTgen + NTwork
ICCS 2010. Amsterdam
Tsw P ( 1+ C Tcpu
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
)
15
PCB: Benchmarking Routines of the PCE R-pfor parallelum P2c
A4c
icc gcc
cc gcc
100 19
10
19
17
15
0,1
13
9
11
7
5
3
1
1
time (seconds) (log. scale)
time (seconds) (log. scale)
17
15
13
9
11
7
5
3
1
1
0,01 0,001
0,1
threads
threads
X4c
X8c
icc gcc
icc gcc
0,01
19
17
15
13
0,01 0,001 0,0001 0,00001
0,001 threads
ICCS 2010. Amsterdam
9
7
5
0,1
11
0,1
3
1
19
17
15
13
11
9
7
5
3
1
time (seconds) (log. scale)
time (seconds) (log. scale)
1
1
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
threads
16
PCB: Benchmarking Routines of the PCE R-pfor Execution time model z
number of generated threads ≤ number of available cores
TR − pfor z
parallelum
NT = PTgen + Twork P
number of generated threads > number of available cores
TR − pfor
ICCS 2010. Amsterdam
⎛ Tsw ⎞ NT ⎟ Twork ⎜1 + = PTgen + ⎜ T ⎟ C cpu ⎠ ⎝
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
17
PCB: Benchmarking Routines of the PCE R-barriers parallelum P2c
A4c
icc gcc 100
time (seconds) (log. scale)
time (seconds) (log. scale)
1
0,1
0,01
10
1
0,1
0,01
0,001
threads
threads
X8c
icc gcc
100
1000
10
100
time (seconds) (log. scale)
time (seconds) (log. scale)
X4c
1 0,1 0,01
icc gcc
10 1 0,1 0,01
0,001
threads
ICCS 2010. Amsterdam
cc gcc
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
threads
18
PCB: Benchmarking Routines of the PCE R-barriers Execution time model z
parallelum
number of generated threads ≤ number of available cores
TR −barriers = PTgen + NTwork + PTsyn z
number of generated threads > number of available cores
TR −barriers = PTgen + NTwork
ICCS 2010. Amsterdam
P ⎛⎜ Tsw ⎞⎟ 1+ + PTsyn C ⎜⎝ Tcpu ⎟⎠
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
19
Outline parallelum
z
Introduction
z
Our approach: Poly-Compilation Engine (PCE) z z
z
PCB: Benchmarking Routines of the PCE Proof of concept of the PCE
Conclusions
ICCS 2010. Amsterdam
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
20
Proof of concept of the PCE parallelum Compiler A
BR source Compiler B
BRA
SPA
BRB
SPB
PCB
other SP
SP
Compiler A
X source Compiler
XA
Decision
problem size
XB
Engine
X model
PCE
B
AP
ICCS 2010. Amsterdam
(version of X to use, number of threads, other SP) A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
21
Proof of concept of the PCE routine X parallelum Compiler A
BRA
SPA
other SP with a routine X BR SP zz AAsimple example of how the PCE can work B B simple example of how the PCE can work with a routine X Compiler 1.1. estimating estimatingthe theSP SPby bymeans meansof ofthe thePCB PCB SP B PCB 2.2. estimated SP + model of routine XÆ estimated SP + model of routine XÆimage imageof ofthe thebehaviour behaviourof ofXX 3.3. Compiler image imageof ofthe thebehaviour behaviourof ofXX Æ ÆPCE PCEdecides decidesthe theAP AP(to (toreduce reduceits its execution time) A execution time) problem size Decision XA X source Engine X model XB BR source
PCE
Compiler B
AP
ICCS 2010. Amsterdam
(version of X to use, number of threads, other SP) A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
22
Proof of concept of the PCE routine R-strassen Execution time model
parallelum
z
number of generated threads ≤ number of available cores
z
number of generated threads > number of available cores
ICCS 2010. Amsterdam
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
23
Proof of concept of the PCE routine R-strassen SP values P2c
parallelum
X4c
A4c
X8c
icc
gcc
icc
gcc
cc
gcc
icc
gcc
Tgen (µs)
75
25
75
25
75
25
75
25
Tsw/ Tcpu
2 + 0.01P
7 - 0.01P
0.9 + 0.3P
0.9 + 0.01P
0.8 + 0.2P
0.8 + 0.02P
6 + 0.05P
0.5 + 0.01P
Tadd (ns)
20 + 0.05 P
20
23 + 0.3P
30 – 0.3P
40 + P
40 – 0.1P
10
10
Tmul (ps)
400 + 100P
400 + 0.1P
140 + 10P
140 – P
60
60 – 0.5P
100
100
ICCS 2010. Amsterdam
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
24
Proof of concept of the PCE routine R-strassen Experimental Vs. Theoretical times icc
P2c
P2c (theoretical)
2
Time (seconds)
2 1,5 1 0,5
1,5 1 0,5 0 1x1 1x4 1x7 2x3 2x6 3x2 3x5 4x1 4x4 4x7 5x3 5x6 6x2 6x5 7x1 7x4 7x7
7x 7
7x 4
7x 1
6x 5
6x 2
5x 6
5x 3
4x 7
4x 4
4x 1
3x 5
3x 2
2x 6
2x 3
1x 7
1x 4
1x 1
0
Threads
Threads
icc
X4c
X4c (theoretical) 2,5
Time (seconds)
2,00 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00
2 1,5 1 0,5
7x 7
7x 1
7x 4
6x 5
6x 2
5x 6
4x 7
5x 3
4x 4
4x 1
3x 5
3x 2
2x 6
2x 3
1x 7
0
1x 4
1x 1
icc gcc
gcc Time (seconds)
icc gcc
gcc
2,5
Time (seconds)
parallelum
1x1 1x4 1x7 2x3 2x6 3x2 3x5 4x1 4x4 4x7 5x3 5x6 6x2 6x5 7x1 7x4 7x7
Threads
ICCS 2010. Amsterdam
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
Threads
25
Proof of concept of the PCE routine R-strassen Experimental Vs. Theoretical times
parallelum
cc
A4c
A4c (theoretical)
gcc
gcc 4 3,5
2
Time (seconds)
Time (seconds)
2,5
1,5 1 0,5
2 1,5 1
7x 7
7x 4
7x 1
6x 5
6x 2
5x 6
5x 3
4x 7
4x 4
4x 1
3x 5
3x 2
2x 6
2x 3
1x 7
0 1x 4
1x 1
3 2,5
0,5
0
1x1 1x4 1x7 2x3 2x6 3x2 3x5 4x1 4x4 4x7 5x3 5x6 6x2 6x5 7x1 7x4 7x7
Threads
Threads
icc
X8c
X8c (theoretical)
gcc
Time (seconds)
0,5 0,4 0,3 0,2 0,1
0,3 0,25 0,2 0,15 0,1 0,05 0
7x 7
7x 4
7x 1
6x 5
5x 6
6x 2
5x 3
4x 7
4x 4
4x 1
3x 5
3x 2
2x 6
2x 3
1x 7
1x 4
0 1x 1
icc gcc
0,35
0,6
Time (seconds)
cc
1x1 1x4 1x7 2x3 2x6 3x2 3x5 4x1 4x4 4x7 5x3 5x6 6x2 6x5 7x1 7x4 7x7
Threads
ICCS 2010. Amsterdam
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
Threads
26
Proof of concept of the PCE routine R-strassen taking decisions: AP values
parallelum
P2c
X4c
A4c
X8c
gcc
gcc
gcc
gcc
Number of threads for the 1st level
7
4
4
7
Number of threads for the 2nd level
7
1
1
2
Compiled version
Algorithmic Parameters values, for the R-strassen routine, taken by the PolyCompilation Engine for different platforms and compilers (Problem size=1000).
ICCS 2010. Amsterdam
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
27
Proof of concept of the PCE routine R-strassen executing the routine
parallelum
P2c
X4c
A4c
X8c
PCE
1.19
0.50
0.49
0.16
ORA
1.17
0.49
0.45
0.11
HW
1.37
0.55
0.65
0.12
SW
1.22
1.31
1.20
0.32
Execution times of the R-strassen routine obtained with different Algorithmic Parameters values sets (times in seconds). Problem size=100
ICCS 2010. Amsterdam
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
28
Outline parallelum
z
Introduction
z
Our approach: Poly-Compilation Engine (PCE) z z
z
PCB: Benchmarking Routines of the PCE Proof of concept of the PCE
Conclusions
ICCS 2010. Amsterdam
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
29
Conclusions parallelum z
z
z
z z z
A good choice of compiler in a multicore contributes to accelerating the solution of problems Best compiler depends on: the type of routine, the number of threads, the problem size,... Working on the design and implementation of a complete PolyCompilation Engine (PCE). z Calculates the basic System Parameters (SP) for platform-compiler. z Selects the Algorithmic Parameters (the compiled version and the number of threads), by means of the theoretical model of the execution time of the routine with the SP measurement The behaviour of a PCE prototype has been shown with a routine Similar studies with other linear algebra routines are being performed. Extending to heterogeneous platforms
ICCS 2010. Amsterdam
A Proposal for Autotuning Linear Algebra Routines on Multicore Platforms
30