Performance of Cluster-enabled OpenMP for the SCASH Software ...

0 downloads 0 Views 1MB Size Report
Software DSM system SCASH. • Omni/SCASH: OpenMP implementation for Software DSM. • Comparison of the basic performance of. Myrinet and Ethernet.
Performance of Cluster-enabled OpenMP for the SCASH Software Distributed Shared Memory System 1)

Yoshinori Ojima Mitsuhisa Sato1) 2) Hiroshi Harada Yutaka Ishikawa3) 1) Information Science and Electronics, University of Tsukuba 2) Hewlett-Packard Japan,Ltd 3) The University of Tokyo

Overview • Background, objective • Software DSM system SCASH • Omni/SCASH: OpenMP implementation for Software DSM • Comparison of the basic performance of Myrinet and Ethernet • Performance evaluation, Discussion • Conclusion, future work

Background • PC Cluster became a popular parallel computing platform. • Distributed memory programming – Message passing library(MPI, PVM) – Programming cost is large

• Shared memory programming – Programmers can parallelize easily with OpenMP – Programming cost is small

Omni/SCASH : OpenMP for Software Distributed Shared Memory(DSM) System This is cluster-enabled OpenMP

The objectives of research • To evaluate performance of Omni/SCASH • To investigate the performance factor depending on the communication performance of networks • To investigate the problem of using a commodity network (Ethernet) as well as Myrinet

Software DSM System SCASH • A Software DSM System as a part of SCore cluster-system software • Uses PM communication library, implemented as user level library • Per page-basis (using kernel page-faults) • Two page consistency protocols – invalidate and update

• Eager Release Consistency(ERC) memory model with multiple writer protocol (diff) • Consistency maintenance communication at synchronization point

Omni/SCASH • OpenMP Compiler for SCASH – It translates OpenMP programs to multithreaded programs linked to SCASH runtime library “shmem” memory model

OpenMP All variables are shared by default No explicit shared memory allocation

All variables declared statically in global scope are private. The shared address space must be allocated by a library function at runtime.

Omni OpenMP Compiler

Transformation for “shmem” model • OpenMP compiler for “shmem” memory model – Detects references to a shared data object – Rewrite it to the references to the objects which are allocated in shared address space at runtime. – A global variable declaration a declaration of the pointer which will point into a shared object at runtime. double x; double a[100]; … a[10] = x;

double *_G_x; double *_G_a; … (_G_a)[10] = (*_G_x); … static _G_DATA_INIT(){ _shm_data_init(&_G_x,8,0); _shm_data_init(&_G_a,800,0); }

Extension of Omni/SCASH to OpenMP • In Software DSMs, the allocation of pages to home nodes greater affects performance • OpenMP – doesn't provide facilities for specifying how data is to be arranged within the memory space – there are no loop scheduling methods to define the way in which data is passed between iterations

Extension to OpenMP

An example of the extension • Data mapping directive double a[100][200]; #pragma omni mapping(a[block][*])

• Loop scheduling clause, “affinity” #pragma omp for schedule(affinity,a[i][*]) for(i = 1; i < 99; i++) for(j = 0; j < 200; j++) a[i][j] = ...;

Comparison of basic performance of network

Basic performance of network • Comparison of basic communication performance of Ethernet and Myrinet – Page transmission cost – Overhead of barrier operation

• Measurement condition – programs are parallelized with OpenMP – 1 processor per node is used

Evaluation platform • PC cluster “COSMO” CPU

PentiumII Xeon 450MHz(4-way SMP)

L2 Cache 1MB Memory

2GB

Nodes

8

Network

800Mbps Myrinet, 100base-TX Ethernet

OS

Linux Kernel 2.4.18

SCore

version 5.0.1

Compiler

gcc 2.96(Optimize option –O4)

Page transmission cost • One dimensional array of 32KB(8pages) • Master thread writes to every element in an array which is followed by a barrier point • At the barrier point, modified data is copied back to their home nodes. • Execution time from the beginning of the array write operation to completion of barrier operation Master Thread … 0 1page

1

2

4byte * 8k Elements (8pages)

Page transmission cost 12 10 8 6 4 2 0

Ethernet Myrinet

Overhead Overhead of of barrier barrier operation operation

00

80

00

70

00

60

00

50

00

40

00

30

00

20

10

00

Gap Gap for for one one step step is is page page transmission transmission cost cost

0

Elapsed time(msec)

Myrinet Myrinet ~0.24ms/page  16 :: ~0.24ms/page  Ethernet :: ~0.7ms/page Ethernet ~0.7ms/page 14

Range to write(index)

Overhead of barrier operation • Two-dimensional array of 256KB(8pages * 8) is allocated in shared memory space • It is mapped with block distribution • Slave threads read every element mapped to master thread which is followed by a barrier point • At the barrier point, no consistency maintenance communication occurs Slave thread 4byte * 8k Elements Master thread

Slave thread Slave thread

25

Overhead Overhead of of Barrier Barrier 20operation operation

nobarrier(Ether) barrier(Myri)

10

nobarrier(Myri)

15 5 0

10 0 00 20 00 30 00 40 00 50 00 60 00 70 00 80 00

Elapsed time(msec)

Barrier :: from to BarrierOverhead from beginning beginning to completion completion of Barrier operation of of barrier barrier operation operation :: ~0.7ms 35 Myrinet Myrinet ~0.7ms to No-barrier : from beginning No-barrier : from beginning to 30 Ethernet :: ~9ms completion of operation barrier(Ether) Ethernet ~9ms completion of read read operation

Range to read(index)

Performance Evaluation

Performance Evaluation • laplace – A simple Laplace equation solver using a Jacobi 5-point stencil operation – Written in C – Two versions • Parallelized with OpenMP(OpenMP version) • Written using SCASH library(SCASH version)

– The array size is 1024×1024(double precision) – The number of iteration is 50 • NPB EP – Fortran,parallelized with OpenMP(by RWC),Class A • NPB BT, SP – Fortran,parallelized with OpenMP,Class A – Affinity scheduling

Myrinet times Myrinet :: ~13.3 ~13.3Laplace times with with 16 16 processors processors

SCASH :version Ethernet Ethernet : ~3.0 ~3.0 times times

OpenMP version 1 - way(Et h e r )

1-w a y ( Ether) 2-w a y ( Ether)

10 1-w a y ( My ri )

1 - way(M yr i)

2-w a y ( My ri )

2 - way(M yr i) 8 El apse d t i m e [se c ]

8 El a ps ed ti m e[s ec]

2 - way(Et h e r )

10

6

4

6

4

2

2

0

0 1

2

4

8

N um ber of proces s ors

16

1

2

4

8

Nu m be r o f pr o c e sso r s

16

Myrinet times Myrinet :: ~13.3 ~13.3Laplace times with with 16 16 processors processors

SCASH :version Ethernet Ethernet : ~3.0 ~3.0 times times

OpenMP version 1 - way(Et h e r )

1-w a y ( Ether) 2-w a y ( Ether)

10 1-w a y ( My ri )

1 - way(M yr i)

2-w a y ( My ri )

2 - way(M yr i) 8 El apse d t i m e [se c ]

8 El a ps ed ti m e[s ec]

2 - way(Et h e r )

10

6

4

6

4

2

2

0

0 1

2

4

8

N um ber of proces s ors

16

1

2

4

8

Nu m be r o f pr o c e sso r s

16

Application of affinity scheduling • Laplace(OpenMP version) A mismatch exists between the alignment of data and loops The number of page faults increases affinity scheduling to match alignment of data and loops

Elapsed time[sec]

10

Peformance improvement with affinity scheduling(Ethernet) Improvement Improvement by by ~10% ~10%

8 1-way(normal) 2-way(normal) 1-way(affinity) 2-way(affinity)

6 4 2 0 1

2 4 8 16 Number of processors

12

Peformance improvement with affinity scheduling(Ethernet) Improvement Improvement by by ~10% ~10%

Elapsed time[sec]

10 8

1-way(normal) 2-way(normal) 1-way(affinity) 2-way(affinity)

6 4 2 0 1

2 4 8 16 Number of processors

Myrinet NPB EP Myrinet :: ~11.6 ~11.6 times times

Myrinet Ethernet Ethernet :: ~7.2 ~7.2 times times 18

18

OMP(1way)

OMP(1 way)

16

16

OMP(2way)

OMP(2 way)

14

MPI(1 way)

12

MPI(2 way)

14 P e rf orm a n c e [M op s]

P e r f o r m a n c e [M o ps ]

Ethernet

10 8 6

10 8 6 4

2

2

0

0 4

8

12

Num be r of pr oc e ssor s

16

MPI(2way)

12

4

0

MPI(1way)

0

4

8

12

Number of processors

16

Myrinet Myrinet :NPB : ~6.2 ~6.2 times times SP Ethernet Ethernet :: no no speedup speedup Myrinet 500

500

400

OMP(1way)

OMP(1way)

OMP(2way)

OMP(2way) 400

MPI

P e rfo rm an c e [M o ps]

P e rform an c e [M ops]

Ethernet

300

200

300

200

100

100

0

0 0

4 8 12 Number of processors

16

MPI

0

4 8 12 Number of processors

16

Myrinet Myrinet :: ~9.5 ~9.5 times times

NPB BT

Ethernet Myrinet:: ~3.4 Ethernet ~3.4 times times 700

700

OMP(1way)

OMP(1way)

OMP(2way)

600

OMP(2way)

600

MPI

MPI

500 P e rfo rm an c e [M o ps]

500 P e rform an c e [M ops]

Ethernet

400

300

400

300

200

200

100

100

0

0 0

4 8 12 Number of processors

16

0

4 8 12 Number of processors

16

Summary • We evaluated performance of Omni/SCASH – Good performance was achieved with Myrinet – With Ethernet, the overhead of barrier operation was very large

• For ethernet-based cluster: – Applications with small communication are suitable for Ethernet – When using Omni/SCASH, we have to carefully optimize communication in applications

Future work • More detailed analysis on: – performance difference of Ethernet and Myrinet – overhead of barrier operation with Ethernet

• Re-design Omni/SCASH for Ethernet • To improve locality, we are currently designing first touch page allocation facility • Performance tuning tools to make performance tuning easier – Performance counters, profiler