Software DSM system SCASH. ⢠Omni/SCASH: OpenMP implementation for Software DSM. ⢠Comparison of the basic performance of. Myrinet and Ethernet.
Performance of Cluster-enabled OpenMP for the SCASH Software Distributed Shared Memory System 1)
Yoshinori Ojima Mitsuhisa Sato1) 2) Hiroshi Harada Yutaka Ishikawa3) 1) Information Science and Electronics, University of Tsukuba 2) Hewlett-Packard Japan,Ltd 3) The University of Tokyo
Overview • Background, objective • Software DSM system SCASH • Omni/SCASH: OpenMP implementation for Software DSM • Comparison of the basic performance of Myrinet and Ethernet • Performance evaluation, Discussion • Conclusion, future work
Background • PC Cluster became a popular parallel computing platform. • Distributed memory programming – Message passing library(MPI, PVM) – Programming cost is large
• Shared memory programming – Programmers can parallelize easily with OpenMP – Programming cost is small
Omni/SCASH : OpenMP for Software Distributed Shared Memory(DSM) System This is cluster-enabled OpenMP
The objectives of research • To evaluate performance of Omni/SCASH • To investigate the performance factor depending on the communication performance of networks • To investigate the problem of using a commodity network (Ethernet) as well as Myrinet
Software DSM System SCASH • A Software DSM System as a part of SCore cluster-system software • Uses PM communication library, implemented as user level library • Per page-basis (using kernel page-faults) • Two page consistency protocols – invalidate and update
• Eager Release Consistency(ERC) memory model with multiple writer protocol (diff) • Consistency maintenance communication at synchronization point
Omni/SCASH • OpenMP Compiler for SCASH – It translates OpenMP programs to multithreaded programs linked to SCASH runtime library “shmem” memory model
OpenMP All variables are shared by default No explicit shared memory allocation
All variables declared statically in global scope are private. The shared address space must be allocated by a library function at runtime.
Omni OpenMP Compiler
Transformation for “shmem” model • OpenMP compiler for “shmem” memory model – Detects references to a shared data object – Rewrite it to the references to the objects which are allocated in shared address space at runtime. – A global variable declaration a declaration of the pointer which will point into a shared object at runtime. double x; double a[100]; … a[10] = x;
Extension of Omni/SCASH to OpenMP • In Software DSMs, the allocation of pages to home nodes greater affects performance • OpenMP – doesn't provide facilities for specifying how data is to be arranged within the memory space – there are no loop scheduling methods to define the way in which data is passed between iterations
Extension to OpenMP
An example of the extension • Data mapping directive double a[100][200]; #pragma omni mapping(a[block][*])
Basic performance of network • Comparison of basic communication performance of Ethernet and Myrinet – Page transmission cost – Overhead of barrier operation
• Measurement condition – programs are parallelized with OpenMP – 1 processor per node is used
Evaluation platform • PC cluster “COSMO” CPU
PentiumII Xeon 450MHz(4-way SMP)
L2 Cache 1MB Memory
2GB
Nodes
8
Network
800Mbps Myrinet, 100base-TX Ethernet
OS
Linux Kernel 2.4.18
SCore
version 5.0.1
Compiler
gcc 2.96(Optimize option –O4)
Page transmission cost • One dimensional array of 32KB(8pages) • Master thread writes to every element in an array which is followed by a barrier point • At the barrier point, modified data is copied back to their home nodes. • Execution time from the beginning of the array write operation to completion of barrier operation Master Thread … 0 1page
1
2
4byte * 8k Elements (8pages)
Page transmission cost 12 10 8 6 4 2 0
Ethernet Myrinet
Overhead Overhead of of barrier barrier operation operation
00
80
00
70
00
60
00
50
00
40
00
30
00
20
10
00
Gap Gap for for one one step step is is page page transmission transmission cost cost
Overhead of barrier operation • Two-dimensional array of 256KB(8pages * 8) is allocated in shared memory space • It is mapped with block distribution • Slave threads read every element mapped to master thread which is followed by a barrier point • At the barrier point, no consistency maintenance communication occurs Slave thread 4byte * 8k Elements Master thread
Slave thread Slave thread
25
Overhead Overhead of of Barrier Barrier 20operation operation
nobarrier(Ether) barrier(Myri)
10
nobarrier(Myri)
15 5 0
10 0 00 20 00 30 00 40 00 50 00 60 00 70 00 80 00
Elapsed time(msec)
Barrier :: from to BarrierOverhead from beginning beginning to completion completion of Barrier operation of of barrier barrier operation operation :: ~0.7ms 35 Myrinet Myrinet ~0.7ms to No-barrier : from beginning No-barrier : from beginning to 30 Ethernet :: ~9ms completion of operation barrier(Ether) Ethernet ~9ms completion of read read operation
Range to read(index)
Performance Evaluation
Performance Evaluation • laplace – A simple Laplace equation solver using a Jacobi 5-point stencil operation – Written in C – Two versions • Parallelized with OpenMP(OpenMP version) • Written using SCASH library(SCASH version)
– The array size is 1024×1024(double precision) – The number of iteration is 50 • NPB EP – Fortran,parallelized with OpenMP(by RWC),Class A • NPB BT, SP – Fortran,parallelized with OpenMP,Class A – Affinity scheduling
Myrinet times Myrinet :: ~13.3 ~13.3Laplace times with with 16 16 processors processors
SCASH :version Ethernet Ethernet : ~3.0 ~3.0 times times
OpenMP version 1 - way(Et h e r )
1-w a y ( Ether) 2-w a y ( Ether)
10 1-w a y ( My ri )
1 - way(M yr i)
2-w a y ( My ri )
2 - way(M yr i) 8 El apse d t i m e [se c ]
8 El a ps ed ti m e[s ec]
2 - way(Et h e r )
10
6
4
6
4
2
2
0
0 1
2
4
8
N um ber of proces s ors
16
1
2
4
8
Nu m be r o f pr o c e sso r s
16
Myrinet times Myrinet :: ~13.3 ~13.3Laplace times with with 16 16 processors processors
SCASH :version Ethernet Ethernet : ~3.0 ~3.0 times times
OpenMP version 1 - way(Et h e r )
1-w a y ( Ether) 2-w a y ( Ether)
10 1-w a y ( My ri )
1 - way(M yr i)
2-w a y ( My ri )
2 - way(M yr i) 8 El apse d t i m e [se c ]
8 El a ps ed ti m e[s ec]
2 - way(Et h e r )
10
6
4
6
4
2
2
0
0 1
2
4
8
N um ber of proces s ors
16
1
2
4
8
Nu m be r o f pr o c e sso r s
16
Application of affinity scheduling • Laplace(OpenMP version) A mismatch exists between the alignment of data and loops The number of page faults increases affinity scheduling to match alignment of data and loops
Elapsed time[sec]
10
Peformance improvement with affinity scheduling(Ethernet) Improvement Improvement by by ~10% ~10%
Myrinet Ethernet Ethernet :: ~7.2 ~7.2 times times 18
18
OMP(1way)
OMP(1 way)
16
16
OMP(2way)
OMP(2 way)
14
MPI(1 way)
12
MPI(2 way)
14 P e rf orm a n c e [M op s]
P e r f o r m a n c e [M o ps ]
Ethernet
10 8 6
10 8 6 4
2
2
0
0 4
8
12
Num be r of pr oc e ssor s
16
MPI(2way)
12
4
0
MPI(1way)
0
4
8
12
Number of processors
16
Myrinet Myrinet :NPB : ~6.2 ~6.2 times times SP Ethernet Ethernet :: no no speedup speedup Myrinet 500
500
400
OMP(1way)
OMP(1way)
OMP(2way)
OMP(2way) 400
MPI
P e rfo rm an c e [M o ps]
P e rform an c e [M ops]
Ethernet
300
200
300
200
100
100
0
0 0
4 8 12 Number of processors
16
MPI
0
4 8 12 Number of processors
16
Myrinet Myrinet :: ~9.5 ~9.5 times times
NPB BT
Ethernet Myrinet:: ~3.4 Ethernet ~3.4 times times 700
700
OMP(1way)
OMP(1way)
OMP(2way)
600
OMP(2way)
600
MPI
MPI
500 P e rfo rm an c e [M o ps]
500 P e rform an c e [M ops]
Ethernet
400
300
400
300
200
200
100
100
0
0 0
4 8 12 Number of processors
16
0
4 8 12 Number of processors
16
Summary • We evaluated performance of Omni/SCASH – Good performance was achieved with Myrinet – With Ethernet, the overhead of barrier operation was very large
• For ethernet-based cluster: – Applications with small communication are suitable for Ethernet – When using Omni/SCASH, we have to carefully optimize communication in applications
Future work • More detailed analysis on: – performance difference of Ethernet and Myrinet – overhead of barrier operation with Ethernet
• Re-design Omni/SCASH for Ethernet • To improve locality, we are currently designing first touch page allocation facility • Performance tuning tools to make performance tuning easier – Performance counters, profiler