Thread-Local Storage Extension to Support Thread-Based MPI/OpenMP ...

Thread-Local Storage Extension to Support Thread-Based MPI/OpenMP Applications Patrick Carribault, Marc Pérache and Hervé Jourdren

CEA, DAM, DIF, F-91297 Arpajon France

IWOMP'2011

June 14th 2011

1

Introduction/Context

HPC Architecture: Petaflop/s Era Multicore processors as basic blocks Clusters of ccNUMA nodes

Parallel programming models MPI: distributed-memory model OpenMP: shared-memory model

Hybrid MPI/OpenMP (or mixed-mode programming) Promising solution (benefit from both models for data parallelism) Thread-based implementation How to handle data visibility?

Contributions Study on data visibility in thread-based parallel programming models New mechanism to handle programming-model private data

IWOMP'2011

June 14th 2011

2

Outline Introduction/Context Running Example Distributed memory (MPI) Shared memory (OpenMP) Hybrid (mixed-mode) MPI/OpenMP Extended Thread-Local Storage (TLS) Design Implementation Experimental Results Statistics on global variables Overhead of extended TLS Conclusion & Future Work IWOMP'2011

June 14th 2011

3

Running Example 1/4 Sample MPI code Global variable a (catch MPI rank)

int a = -1 ; void f() { printf("Rank = %d\n", a) ;

Traditional MPI approach Each MPI task is a UNIX process

}

int main( int argc, char ** argv ) {

Global variables are duplicated

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &a);

Thread-based MPI approach

MPI_Barrier(MPI_COMM_WORLD);

Each MPI task is a thread

f();

Process virtualization mechanism

MPI_Finalize(); }

Global variables are shared among MPI task located on the same node Need for a mechanism to duplicate global variables IWOMP'2011

June 14th 2011

4

Running Example 2/4 Multiple approaches to restore data flow

int a = -1 ; void f() {

Array privatization Source transformation Promote variables to arrays

printf("Rank = %d\n", a) ; }

int main( int argc, char ** argv ) {

Use rank to index this array

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &a);

Thread-Local Storage (TLS)

MPI_Barrier(MPI_COMM_WORLD);

Runtime mechanism

f();

Need compiler cooperation

MPI_Finalize();

Local storage for every thread

}

Handle thread creation/destruction

IWOMP'2011

June 14th 2011

5

Running Example 3/4 Hybrid MPI/OpenMP code Global variable a (catch MPI rank) Used in OpenMP parallel region Thread-based MPI approach Variable a should be duplicated for each MPI task Each OpenMP thread should access the right copy of a Need for a mechanism to duplicate global variables and allow shared access among threads

int a = -1 ; void f() { printf("Rank = %d\n", a ) ; } int main( int argc, char ** argv ) { MPI_Init( &argc, &argv ) ; MPI_Comm_rank(MPI_COMM_WORLD, &a) ; MPI_Barrier(MPI_COMM_WORLD); #pragma omp parallel { f() ; } MPI_Finalize() ; }

Array privatization? OK TLS? NO

IWOMP'2011

June 14th 2011

6

Running Example 4/4 Hybrid MPI/OpenMP code Global variable a (catch MPI rank) Variable b private to OpenMP threads Both used in OpenMP parallel region Thread-based MPI approach Variable a should be duplicated for each MPI task Each OpenMP thread should access the right copy of a Variable b should be duplicated for each OpenMP thread Contribution: extension of TLS mechanism to allow multiple levels Based on TLS One level for each parallel programming model IWOMP'2011

int a = -1 ; int b = -1 ; #pragma omp threadprivate(b) void f() { printf( "MPI Rank = %d and" " OpenMP Rank = %d\n", a, b ) ; } int main( int argc, char ** argv ) { MPI_Init( &argc, &argv ) ; MPI_Comm_rank(MPI_COMM_WORLD,&a) ; MPI_Barrier(MPI_COMM_WORLD); #pragma omp parallel { b = omp_get_thread_num() ; f() ; } MPI_Finalize() ; } June 14th 2011

7


June 14th 2011

8

TLS Design Efficient and flexible usage of thread-local data Extension to C and C++ language New keyword __thread Cooperation between compiler/linker and runtime system Full C++ support in C++0x standard (thread-local object)

IWOMP'2011

June 14th 2011

9

Extended TLS Design Main idea: add the notion of levels to thread-local data One additional pointer (programming-model level)

IWOMP'2011

June 14th 2011

10

Extended TLS Implementation Cooperation between compiler and runtime system Compiler part in GCC Runtime part in MPC (Message-Passing Computing) Compiler part in GCC New pass to put variables to the right extended-TLS level Modification of backend part for code generation (link to the runtime system) Runtime part in MPC Integrated to user-level thread mechanism Copy-on-write optimization Modified context switch to update pointer to extended TLS variables Implementation of extended TLS freely available in MPC 2.2 at http://mpc.sourceforge.net IWOMP'2011

June 14th 2011

11

Application to Parallel Programming Automatic privatization Automatically convert any MPI code for thread-based MPI compliance Put global variables in extended-TLS MPI level Thread-based hybrid programming Automatically handle thread-based MPI/OpenMP programming model Put global variables into extended-TLS MPI level Put OpenMP threadprivate variables into OpenMP level Embedding external tools Automatically convert external tools (JITs, interpreters, VMs …) Possibility to embed these tools in threads IWOMP'2011

June 14th 2011

12


June 14th 2011

13

Experimental Results Evaluation of extended TLS overhead compared to regular TLS Target architecture TERA 100 machine As of November 2010, rank 6th in Top500 Quad-socket eight-core Nehalem EX nodes Statistics Number of variables to update per benchmark Experimental results NAS benchmark MPI, OpenMP and MultiZone version

IWOMP'2011

June 14th 2011

14

Experimental Results NAS MultiZone benchmarks High-level language: Fortran Parallel programming models: MPI+OpenMP Statistics on the number of variables to update for thread-safety compliance

IWOMP'2011

June 14th 2011

15

20

20

15

15

10

10

5

0 1

4

9

16

25

Overhead (%)

Overhead (%)

Experimental Results

5

0 1

-5

-5

-10

-10

-15

4

9

16

25

-15

Number of MPI tasks Overhead over TLS (w/ dynamic library)

Number of MPI Tasks

Overhead over TLS

Overhead over TLS (w/ dynamic library)

Overhead over TLS

Experiments on NAS 3.3 MPI class B (left: SP, right: BT) Overhead of extended TLS approach compared to regular TLS Overhead up to 15% (depending on benchmark) Dynamic library hampers TLS optimizations (large part of overhead) Extended TLS hampers linker optimizations IWOMP'2011

June 14th 2011

16

20

20

15

15

10

10

5

0 1

2

4

8

16

32

64

Overhead(%)

Overhead(%)


5

0 1

-5

-5

-10

-10

-15

2

4

8

16

32

-15

Number of OpenMP threads Overhead over TLS (w/ dynamic library)

Number of OpenMP threads

Overhead over TLS

Overhead over TLS (w/ dynamic library)

Overhead over TLS

Experiments on NAS 3.3 OpenMP class B (left: SP, right: BT) Overhead of extended TLS approach compared to executable with TLS Threadprivate variables are put into TLS Smaller overhead

IWOMP'2011

June 14th 2011

17

20

20

15

15

10

10

5

0 1

4

9

16

25

Overhead (%)

Overhead (%)


5

0 1

-5

-5

-10

-10

-15

4

9

16

25

-15


Overhead over TLS


Overhead over TLS

Experiments on NAS MZ 3.2 class B (left: SP, right: BT) Overhead of extended TLS approach compared to executable with TLS Global variables and threadprivate variables are put into extended TLS Overhead up to 10% (mainly because of no TLS optimizations)

IWOMP'2011

June 14th 2011

18


June 14th 2011

19

Conclusion Context Mixing multiple thread-based parallel programming models Contributions New mechanism based on extension of TLS (Thread-Local Storage) Cooperation between compiler (GCC) and runtime system (MPC) Freely available at http://mpc.sourceforge.net (version 2.2) Contact: [email protected] or [email protected] Future Work Optimization of linker tool to hoist TLS accesses Deal with more applications/embedded tools Extend this mechanism to more than 2 programming models

IWOMP'2011

June 14th 2011

20

Thread-Local Storage Extension to Support Thread-Based MPI/OpenMP ...

Thread-Local Storage Extension to Support Thread-Based MPI/OpenMP ...

Suggest Documents

Extension of Sikuli Tool to Support Automated Tests to ...

Food Storage Guide - NDSU Agriculture and Extension

Extension modules for storage, visualization and

Preventing Silage Storage Losses - UW-Extension

Food Storage Guide - NDSU Agriculture and Extension

Extension of MIH to Support FPMIPv6 for Optimized ... - arXiv

SEEG Assistant: a 3DSlicer extension to support epilepsy surgery - Core

An Extension of the SCORM Standard to Support

cmpSCTP-An Extension of SCTP to Support ... - Semantic Scholar

Expedite: An Operating System Extension to Support Low ... - CiteSeerX

More than 300 Economists Say to Congress: Support Extension of ...

On the Extension of Xception to Support Software Fault ... - CiteSeerX

Data-Aware Extension of BPEL to Support Data-Intensive Service ...

fossil group to support google's extension of android into wearables

More than 300 Economists Say to Congress: Support Extension of ...

An Annotational Extension of Message Sequence Charts to Support ...

Topic Extraction and Extension to Support Concept ... - Semantic Scholar

An Extension of RSVP to Support Real-Time ... - Semantic Scholar

Organizational Principles of Cloud Storage to Support ...

An Agent Framework for CE Devices to Support Storage

An architecture to support storage and retrieval of events - CiteSeerX

Providing Frequency Support of Hydro-Pumped Storage to Taiwan ...

An architecture to support storage and retrieval of events - CiteSeerX

Storage Designed to Support an Oracle Database | INFINIDAT