Intel Xeon Phi MIC Offload Programming Models - TACC User Portal

Intel Xeon Phi MIC Offload Programming Models Kent Milfeld Nov. 16, 2014 These slides: tinyurl.com/sc14-phi © The University of Texas at Austin, 2014 Please see the final slide for copyright and licensing information.

PowerPoint original available on request ver KM2014-011

Key References •  Jeffers and Reinders, Intel Xeon Phi... –  High Performance Parallelism Pearls, Nov. 17

•  Intel Developer Zone –  http://software.intel.com/en-us/mic-developer –  http://software.intel.com/en-us/articles/effective-use-of-the-intelcompilers-offload-features

•  Stampede User Guide and related TACC resources –  Search User Guide for "Advanced Offload" and follow link Other specific recommendations throughout this presentation

2

Overview •  Basic Offload Process •  Automatic Offload •  Explicit Offload (directives) •  Virtual Shared Memory –  Offload Examples •  Building Libraries –  Controlling Offload •  Asynchronous Offload –  Monitoring –  Environment Variables

Source code available on Stampede: tar xvf ~train00/offload_demos.tar

3

Offloading: MIC as coprocessor A program running on the host offloads work by directing the MIC to execute a specified block of code. The host also directs the exchange of data between host and MIC.

app

“...do work and

running on host

host

Ideally, the host stays active while the MIC coprocessor does its assigned work.

deliver results as directed...”

x16 PCIe

node 4

Offload Models •  Compiler Assisted Offload (CAO) Insert Directives –  Explicit (Non-shared Memory Model) •  Programmer directs code execution AND data movement

–  Implicit (Virtual Shared Memory Model) •  Programmer marks some data as “shared” in the virtual sense •  Runtime automatically synchronizes values between host and MIC

•  Automatic Offload (AO) Some MKL lib routines offload when enabled –  Computationally intensive calls to Intel Math Kernel Library (MKL) –  MKL automatically manages details –  More than just offload: work division across host and MIC!

5

Explicit Model: Direct Control of Data Movement •  Available for C/C++ and Fortran •  Rich set of Data Motion directives and clauses (for Non-Shared Model through COI*) •  Supports simple (“bitwise copyable”) data structures (think 1d arrays of scalars)

*Coprocessor Offload Infrastructure 6

F90

program main

Explicit Offload

use omp_lib integer :: nprocs nprocs = omp_get_num_procs() print*, "procs: ", nprocs end program

ifort -openmp off00host.f90 icc -openmp off00host.c #include #include int main( void )

C/C++ {

int totalProcs;

Simple Fortran and C codes that each return "procs: 16" on Sandy Bridge host…

totalProcs = omp_get_num_procs(); printf( "procs: %d\n", totalProcs ); return 0; } 7

F90

program main

Explicit Offload

use omp_lib integer :: nprocs

offload directive

!dir$ offload target(mic) nprocs = omp_get_num_procs()

runs on MIC

print*, "procs: ", nprocs end program

runs on host

ifort -openmp off01simple.f90 icc -openmp off01simple.c

#include #include int main( void ) int totalProcs;

Add a one-line directive/pragma that offloads to the MIC the one line of executable code that occurs below it…

C/C++ {

offload pragma

runs on MIC #pragma offload target(mic) totalProcs = omp_get_num_procs(); printf( "procs: %d\n", totalProcs ); return 0;

…codes now return "procs: 240"… }

runs on host 8

F90

program main

Explicit Offload

use omp_lib integer :: nprocs !dir$ offload target(mic) nprocs = omp_get_num_procs()

don't use "-mmic"

print*, "procs: ", nprocs end program

ifort -openmp off01simple.f90 icc -openmp off01simple.c #include #include int main( void )

C/C++ {

int totalProcs;

Don't even need to change the compile line…

#pragma offload target(mic) totalProcs = omp_get_num_procs(); printf( "procs: %d\n", totalProcs ); return 0; } 9

F90

program main

Explicit Offload

use omp_lib integer :: nprocs

Execution blocks until offload finished

!dir$ offload target(mic) nprocs = omp_get_num_procs() print*, "procs: ", nprocs end program

off01simple #include #include int main( void ) int totalProcs;

Not asynchronous (yet): the host pauses until MIC is finished.

C/C++ {

Execution blocks until offload finished

#pragma offload target(mic) totalProcs = omp_get_num_procs();

printf( "procs: %d\n", totalProcs ); return 0; } 10

F90 !dir$ offload begin target(mic) nprocs = omp_get_num_procs() maxthreads = omp_get_max_threads() !dir$ end offload

Explicit Offload off02block

C/C++

Can offload section of code as a block

#pragma offload target(mic) { totalProcs = omp_get_num_procs(); maxThreads = omp_get_max_threads(); }

11

program main integer, parameter :: N = 500000 real :: a(N)

F90

Explicit Offload

! constant ! on stack

!dir$ offload target(mic) !$omp parallel do do i=1,N a(i) = real(i) end do !$omp end parallel do ...

off03omp

C/C++

int main( void ) {

…or an OpenMP region defined by an omp directive…

double a[500000]; // on the stack; //literal here is important int i; #pragma offload target(mic) #pragma omp parallel for for ( i=0; i

Intel Xeon Phi MIC Offload Programming Models - TACC User Portal

Intel Xeon Phi MIC Offload Programming Models - TACC User Portal

Suggest Documents

Introduction to Intel Xeon Phi Coprocessors â Part I - TACC User Portal [PDF]

Paper 1 Comparison of Intel Xeon Phi Offload ...

Best Practice Guide Intel Xeon Phi v1.1 - prace

Best Practice Guide Intel Xeon Phi v1.1 - prace

Download-This-Intel-Xeon-Phi-.pdf - Google Drive

Best Practice Guide Intel Xeon Phi v1.1 - prace

Effective SIMD Vectorization for Intel Xeon Phi Coprocessors

Porting FEASTFLOW to the Intel Xeon Phi: Lessons Learned - prace

Preliminary Experiments with XKaapi on Intel Xeon Phi ... - HAL-Inria

Best Practice Guide Intel Xeon Phi v1.1 - prace

Comparative Performance Analysis of Intel Xeon Phi, GPU, and CPU

Optimizing Performance of ROMS on Intel Xeon Phi - Science Direct

Porting FEASTFLOW to the Intel Xeon Phi: Lessons Learned - prace

Best Practice Guide Intel Xeon Phi v2.0 - prace

Intel Xeon Phi accelerated WRF Goddard microphysics scheme - GMDD

Parallel finite element analysis on the Intel Xeon Phi - EMiT

Preliminary experiences with the uintah framework on Intel Xeon Phi ...

SGI with Intel® Xeon Phi™ Coprocessors Complete Solutions

White Paper System Administration for the Intel® Xeon Phi ...

Nvidia GPUs vs. Intel Xeon Phi - Distributed Systems

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Optimizing Performance of ROMS on Intel Xeon Phi - ScienceDirect

QPACE 2 and Domain Decomposition on the Intel Xeon Phi

Test-Driving Intel Xeon Phi - Cees de Laat

Intel Xeon Phi MIC Offload Programming Models - TACC User Portal