Nov 16, 2014 - integer, parameter :: N = 100000 ! constant real. :: a(N), b(N), c(N), d(N) ! on stack ... !dir$ offload
Intel Xeon Phi MIC Offload Programming Models Kent Milfeld Nov. 16, 2014 These slides: tinyurl.com/sc14-phi © The University of Texas at Austin, 2014 Please see the final slide for copyright and licensing information.
PowerPoint original available on request ver KM2014-011
Key References • Jeffers and Reinders, Intel Xeon Phi... – High Performance Parallelism Pearls, Nov. 17
• Intel Developer Zone – http://software.intel.com/en-us/mic-developer – http://software.intel.com/en-us/articles/effective-use-of-the-intelcompilers-offload-features
• Stampede User Guide and related TACC resources – Search User Guide for "Advanced Offload" and follow link Other specific recommendations throughout this presentation
2
Overview • Basic Offload Process • Automatic Offload • Explicit Offload (directives) • Virtual Shared Memory – Offload Examples • Building Libraries – Controlling Offload • Asynchronous Offload – Monitoring – Environment Variables
Source code available on Stampede: tar xvf ~train00/offload_demos.tar
3
Offloading: MIC as coprocessor A program running on the host offloads work by directing the MIC to execute a specified block of code. The host also directs the exchange of data between host and MIC.
app
“...do work and
running on host
host
Ideally, the host stays active while the MIC coprocessor does its assigned work.
deliver results as directed...”
x16 PCIe
node 4
Offload Models • Compiler Assisted Offload (CAO) Insert Directives – Explicit (Non-shared Memory Model) • Programmer directs code execution AND data movement
– Implicit (Virtual Shared Memory Model) • Programmer marks some data as “shared” in the virtual sense • Runtime automatically synchronizes values between host and MIC
• Automatic Offload (AO) Some MKL lib routines offload when enabled – Computationally intensive calls to Intel Math Kernel Library (MKL) – MKL automatically manages details – More than just offload: work division across host and MIC!
5
Explicit Model: Direct Control of Data Movement • Available for C/C++ and Fortran • Rich set of Data Motion directives and clauses (for Non-Shared Model through COI*) • Supports simple (“bitwise copyable”) data structures (think 1d arrays of scalars)
*Coprocessor Offload Infrastructure 6
F90
program main
Explicit Offload
use omp_lib integer :: nprocs nprocs = omp_get_num_procs() print*, "procs: ", nprocs end program
ifort -openmp off00host.f90 icc -openmp off00host.c #include #include int main( void )
C/C++ {
int totalProcs;
Simple Fortran and C codes that each return "procs: 16" on Sandy Bridge host…
totalProcs = omp_get_num_procs(); printf( "procs: %d\n", totalProcs ); return 0; } 7
F90
program main
Explicit Offload
use omp_lib integer :: nprocs
offload directive
!dir$ offload target(mic) nprocs = omp_get_num_procs()
runs on MIC
print*, "procs: ", nprocs end program
runs on host
ifort -openmp off01simple.f90 icc -openmp off01simple.c
#include #include int main( void ) int totalProcs;
Add a one-line directive/pragma that offloads to the MIC the one line of executable code that occurs below it…
C/C++ {
offload pragma
runs on MIC #pragma offload target(mic) totalProcs = omp_get_num_procs(); printf( "procs: %d\n", totalProcs ); return 0;
…codes now return "procs: 240"… }
runs on host 8
F90
program main
Explicit Offload
use omp_lib integer :: nprocs !dir$ offload target(mic) nprocs = omp_get_num_procs()
don't use "-mmic"
print*, "procs: ", nprocs end program
ifort -openmp off01simple.f90 icc -openmp off01simple.c #include #include int main( void )
C/C++ {
int totalProcs;
Don't even need to change the compile line…
#pragma offload target(mic) totalProcs = omp_get_num_procs(); printf( "procs: %d\n", totalProcs ); return 0; } 9
F90
program main
Explicit Offload
use omp_lib integer :: nprocs
Execution blocks until offload finished
!dir$ offload target(mic) nprocs = omp_get_num_procs() print*, "procs: ", nprocs end program
off01simple #include #include int main( void ) int totalProcs;
Not asynchronous (yet): the host pauses until MIC is finished.
C/C++ {
Execution blocks until offload finished
#pragma offload target(mic) totalProcs = omp_get_num_procs();
printf( "procs: %d\n", totalProcs ); return 0; } 10
F90 !dir$ offload begin target(mic) nprocs = omp_get_num_procs() maxthreads = omp_get_max_threads() !dir$ end offload
Explicit Offload off02block
C/C++
Can offload section of code as a block
#pragma offload target(mic) { totalProcs = omp_get_num_procs(); maxThreads = omp_get_max_threads(); }
11
program main integer, parameter :: N = 500000 real :: a(N)
F90
Explicit Offload
! constant ! on stack
!dir$ offload target(mic) !$omp parallel do do i=1,N a(i) = real(i) end do !$omp end parallel do ...
off03omp
C/C++
int main( void ) {
…or an OpenMP region defined by an omp directive…
double a[500000]; // on the stack; //literal here is important int i; #pragma offload target(mic) #pragma omp parallel for for ( i=0; i