Accelerators and Exascale Systems: A programmer's ...

Accelerators and Exascale Systems: A programmer’s perspective Tim Mattson ([email protected]) Intel, Parallel Computing Lab

1

Legal Disclaimer & Optimization Notice • INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. • Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. • Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Disclaimer • The views expressed in this talk are those of the speaker and not his employer. • If I say something “smart” or worthwhile: – Credit goes to the many smart people I work with.

• If I say something stupid… – It’s my own fault

I work in Intel’s research labs. I don’t build products. Instead, I get to poke into dark corners and think silly thoughts… just to make sure we don’t miss any great ideas. Hence, my views are by design far “off the roadmap”.

3

Outline • Accelerators:

• Software:

• The Future:

4

Accelerators in HPC: Papers from AsHES sums it up nicely Accelerators at AsHES 12

GPU CPU (many-core)

10

8

4

CACHES 2011

FPGA

0

Cell

2

Memory Cube

Theory

6

AsHES 2012

AsHES 2013

AsHES 2014

AsHES 2015

AsHES 2016

In HPC, Accelerator means GPU first, and many-core CPUs a distant second. Everything else is in the noise 5

Outside HPC: Accelerators are HUGE • The data center is full of accelerators – Packet processing – Smart NICs – Cryptographic engines – DBMS accelerators (indexing, hashing, etc.) – Machine learning (GPU dominated ... primarily for training) – and spiking neural network chips (on the horizon … IBM true-north)

• Most of these start as an FPGA then, once algorithms are stabilized, turn into ASICs

Accelerators are grabbing an increasing fraction of the data center MIPS … hence why Intel bought Altera and Nirvana 6

Accelerators beyond the GPU • Challenge facing accelerators: – They specialize to particular algorithms … algorithms evolve so accelerators are always chasing a moving target.

• Solutions: programmable hardware (FPGA)

O(106) 1 bit logic/register elements O(103) 20 Kb memory blocks O(103) Floating point mult/add blocks Fixed function units for common basic ops • Transceivers • memory controllers • ARM© cores 7

Do you need Verilog or VHDL to use an FPGA? No … OpenCL will do. Gzip compression algorithm OpenCL Was 10% Slower 12% more resources 3x faster development time

• Altera summer intern ported and optimized GZIP algorithm in less than a month • Industry leading companies FPGA engineer coded Verilog in 3 months

*

Source: http://www.eecg.utoronto.ca/~mohamed/iwocl14.pdf

8

An OpenCLTM Deep Learning Accelerator on Arria 10 • An OpenCL™ Deep Learning Accelerator on Arria 10 • Utku Aydonat, Shane O’Connell, Davor Capalija, Andrew C. Ling, Gordon R. Chiu DOI: http://dx.doi.org/10.1145/3020078.3021738

img/s/W (Alexnet … all of the layers) 25

1024 img/s @45W

5120 img/s @227W

20

1150 img/s @58W

15

10

104 img/s @25W

5

0

DLA

A-10 1150 FPGA

KU060

TitanX

M4

Xilinx FPGA

Nvidia GPU

Nvidia GPU

This shows you can program an FPGA with OpenCL and get good results. I do not like comparing to NVIDIA or Xilinx or anyone else …. I include those number to show that the FPGA/OCL results are reasonably good compared to competitors. Third Party Names are the property of their owners.

9

Deep Learning is driving the cutting edge of Accelerators • The GPU put Deep Learning on the map • Accelerators will join the fray to make it power efficient • Google TPU … the shot heard round the industry

Third Party Names are the property of their owners.

10

Deep Learning is driving the cutting edge of Accelerators • The GPU put Deep Learning on the map • Accelerators will join the fray to make it power efficient • Google TPU … the shot (ASIC) heard round the industry

Board fits in a SATA slot using PCIe Gen 3x16

Systolic matrix multiply unit … output 256 results per cycle once pipeline filled 11

TPU block diagram • An array of 256x256 8 bit MACs à 16 bit products accumulated into 32 bit accumulators (4096 256-element, 32 bit accumulators) for enough space to support double buffering at peak speed.

Supports the inference phase in deep learning at a throughput of 92 TeraOps/sec. Programming model is typically TensorFlow

12

TPU Performance results

3 types of neural networks (NN) in two flavors each … together comprise 95% of Google inferencing workloads • • •

Multi-Layer Perceptrons Convolutional NN Long Short-term memory (a recurrent NN)

Stars TPU, triangles for K80, circles for Haswell 13

TPU • TPU has disrupted the status quo … there is no turning back. • Motivated by the needs of AI, Google took its DL-HW-fate in its own hands and created an amazing application specific ASIC … and it took them about 15 months to do this.

Order-of-magnitude differences between commercial products are rare in computer architecture, which may lead to the TPU becoming an archetype for domainspecific architectures. We expect that many will build successors that will raise the bar even higher. 14

Machine learning: the killer app that’s changing our world • From Autonomous vehicles to intentional programming, machine learning will take over increasing amounts of our lives. • The winners of the AI game will “rule” … and they are willing to spend what it takes to win.

• This is pushing hardware innovation like nothing I’ve EVER seen. • I just hope the “Brave new World” we’re building is one we’d want to live in 15





Outline • Accelerators: The real action is outside traditional HPC. Can we somehow benefit from this work, or will we stay locked onto GPUs?

• Software

• The Future: 18

People buy applications, not computers • If programmers aren’t happy, nobody is happy. • Programmers need: – Execution models they can use to understand performance issues during algorithm design – Tools to help debug and optimize code – Supporting libraries needed by their apps (e.g. BLAS) – Ability to support ALL the platforms THEIR customers care about • Performance portability: Single source (with O(zero) #ifdefs) runs fast on all platforms of interest. • Maintainability: Support platforms of interest from a single code base. Attacks on Performance Portability is often used to convince people to lock themselves to a vendor’s software platform. But it misses the point. Maintainability is key. Specialization within a common source base is how we’ve always done things … even in sequential programming 19

Be careful believing anything a vendor says! • Remember the immortal words of Upton Sinclair It is difficult to get a man to understand something when his salary depends upon his not understanding it.

• I work for Intel … so be cautious with any benchmarks or hardware performance comparisons I might make. • Furthermore, I helped create both OpenMP and OpenCL … to say I am biased is a gross understatement. Third Party Names are the property of their owners.

20

So I depend on the work of others … • Acknowledgments – Simon McIntosh-Smith and his group at Bristol University. – They specialize in Portable Parallel Programming with OpenCL and OpenMP 4.5. Portable Performance with OpenCL, Simon McIntosh-Smith and Tim Mattson in High Performance Parallelism Pearls, Editors Jim Jeffers and James Reinders, Morgan Kaufmann, 2014 An Evaluation of Emerging Many-Core Parallel Programming Models, M. Martineau, S. McIntosh-Smith, M. Boulton, W. Gaudin. Proc. of the 7th Inter. Workshop on Prog. Models and Applications for Multicores and Manycores Evaluating OpenMP 4.0’s effectiveness as a heterogeneous parallel programming model, M. Martineau, S. McIntosh-Smith, W. Gaudin, Parallel and Distributed Processing Symposium Workshops, 2016 IEEE International Pragmatic Performance Portability with OpenMP 4.x M. Martineau, J. Price, S. McIntosh-Smith, and W. Gaudin, IWOMP, 2016

21

I love the Single Instruction multiple thread (SIMT) model • Dominant as a “proprietary” solution based on CUDA and OpenACC#. • But there is an Open Standard response (supported to varying degrees by all major vendors) SIMT programming for CPUs, GPUs, DSPs, and FPGAs. Basically, an Open Standard that generalizes the SIMT platform pioneered by our friends at NVIDIA®

OpenMP 4.0 added target and device directives ... Based on the same work that was used to create OpenACC. Therefore, just like OpenACC, you can program a GPU with OpenMP!!! #yes, I am aware that OpenACC is trying to become an open Standard. But it isn’t there yet and is still basically connected to Nvidia products. *third party names are the property of their owners

22

Portable performance: dense matrix multiplication void mat_mul(int N, float *A, float *B, float *C) { int i, j, k; int NB=N/block_size; // assume N%block_size=0 for (ib = 0; ib < NB; ib++) for (jb = 0; jb < NB; jb++) for (kb = 0; kb < NB; kb++) sgemm(C, A, B, …) // Cib,jb = Aib,kb * Bkb,jb C(ib,jb)

=

}

B(:,jb)

A(ib,:)

x

Transform the basic serial matrix multiply into multiplication over blocks

Note: sgemm is the name of the level three BLAS routine to multiply two matrices

Blocked matrix multiply: kernel #define blksz 16 __kernel void mmul( const unsigned int N, __global float* A, __global float* B, __global float* C, __local float* Awrk, __local float* Bwrk) { int kloc, Kblk; float Ctmp=0.0f;

// upper-left-corner and inc for A and B int Abase = Iblk*N*blksz; int Ainc = blksz; int Bbase = Jblk*blksz; int Binc = blksz*N; // C(Iblk,Jblk) = (sum over Kblk) A(Iblk,Kblk)*B(Kblk,Jblk) for (Kblk = 0; Kblk

Accelerators and Exascale Systems: A programmer's ...

Accelerators and Exascale Systems: A programmer's ...

Suggest Documents

Redundant Computing for Exascale Systems - Sandia National ...

Exploring Reliability of Exascale Systems through Simulations

Technologies for exascale systems - Semantic Scholar

Redundant Computing for Exascale Systems - Sandia National ...

Silicon Photonics for Exascale Systems - Lightwave Research ...

Making a Case for Distributed File Systems at Exascale

Is Europe For Exascale? - European Exascale Projects

Distributed Monitoring and Management of Exascale Systems in the ...

Distributed Monitoring and Management of Exascale Systems in the ...

Co-Design of Systems and Applications for Exascale - DROPS

Co-Design of Systems and Applications for Exascale - DROPS

Pick-A-Crowd - eXascale Infolab

Cryogenic systems for accelerators - CERN Document Server

Accelerators and Detectors

Distributed File Systems for Exascale Computing - Semantic Scholar

Max-Flow Min-Energy Routing for ExaScale Cloud Computing Systems

Exascale Storage Systems -- An Analytical Study of Expenses | Kunkel ...

Silicon Photonics for Exascale Systems - Lightwave Research Labratory

a look inside accelerators - Nesta

Literature Review: Presentations by Programmers for Programmers

LINEAR ACCELERATORS

A Programmers Guide to MMT

Exploring Reliability of Exascale Systems through ... - Semantic Scholar

Comparing runtime systems with exascale ambitions using the ...