Accelerators and Exascale Systems: A programmer’s perspective Tim Mattson (
[email protected]) Intel, Parallel Computing Lab
1
Legal Disclaimer & Optimization Notice • INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. • Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. • Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804
Disclaimer • The views expressed in this talk are those of the speaker and not his employer. • If I say something “smart” or worthwhile: – Credit goes to the many smart people I work with.
• If I say something stupid… – It’s my own fault
I work in Intel’s research labs. I don’t build products. Instead, I get to poke into dark corners and think silly thoughts… just to make sure we don’t miss any great ideas. Hence, my views are by design far “off the roadmap”.
3
Outline • Accelerators:
• Software:
• The Future:
4
Accelerators in HPC: Papers from AsHES sums it up nicely Accelerators at AsHES 12
GPU CPU (many-core)
10
8
4
CACHES 2011
FPGA
0
Cell
2
Memory Cube
Theory
6
AsHES 2012
AsHES 2013
AsHES 2014
AsHES 2015
AsHES 2016
In HPC, Accelerator means GPU first, and many-core CPUs a distant second. Everything else is in the noise 5
Outside HPC: Accelerators are HUGE • The data center is full of accelerators – Packet processing – Smart NICs – Cryptographic engines – DBMS accelerators (indexing, hashing, etc.) – Machine learning (GPU dominated ... primarily for training) – and spiking neural network chips (on the horizon … IBM true-north)
• Most of these start as an FPGA then, once algorithms are stabilized, turn into ASICs
Accelerators are grabbing an increasing fraction of the data center MIPS … hence why Intel bought Altera and Nirvana 6
Accelerators beyond the GPU • Challenge facing accelerators: – They specialize to particular algorithms … algorithms evolve so accelerators are always chasing a moving target.
• Solutions: programmable hardware (FPGA)
O(106) 1 bit logic/register elements O(103) 20 Kb memory blocks O(103) Floating point mult/add blocks Fixed function units for common basic ops • Transceivers • memory controllers • ARM© cores 7
Do you need Verilog or VHDL to use an FPGA? No … OpenCL will do. Gzip compression algorithm OpenCL Was 10% Slower 12% more resources 3x faster development time
• Altera summer intern ported and optimized GZIP algorithm in less than a month • Industry leading companies FPGA engineer coded Verilog in 3 months
*
Source: http://www.eecg.utoronto.ca/~mohamed/iwocl14.pdf
8
An OpenCLTM Deep Learning Accelerator on Arria 10 • An OpenCL™ Deep Learning Accelerator on Arria 10 • Utku Aydonat, Shane O’Connell, Davor Capalija, Andrew C. Ling, Gordon R. Chiu DOI: http://dx.doi.org/10.1145/3020078.3021738
img/s/W (Alexnet … all of the layers) 25
1024 img/s @45W
5120 img/s @227W
20
1150 img/s @58W
15
10
104 img/s @25W
5
0
DLA
A-10 1150 FPGA
KU060
TitanX
M4
Xilinx FPGA
Nvidia GPU
Nvidia GPU
This shows you can program an FPGA with OpenCL and get good results. I do not like comparing to NVIDIA or Xilinx or anyone else …. I include those number to show that the FPGA/OCL results are reasonably good compared to competitors. Third Party Names are the property of their owners.
9
Deep Learning is driving the cutting edge of Accelerators • The GPU put Deep Learning on the map • Accelerators will join the fray to make it power efficient • Google TPU … the shot heard round the industry
Third Party Names are the property of their owners.
10
Deep Learning is driving the cutting edge of Accelerators • The GPU put Deep Learning on the map • Accelerators will join the fray to make it power efficient • Google TPU … the shot (ASIC) heard round the industry
Board fits in a SATA slot using PCIe Gen 3x16
Systolic matrix multiply unit … output 256 results per cycle once pipeline filled 11
TPU block diagram • An array of 256x256 8 bit MACs à 16 bit products accumulated into 32 bit accumulators (4096 256-element, 32 bit accumulators) for enough space to support double buffering at peak speed.
Supports the inference phase in deep learning at a throughput of 92 TeraOps/sec. Programming model is typically TensorFlow
12
TPU Performance results
3 types of neural networks (NN) in two flavors each … together comprise 95% of Google inferencing workloads • • •
Multi-Layer Perceptrons Convolutional NN Long Short-term memory (a recurrent NN)
Stars TPU, triangles for K80, circles for Haswell 13
TPU • TPU has disrupted the status quo … there is no turning back. • Motivated by the needs of AI, Google took its DL-HW-fate in its own hands and created an amazing application specific ASIC … and it took them about 15 months to do this.
Order-of-magnitude differences between commercial products are rare in computer architecture, which may lead to the TPU becoming an archetype for domainspecific architectures. We expect that many will build successors that will raise the bar even higher. 14
Machine learning: the killer app that’s changing our world • From Autonomous vehicles to intentional programming, machine learning will take over increasing amounts of our lives. • The winners of the AI game will “rule” … and they are willing to spend what it takes to win.
• This is pushing hardware innovation like nothing I’ve EVER seen. • I just hope the “Brave new World” we’re building is one we’d want to live in 15
Machine learning: the killer app that’s changing our world • From Autonomous vehicles to intentional programming, machine learning will take over increasing amounts of our lives. • The winners of the AI game will “rule” … and they are willing to spend what it takes to win.
• This is pushing hardware innovation like nothing I’ve EVER seen. • I just hope the “Brave new World” we’re building is one we’d want to live in 16
Machine learning: the killer app that’s changing our world • From Autonomous vehicles to intentional programming, machine learning will take over increasing amounts of our lives. • The winners of the AI game will “rule” … and they are willing to spend what it takes to win.
• This is pushing hardware innovation like nothing I’ve EVER seen. • I just hope the “Brave new World” we’re building is one we’d want to live in 17
Outline • Accelerators: The real action is outside traditional HPC. Can we somehow benefit from this work, or will we stay locked onto GPUs?
• Software
• The Future: 18
People buy applications, not computers • If programmers aren’t happy, nobody is happy. • Programmers need: – Execution models they can use to understand performance issues during algorithm design – Tools to help debug and optimize code – Supporting libraries needed by their apps (e.g. BLAS) – Ability to support ALL the platforms THEIR customers care about • Performance portability: Single source (with O(zero) #ifdefs) runs fast on all platforms of interest. • Maintainability: Support platforms of interest from a single code base. Attacks on Performance Portability is often used to convince people to lock themselves to a vendor’s software platform. But it misses the point. Maintainability is key. Specialization within a common source base is how we’ve always done things … even in sequential programming 19
Be careful believing anything a vendor says! • Remember the immortal words of Upton Sinclair It is difficult to get a man to understand something when his salary depends upon his not understanding it.
• I work for Intel … so be cautious with any benchmarks or hardware performance comparisons I might make. • Furthermore, I helped create both OpenMP and OpenCL … to say I am biased is a gross understatement. Third Party Names are the property of their owners.
20
So I depend on the work of others … • Acknowledgments – Simon McIntosh-Smith and his group at Bristol University. – They specialize in Portable Parallel Programming with OpenCL and OpenMP 4.5. Portable Performance with OpenCL, Simon McIntosh-Smith and Tim Mattson in High Performance Parallelism Pearls, Editors Jim Jeffers and James Reinders, Morgan Kaufmann, 2014 An Evaluation of Emerging Many-Core Parallel Programming Models, M. Martineau, S. McIntosh-Smith, M. Boulton, W. Gaudin. Proc. of the 7th Inter. Workshop on Prog. Models and Applications for Multicores and Manycores Evaluating OpenMP 4.0’s effectiveness as a heterogeneous parallel programming model, M. Martineau, S. McIntosh-Smith, W. Gaudin, Parallel and Distributed Processing Symposium Workshops, 2016 IEEE International Pragmatic Performance Portability with OpenMP 4.x M. Martineau, J. Price, S. McIntosh-Smith, and W. Gaudin, IWOMP, 2016
21
I love the Single Instruction multiple thread (SIMT) model • Dominant as a “proprietary” solution based on CUDA and OpenACC#. • But there is an Open Standard response (supported to varying degrees by all major vendors) SIMT programming for CPUs, GPUs, DSPs, and FPGAs. Basically, an Open Standard that generalizes the SIMT platform pioneered by our friends at NVIDIA®
OpenMP 4.0 added target and device directives ... Based on the same work that was used to create OpenACC. Therefore, just like OpenACC, you can program a GPU with OpenMP!!! #yes, I am aware that OpenACC is trying to become an open Standard. But it isn’t there yet and is still basically connected to Nvidia products. *third party names are the property of their owners
22
Portable performance: dense matrix multiplication void mat_mul(int N, float *A, float *B, float *C) { int i, j, k; int NB=N/block_size; // assume N%block_size=0 for (ib = 0; ib < NB; ib++) for (jb = 0; jb < NB; jb++) for (kb = 0; kb < NB; kb++) sgemm(C, A, B, …) // Cib,jb = Aib,kb * Bkb,jb C(ib,jb)
=
}
B(:,jb)
A(ib,:)
x
Transform the basic serial matrix multiply into multiplication over blocks
Note: sgemm is the name of the level three BLAS routine to multiply two matrices
Blocked matrix multiply: kernel #define blksz 16 __kernel void mmul( const unsigned int N, __global float* A, __global float* B, __global float* C, __local float* Awrk, __local float* Bwrk) { int kloc, Kblk; float Ctmp=0.0f;
// upper-left-corner and inc for A and B int Abase = Iblk*N*blksz; int Ainc = blksz; int Bbase = Jblk*blksz; int Binc = blksz*N; // C(Iblk,Jblk) = (sum over Kblk) A(Iblk,Kblk)*B(Kblk,Jblk) for (Kblk = 0; Kblk