Lecture 1. Introduction ... www.cse.ucsd.edu/classes/fa12/cse260-b/Lectures ... 5.
Policies. • Academic Integrity. ◇ Do you own work. ◇ Plagiarism and cheating ...
Lecture 1 Introduction
Welcome to CSE 260! • Your instructor is Scott Baden
[email protected] Office hours in EBU3B Room 3244 • TBD; suggest after class
• Your TA is Edmund Wong
[email protected]
• The class home page is http://www.cse.ucsd.edu/classes/fai12/cse260-b
• CSME? • Logins
Moodle Bang Lilliput
©2012 Scott B. Baden / CSE 260 / Fall 2012
2
Text and readings • Required texts
Programming Massively Parallel Processors: A Handson Approach, by David Kirk and Wen-mei Hwu, Morgan Kaufmann Publishers (2010) An Introduction to Parallel Programming, by Peter Pacheco, Morgan Kaufmann (2011)
• Assigned class readings will include handouts and on-line material • Texts on reserve in the S&E library • Lecture slides www.cse.ucsd.edu/classes/fa12/cse260-b/Lectures ©2012 Scott B. Baden / CSE 260 / Fall 2012
3
Course Requirements • Programming labs: 60%
Teams of 2 or 3 Includes a lab report Find a partner using the “looking for a partner” Moodle forum
• Project: 40%
Teams of 2 or 3 See list of projects Polished lab report
©2012 Scott B. Baden / CSE 260 / Fall 2012
4
Policies • Academic Integrity
Do you own work Plagiarism and cheating will not be tolerated
• By taking this course, you implicitly agree to abide by the following the course polices: www.cse.ucsd.edu/classes/fa12/cse260-b/Policies.html
©2012 Scott B. Baden / CSE 260 / Fall 2012
5
Labs and Scheduling • Room full of Mac Minis with GPU
Lab sessions in APM
• One lecture to be rescheduled Weds 11/14 → TBD Will come earlier in the quarter You will be working on projects by then
©2012 Scott B. Baden / CSE 260 / Fall 2012
6
Programming Laboratories • 3 labs
• • • •
#1: Performance programming a single core #2: GPU Fermi (NERSC’s Dirac) – CUDA #3 MPI on a cluster (Bang) - MPI
Project: 3 or 4 weeks Teams of 2 or 3 Teams of 3 will have a larger workload than teams of 2 Establish a schedule with your partner from the start
©2012 Scott B. Baden / CSE 260 / Fall 2012
7
Course overview and background • How to solve computationally intensive problems on parallel computers
Software techniques Performance tradeoffs
• Background
Graduate standing Recommended: computer architecture (CSE 240A) Students outside CSE are welcome See me if you are unsure about your background
• Prior experience
Parallel computation? Numerical analysis? ©2012 Scott B. Baden / CSE 260 / Fall 2012
8
Background Markers • • • • • • • • •
C/C++ Java Fortran? Navier Stokes Equations Sparse factorization TLB misses "•u=0 Multithreading D# + # (" • v ) = 0 MPI Dt CUDA, GPUs RPC f "(a) f ""(a) f (a) + (x # a) + (x # a) 1! 2! Abstract base class !
2
+ ...
! ©2012 Scott B. Baden / CSE 260 / Fall 2012
9
Syllabus • Fundamentals
Motivation, system organization, hardware execution models, limits to performance, program execution models, theoretical models
• Software and programming
Programming models and techniques: message passing, multithreading Architectural considerations: GPUs and multicore Higher level run time models, language support CUDA, openmp, pthreads, MPI
• Parallel algorithm design and implementation
Case studies to develop a repertoire of problem solving techniques: discretization, sorting, linear algebra, irregular problems Data structures and their efficient implementation: load balancing and performance Performance tradeoffs, evaluation, and tuning Exascale computing: communication avoidance, resilience ©2012 Scott B. Baden / CSE 260 / Fall 2012
10
What is parallel processing? • Decompose a workload onto simultaneously executing physical resources • Multiple processors co-operate to process a related set of tasks – tightly coupled • Improve some aspect of performance
Speedup: 100 processors run ×100 faster than one Capability: Tackle a larger problem, more accurately Algorithmic, e.g. search Locality: more cache memory and bandwidth
• Resilience an issue at the high end Exascale:1018 flops/sec
©2012 Scott B. Baden / CSE 260 / Fall 2012
11
Parallel Processing, Concurrency & Distributed Computing • Parallel processing
Performance (and capacity) is the main goal More tightly coupled than distributed computation
• Concurrency
Concurrency control: serialize certain computations to ensure correctness, e.g. database transactions Performance need not be the main goal
• Distributed computation
Geographically distributed Multiple resources computing & communicating unreliably Cloud or Grid computing, large amounts of storage Looser, coarser grained communication and synchronization
• May or may not involve separate physical resources, e.g. multitasking “Virtual Parallelism” ©2012 Scott B. Baden / CSE 260 / Fall 2012
12
Granularity • A measure of how often a computation communicates, and what scale
Distributed computer: a whole program Multicomputer: function, a loop nest Multiprocessor: + memory reference Multicore: similar to a multiprocessor but perhaps finer grained GPU: kernel thread Instruction level parallelism: instruction, register ©2012 Scott B. Baden / CSE 260 / Fall 2012
13
The impact of technology
©2012 Scott B. Baden / CSE 260 / Fall 2012
14
Why is parallelism inevitable? • Physical limits on processor clock speed and heat dissipation • A parallel computer increases memory capacity and bandwidth as well as the computational rate
Nvidia
Average CPU clock speeds (via Bill Gropp)
http://www.pcpitstop.com/research/cpu.asp
©2012 Scott B. Baden / CSE 260 / Fall 2012
15
Today’s laptop would have been yesterday s supercomputer • Cray-1 Supercomputer • 80 MHz processor • 240 Mflops/sec • 8 Megabytes memory • • • •
Water cooled 1.8m H x 2.2m W 4 tons Over $10M in 1976
• • • • • • • • • • •
MacBook Air 1.6GHz dual-core Intel Core i5 2 x 8 x 1.6 = 25.6 Gflops/sec 2 Gigabytes memory, 3 Megabytes shared cache 64GB Flash storage Intel HD Graphics 3000 256MB shared DDR3 SDRAM Color display Wireless Networking Air cooled ~ 1.7 x 30 x 19.2 cm. 1.08 kg $999 in January 2012
©2012 Scott B. Baden / CSE 260 / Fall 2012
16
Technological disruption • Transformational: modelling, healthcare… • New capabilities • Changes the common wisdom for solving a problem including the implementation Cray-1, 1976, 240 Megaflops
Connection Machine CM-2, 1987
Nvidia Tesla, 4.14 Tflops, 2009
Beowulf cluster, late 1990s
Intel 48 core processor, 2009
ASCI Red, 1997, 1Tflop
Sony Playstation 3, 150 Glfops, 2006
©2012 Scott B. Baden / CSE 260 / Fall 2012
Tilera 100 core processor, 2009
17
Trends in Top 500 supercomputers
www.top500.org
10/5/12
©2012 Scott B. Baden / CSE 260 / Fall 2012
18 18
The age of the multi-core processor • On chip parallel computer • IBM Power4 (2001), many others follow (Intel, AMD, Tilera, Cell Broadband Engine) • First dual core laptops (2005-6) • GPUs (nVidia, ATI): a supercomputer on a desktop realworldtech.com
10/5/12
©2012 Scott B. Baden / CSE 260 / Fall 2012
19 19
The GPU Specialized many-core processor Massively multithreaded, long vectors Reduced on-chip memory per core Explicitly manage the memory hierarchy
1200 1000
AMD (GPU) NVIDIA (GPU) Intel (CPU) Many-core GPU
800 GFLOPS
• • • •
600 400 200 0 2001
Multicore CPU Dual-core 2002
2003
2004
2005 Year
2006
Quad-core 2007 2008 2009 Courtesy: John Owens
Christopher Dyken, SINTEF
10/5/12
©2012 Scott B. Baden / CSE 260 / Fall 2012
20 20
The impact • You are taking this class • A renaissance in parallel computation • Parallelism is no longer restricted to the HPC cult with large machine rooms, it is relevant to everyone • We all have a parallel computer at our fingertips • May or may not need to know they are using a parallel computer
10/5/12
©2012 Scott B. Baden / CSE 260 / Fall 2012
21 21
Motivating Applications
©2012 Scott B. Baden / CSE 260 / Fall 2012
22
A Motivating Application - TeraShake Simulates a 7.7 earthquake along the southern San Andreas fault near LA using seismic, geophysical, and other data from the Southern California Earthquake Center
epicenter.usc.edu/cmeportal/TeraShake.html ©2012 Scott B. Baden / CSE 260 / Fall 2012
23
How TeraShake Works • Divide up Southern California into blocks • For each block, get all the data about geological structures, fault information, … • Map the blocks onto processors of the supercomputer • Run the simulation using current information on fault activity and on the physics of earthquakes
SDSC Machine
Room DataCentral@SDSC
©2012 Scott B. Baden / CSE 260 / Fall 2012
24
Animation
©2012 Scott B. Baden / CSE 260 / Fall 2012
25
Application #2: modeling the heart on a GPU • Fred Lionetti and Andrew McCulloch • nVidia GTX-295, single GPU: 240 single precision units @ 1.242 GHz • ×134 speedup compared with openmp running on quad core i7 @ 2.93GHz • ×20 improvement over entire simulation
10/5/12
©2012 Scott B. Baden / CSE 260 / Fall 2012
26
26
Face detection with Viola-Jones algorithm • Searches images for features of a human face Window
Feature
Image
Sonia Sotomayor
• GPU performance competitive with FPGAs, but far lower development cost • Jason Oberg, Daniel Hefenbrock, Tan Nguyen [CSE 260, fa’09 →fccm’10] ©2012 Scott B. Baden / CSE 260 / Fall 2012
27
How do we know if we’ve succeeded? • Capability
We solved a problem that we couldn’t solve before, or under conditions that were not possible previously
• Performance
Solve the same problem in less time than before This can provide a capability if we are solving many problem instances
• The result achieved must justify the effort
Enable new scientific discovery ©2012 Scott B. Baden / CSE 260 / Fall 2012
28
Is it that simple? • Simplified processor design, but more user control over the hardware resources
Redesign the software Rethink the problem solving technique
• If we don’t use the parallelism, we lose it
Amdahl’s law - serial sections Von Neumann bottleneck Load imbalances
©2012 Scott B. Baden / CSE 260 / Fall 2012
29
Why is parallel programming challenging? • A well behaved single processor algorithm may behave poorly on a parallel computer, and may need to be reformulated numerically • There is no magic compiler that can turn a serial program into an efficient parallel program all the time and on all machines
Performance programming involving low-level details: heavily application dependent Irregularity in the computation and its data structures forces us to think even harder Users don’t start from scratch-they reuse old code. Poorly structured code, or code structured for older architectures can entail costly reprogramming ©2012 Scott B. Baden / CSE 260 / Fall 2012
30
The two mantras for high performance • Domain-specific knowledge is important in optimizing performance, especially locality • Significant time investment for each 10-fold increase in a performance
X 100
X 10
Today
©2012 Scott B. Baden / CSE 260 / Fall 2012
31
Performance and Implementation Issues Conserve
locality Hide latency
• Little’s Law [1961]
Processor
Performance
• Data motion cost growing relative to computation
# threads = performance × latency T=p×λ
and λ increasing with time p =1 - 8 flops/cycle λ = 500 cycles/word
Memory (DRAM) Year
p-1
λ
p
10/5/12
©2012 Scott B. Baden / CSE 260 / Fall 2012
fotosearch.com
32 32