Lecture slides

7 downloads 90320 Views 1MB Size Report
Lecture 1. Introduction ... www.cse.ucsd.edu/classes/fa12/cse260-b/Lectures ... 5. Policies. • Academic Integrity. ◇ Do you own work. ◇ Plagiarism and cheating ...
Lecture 1 Introduction

Welcome to CSE 260! •  Your instructor is Scott Baden    

[email protected] Office hours in EBU3B Room 3244 •  TBD; suggest after class

•  Your TA is Edmund Wong  

[email protected]

•  The class home page is http://www.cse.ucsd.edu/classes/fai12/cse260-b

•  CSME? •  Logins      

Moodle Bang Lilliput

©2012 Scott B. Baden / CSE 260 / Fall 2012

2

Text and readings •  Required texts  

 

Programming Massively Parallel Processors: A Handson Approach, by David Kirk and Wen-mei Hwu, Morgan Kaufmann Publishers (2010) An Introduction to Parallel Programming, by Peter Pacheco, Morgan Kaufmann (2011)

•  Assigned class readings will include handouts and on-line material •  Texts on reserve in the S&E library •  Lecture slides www.cse.ucsd.edu/classes/fa12/cse260-b/Lectures ©2012 Scott B. Baden / CSE 260 / Fall 2012

3

Course Requirements •  Programming labs: 60%  

 

 

Teams of 2 or 3 Includes a lab report Find a partner using the “looking for a partner” Moodle forum

•  Project: 40%  

 

 

Teams of 2 or 3 See list of projects Polished lab report

©2012 Scott B. Baden / CSE 260 / Fall 2012

4

Policies •  Academic Integrity  

 

Do you own work Plagiarism and cheating will not be tolerated

•  By taking this course, you implicitly agree to abide by the following the course polices: www.cse.ucsd.edu/classes/fa12/cse260-b/Policies.html

©2012 Scott B. Baden / CSE 260 / Fall 2012

5

Labs and Scheduling •  Room full of Mac Minis with GPU  

Lab sessions in APM

•  One lecture to be rescheduled Weds 11/14 → TBD Will come earlier in the quarter You will be working on projects by then      

©2012 Scott B. Baden / CSE 260 / Fall 2012

6

Programming Laboratories •  3 labs      

•  •  •  • 

#1: Performance programming a single core #2: GPU Fermi (NERSC’s Dirac) – CUDA #3 MPI on a cluster (Bang) - MPI

Project: 3 or 4 weeks Teams of 2 or 3 Teams of 3 will have a larger workload than teams of 2 Establish a schedule with your partner from the start

©2012 Scott B. Baden / CSE 260 / Fall 2012

7

Course overview and background •  How to solve computationally intensive problems on parallel computers  

 

Software techniques Performance tradeoffs

•  Background        

Graduate standing Recommended: computer architecture (CSE 240A) Students outside CSE are welcome See me if you are unsure about your background

•  Prior experience    

Parallel computation? Numerical analysis? ©2012 Scott B. Baden / CSE 260 / Fall 2012

8

Background Markers •  •  •  •  •  •  •  •  • 

C/C++ Java Fortran? Navier Stokes Equations Sparse factorization TLB misses "•u=0 Multithreading D# + # (" • v ) = 0 MPI Dt CUDA, GPUs RPC f "(a) f ""(a) f (a) + (x # a) + (x # a) 1! 2! Abstract base class !

2

+ ...

! ©2012 Scott B. Baden / CSE 260 / Fall 2012

9

Syllabus •  Fundamentals

Motivation, system organization, hardware execution models, limits to performance, program execution models, theoretical models

•  Software and programming  

     

Programming models and techniques: message passing, multithreading Architectural considerations: GPUs and multicore Higher level run time models, language support CUDA, openmp, pthreads, MPI

•  Parallel algorithm design and implementation  

 

   

Case studies to develop a repertoire of problem solving techniques: discretization, sorting, linear algebra, irregular problems Data structures and their efficient implementation: load balancing and performance Performance tradeoffs, evaluation, and tuning Exascale computing: communication avoidance, resilience ©2012 Scott B. Baden / CSE 260 / Fall 2012

10

What is parallel processing? •  Decompose a workload onto simultaneously executing physical resources •  Multiple processors co-operate to process a related set of tasks – tightly coupled •  Improve some aspect of performance        

Speedup: 100 processors run ×100 faster than one Capability: Tackle a larger problem, more accurately Algorithmic, e.g. search Locality: more cache memory and bandwidth

•  Resilience an issue at the high end Exascale:1018 flops/sec

©2012 Scott B. Baden / CSE 260 / Fall 2012

11

Parallel Processing, Concurrency & Distributed Computing •  Parallel processing    

Performance (and capacity) is the main goal More tightly coupled than distributed computation

•  Concurrency  

 

Concurrency control: serialize certain computations to ensure correctness, e.g. database transactions Performance need not be the main goal

•  Distributed computation        

Geographically distributed Multiple resources computing & communicating unreliably Cloud or Grid computing, large amounts of storage Looser, coarser grained communication and synchronization

•  May or may not involve separate physical resources, e.g. multitasking “Virtual Parallelism” ©2012 Scott B. Baden / CSE 260 / Fall 2012

12

Granularity •  A measure of how often a computation communicates, and what scale  

 

 

 

 

 

Distributed computer: a whole program Multicomputer: function, a loop nest Multiprocessor: + memory reference Multicore: similar to a multiprocessor but perhaps finer grained GPU: kernel thread Instruction level parallelism: instruction, register ©2012 Scott B. Baden / CSE 260 / Fall 2012

13

The impact of technology

©2012 Scott B. Baden / CSE 260 / Fall 2012

14

Why is parallelism inevitable? •  Physical limits on processor clock speed and heat dissipation •  A parallel computer increases memory capacity and bandwidth as well as the computational rate

Nvidia

Average CPU clock speeds (via Bill Gropp)

http://www.pcpitstop.com/research/cpu.asp

©2012 Scott B. Baden / CSE 260 / Fall 2012

15

Today’s laptop would have been yesterday s supercomputer •  Cray-1 Supercomputer •  80 MHz processor •  240 Mflops/sec •  8 Megabytes memory •  •  •  • 

Water cooled 1.8m H x 2.2m W 4 tons Over $10M in 1976

•  •  •  •  •  •  •  •  •  •  • 

MacBook Air 1.6GHz dual-core Intel Core i5 2 x 8 x 1.6 = 25.6 Gflops/sec 2 Gigabytes memory, 3 Megabytes shared cache 64GB Flash storage Intel HD Graphics 3000 256MB shared DDR3 SDRAM Color display Wireless Networking Air cooled ~ 1.7 x 30 x 19.2 cm. 1.08 kg $999 in January 2012

©2012 Scott B. Baden / CSE 260 / Fall 2012

16

Technological disruption •  Transformational: modelling, healthcare… •  New capabilities •  Changes the common wisdom for solving a problem including the implementation Cray-1, 1976, 240 Megaflops



Connection Machine CM-2, 1987



Nvidia Tesla, 4.14 Tflops, 2009

Beowulf cluster,  late 1990s



Intel 48 core processor, 2009



ASCI Red, 1997, 1Tflop



Sony Playstation 3, 150 Glfops, 2006

©2012 Scott B. Baden / CSE 260 / Fall 2012

Tilera 100 core processor, 2009



17

Trends in Top 500 supercomputers

www.top500.org

10/5/12

©2012 Scott B. Baden / CSE 260 / Fall 2012

18 18

The age of the multi-core processor •  On chip parallel computer •  IBM Power4 (2001), many others follow (Intel, AMD, Tilera, Cell Broadband Engine) •  First dual core laptops (2005-6) •  GPUs (nVidia, ATI): a supercomputer on a desktop realworldtech.com

10/5/12

©2012 Scott B. Baden / CSE 260 / Fall 2012

19 19

The GPU Specialized many-core processor Massively multithreaded, long vectors Reduced on-chip memory per core Explicitly manage the memory hierarchy

1200 1000

AMD (GPU) NVIDIA (GPU) Intel (CPU) Many-core GPU

800 GFLOPS

•  •  •  • 

600 400 200 0 2001

Multicore CPU Dual-core 2002

2003

2004

2005 Year

2006

Quad-core 2007 2008 2009 Courtesy: John Owens

Christopher Dyken, SINTEF

10/5/12

©2012 Scott B. Baden / CSE 260 / Fall 2012

20 20

The impact •  You are taking this class •  A renaissance in parallel computation •  Parallelism is no longer restricted to the HPC cult with large machine rooms, it is relevant to everyone •  We all have a parallel computer at our fingertips •  May or may not need to know they are using a parallel computer

10/5/12

©2012 Scott B. Baden / CSE 260 / Fall 2012

21 21

Motivating Applications

©2012 Scott B. Baden / CSE 260 / Fall 2012

22

A Motivating Application - TeraShake Simulates a 7.7 earthquake along the southern San Andreas fault near LA using seismic, geophysical, and other data from the Southern California Earthquake Center

epicenter.usc.edu/cmeportal/TeraShake.html ©2012 Scott B. Baden / CSE 260 / Fall 2012

23

How TeraShake Works •  Divide up Southern California into blocks •  For each block, get all the data about geological structures, fault information, … •  Map the blocks onto processors of the supercomputer •  Run the simulation using current information on fault activity and on the physics of earthquakes

SDSC Machine

Room DataCentral@SDSC

©2012 Scott B. Baden / CSE 260 / Fall 2012

24

Animation

©2012 Scott B. Baden / CSE 260 / Fall 2012

25

Application #2: modeling the heart on a GPU •  Fred Lionetti and Andrew McCulloch •  nVidia GTX-295, single GPU: 240 single precision units @ 1.242 GHz •  ×134 speedup compared with openmp running on quad core i7 @ 2.93GHz •  ×20 improvement over entire simulation

10/5/12

©2012 Scott B. Baden / CSE 260 / Fall 2012

26

26

Face detection with Viola-Jones algorithm •  Searches images for features of a human face Window

Feature

Image

Sonia Sotomayor

•  GPU performance competitive with FPGAs, but far lower development cost •  Jason Oberg, Daniel Hefenbrock, Tan Nguyen [CSE 260, fa’09 →fccm’10] ©2012 Scott B. Baden / CSE 260 / Fall 2012

27

How do we know if we’ve succeeded? •  Capability  

We solved a problem that we couldn’t solve before, or under conditions that were not possible previously

•  Performance    

Solve the same problem in less time than before This can provide a capability if we are solving many problem instances

•  The result achieved must justify the effort  

Enable new scientific discovery ©2012 Scott B. Baden / CSE 260 / Fall 2012

28

Is it that simple? •  Simplified processor design, but more user control over the hardware resources  

 

Redesign the software Rethink the problem solving technique

•  If we don’t use the parallelism, we lose it  

 

 

Amdahl’s law - serial sections Von Neumann bottleneck Load imbalances

©2012 Scott B. Baden / CSE 260 / Fall 2012

29

Why is parallel programming challenging? •  A well behaved single processor algorithm may behave poorly on a parallel computer, and may need to be reformulated numerically •  There is no magic compiler that can turn a serial program into an efficient parallel program all the time and on all machines  

 

 

Performance programming involving low-level details: heavily application dependent Irregularity in the computation and its data structures forces us to think even harder Users don’t start from scratch-they reuse old code. Poorly structured code, or code structured for older architectures can entail costly reprogramming ©2012 Scott B. Baden / CSE 260 / Fall 2012

30

The two mantras for high performance •  Domain-specific knowledge is important in optimizing performance, especially locality •  Significant time investment for each 10-fold increase in a performance

X 100

X 10

Today

©2012 Scott B. Baden / CSE 260 / Fall 2012

31

Performance and Implementation Issues   Conserve

locality  Hide latency

•  Little’s Law [1961]

Processor

Performance

•  Data motion cost growing relative to computation

# threads = performance × latency T=p×λ

and λ increasing with time p =1 - 8 flops/cycle λ = 500 cycles/word

Memory (DRAM) Year

p-1

λ

 p

10/5/12

©2012 Scott B. Baden / CSE 260 / Fall 2012

fotosearch.com

32 32