Lecture slides

Lecture 1 Introduction

Welcome to CSE 260! •  Your instructor is Scott Baden    

[email protected] Office hours in EBU3B Room 3244 •  TBD; suggest after class

•  Your TA is Edmund Wong  

[email protected]

•  The class home page is http://www.cse.ucsd.edu/classes/fai12/cse260-b

•  CSME? •  Logins      

Moodle Bang Lilliput

©2012 Scott B. Baden / CSE 260 / Fall 2012

2

Text and readings •  Required texts  

 

Programming Massively Parallel Processors: A Handson Approach, by David Kirk and Wen-mei Hwu, Morgan Kaufmann Publishers (2010) An Introduction to Parallel Programming, by Peter Pacheco, Morgan Kaufmann (2011)

•  Assigned class readings will include handouts and on-line material •  Texts on reserve in the S&E library •  Lecture slides www.cse.ucsd.edu/classes/fa12/cse260-b/Lectures ©2012 Scott B. Baden / CSE 260 / Fall 2012

3

Course Requirements •  Programming labs: 60%  

 

 

Teams of 2 or 3 Includes a lab report Find a partner using the “looking for a partner” Moodle forum

•  Project: 40%  

 

 

Teams of 2 or 3 See list of projects Polished lab report


4

Policies •  Academic Integrity  

 

Do you own work Plagiarism and cheating will not be tolerated

•  By taking this course, you implicitly agree to abide by the following the course polices: www.cse.ucsd.edu/classes/fa12/cse260-b/Policies.html


5

Labs and Scheduling •  Room full of Mac Minis with GPU  

Lab sessions in APM

•  One lecture to be rescheduled Weds 11/14 → TBD Will come earlier in the quarter You will be working on projects by then      


6

Programming Laboratories •  3 labs      

•  •  •  • 

#1: Performance programming a single core #2: GPU Fermi (NERSC’s Dirac) – CUDA #3 MPI on a cluster (Bang) - MPI

Project: 3 or 4 weeks Teams of 2 or 3 Teams of 3 will have a larger workload than teams of 2 Establish a schedule with your partner from the start


7

Course overview and background •  How to solve computationally intensive problems on parallel computers  

 

Software techniques Performance tradeoffs

•  Background        

Graduate standing Recommended: computer architecture (CSE 240A) Students outside CSE are welcome See me if you are unsure about your background

•  Prior experience    

Parallel computation? Numerical analysis? ©2012 Scott B. Baden / CSE 260 / Fall 2012

8

Background Markers •  •  •  •  •  •  •  •  • 

C/C++ Java Fortran? Navier Stokes Equations Sparse factorization TLB misses "•u=0 Multithreading D# + # (" • v ) = 0 MPI Dt CUDA, GPUs RPC f "(a) f ""(a) f (a) + (x # a) + (x # a) 1! 2! Abstract base class !

2

+ ...

! ©2012 Scott B. Baden / CSE 260 / Fall 2012

9

Syllabus •  Fundamentals

Motivation, system organization, hardware execution models, limits to performance, program execution models, theoretical models

•  Software and programming  

     

Programming models and techniques: message passing, multithreading Architectural considerations: GPUs and multicore Higher level run time models, language support CUDA, openmp, pthreads, MPI

•  Parallel algorithm design and implementation  

 

   

Case studies to develop a repertoire of problem solving techniques: discretization, sorting, linear algebra, irregular problems Data structures and their efficient implementation: load balancing and performance Performance tradeoffs, evaluation, and tuning Exascale computing: communication avoidance, resilience ©2012 Scott B. Baden / CSE 260 / Fall 2012

10

What is parallel processing? •  Decompose a workload onto simultaneously executing physical resources •  Multiple processors co-operate to process a related set of tasks – tightly coupled •  Improve some aspect of performance        

Speedup: 100 processors run ×100 faster than one Capability: Tackle a larger problem, more accurately Algorithmic, e.g. search Locality: more cache memory and bandwidth

•  Resilience an issue at the high end Exascale:1018 flops/sec


11

Parallel Processing, Concurrency & Distributed Computing •  Parallel processing    

Performance (and capacity) is the main goal More tightly coupled than distributed computation

•  Concurrency  

 

Concurrency control: serialize certain computations to ensure correctness, e.g. database transactions Performance need not be the main goal

•  Distributed computation        

Geographically distributed Multiple resources computing & communicating unreliably Cloud or Grid computing, large amounts of storage Looser, coarser grained communication and synchronization

•  May or may not involve separate physical resources, e.g. multitasking “Virtual Parallelism” ©2012 Scott B. Baden / CSE 260 / Fall 2012

12

Granularity •  A measure of how often a computation communicates, and what scale  

 

 

 

 

 

Distributed computer: a whole program Multicomputer: function, a loop nest Multiprocessor: + memory reference Multicore: similar to a multiprocessor but perhaps finer grained GPU: kernel thread Instruction level parallelism: instruction, register ©2012 Scott B. Baden / CSE 260 / Fall 2012

13

The impact of technology


14

Why is parallelism inevitable? •  Physical limits on processor clock speed and heat dissipation •  A parallel computer increases memory capacity and bandwidth as well as the computational rate

Nvidia

Average CPU clock speeds (via Bill Gropp)

http://www.pcpitstop.com/research/cpu.asp


15

Today’s laptop would have been yesterday s supercomputer •  Cray-1 Supercomputer •  80 MHz processor •  240 Mflops/sec •  8 Megabytes memory •  •  •  • 

Water cooled 1.8m H x 2.2m W 4 tons Over $10M in 1976

•  •  •  •  •  •  •  •  •  •  • 

MacBook Air 1.6GHz dual-core Intel Core i5 2 x 8 x 1.6 = 25.6 Gflops/sec 2 Gigabytes memory, 3 Megabytes shared cache 64GB Flash storage Intel HD Graphics 3000 256MB shared DDR3 SDRAM Color display Wireless Networking Air cooled ~ 1.7 x 30 x 19.2 cm. 1.08 kg $999 in January 2012


16

Technological disruption •  Transformational: modelling, healthcare… •  New capabilities •  Changes the common wisdom for solving a problem including the implementation Cray-1, 1976, 240 Megaflops

Connection Machine CM-2, 1987

Nvidia Tesla, 4.14 Tflops, 2009

Beowulf cluster, late 1990s

Intel 48 core processor, 2009

ASCI Red, 1997, 1Tflop

Sony Playstation 3, 150 Glfops, 2006


Tilera 100 core processor, 2009

17

Trends in Top 500 supercomputers

www.top500.org

10/5/12


18 18

The age of the multi-core processor •  On chip parallel computer •  IBM Power4 (2001), many others follow (Intel, AMD, Tilera, Cell Broadband Engine) •  First dual core laptops (2005-6) •  GPUs (nVidia, ATI): a supercomputer on a desktop realworldtech.com

10/5/12


19 19

The GPU Specialized many-core processor Massively multithreaded, long vectors Reduced on-chip memory per core Explicitly manage the memory hierarchy

1200 1000

AMD (GPU) NVIDIA (GPU) Intel (CPU) Many-core GPU

800 GFLOPS

•  •  •  • 

600 400 200 0 2001

Multicore CPU Dual-core 2002

2003

2004

2005 Year

2006

Quad-core 2007 2008 2009 Courtesy: John Owens

Christopher Dyken, SINTEF

10/5/12


20 20

The impact •  You are taking this class •  A renaissance in parallel computation •  Parallelism is no longer restricted to the HPC cult with large machine rooms, it is relevant to everyone •  We all have a parallel computer at our fingertips •  May or may not need to know they are using a parallel computer

10/5/12


21 21

Motivating Applications


22

A Motivating Application - TeraShake Simulates a 7.7 earthquake along the southern San Andreas fault near LA using seismic, geophysical, and other data from the Southern California Earthquake Center

epicenter.usc.edu/cmeportal/TeraShake.html ©2012 Scott B. Baden / CSE 260 / Fall 2012

23

How TeraShake Works •  Divide up Southern California into blocks •  For each block, get all the data about geological structures, fault information, … •  Map the blocks onto processors of the supercomputer •  Run the simulation using current information on fault activity and on the physics of earthquakes

SDSC Machine

Room DataCentral@SDSC


24

Animation


25

Application #2: modeling the heart on a GPU •  Fred Lionetti and Andrew McCulloch •  nVidia GTX-295, single GPU: 240 single precision units @ 1.242 GHz •  ×134 speedup compared with openmp running on quad core i7 @ 2.93GHz •  ×20 improvement over entire simulation

10/5/12


26

26

Face detection with Viola-Jones algorithm •  Searches images for features of a human face Window

Feature

Image

Sonia Sotomayor

•  GPU performance competitive with FPGAs, but far lower development cost •  Jason Oberg, Daniel Hefenbrock, Tan Nguyen [CSE 260, fa’09 →fccm’10] ©2012 Scott B. Baden / CSE 260 / Fall 2012

27

How do we know if we’ve succeeded? •  Capability  

We solved a problem that we couldn’t solve before, or under conditions that were not possible previously

•  Performance    

Solve the same problem in less time than before This can provide a capability if we are solving many problem instances

•  The result achieved must justify the effort  

Enable new scientific discovery ©2012 Scott B. Baden / CSE 260 / Fall 2012

28

Is it that simple? •  Simplified processor design, but more user control over the hardware resources  

 

Redesign the software Rethink the problem solving technique

•  If we don’t use the parallelism, we lose it  

 

 

Amdahl’s law - serial sections Von Neumann bottleneck Load imbalances


29

Why is parallel programming challenging? •  A well behaved single processor algorithm may behave poorly on a parallel computer, and may need to be reformulated numerically •  There is no magic compiler that can turn a serial program into an efficient parallel program all the time and on all machines  

 

 

Performance programming involving low-level details: heavily application dependent Irregularity in the computation and its data structures forces us to think even harder Users don’t start from scratch-they reuse old code. Poorly structured code, or code structured for older architectures can entail costly reprogramming ©2012 Scott B. Baden / CSE 260 / Fall 2012

30

The two mantras for high performance •  Domain-specific knowledge is important in optimizing performance, especially locality •  Significant time investment for each 10-fold increase in a performance

X 100

X 10

Today


31

Performance and Implementation Issues   Conserve

locality  Hide latency

•  Little’s Law [1961]

Processor

Performance

•  Data motion cost growing relative to computation

# threads = performance × latency T=p×λ

and λ increasing with time p =1 - 8 flops/cycle λ = 500 cycles/word

Memory (DRAM) Year

p-1

λ

 p

10/5/12


fotosearch.com

32 32

Lecture slides

Lecture slides

Suggest Documents

lecture slides

Lecture slides

Lecture slides

Lecture slides

Lecture Slides

Lecture slides

lecture slides

Lecture Slides

lecture slides

lecture slides

Lecture slides

Lecture Slides

Slides for Lecture 08

Lecture Slides (PDF - 3.2MB)

Slides Lecture 11

lecture slides - toronto.edu

to download lecture slides.

PDF of lecture slides.

Lecture 3 slides

Lecture 2 slides

Nobel Lecture slides [PDF]

Lecture Slides (PDF)

Smeenk Lecture Slides

Lecture 1 Slides