Lecture 1 - Computer Science Division

3 downloads 194 Views 7MB Size Report
Feb 3, 2014 ... Outline. • Why powerful computers must be parallel processors .... Will all programmers have to be parallel programmers? - New software model ...... “ Sourcebook for Parallel Computing”, by Dongarra, Foster, Fox, .. - A general ...
2/3/14

Outline

CS267
 Applications of Parallel Computers
 


all •  Why powerful computers must be parallel processors Including your laptops and handhelds

•  Large Computational Science and Engineering (CSE) problems require powerful computers



www.cs.berkeley.edu/~demmel/cs267_Spr14/


 Lecture 1: Introduction
 
 Jim Demmel ! 
 EECS & Math Departments! " [email protected]!

Commercial problems too

•  Why writing (fast) parallel programs is hard But things are improving

•  Structure of the course

1!

01/21/2014

2!

CS267 - Lecture 1

Units of Measure •  High Performance Computing (HPC) units are: -  Flop: floating point operation, usually double precision unless noted -  Flop/s: floating point operations per second -  Bytes: size of data (a double precision floating point number is 8 bytes)

•  Typical sizes are millions, billions, trillions… Mega Giga

Mflop/s = 106 flop/sec Gflop/s = 109 flop/sec

Mbyte = 220 = 1048576 ~ 106 bytes Gbyte = 230 ~ 109 bytes

Tera Peta Exa

Tflop/s = 1012 flop/sec Pflop/s = 1015 flop/sec Eflop/s = 1018 flop/sec

Tbyte = 240 ~ 1012 bytes Pbyte = 250 ~ 1015 bytes Ebyte = 260 ~ 1018 bytes

Zetta Yotta

Zflop/s = 1021 flop/sec Yflop/s = 1024 flop/sec

Zbyte = 270 ~ 1021 bytes Ybyte = 280 ~ 1024 bytes

all

(2007)

Why powerful computers are parallel" circa 1991-2006

•  Current fastest (public) machine ~ 55 Pflop/s, 3.1M cores -  Up-to-date list at www.top500.org 01/21/2014

CS267 - Lecture 1

3!

4!

1

2/3/14

Technology Trends: Microprocessor Capacity

Tunnel Vision by Experts •  “I think there is a world market for maybe five computers.” -  Thomas Watson, chairman of IBM, 1943.

•  “There is no reason for any individual to have a computer in their home”

Moore’s Law

-  Ken Olson, president and founder of Digital Equipment Corporation, 1977.

•  “640K [of memory] ought to be enough for anybody.” 2X transistors/Chip Every 1.5 years

-  Bill Gates, chairman of Microsoft,1981.

Called “Moore’s Law”

•  “On several recent occasions, I have been asked whether parallel computing will soon be relegated to the trash heap reserved for promising technologies that never quite make it.”

Microprocessors have become smaller, denser, and more powerful.

-  Ken Kennedy, CRPC Directory, 1994 01/21/2014

CS267 - Lecture 1

Slide source: Warfield et al.

5!

Microprocessor Transistors / Clock (1970-2000)

01/21/2014

6!

Impact of Device Shrinkage

•  Clock rate goes up by x because wires are shorter -  actually less than x, because of power consumption

1000000

Transistors (Thousands) 100000

Frequency (MHz)

•  Transistors per unit area goes up by x2

10000

•  Die size also tends to increase

1000

-  typically another factor of ~x

100

•  Raw computing power of the chip goes up by ~ x4 ! -  typically x3 is devoted to either on-chip

10

1

01/21/2014

Slide source: Jack Dongarra

CS267 - Lecture 1

•  What happens when the feature size (transistor size) shrinks by a factor of x ?

10000000

0 1970

Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.

- parallelism: hidden parallelism such as ILP 1975

1980

1985

CS267 - Lecture 1

1990

1995

- locality: caches

2000

7!

•  So most programs x3 times faster, without changing them 01/21/2014

CS267 - Lecture 1

8!

2

2/3/14

• 

2nd

Moore’s law (Rock’s law): costs go up"

Demo of 0.06 micron CMOS

Source: Forbes Magazine

• 

Yield" - What percentage of the chips are usable? - E.g., Cell processor (PS3) was sold with 7 out of 8 “on” to improve yield 9!

Revolution in Processors

100000 100000 10000 10000

Sun’s Surface

Source: Patrick Gelsinger, Shenkar Bokar, Intel®

Rocket

1000

Nozzle Nuclear

100

Reactor Hot Plate

8086 10

4004 8008 8080

P6 8085

286

Pentium®

386 486

1 1970

1980

1990

2000

2010

Year

•  High performance serial processors waste power -  Speculation, dynamic dependence checking, etc. burn power -  Implicit parallelism discovery

•  More transistors, but not faster serial processors 01/21/2014

CS267 - Lecture 1

•  All major processor vendors are producing multicore chips

Transistors Transistors (Thousands) (Thousands) Transistors (Thousands) Frequency (MHz) Frequency (MHz) Power (W) Cores Cores

-  Every machine will soon be a parallel machine -  To keep doubling performance, parallelism must double

•  Which (commercial) applications can use this parallelism? -  Do they have to be rewritten from scratch?

1000 100

•  Will all programmers have to be parallel programmers?

100 10

-  New software model needed -  Try to hide complexity from most programmers – eventually

10 1

1975

1980

1985

1990

1995

2000

2005

CS267 - Lecture 1

-  In the meantime, need to understand it

2010

•  Chip density is continuing increase ~2x every 2 years •  Clock speed is not •  Number of processor cores may double instead •  Power is under control, no longer growing

01/21/2014

10000

Parallelism in 2014?

1000

1 0 1970

–  Dynamic power is proportional to V2fC –  Increasing frequency (f) also increases supply voltage (V) à cubic effect –  Increasing cores increases capacitance (C) but only linearly –  Save power by lowering clock speed

Scaling clock speed (business as usual) will not work

•  These arguments are no longer theoretical

10000000 1000000 1000000

Power Density Limits Serial Performance •  Concurrent systems are more power efficient

Manufacturing costs and yield problems limit use of density

Power Density (W/cm2)

Manufacturing Issues Limit Performance"

•  Computer industry betting on this big change, but does not have all the answers -  Berkeley ParLab, then ASPIRE, established to work on this 11!

01/21/2014

CS267 - Lecture 1

12!

3

2/3/14

Memory is Not Keeping Pace

The TOP500 Project

Technology trends against a constant or increasing memory per core

• Listing the 500 most powerful computers in the world

•  Memory density is doubling every three years; processor logic is every two •  Storage costs (dollars/Mbyte) are dropping gradually compared to logic costs

• Yardstick: Rmax of Linpack

Cost of Computation vs. Memory

-  Solve Ax=b, dense problem, matrix is random

Source: David Turek, IBM

-  Dominated by dense matrix-matrix multiply

• Updated twice a year:

Source: IBM

-  ISC’xy in June in Germany -  SCxy in November in the U.S.

• All information available from the TOP500 web site at: www.top500.org

Question: Can you double concurrency without doubling memory? •  Strong scaling: fixed problem size, increase number of processors •  Weak scaling: grow problem size proportionally to number of processors 01/21/2014

CS267 - Lecture 1

01/21/2014

42st List: The TOP10 in November 2013 #

1 2 3

Site

National University of Defense Technology Oak Ridge National Laboratory Lawrence Livermore National Laboratory

RIKEN Advanced Institute for Computational 4 Science 5 6 7 8 9

Argonne National Laboratory Swiss National Supercomputing Centre (CSCS) Texas Advanced Computing Center/UT Forschungszentrum Juelich (FZJ) Lawrence Livermore National Laboratory

01/21/2014 10 Leibniz Rechenzentrum

Manufacturer

Computer

NUDT

Tianhe-2 NUDT TH-IVB-FEP,

Country

China

Cores

3,120,000

Rmax [Pflops]

33.9

Power [MW]

17.8

42st List: The TOP9 + the one you’ll use #

Cray

Cray XK7, Opteron 16C 2.2GHz,

National University of Defense Technology

2

Oak Ridge National Laboratory

USA

560,640

17.6

8.21

Gemini, NVIDIA K20x

IBM

Sequoia BlueGene/Q,

USA

1,572,864

17.2

7.89

3

Power BQC 16C 1.6GHz, Custom

Fujitsu IBM

K Computer SPARC64 VIIIfx 2.0GHz, Tofu Interconnect Mira BlueGene/Q,

Cray

Aries, NVIDIA K20x

Dell

Stampede PowerEdge C8220,

Japan

795,024

10.5

12.7

USA

786,432

8.59

3.95

5

Argonne National Laboratory

6

Swiss National Supercomputing Centre (CSCS)

Switzerland USA

115,984 462,462

6.27 5.17

2.33 4.51

7

Xeon E5 8C 2.7GHz, Intel Xeon Phi

IBM

JuQUEEN BlueGene/Q,

Germany

458,752

5.01

2.30

IBM

Forschungszentrum Juelich (FZJ)

9

Lawrence Livermore National Laboratory

USA

393,216

4.29

1.97

Power BQC 16C 1.6GHz, Custom

IBM

SuperMUC CS267iDataPlex - Lecture 1 DX360M4, Xeon E5 8C 2.7GHz, Infiniband FDR

Germany

147,456

2.90

3.52

Texas Advanced Computing Center/UT

8

Power BQC 16C 1.6GHz, Custom

Vulcan BlueGene/Q,

Lawrence Livermore National Laboratory

RIKEN Advanced Institute for Computational 4 Science

Power BQC 16C 1.6GHz, Custom

Piz Daint Cray XC30, Xeon E5 8C 2.6GHz,

Site

1

Xeon 12C 2.2GHz, IntelXeon Phi

Titan

CS267 - Lecture 1

28

Lawrence Berkeley 01/21/2014

National Laboratory

Rmax

Power [MW]

Manufacturer

Computer

Country

Cores

NUDT

Tianhe-2 NUDT TH-IVB-FEP,

China

3,120,000

33.9

17.8

USA

560,640

17.6

8.21

USA

1,572,864

17.2

7.89

Japan

795,024

10.5

12.7

USA

786,432

8.59

3.95

Switzerland

115,984

6.27

2.33

USA

462,462

5.17

4.51

Germany

458,752

5.01

2.30

USA

393,216

4.29

1.97

USA

153,408

1.05

2.91

[Pflops]

Xeon 12C 2.2GHz, IntelXeon Phi

Cray

Titan Cray XK7, Opteron 16C 2.2GHz, Gemini, NVIDIA K20x

IBM

Sequoia BlueGene/Q, Power BQC 16C 1.6GHz, Custom

Fujitsu IBM

K Computer SPARC64 VIIIfx 2.0GHz, Tofu Interconnect Mira BlueGene/Q, Power BQC 16C 1.6GHz, Custom

Cray

Piz Daint Cray XC30, Xeon E5 8C 2.6GHz, Aries, NVIDIA K20x

Dell

Stampede PowerEdge C8220, Xeon E5 8C 2.7GHz, Intel Xeon Phi

IBM

JuQUEEN BlueGene/Q, Power BQC 16C 1.6GHz, Custom

IBM

Vulcan BlueGene/Q, Power BQC 16C 1.6GHz, Custom

Cray

Hopper

CS267 - Lecture 1 Cray XE6 , Opteron 12C 2.1 GHZ, Gemini

4

2/3/14

The TOP10 in November 2012 Ran k

Site

Manufacturer

1

Oak Ridge National Laboratory

Cray

2

Lawrence Livermore National Laboratory

IBM

3

RIKEN Advanced Institute for Computational Science

Fujitsu

4

Argonne National Laboratory

IBM

Computer Titan Cray XK7, Opteron 16C 2.2GHz, Gemini, NVIDIA K20x

Sequoia BlueGene/Q,

The TOP10 in November 2012, plus one Country

Cores

Rmax Power [Pflops] [MW]

USA

560,640

17.59

8.21

USA

1,572,864

16.32

7.89

Japan

705,024

10.51

12.66

USA

786,432

8.16

3.95

Power BQC 16C 1.6GHz, Custom

K computer SPARC64 VIIIfx 2.0GHz, Tofu Interconnect

Mira BlueGene/Q, Power BQC 16C 1.6GHz, Custom

5

Forschungszentrum Juelich (FZJ)

IBM

JUQUEEN BlueGene/Q,

Germany

393,216

4.14

1.97

Power BQC 16C 1.6GHz, Custom

6

Leibniz Rechenzentrum

IBM

7

Texas Advanced Computing Center/UT

Dell

8

National SuperComputer Center in Tianjin

NUDT

SuperMUC iDataPlex DX360M4,

Germany

147,456

2.90

USA

204,900

2.66

China

186,368

2.57

3.42

Xeon E5 8C 2.7GHz, Infiniband FDR

9

CINECA

10

Tianhe-1A NUDT TH MPP,

IBM

Manufacturer

1

Oak Ridge National Laboratory

2

Lawrence Livermore National Laboratory

3

RIKEN Advanced Institute for Computational Science

Fujitsu

4

Argonne National Laboratory

IBM

5

Forschungszentrum Juelich (FZJ)

Cray IBM

4.04

IBM

Leibniz Rechenzentrum

7

Dell

8

National SuperComputer Center in Tianjin

NUDT

K computer SPARC64 VIIIfx 2.0GHz, Tofu Interconnect

Mira BlueGene/Q, JUQUEEN BlueGene/Q,

IBM

SuperMUC iDataPlex DX360M4,

Italy

163,840

1.73

.82

9

CINECA

Stampede PowerEdge C8220,

Cores

Rmax Power [Pflops] [MW]

USA

560,640

17.59

8.21

USA

1,572,864

16.32

7.89

Japan

705,024

10.51

12.66

USA

786,432

8.16

3.95

Germany

393,216

4.14

1.97

Germany

147,456

2.90

3.42

Tianhe-1A NUDT TH MPP,

IBM

Fermi BlueGene/Q, DARPA Trial Subset Power 775,

USA

63,360

1.52

3.58

204,900

2.66

186,368

2.57

4.04

Italy

163,840

1.73

.82

USA

63,360

1.52

3.58

153,408

18! 1.054

2.91

Xeon 6C, NVidia, FT-1000 8C

Power BQC 16C 1.6GHz, Custom

CS267 - Lecture 1

USA

China

Xeon E5 8C 2.7GHz, Intel Xeon Phi

DARPA Trial Subset Power 775,

10

IBM

IBM

19

Lawrence Berkeley 01/21/2014

Cray

Power7 8C 3.84GHz, Custom

National Laboratory

Hopper CS267 - Lecture 1

Cray XE6, 6C 2.1 GHz

USA

Performance Development (Nov 2012) 250  PFlop/s  

1 Eflop/s

100 Pflop/s

33.9  PFlop/s  

100 Pflop/s

10 Pflop/s

162  PFlop/s   17.6  PFlop/s  

10 Pflop/s

1 Pflop/s

1 Pflop/s SUM  

100 Tflop/s 10 Tflop/s

118  TFlop/s  

SUM  

100 Tflop/s

76.5  TFlop/s  

10 Tflop/s

1 Tflop/s 1.17  TFlop/s  

N=1  

100 Gflop/s

1 Tflop/s 1.17  TFlop/s  

N=1  

100 Gflop/s

59.7  GFlop/s  

N=500  

1 Gflop/s

01/21/2014

Sequoia BlueGene/Q,

Country

Power BQC 16C 1.6GHz, Custom

Texas Advanced Computing Center/UT

1 Eflop/s

100 Mflop/s

Cray XK7, Opteron 16C 2.2GHz, Gemini, NVIDIA K20x

Power BQC 16C 1.6GHz, Custom

Performance Development (Nov 2013)

10 Gflop/s

Titan

Power BQC 16C 1.6GHz, Custom

Power7 8C 3.84GHz, Custom

01/21/2014

Computer

Power BQC 16C 1.6GHz, Custom

6

Xeon 6C, NVidia, FT-1000 8C

Fermi BlueGene/Q,

Site

Xeon E5 8C 2.7GHz, Infiniband FDR

Xeon E5 8C 2.7GHz, Intel Xeon Phi

IBM

IBM

Stampede PowerEdge C8220,

Ran k

10 Gflop/s

59.7  GFlop/s  

N=500  

1 Gflop/s 400  MFlop/s  

100 Mflop/s

CS267 - Lecture 1

01/21/2014

400  MFlop/s  

CS267 - Lecture 1

5

2/3/14

Projected Performance Development (Nov 2013)

Projected Performance Development (Nov 2012)

1 Eflop/s

1 Eflop/s

100 Pflop/s

100 Pflop/s

10 Pflop/s

1 Pflop/s

10 Pflop/s

SUM

100 Tflop/s 10 Tflop/s

100 Tflop/s

N=1

1 Tflop/s 100 Gflop/s

10 Tflop/s

N=1  

1 Tflop/s N=500

100 Gflop/s

10 Gflop/s

10 Gflop/s

1 Gflop/s

1 Gflop/s

100 Mflop/s

100 Mflop/s

01/21/2014

SUM  

1 Pflop/s

CS267 - Lecture 1

Core Count

01/21/2014

N=500  

CS267 - Lecture 1

Moore’s Law reinterpreted •  Number of cores per chip can double every two years •  Clock speed will not increase (possibly decrease) •  Need to deal with systems with millions of concurrent threads •  Need to deal with inter-chip parallelism as well as intra-chip parallelism

01/21/2014

CS267 - Lecture 1

01/21/2014

CS267 - Lecture 1

6

2/3/14

Computational Science - News

Outline all •  Why powerful computers must be parallel processors

“An important development in sciences is occurring at the intersection of computer science and the sciences that has the potential to have a profound impact on science. It is a leap from the application of computing … to the integration of computer science concepts, tools, and theorems into the very fabric of science.” -Science 2020 Report, March 2006

Including your laptops and handhelds

•  Large CSE problems require powerful computers Commercial problems too

•  Why writing (fast) parallel programs is hard But things are improving

•  Structure of the course

01/21/2014

CS267 - Lecture 1

25!

Drivers for Change

01/21/2014

Nature, March 23, 2006

26!

CS267 - Lecture 1

Simulation: The Third Pillar of Science •  Traditional scientific and engineering method:

•  Continued exponential increase in computational power -  Can simulate what theory and experiment can’t do •  Continued exponential increase in experimental data -  Moore’s Law applies to sensors too -  Need to analyze all that data

(1) Do theory or paper design

Theory

•  Continued exponential increase in computational (2) Perform experiments or build system power → simulation is becoming third pillar of science, complementing theory and experiment •  Limitations: •  Continued exponential increase inlarge experimental –Too difficult—build wind tunnels data → techniques and technology in data –Too expensive—build a throw-away passenger jet analysis, visualization, analytics, networking, and collaboration tools are becoming in all evolution –Too slow—wait for essential climate or galactic data rich scientific applications

Experiment

Simulation

–Too dangerous—weapons, drug design, climate experimentation

•  Computational science and engineering paradigm:

01/21/2014

CS267 - Lecture 1

27!

(3) Use computers to simulate and analyze the phenomenon -  Based on known physical laws and efficient numerical methods -  Analyze simulation results with computational tools and methods beyond what is possible manually 01/21/2014

CS267 - Lecture 1

28!

7

2/3/14

Data Driven Science

Some Particularly Challenging Computations •  Science

•  Scientific data sets are growing exponentially

-  -  -  -  - 

-  Ability to generate data is exceeding our ability to store and analyze -  Simulation systems and some observational devices grow in capability with Moore’s Law

•  Engineering

•  Petabyte (PB) data sets will soon be common: -  Climate modeling: estimates of the next IPCC data is in 10s of petabytes -  Genome: JGI alone will have .5 petabyte of data this year and double each year

-  -  -  -  - 

-  Particle physics: LHC is projected to produce 16 petabytes of data per year -  Astrophysics: LSST and others will produce 5 petabytes/year (via 3.2 Gigapixel camera)

CS267 - Lecture 1

Semiconductor design Earthquake and structural modeling Computation fluid dynamics (airplane design) Combustion (engine design) Crash simulation

•  Business -  Financial and economic modeling -  Transaction processing, web services and search engines

•  Defense

•  Create scientific communities with “Science Gateways” to data 01/21/2014

Global climate modeling Biology: genomics; protein folding; drug design Astrophysical modeling Computational Chemistry Computational Material Sciences and Nanosciences

-  Nuclear weapons -- test by simulations -  Cryptography 29!

Economic Impact of HPC

01/21/2014

$5B World Market in Technical Computing in 2004

•  Airlines:

1998 1999 2000 2001 2002 2003

-  System-wide logistics optimization systems on parallel systems. -  Savings: approx. $100 million per airline per year.

100%

Simulation Scientific Research and R&D

80%

-  Major automotive companies use large systems (500+ CPUs) for: -  CAD-CAM, crash testing, structural integrity and aerodynamics. -  One company has 500+ CPU parallel system. -  Savings: approx. $1 billion per company per year.

Mechanical Design/Engineering Analysis

70%

Mechanical Design and Drafting

60%

•  Semiconductor industry:

Other Technical Management and Support

90%

•  Automotive design:

Imaging

50%

Geoscience and Geoengineering

40%

Electrical Design/Engineering Analysis Economics/Financial

30%

-  Semiconductor firms use large systems (500+ CPUs) for -  device electronics simulation and logic validation -  Savings: approx. $1 billion per company per year.

Digital Content Creation and Distribution

20%

Classified Defense

10%

•  Energy -  Computational modeling improved performance of current nuclear power plants, equivalent to building two new power plants. 01/21/2014 CS267 - Lecture 1

30!

CS267 - Lecture 1

Chemical Engineering Biosciences

0%

Source: IDC 2004, from NRC Future of Supercomputing Report 31!

01/21/2014

CS267 - Lecture 1

32!

8

2/3/14

What Supercomputers Do – Two Examples

Global Climate Modeling Problem •  Problem is to compute: f(latitude, longitude, elevation, time) à “weather” =

•  Climate modeling

(temperature, pressure, humidity, wind velocity)

-  simulation replacing experiment that is too slow

•  Approach:

•  Cosmic microwave background radition

-  Discretize the domain, e.g., a measurement point every 10 km

-  analyzing massive amounts of data with new tools

-  Devise an algorithm to predict weather at time t+δt given t

•  Uses: -  Predict major events, e.g., El Nino -  Use in setting air emissions standards -  Evaluate global warming scenarios Source: http://www.epm.ornl.gov/chammp/chammp.html 01/21/2014

CS267 - Lecture 1

33!

Global Climate Modeling Computation •  One piece is modeling the fluid flow in the atmosphere

01/21/2014

CS267 - Lecture 1

High Resolution Climate Modeling on NERSC-3 – P. Duffy, et al., LLNL!

34!

(millimeters/day)

-  Solve Navier-Stokes equations -  Roughly 100 Flops per grid point with 1 minute timestep

•  Computational requirements: -  To match real-time, need 5 x 1011 flops in 60 seconds = 8 Gflop/s -  Weather prediction (7 days in 24 hours) à 56 Gflop/s -  Climate prediction (50 years in 30 days) à 4.8 Tflop/s -  To use in policy negotiations (50 years in 12 hours) à 288 Tflop/s

•  To double the grid resolution, computation is 8x to 16x •  State of the art models require integration of atmosphere, clouds, ocean, sea-ice, land models, plus possibly carbon cycle, geochemistry and more •  Current models are coarser than this 01/21/2014

CS267 - Lecture 1

35!

01/21/2014

CS267 - Lecture 1

36!

9

2/3/14

NERSC User George Smoot wins 2006 Nobel Prize in Physics

U.S.A. Hurricane

Smoot and Mather 1992 COBE Experiment showed anisotropy of CMB

Cosmic Microwave Background Radiation (CMB): an image of the universe at 400,000 years Source: Data from M.Wehner, visualization by Prabhat, LBNL 01/21/2014

37!

CS267 - Lecture 1

01/21/2014

38!

CS267 - Lecture 1

Evolution Of CMB Data Sets: Cost > O(Np^3 )

The Current CMB Map

Experiment

Nt

Np

Nb

Limiting Data

Notes

COBE (1989)

2x109

6x103

3x101

Time

BOOMERanG (1998)

3x108

5x105

3x101

Pixel

Balloon, 1st HPC/NERSC

(4yr) WMAP (2001)

7x1010

4x107

1x103

?

Satellite, Analysis-bound

Planck (2007)

5x1011

6x108

6x103

Time/ Pixel

POLARBEAR (2007)

8x1012

6x106

1x103

Time

Ground, NG-multiplexing

CMBPol (~2020)

1014

109

104

Time/ Pixel

Satellite, Early planning/design

Satellite,

Workstation

Satellite, Major HPC/DA effort

source J. Borrill, LBNL

•  Unique imprint of primordial physics through the tiny anisotropies in temperature and polarization. •  Extracting these µKelvin fluctuations from inherently noisy data is a serious computational challenge. 01/21/2014

CS267 - Lecture 1

39!

data compression 01/21/2014

CS267 - Lecture 1

40!

10

2/3/14

Which commercial applications require parallelism?

What do commercial and CSE applications have in common?

Motif/Dwarf: Common Computational Methods

1 2 3 4 5 6 7 8 9 10 11 12 13

Finite State Mach. Combinational Graph Traversal Structured Grid Dense Matrix Sparse Matrix Spectral (FFT) Dynamic Prog N-Body MapReduce Backtrack/ B&B Graphical Models Unstructured Grid 01/21/2014

Image Speech Music Browser

Analyzed in detail in “Berkeley View” report www.eecs.berkeley.edu/Pubs/ TechRpts/2006/ EECS-2006-183.html

CS267 - Lecture 1

1 2 3 4 5 6 7 8 9 10 11 12 13

ML

HPC

DB

Games

Embed

Health

SPEC

ML

HPC

DB

Games

Embed

Analyzed in detail in “Berkeley View” report

SPEC

(Red Hot → Blue Cool)

Health Image Speech Music Browser

Finite State Mach. Combinational Graph Traversal Structured Grid Dense Matrix Sparse Matrix Spectral (FFT) Dynamic Prog N-Body MapReduce Backtrack/ B&B Graphical Models Unstructured Grid

01/21/2014

CS267 - Lecture 1

Principles of Parallel Computing

Outline

•  Finding enough parallelism (Amdahl’s Law)

all •  Why powerful computers must be parallel processors

•  Granularity – how big should each parallel task be

Including your laptops and handhelds

•  Locality – moving data costs more than arithmetic

•  Large CSE problems require powerful computers

•  Load balance – don’t want 1K processors to wait for one slow one

Commercial problems too

•  Coordination and synchronization – sharing data safely

•  Why writing (fast) parallel programs is hard

•  Performance modeling/debugging/tuning

But things are improving

•  Structure of the course All of these things makes parallel programming even harder than sequential programming.

01/21/2014

CS267 - Lecture 1

43!

01/21/2014

CS267 - Lecture 1

44!

11

2/3/14

“Automatic” Parallelism in Modern Machines • Bit level parallelism

Finding Enough Parallelism •  Suppose only part of an application seems parallel

-  within floating point operations, etc.

•  Amdahl’s law -  let s be the fraction of work done sequentially, so (1-s) is fraction parallelizable

• Instruction level parallelism (ILP) -  multiple instructions execute per clock cycle

-  P = number of processors

• Memory system parallelism

Speedup(P) = Time(1)/Time(P)

-  overlap of memory operations with computation