Feb 3, 2014 ... Outline. • Why powerful computers must be parallel processors .... Will all
programmers have to be parallel programmers? - New software model ...... “
Sourcebook for Parallel Computing”, by Dongarra, Foster, Fox, .. - A general ...
2/3/14
Outline
CS267
Applications of Parallel Computers
all • Why powerful computers must be parallel processors Including your laptops and handhelds
• Large Computational Science and Engineering (CSE) problems require powerful computers
www.cs.berkeley.edu/~demmel/cs267_Spr14/
Lecture 1: Introduction
Jim Demmel !
EECS & Math Departments! "
[email protected]!
Commercial problems too
• Why writing (fast) parallel programs is hard But things are improving
• Structure of the course
1!
01/21/2014
2!
CS267 - Lecture 1
Units of Measure • High Performance Computing (HPC) units are: - Flop: floating point operation, usually double precision unless noted - Flop/s: floating point operations per second - Bytes: size of data (a double precision floating point number is 8 bytes)
• Typical sizes are millions, billions, trillions… Mega Giga
Mflop/s = 106 flop/sec Gflop/s = 109 flop/sec
Mbyte = 220 = 1048576 ~ 106 bytes Gbyte = 230 ~ 109 bytes
Tera Peta Exa
Tflop/s = 1012 flop/sec Pflop/s = 1015 flop/sec Eflop/s = 1018 flop/sec
Tbyte = 240 ~ 1012 bytes Pbyte = 250 ~ 1015 bytes Ebyte = 260 ~ 1018 bytes
Zetta Yotta
Zflop/s = 1021 flop/sec Yflop/s = 1024 flop/sec
Zbyte = 270 ~ 1021 bytes Ybyte = 280 ~ 1024 bytes
all
(2007)
Why powerful computers are parallel" circa 1991-2006
• Current fastest (public) machine ~ 55 Pflop/s, 3.1M cores - Up-to-date list at www.top500.org 01/21/2014
CS267 - Lecture 1
3!
4!
1
2/3/14
Technology Trends: Microprocessor Capacity
Tunnel Vision by Experts • “I think there is a world market for maybe five computers.” - Thomas Watson, chairman of IBM, 1943.
• “There is no reason for any individual to have a computer in their home”
Moore’s Law
- Ken Olson, president and founder of Digital Equipment Corporation, 1977.
• “640K [of memory] ought to be enough for anybody.” 2X transistors/Chip Every 1.5 years
- Bill Gates, chairman of Microsoft,1981.
Called “Moore’s Law”
• “On several recent occasions, I have been asked whether parallel computing will soon be relegated to the trash heap reserved for promising technologies that never quite make it.”
Microprocessors have become smaller, denser, and more powerful.
- Ken Kennedy, CRPC Directory, 1994 01/21/2014
CS267 - Lecture 1
Slide source: Warfield et al.
5!
Microprocessor Transistors / Clock (1970-2000)
01/21/2014
6!
Impact of Device Shrinkage
• Clock rate goes up by x because wires are shorter - actually less than x, because of power consumption
1000000
Transistors (Thousands) 100000
Frequency (MHz)
• Transistors per unit area goes up by x2
10000
• Die size also tends to increase
1000
- typically another factor of ~x
100
• Raw computing power of the chip goes up by ~ x4 ! - typically x3 is devoted to either on-chip
10
1
01/21/2014
Slide source: Jack Dongarra
CS267 - Lecture 1
• What happens when the feature size (transistor size) shrinks by a factor of x ?
10000000
0 1970
Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.
- parallelism: hidden parallelism such as ILP 1975
1980
1985
CS267 - Lecture 1
1990
1995
- locality: caches
2000
7!
• So most programs x3 times faster, without changing them 01/21/2014
CS267 - Lecture 1
8!
2
2/3/14
•
2nd
Moore’s law (Rock’s law): costs go up"
Demo of 0.06 micron CMOS
Source: Forbes Magazine
•
Yield" - What percentage of the chips are usable? - E.g., Cell processor (PS3) was sold with 7 out of 8 “on” to improve yield 9!
Revolution in Processors
100000 100000 10000 10000
Sun’s Surface
Source: Patrick Gelsinger, Shenkar Bokar, Intel®
Rocket
1000
Nozzle Nuclear
100
Reactor Hot Plate
8086 10
4004 8008 8080
P6 8085
286
Pentium®
386 486
1 1970
1980
1990
2000
2010
Year
• High performance serial processors waste power - Speculation, dynamic dependence checking, etc. burn power - Implicit parallelism discovery
• More transistors, but not faster serial processors 01/21/2014
CS267 - Lecture 1
• All major processor vendors are producing multicore chips
Transistors Transistors (Thousands) (Thousands) Transistors (Thousands) Frequency (MHz) Frequency (MHz) Power (W) Cores Cores
- Every machine will soon be a parallel machine - To keep doubling performance, parallelism must double
• Which (commercial) applications can use this parallelism? - Do they have to be rewritten from scratch?
1000 100
• Will all programmers have to be parallel programmers?
100 10
- New software model needed - Try to hide complexity from most programmers – eventually
10 1
1975
1980
1985
1990
1995
2000
2005
CS267 - Lecture 1
- In the meantime, need to understand it
2010
• Chip density is continuing increase ~2x every 2 years • Clock speed is not • Number of processor cores may double instead • Power is under control, no longer growing
01/21/2014
10000
Parallelism in 2014?
1000
1 0 1970
– Dynamic power is proportional to V2fC – Increasing frequency (f) also increases supply voltage (V) à cubic effect – Increasing cores increases capacitance (C) but only linearly – Save power by lowering clock speed
Scaling clock speed (business as usual) will not work
• These arguments are no longer theoretical
10000000 1000000 1000000
Power Density Limits Serial Performance • Concurrent systems are more power efficient
Manufacturing costs and yield problems limit use of density
Power Density (W/cm2)
Manufacturing Issues Limit Performance"
• Computer industry betting on this big change, but does not have all the answers - Berkeley ParLab, then ASPIRE, established to work on this 11!
01/21/2014
CS267 - Lecture 1
12!
3
2/3/14
Memory is Not Keeping Pace
The TOP500 Project
Technology trends against a constant or increasing memory per core
• Listing the 500 most powerful computers in the world
• Memory density is doubling every three years; processor logic is every two • Storage costs (dollars/Mbyte) are dropping gradually compared to logic costs
• Yardstick: Rmax of Linpack
Cost of Computation vs. Memory
- Solve Ax=b, dense problem, matrix is random
Source: David Turek, IBM
- Dominated by dense matrix-matrix multiply
• Updated twice a year:
Source: IBM
- ISC’xy in June in Germany - SCxy in November in the U.S.
• All information available from the TOP500 web site at: www.top500.org
Question: Can you double concurrency without doubling memory? • Strong scaling: fixed problem size, increase number of processors • Weak scaling: grow problem size proportionally to number of processors 01/21/2014
CS267 - Lecture 1
01/21/2014
42st List: The TOP10 in November 2013 #
1 2 3
Site
National University of Defense Technology Oak Ridge National Laboratory Lawrence Livermore National Laboratory
RIKEN Advanced Institute for Computational 4 Science 5 6 7 8 9
Argonne National Laboratory Swiss National Supercomputing Centre (CSCS) Texas Advanced Computing Center/UT Forschungszentrum Juelich (FZJ) Lawrence Livermore National Laboratory
01/21/2014 10 Leibniz Rechenzentrum
Manufacturer
Computer
NUDT
Tianhe-2 NUDT TH-IVB-FEP,
Country
China
Cores
3,120,000
Rmax [Pflops]
33.9
Power [MW]
17.8
42st List: The TOP9 + the one you’ll use #
Cray
Cray XK7, Opteron 16C 2.2GHz,
National University of Defense Technology
2
Oak Ridge National Laboratory
USA
560,640
17.6
8.21
Gemini, NVIDIA K20x
IBM
Sequoia BlueGene/Q,
USA
1,572,864
17.2
7.89
3
Power BQC 16C 1.6GHz, Custom
Fujitsu IBM
K Computer SPARC64 VIIIfx 2.0GHz, Tofu Interconnect Mira BlueGene/Q,
Cray
Aries, NVIDIA K20x
Dell
Stampede PowerEdge C8220,
Japan
795,024
10.5
12.7
USA
786,432
8.59
3.95
5
Argonne National Laboratory
6
Swiss National Supercomputing Centre (CSCS)
Switzerland USA
115,984 462,462
6.27 5.17
2.33 4.51
7
Xeon E5 8C 2.7GHz, Intel Xeon Phi
IBM
JuQUEEN BlueGene/Q,
Germany
458,752
5.01
2.30
IBM
Forschungszentrum Juelich (FZJ)
9
Lawrence Livermore National Laboratory
USA
393,216
4.29
1.97
Power BQC 16C 1.6GHz, Custom
IBM
SuperMUC CS267iDataPlex - Lecture 1 DX360M4, Xeon E5 8C 2.7GHz, Infiniband FDR
Germany
147,456
2.90
3.52
Texas Advanced Computing Center/UT
8
Power BQC 16C 1.6GHz, Custom
Vulcan BlueGene/Q,
Lawrence Livermore National Laboratory
RIKEN Advanced Institute for Computational 4 Science
Power BQC 16C 1.6GHz, Custom
Piz Daint Cray XC30, Xeon E5 8C 2.6GHz,
Site
1
Xeon 12C 2.2GHz, IntelXeon Phi
Titan
CS267 - Lecture 1
28
Lawrence Berkeley 01/21/2014
National Laboratory
Rmax
Power [MW]
Manufacturer
Computer
Country
Cores
NUDT
Tianhe-2 NUDT TH-IVB-FEP,
China
3,120,000
33.9
17.8
USA
560,640
17.6
8.21
USA
1,572,864
17.2
7.89
Japan
795,024
10.5
12.7
USA
786,432
8.59
3.95
Switzerland
115,984
6.27
2.33
USA
462,462
5.17
4.51
Germany
458,752
5.01
2.30
USA
393,216
4.29
1.97
USA
153,408
1.05
2.91
[Pflops]
Xeon 12C 2.2GHz, IntelXeon Phi
Cray
Titan Cray XK7, Opteron 16C 2.2GHz, Gemini, NVIDIA K20x
IBM
Sequoia BlueGene/Q, Power BQC 16C 1.6GHz, Custom
Fujitsu IBM
K Computer SPARC64 VIIIfx 2.0GHz, Tofu Interconnect Mira BlueGene/Q, Power BQC 16C 1.6GHz, Custom
Cray
Piz Daint Cray XC30, Xeon E5 8C 2.6GHz, Aries, NVIDIA K20x
Dell
Stampede PowerEdge C8220, Xeon E5 8C 2.7GHz, Intel Xeon Phi
IBM
JuQUEEN BlueGene/Q, Power BQC 16C 1.6GHz, Custom
IBM
Vulcan BlueGene/Q, Power BQC 16C 1.6GHz, Custom
Cray
Hopper
CS267 - Lecture 1 Cray XE6 , Opteron 12C 2.1 GHZ, Gemini
4
2/3/14
The TOP10 in November 2012 Ran k
Site
Manufacturer
1
Oak Ridge National Laboratory
Cray
2
Lawrence Livermore National Laboratory
IBM
3
RIKEN Advanced Institute for Computational Science
Fujitsu
4
Argonne National Laboratory
IBM
Computer Titan Cray XK7, Opteron 16C 2.2GHz, Gemini, NVIDIA K20x
Sequoia BlueGene/Q,
The TOP10 in November 2012, plus one Country
Cores
Rmax Power [Pflops] [MW]
USA
560,640
17.59
8.21
USA
1,572,864
16.32
7.89
Japan
705,024
10.51
12.66
USA
786,432
8.16
3.95
Power BQC 16C 1.6GHz, Custom
K computer SPARC64 VIIIfx 2.0GHz, Tofu Interconnect
Mira BlueGene/Q, Power BQC 16C 1.6GHz, Custom
5
Forschungszentrum Juelich (FZJ)
IBM
JUQUEEN BlueGene/Q,
Germany
393,216
4.14
1.97
Power BQC 16C 1.6GHz, Custom
6
Leibniz Rechenzentrum
IBM
7
Texas Advanced Computing Center/UT
Dell
8
National SuperComputer Center in Tianjin
NUDT
SuperMUC iDataPlex DX360M4,
Germany
147,456
2.90
USA
204,900
2.66
China
186,368
2.57
3.42
Xeon E5 8C 2.7GHz, Infiniband FDR
9
CINECA
10
Tianhe-1A NUDT TH MPP,
IBM
Manufacturer
1
Oak Ridge National Laboratory
2
Lawrence Livermore National Laboratory
3
RIKEN Advanced Institute for Computational Science
Fujitsu
4
Argonne National Laboratory
IBM
5
Forschungszentrum Juelich (FZJ)
Cray IBM
4.04
IBM
Leibniz Rechenzentrum
7
Dell
8
National SuperComputer Center in Tianjin
NUDT
K computer SPARC64 VIIIfx 2.0GHz, Tofu Interconnect
Mira BlueGene/Q, JUQUEEN BlueGene/Q,
IBM
SuperMUC iDataPlex DX360M4,
Italy
163,840
1.73
.82
9
CINECA
Stampede PowerEdge C8220,
Cores
Rmax Power [Pflops] [MW]
USA
560,640
17.59
8.21
USA
1,572,864
16.32
7.89
Japan
705,024
10.51
12.66
USA
786,432
8.16
3.95
Germany
393,216
4.14
1.97
Germany
147,456
2.90
3.42
Tianhe-1A NUDT TH MPP,
IBM
Fermi BlueGene/Q, DARPA Trial Subset Power 775,
USA
63,360
1.52
3.58
204,900
2.66
186,368
2.57
4.04
Italy
163,840
1.73
.82
USA
63,360
1.52
3.58
153,408
18! 1.054
2.91
Xeon 6C, NVidia, FT-1000 8C
Power BQC 16C 1.6GHz, Custom
CS267 - Lecture 1
USA
China
Xeon E5 8C 2.7GHz, Intel Xeon Phi
DARPA Trial Subset Power 775,
10
IBM
IBM
19
Lawrence Berkeley 01/21/2014
Cray
Power7 8C 3.84GHz, Custom
National Laboratory
Hopper CS267 - Lecture 1
Cray XE6, 6C 2.1 GHz
USA
Performance Development (Nov 2012) 250 PFlop/s
1 Eflop/s
100 Pflop/s
33.9 PFlop/s
100 Pflop/s
10 Pflop/s
162 PFlop/s 17.6 PFlop/s
10 Pflop/s
1 Pflop/s
1 Pflop/s SUM
100 Tflop/s 10 Tflop/s
118 TFlop/s
SUM
100 Tflop/s
76.5 TFlop/s
10 Tflop/s
1 Tflop/s 1.17 TFlop/s
N=1
100 Gflop/s
1 Tflop/s 1.17 TFlop/s
N=1
100 Gflop/s
59.7 GFlop/s
N=500
1 Gflop/s
01/21/2014
Sequoia BlueGene/Q,
Country
Power BQC 16C 1.6GHz, Custom
Texas Advanced Computing Center/UT
1 Eflop/s
100 Mflop/s
Cray XK7, Opteron 16C 2.2GHz, Gemini, NVIDIA K20x
Power BQC 16C 1.6GHz, Custom
Performance Development (Nov 2013)
10 Gflop/s
Titan
Power BQC 16C 1.6GHz, Custom
Power7 8C 3.84GHz, Custom
01/21/2014
Computer
Power BQC 16C 1.6GHz, Custom
6
Xeon 6C, NVidia, FT-1000 8C
Fermi BlueGene/Q,
Site
Xeon E5 8C 2.7GHz, Infiniband FDR
Xeon E5 8C 2.7GHz, Intel Xeon Phi
IBM
IBM
Stampede PowerEdge C8220,
Ran k
10 Gflop/s
59.7 GFlop/s
N=500
1 Gflop/s 400 MFlop/s
100 Mflop/s
CS267 - Lecture 1
01/21/2014
400 MFlop/s
CS267 - Lecture 1
5
2/3/14
Projected Performance Development (Nov 2013)
Projected Performance Development (Nov 2012)
1 Eflop/s
1 Eflop/s
100 Pflop/s
100 Pflop/s
10 Pflop/s
1 Pflop/s
10 Pflop/s
SUM
100 Tflop/s 10 Tflop/s
100 Tflop/s
N=1
1 Tflop/s 100 Gflop/s
10 Tflop/s
N=1
1 Tflop/s N=500
100 Gflop/s
10 Gflop/s
10 Gflop/s
1 Gflop/s
1 Gflop/s
100 Mflop/s
100 Mflop/s
01/21/2014
SUM
1 Pflop/s
CS267 - Lecture 1
Core Count
01/21/2014
N=500
CS267 - Lecture 1
Moore’s Law reinterpreted • Number of cores per chip can double every two years • Clock speed will not increase (possibly decrease) • Need to deal with systems with millions of concurrent threads • Need to deal with inter-chip parallelism as well as intra-chip parallelism
01/21/2014
CS267 - Lecture 1
01/21/2014
CS267 - Lecture 1
6
2/3/14
Computational Science - News
Outline all • Why powerful computers must be parallel processors
“An important development in sciences is occurring at the intersection of computer science and the sciences that has the potential to have a profound impact on science. It is a leap from the application of computing … to the integration of computer science concepts, tools, and theorems into the very fabric of science.” -Science 2020 Report, March 2006
Including your laptops and handhelds
• Large CSE problems require powerful computers Commercial problems too
• Why writing (fast) parallel programs is hard But things are improving
• Structure of the course
01/21/2014
CS267 - Lecture 1
25!
Drivers for Change
01/21/2014
Nature, March 23, 2006
26!
CS267 - Lecture 1
Simulation: The Third Pillar of Science • Traditional scientific and engineering method:
• Continued exponential increase in computational power - Can simulate what theory and experiment can’t do • Continued exponential increase in experimental data - Moore’s Law applies to sensors too - Need to analyze all that data
(1) Do theory or paper design
Theory
• Continued exponential increase in computational (2) Perform experiments or build system power → simulation is becoming third pillar of science, complementing theory and experiment • Limitations: • Continued exponential increase inlarge experimental –Too difficult—build wind tunnels data → techniques and technology in data –Too expensive—build a throw-away passenger jet analysis, visualization, analytics, networking, and collaboration tools are becoming in all evolution –Too slow—wait for essential climate or galactic data rich scientific applications
Experiment
Simulation
–Too dangerous—weapons, drug design, climate experimentation
• Computational science and engineering paradigm:
01/21/2014
CS267 - Lecture 1
27!
(3) Use computers to simulate and analyze the phenomenon - Based on known physical laws and efficient numerical methods - Analyze simulation results with computational tools and methods beyond what is possible manually 01/21/2014
CS267 - Lecture 1
28!
7
2/3/14
Data Driven Science
Some Particularly Challenging Computations • Science
• Scientific data sets are growing exponentially
- - - - -
- Ability to generate data is exceeding our ability to store and analyze - Simulation systems and some observational devices grow in capability with Moore’s Law
• Engineering
• Petabyte (PB) data sets will soon be common: - Climate modeling: estimates of the next IPCC data is in 10s of petabytes - Genome: JGI alone will have .5 petabyte of data this year and double each year
- - - - -
- Particle physics: LHC is projected to produce 16 petabytes of data per year - Astrophysics: LSST and others will produce 5 petabytes/year (via 3.2 Gigapixel camera)
CS267 - Lecture 1
Semiconductor design Earthquake and structural modeling Computation fluid dynamics (airplane design) Combustion (engine design) Crash simulation
• Business - Financial and economic modeling - Transaction processing, web services and search engines
• Defense
• Create scientific communities with “Science Gateways” to data 01/21/2014
Global climate modeling Biology: genomics; protein folding; drug design Astrophysical modeling Computational Chemistry Computational Material Sciences and Nanosciences
- Nuclear weapons -- test by simulations - Cryptography 29!
Economic Impact of HPC
01/21/2014
$5B World Market in Technical Computing in 2004
• Airlines:
1998 1999 2000 2001 2002 2003
- System-wide logistics optimization systems on parallel systems. - Savings: approx. $100 million per airline per year.
100%
Simulation Scientific Research and R&D
80%
- Major automotive companies use large systems (500+ CPUs) for: - CAD-CAM, crash testing, structural integrity and aerodynamics. - One company has 500+ CPU parallel system. - Savings: approx. $1 billion per company per year.
Mechanical Design/Engineering Analysis
70%
Mechanical Design and Drafting
60%
• Semiconductor industry:
Other Technical Management and Support
90%
• Automotive design:
Imaging
50%
Geoscience and Geoengineering
40%
Electrical Design/Engineering Analysis Economics/Financial
30%
- Semiconductor firms use large systems (500+ CPUs) for - device electronics simulation and logic validation - Savings: approx. $1 billion per company per year.
Digital Content Creation and Distribution
20%
Classified Defense
10%
• Energy - Computational modeling improved performance of current nuclear power plants, equivalent to building two new power plants. 01/21/2014 CS267 - Lecture 1
30!
CS267 - Lecture 1
Chemical Engineering Biosciences
0%
Source: IDC 2004, from NRC Future of Supercomputing Report 31!
01/21/2014
CS267 - Lecture 1
32!
8
2/3/14
What Supercomputers Do – Two Examples
Global Climate Modeling Problem • Problem is to compute: f(latitude, longitude, elevation, time) à “weather” =
• Climate modeling
(temperature, pressure, humidity, wind velocity)
- simulation replacing experiment that is too slow
• Approach:
• Cosmic microwave background radition
- Discretize the domain, e.g., a measurement point every 10 km
- analyzing massive amounts of data with new tools
- Devise an algorithm to predict weather at time t+δt given t
• Uses: - Predict major events, e.g., El Nino - Use in setting air emissions standards - Evaluate global warming scenarios Source: http://www.epm.ornl.gov/chammp/chammp.html 01/21/2014
CS267 - Lecture 1
33!
Global Climate Modeling Computation • One piece is modeling the fluid flow in the atmosphere
01/21/2014
CS267 - Lecture 1
High Resolution Climate Modeling on NERSC-3 – P. Duffy, et al., LLNL!
34!
(millimeters/day)
- Solve Navier-Stokes equations - Roughly 100 Flops per grid point with 1 minute timestep
• Computational requirements: - To match real-time, need 5 x 1011 flops in 60 seconds = 8 Gflop/s - Weather prediction (7 days in 24 hours) à 56 Gflop/s - Climate prediction (50 years in 30 days) à 4.8 Tflop/s - To use in policy negotiations (50 years in 12 hours) à 288 Tflop/s
• To double the grid resolution, computation is 8x to 16x • State of the art models require integration of atmosphere, clouds, ocean, sea-ice, land models, plus possibly carbon cycle, geochemistry and more • Current models are coarser than this 01/21/2014
CS267 - Lecture 1
35!
01/21/2014
CS267 - Lecture 1
36!
9
2/3/14
NERSC User George Smoot wins 2006 Nobel Prize in Physics
U.S.A. Hurricane
Smoot and Mather 1992 COBE Experiment showed anisotropy of CMB
Cosmic Microwave Background Radiation (CMB): an image of the universe at 400,000 years Source: Data from M.Wehner, visualization by Prabhat, LBNL 01/21/2014
37!
CS267 - Lecture 1
01/21/2014
38!
CS267 - Lecture 1
Evolution Of CMB Data Sets: Cost > O(Np^3 )
The Current CMB Map
Experiment
Nt
Np
Nb
Limiting Data
Notes
COBE (1989)
2x109
6x103
3x101
Time
BOOMERanG (1998)
3x108
5x105
3x101
Pixel
Balloon, 1st HPC/NERSC
(4yr) WMAP (2001)
7x1010
4x107
1x103
?
Satellite, Analysis-bound
Planck (2007)
5x1011
6x108
6x103
Time/ Pixel
POLARBEAR (2007)
8x1012
6x106
1x103
Time
Ground, NG-multiplexing
CMBPol (~2020)
1014
109
104
Time/ Pixel
Satellite, Early planning/design
Satellite,
Workstation
Satellite, Major HPC/DA effort
source J. Borrill, LBNL
• Unique imprint of primordial physics through the tiny anisotropies in temperature and polarization. • Extracting these µKelvin fluctuations from inherently noisy data is a serious computational challenge. 01/21/2014
CS267 - Lecture 1
39!
data compression 01/21/2014
CS267 - Lecture 1
40!
10
2/3/14
Which commercial applications require parallelism?
What do commercial and CSE applications have in common?
Motif/Dwarf: Common Computational Methods
1 2 3 4 5 6 7 8 9 10 11 12 13
Finite State Mach. Combinational Graph Traversal Structured Grid Dense Matrix Sparse Matrix Spectral (FFT) Dynamic Prog N-Body MapReduce Backtrack/ B&B Graphical Models Unstructured Grid 01/21/2014
Image Speech Music Browser
Analyzed in detail in “Berkeley View” report www.eecs.berkeley.edu/Pubs/ TechRpts/2006/ EECS-2006-183.html
CS267 - Lecture 1
1 2 3 4 5 6 7 8 9 10 11 12 13
ML
HPC
DB
Games
Embed
Health
SPEC
ML
HPC
DB
Games
Embed
Analyzed in detail in “Berkeley View” report
SPEC
(Red Hot → Blue Cool)
Health Image Speech Music Browser
Finite State Mach. Combinational Graph Traversal Structured Grid Dense Matrix Sparse Matrix Spectral (FFT) Dynamic Prog N-Body MapReduce Backtrack/ B&B Graphical Models Unstructured Grid
01/21/2014
CS267 - Lecture 1
Principles of Parallel Computing
Outline
• Finding enough parallelism (Amdahl’s Law)
all • Why powerful computers must be parallel processors
• Granularity – how big should each parallel task be
Including your laptops and handhelds
• Locality – moving data costs more than arithmetic
• Large CSE problems require powerful computers
• Load balance – don’t want 1K processors to wait for one slow one
Commercial problems too
• Coordination and synchronization – sharing data safely
• Why writing (fast) parallel programs is hard
• Performance modeling/debugging/tuning
But things are improving
• Structure of the course All of these things makes parallel programming even harder than sequential programming.
01/21/2014
CS267 - Lecture 1
43!
01/21/2014
CS267 - Lecture 1
44!
11
2/3/14
“Automatic” Parallelism in Modern Machines • Bit level parallelism
Finding Enough Parallelism • Suppose only part of an application seems parallel
- within floating point operations, etc.
• Amdahl’s law - let s be the fraction of work done sequentially, so (1-s) is fraction parallelizable
• Instruction level parallelism (ILP) - multiple instructions execute per clock cycle
- P = number of processors
• Memory system parallelism
Speedup(P) = Time(1)/Time(P)
- overlap of memory operations with computation