“Sourcebook of Parallel Computing”, Morgan Kauf- mann Publishers ... Ed Akin, “
Object-Oriented Programming via Fortran90/95”, Cambridge Univer- sity Press ....
If a computer can do 109 flops (nowadays PC-s), it will take around. 1.15∗1015 ...
University of Tartu Institute of Computer Science
Parallel Computing MTAT.08.020 PARALLEELARVUTUSED
[email protected] Liivi 2 - 324
2012/13 Fall term
2 Practical information about the course
Lectures: Thursdays Liivi 2 - 612 16:15 – Eero Vainikko Computer classes: Mondays Liivi 2 - 205 at 16:15 – Oleg Batrašev
• 6 EAP • Exam
3 Practical information about the course
In theory, there is no difference between theory and practice. But, in practice, there is. Jan L. A. van de Snepscheut
4 Practical information about the course
NB! To be allowed to the exam, all computer class exercises must be submitted, and the deadlines should be met! The final grade consists of: 1. Computer class exercises 50% 2. Oral exam with the blackboard 40% 3. Active participation at lectures 10%
Course homepage (http://www.ut.ee/~eero/PC/)
5 Course syllabus
Course content (preliminary, about to change on the way) 1. Motivation for parallel computing 2. Classifications of parallel systems 3. Clos-networks, Beowulf clusters, Grid, Peer-to-peer networks 4. Benchmarks 5. Speedup, efficiency 6. Amdahl’s law 7. Scaled efficiency, ways for increasing the efficiency 8. Fortran95, HPF 9. OpenMP 10. MPI 11. Assessment of parallel algorithms, definition of optimality
6 Course syllabus 12. Main techniques in parallel algorithms: 13. Balanced trees 14. Pointer jumping 15. Divide and conquer 16. Partitioning 17. Pipelining 18. Accelerated Cascading 19. Symmetry breaking
7 Course syllabus Computer class activities:
1. Fortran 90/95 2. OpenMP 3. Message Passing Interface 4. Solving various parallelisation problems
Literature: 1. A.Grama, A.Gupta, G.Karypis and V.Kumar. “Introduction to Parallel Computing”, Second Edition, Addison-Wesley, 2003. 2. L. Ridgway Scott, Terry Clark and Babak Bagheri. “Scientific Parallel Computing”, Princeton University Press, 2005.
8 Course syllabus 3. Jack Dongarra, Ian Foster, Geoffrey Fox, William Gropp, Ken Kennedy, Linda Torczon and Andy White. “Sourcebook of Parallel Computing”, Morgan Kaufmann Publishers, 2003. 4. I.Foster. "Designing and Building Parallel Programs", Reading, MA: AddisonWesley, 1995; http://www.mcs.anl.gov/dbpp. 5. Scott, L. Ridgway Clark, Terry Bagheri, Babak, “Scientific Parallel Computing”, Princeton University Press, 2005. 6. E.Vainikko. "Fortran95 ja MPI", TÜ Kirjastus, 2004. 7. Ed Akin, “Object-Oriented Programming via Fortran90/95”, Cambridge University Press, 2003. 8. A. C. Marshall, Fortran 90 1 Day Course (http://www.liv.ac.uk/HPC/ F90Course1DayHomePage.html) ,[1997] 9. C.-K- Shene, “Fortran 90 Tutorial (http://www.cs.mtu.edu/~shene/ COURSES/cs201/NOTES/fortran.html)” Department of Computer Science, Michigan Technological University, [2009.]
9 Course syllabus 10. P.S.Pacheco. "Parallel Programming with MPI", Morgan Kaufmann Publ, 1997. 11. Joseph Ja’Ja’. "An Introduction to Parallel Algorithms", Addison-Wesley Publ., 1992. 12. Ralf Steinmetz, Klaus Wehrle (Eds.), Peer-to-Peer Systems and Applications, LNCS 3485, Springer, 2005.
10 Introduction
Why do we need parallel computing? Write First, form downgroups as many of 2! reasons as you can during 5 minutes!
11 Introduction
Introduction Parallel Computing is a part of Computer Science and Computational Sciences (hardware, software, applications, programming technologies, algorithms, theory and practice) with special emphasis on parallel computing or supercomputing
1
Parallel Computing – motivation The main questions in parallel computing:
• How is organised interprocessor communication? • How to organise memory? • Who is taking care of parallelisation? ◦ CPU? ◦ compiler? ◦ ... or programmer?
12 Introduction
1.1
1.1 History of computing
History of computing
Pre-history Even before electronic computers parallel computing existed ? . . . – (pyramides.) 1929 – parallelisation of weather forecasts ≈1940 – Parallel computations in war industry using Felix-like devices 1950 - Emergence of first electronic computers. Based on lamps. ENIAC and others
13 Introduction
1960 - Mainframe era. IBM
1.1 History of computing
14 Introduction
1970 - Era of Minis 1980 - Era of PCs 1990 - Era of parallel computers 2000 - Clusters / Grids 2010 - Clouds
1.1 History of computing
15 Introduction
1.2
1.2 Expert’s predictions
Expert’s predictions
Much-cited legend: In 1947 computer engineer Howard Aiken said that USA will need in the future at most 6 computers 1950: Thomas J. Watson as well: 6 computers will be needed in the world 1977: Seymour Cray: The computer Cray-1 will attract only 100 clients in the world 1980: Study by IBM – about 50 Cray-class computers could be sold at most in a year worldwide Reality: Many Cray-* processing power in many of nowadays homes Gordon Moore’s (founder of Intel) law: (1965: the number of switches doubles every second year ) 1975: - refinement of the above: The number of switches on a CPU doubles every 18 months Until 2020 or 2030 we would reach in such a way to the atomic level or quantum computer! – The But why? reason is: light speed limit (see Example 1.2 below!)
16 Introduction
1.2 Expert’s predictions
Flops: first computers desktop computers today supercomputers nowadays the aim of today next step
102 109 1012 1015 1018
100 Flops Gigaflops (GFlops) Teraflops (TFlops) Petaflops (PFlops) Exaflops (EFlops)
(sada op/s) (miljard op/s) (triljon op/s) (kvadriljon op/s) (kvintiljon op/s)
http://en.wikipedia.org/wiki/Orders_of_magnitude_ (numbers)
17 Introduction
1.3
1.3 Usage areas of a petaflop computer
Usage areas of a petaflop computer
Engineering applications
• Wind tunnels • Combustion engine optimisation • High frequency cirquits • Buildings and mechanical structures (optimisation, cost, shape, safety etc) • Effective and safe nuclear power station design • Simulation of nuclear explosions
18 Introduction
1.3 Usage areas of a petaflop computer
Scientific applications
• Bioinformatics • Computational physics and chemistry (quantum mechanics, macromolecules, composite materials, chemical reactions etc).
• Astrophysics (development of galaxies, thermonuclear reactions, postprocessing of large datasets produced by telescopes)
• Weather modelling • Climate and environment modelling • Looking for minerals • Flood predictions
19 Introduction
1.3 Usage areas of a petaflop computer
Commercial applications
• Oil industry • Market and monetary predictions • Global economy modelling • Car industry (like crash-tests) • Aviation (design of modern aircrafts and space shuttles ) Applications in computer systems
• Computer security (discovery of network attacks) • Cryptography
20 Introduction
• Embedded systems (for example, car) • etc
1.3 Usage areas of a petaflop computer
21 Introduction
1.4
1.4 Example 1.1
Example 1.1
Why speeds of order 1012 are not enough? Weather forecast in Europe for next 48 hours from sea lever upto 20km – need to solve an ODE (xyz and t ) The volume of the region: 5000 ∗ 4000 ∗ 20 km3 . Stepsize in xy-plane 1km Cartesian mesh z-direction: 0.1km. Timesteps: 1min. Around 1000 flop per each timestep. As a result, ca
5000 ∗ 4000 ∗ 20 km3 × 10 rectangles per km3 = 4 ∗ 109 meshpoints and
4 ∗ 109 meshpoints ∗ 48 ∗ 60 timesteps × 103 flop ≈ 1.15 ∗ 1015 flop. If a computer can do 109 flops (nowadays PC-s), it will take around
1.15 ∗ 1015 flop / 109 flop = 1.15 ∗ 106 seconds ≈ 13 days !!
22 Introduction
1.4 Example 1.1
But, with 1012 flops,
1.15 ∗ 103 seconds
< 20 min.
Not hard to imagine “small” changes in given scheme such that 1012 flops not enough:
• Reduce the mesh stepsize to 0.1km in each directions and timestep to 10 seconds and the total time would grow from
20 min to
8 days.
• We could replace the Europe with the whole World model (the area: 2 ∗ 107 km2 −→ 5 ∗ 108 km2 ) and to combine the model with an Ocean model.
23 Introduction
1.4 Example 1.1
Therefore, we must say: The needs of science and technology grow faster than available possibilities, need only to change ε and h to get unsatisfied again!
But again, why do we need a parallel computer to achieve this goal?
24 Introduction
1.5
1.5 Example 1.2
Example 1.2
Have a look at the following piece of code:
do i =1 ,1000000000000 z ( i ) =x ( i ) +y ( i ) ! i e . 3 ∗ 1012 memory accesses end do
Assuming, that data is traveling with the speed of light 3 ∗ 108 m/s, for finishing the operation in 1 second, the average distance of a memory cell must be:
r =
3∗108 m/s∗1s 3∗1012
= 10−4 m.
Typical computer memory is situated on a rectangular mesh. The length of each edge would be −4 m 2∗10 √ 3∗1012
≈ 10−10 m,
– the size of a small atom!
25 Introduction
1.5 Example 1.2
But why is parallel computing still not predominant? Three main reasons: hardware, algorithms and software Hardware: speed of
• networking • peripheral devices • memory access do not cope with the speed growth in processor capacity. Algorithms: An enormous number of different algorithms exist but
• problems start with their implementation on some real life application
26 Introduction Software: development in progress;
• no good enough automatic parallelisers • everything done as handwork • no easily portable libraries for different architectures • does the paralleising effort pay off at all?
BUT (as explained above) parallel computing will take over
1.5 Example 1.2
27 Introduction
1.6
1.6 Example 1.3
Example 1.3
Solving a sparse system of linear equations using MUMPS4.3 (Multifrontal Massively Parallel Solver ): SUN computer class procesors for solution of a medium size problem (262144 unknowns, 1308672 nonzeros) #procs.
time (s)
speedup
1 2 4 8 12 16 20 23
84.3 63.0 53.6 27.0 59.3 29.5 57.5 74.0
1.34 1.57 3.12 1.47 2.86 1.47 1.14
28 Introduction
1.7
1.7 Example 1.4
Example 1.4
Approximation of π using Monte-Carlo method #procs.
time (s)
1 2 4 8 12 16 20 21 23
107.8 53.5 26.9 13.5 9.3 6.8 5.5 9.0 29.2
speedup
2.01 4.00 7.98 11.59 15.85 19.59 11.97 3.69
29 Computer architectures and //
2
Computer architectures and parallelism
Various ways to classify computers:
• Architecture ◦ Single processor computer ◦ Multicore processor ◦ distributed system ◦ shared memory system • operating system ◦ UNIX, ◦ LINUX, ◦ (OpenMosix) ◦ Plan 9
1.7 Example 1.4
30 Computer architectures and //
1.7 Example 1.4
◦ WIN* ◦ etc • usage area ◦ supercomputing ◦ distributed computing ◦ real time sytems ◦ mobile systems ◦ neurological networks ◦ etc But impossible to ignore (implicit or explicit) parallelism in a computer or a set of computers
31 // Architectures
2.1 Processor development
Osa I
Parallel Architectures 2.1
Processor development
Instruction pipelining (similar to car production lines) - performing different suboperations with moving objects
Instruction throughput (the number of instructions that can be executed in a unit of time) increase
32 // Architectures
2.1 Processor development
operation overlap RISC processor pipelines:
• Instruction fetch • Instruction decode and register fetch
• Execute • Memory access • Register write back
Basic five-stage pipeline in a RISC machine (IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back). In the fourth clock cycle (the green column), the earliest instruction is in MEM stage, and the latest instruction has not yet entered the pipeline.
33 // Architectures
2.1 Processor development
Pipeline length? Limitations:
• pipeline speed defined by the slowest component • Usually, each 5-6 operation - branch • Cost of false prediction grows with the length of pipeline (larger number of subcommands get waisted) One possibility: Multiple pipelines (super-pipelined processors, superscalar execution) - in effect, execution of multiple commands in a single cycle
34 // Architectures Example 2.1: Two pipelines for adding 4 numbers:
2.1 Processor development
35 // Architectures
2.1 Processor development
True Data Dependency - result of one operation being an input to another Resource Dependency - two operations competing for the same resource (e.g. processor floating point unit) Branch Dependency (or Procedural Dependency) - scheduling impossible in case of if-directive Instruction Level Parallelism (ILP)
• possibility for out-of-order instruction execution ◦ factor: size of instruction window • ILP is limited by: ◦ parallelism present in the instruction window ◦ hardware parameters – existence of different functional units
36 // Architectures
· · · ·
2.1 Processor development
number of pipelines pipeline lengths pipeline properties etc
◦ not possible to always utilise all Functional Units (FU) – if at certain timestep none of FUs is utilised – vertical waste – if only some FUs utilised – horisontal waste
• Typically, processors support 4-level superscalar execution
37 // Architectures
2.1 Processor development
Very Long Instruction Word (VLIW) Processors
• Main design concern with superscalar processors – complexity and high price • VLIW – based on compile-time analysis - which commands to bind together for ILP
• These commands get packed together into one (long) instruction (giving the name)
• First used in Multiflow Trace machine (ca 1984) • Intel IA64 - part of the concept being implemented
38 // Architectures
2.1 Processor development
VLIW properties
• Simpler hardware side • Compiler has more context to find ILP • But compiler lacks all the run-time information (e.g. data-misses in cache), therefore, only quite conservative approach possible (just-in-time compilers might have benefit here!)
• More difficult prediction of branching and memory usage • VLIW performance depends highly on compiler abilities, like ◦ loop unrolling ◦ speculative execution ◦ branch prediction etc. • 4-way to 8-way parallelism in VLIW processors
39 // Architectures
2.2
2.2 Memory performance effect
Memory performance effect
• Memory system, not processor speed appears often as a bottleneck • 2 main parameters:
latency - time that it takes from request to delivery from the memory bandwidth - amount of data flow in timestep (between the processor and memory)
40 // Architectures
2.2 Memory performance effect
Example 2.2: Memory latency
• Simplified architecture: 1 GHz processor • memory fetching time 100 ns (no cache), blocksize 1 word • 2 multiply-add units, 4 multiplication or addition operations per clock cycle => peak performance – 4 Gflops.
• Suppose, we need to perform dot product. Each memory access takes 100 clock cycles =⇒ 1 flop per 100 ns =⇒ actual performance only: 10 ???Mflops!
41 // Architectures
2.2 Memory performance effect
Memory latency hiding with the help of cache
• Cache - small but fast memory between main memory and DRAM • works as low latency, high throughput storage • helps to decrease real memory latency only if enough reuse of data is present • part of data served by cache without main memory access – cache hit ratio • often: hit ratio a major factor to performance
42 // Architectures
2.2 Memory performance effect
Example 2.3: Cache effect
• Cache size: 32 KB with latency 1 ns • operation C = AB , where A and B are 32 × 32 matrices (i.e. all matrices A, B and C fit simultaneously into cache • We observe that: ◦ reading 2 matrices into cache (meaning 2K words) takes ca 200µs ◦ multiplying two n × n matrices takes 2n3 operations, =⇒ 64K operations, can be performed in 16K cycles (4 op/cycle)
◦ =⇒ total time (data access + calculations= 200µs + 16µs ◦ close to peak performance 64K/216 or 303 Mflops. • Access to same data corresponds to temporal locality
43 // Architectures
2.2 Memory performance effect
• In given example O(n2 ) data accesses and O(n3 ) computations. Such asymptotic difference very good in case of cache
• Data reusage is of critical importance for performance!
44 // Architectures
2.3
2.3 Memory Bandwidth (MB) and performance
Memory Bandwidth (MB) and performance
• MB is governed by memory bus and memory properties • MB can be increased with enlarging memory block accessed Let system spend l time units to move b data units, where
• l – system latency • b – memory block size (also called cache line)
Example 2.4: effect of memory block size: dot product of two vectors
• In example 2.2 we had b = 1, words giving 10 Mflops
45 // Architectures
2.3 Memory Bandwidth (MB) and performance
• Increasing b = 4 and assuming that data stored in memory linearly =⇒8 flops (4 * and + operations) for each 200 clock cycles corresponding to 1 flop per 25 ns resulting in peak performance 40 Mflops. Observations:
• Block size does not affect system latency • Wide memory bus (like in the example: 4) expensive to implement in hardware • Usually, memory access is performed similarly to pipeline • Therefore, with MB increase, system peak performance increases • Locality of memory addresses IS important! • Good to reorder computations according to data location!
46 // Architectures
2.3 Memory Bandwidth (MB) and performance
Example 2.5: Effect of memory ordering
f l o a t : : A(1000 ,1000) ,rowsum( 1 0 0 0 )
Version A
do i =1 ,1000 rowsum( i ) =0.0 do j =1 ,1000 rowsum( i ) =rowsum( i ) +A( i , j ) enddo enddo !
• Which one is faster: A or B?
Version B
do i =1 ,1000 rowsum( i ) =0.0 enddo do j =1 ,1000 do i =1 ,1000 rowsum( i ) =rowsum( i ) +A( i , j ) enddo enddo
47 // Architectures
2.3 Memory Bandwidth (MB) and performance
48 // Architectures
2.4 Other approaches for hiding memory latency
• FORTRAN rule: two-dimensional arrays (matrices) are stored columnwise! • For example, matrix B(1:3,1:2) elements in memory ordered as: B(1,1), B(2,1), B(3,1), B(1,2), B(2,2), B(3,2). (fastest changing – the leftmost index) To avoid spending time how to optimise due to memory ordering – use Fortran95 array syntax! (leaving the optimisation to the compiler)
2.4
Other approaches for hiding memory latency
• Multithreading • Prefetching
49 // Architectures
2.4 Other approaches for hiding memory latency
Example 2.6: Multithreading
Preliminary code (C):
f o r ( i = 0 ; i < n ; i ++) { c [ i ] = dot_product ( get_row ( a , i ) , b ) ; }
• each dot product appears to be independent, therefore it is possible to recode:
f o r ( i = 0 ; i < n ; i ++) { c [ i ] = create_thread ( dot_product , get_row ( a , i ) , b ) ; }
50 // Architectures
2.4 Other approaches for hiding memory latency
• Possibility for performing computation in each clock cycle ◦ prerequisites: – memory system supporting multiple requests simultaneously – processor capable switching from one thread to another in each clock cycle (e.g. machines like HEP and Tera)
◦ program should explicitly specify independence in form of threads
51 // Architectures
2.4 Other approaches for hiding memory latency
Prefetching • Cache misses make programs waiting • Why not to take care of data being in cache in advance? • Draw-backs: ◦ need for larger caches ◦ in case of pre-fetched data happens to get over-written, the situation is worse than before!
Drawbacks of latency hiding techniques In both cases, better memory bandwidth between main memory and cache needed!!! Each thread has proportinally smaller part of cache to use. =>
• possibly recovering from latency problem bandwidth problem might occur! • higher need for hardware properties (cache size)
52 // Architectures
2.5 2.5.1
2.5 Classification of Parallel Computers
Classification of Parallel Computers Granularity
In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events.
• fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing
53 // Architectures
2.5 Classification of Parallel Computers
– Implies high communication overhead and less opportunity for performance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation.
• operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently
54 // Architectures 2.5.2
2.5 Classification of Parallel Computers
Hardware:
Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for ∀ element L clock cycles =⇒ for calculation it would take L + N cycles; without pipeline L ∗ N cycles
Example of good code for pipelineing:
do i =1 ,k z ( i ) =x ( i ) +y ( i ) end do
55 // Architectures
2.5 Classification of Parallel Computers
Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor. Processor arrays Most often 2-dimensional arrays (For example: MasPar MP2 - massively parallel computer ) neighbours and on edges to the corresponding opposite edge nodes. Processors had mutual clock. Programming such a computer quite specific, special language, MPL, was used; need for thinkinking about communication beMasPar-MP2: 128x128=16384 tween neighbours, or to all processor at processors, each had 64Kbytes memory. once (which was slower). each processor connected to its
56 // Architectures
2.5 Classification of Parallel Computers
Shared memory computer
Distributed systems (e.g.: clusters) Most spread today. One of the main questions on parallel hardware: do the processors share a mutual clock or not?
57 // Architectures
2.5 Classification of Parallel Computers
58 // Architectures
2.6
2.6 Flynn’s classification
Flynn’s classification Instruction
SISD SIMD (MISD) MIMD Data
S - Single M - Multiple I - Instruction
Abbreviations: D - Data For example: Single Instruction Multiple Data stream =>: SISD - single instruction single data stream, (e.g. simple PC) SIMD - Same instructions applied to multiple data. (Example: MasPar) MISD - same data used to perform multiple operations... Sometimes have been considered vector processors belonging here but most often said to be empty class MIMD - Separate data and separate instructions. (Example: computer cluster)
59 // Architectures 2.6.1
2.6 Flynn’s classification
SISD
• A serial (non-parallel) computer • Single Instruction: Only one instruction stream is being acted on by the CPU during any one clock cycle
• Single Data: Only one data stream is being used as input during any one clock cycle
• Deterministic execution • This is the oldest and even today, the most common type of computer • Examples: older generation mainframes, minicomputers and workstations; most modern day PCs.
60 // Architectures 2.6.2
2.6 Flynn’s classification
SIMD
• A type of parallel computer • Single Instruction: All processing units execute the same instruction at any given clock cycle
• Multiple Data: Each processing unit can operate on a different data element • Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image processing.
• Synchronous (lockstep) and deterministic execution • Two varieties: Processor Arrays and Vector Pipelines • Examples:some early computers of this type: ◦ Processor Arrays: Connection Machine CM-2, MasPar MP-1 & MP-2, ILLIAC IV
61 // Architectures
2.6 Flynn’s classification
◦ Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820, ETA10
• graphics cards • Most modern computers, particularly those with graphics processor units (GPUs) employ SIMD instructions and execution units.
• possibility to switch off some processors (with mask arrays)
62 // Architectures 2.6.3
2.6 Flynn’s classification
(MISD):
• A type of parallel computer • Multiple Instruction: Each processing unit operates on the data independently via separate instruction streams.
• Single Data: A single data stream is fed into multiple processing units. • Few actual examples of this class of parallel computer have ever existed. One is the experimental Carnegie-Mellon C.mmp computer (1971).
• Some conceivable uses might be: ◦ multiple frequency filters operating on a single signal stream ◦ multiple cryptography algorithms attempting to crack a single coded message.
63 // Architectures 2.6.4
2.6 Flynn’s classification
MIMD
• A type of parallel computer • Multiple Instruction: Every processor may be executing a different instruction stream
• Multiple Data: Every processor may be working with a different data stream • Execution can be synchronous or asynchronous, deterministic or nondeterministic
• Currently, the most common type of parallel computer - most modern supercomputers fall into this category.
• Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers, multi-core PCs.
• Note: many MIMD architectures also include SIMD execution sub-components
64 // Architectures 2.6.5
2.6 Flynn’s classification
Comparing SIMD with MIMD
• SIMD have less hardware units than MIMD (single instruction unit) • Nevertheless, as SIMD computers are specially designed, they tend to be expensive and timeconsuming to develop
• not all applications suitable for SIMD • Platforms supporting SPMD can be built from cheaper components
65 // Architectures 2.6.6
Flynn-Johnson classification
(picture by: Behrooz Parhami)
2.6 Flynn’s classification
66 // Architectures
2.7
2.7.1
2.7 Type of memory access
Type of memory access
Shared Memory
• common shared memory • Problem occurs when more than one process want to write to (or read from) the same memory address
• Shared memory programming models do deal with these situations
67 // Architectures 2.7.2
Distributed memory
• Networked processors with their private memory
2.7.3
hybrid memory models
• E.g. distributed shared memory, SGI Origin 2000
2.7 Type of memory access
68 // Architectures
2.8 2.8.1
2.8 Communication model of parallel computers
Communication model of parallel computers Communication through shared memory address space
• UMA (uniform memory access) • NUMA (non-uniform memory access) ◦ SGI Origin 2000 ◦ Sun Ultra HPC
69 // Architectures
2.8 Communication model of parallel computers
Comparing UMA and NUMA:
C - Cache, P- Processor, M-Memory / Which (a) & (b) are- UMA, which (c) - NUMA NUMA?
70 // Architectures 2.8.2
2.8 Communication model of parallel computers
Communication through messages
• Using some messaging libraries like MPI, PVM. • All processors are ◦ independent ◦ own private memory ◦ have unique ID • Communication is performed through exchanging messages
71 // Architectures
2.9 2.9.1
Other classifications Algorithm realisation
• using only hardware modules • mixed modules (hardware and software) 2.9.2
Control type
1. synchronous 2. dataflow-driven 3. asynchronous
2.9 Other classifications
72 // Architectures 2.9.3
2.9 Other classifications
Network connection speed
• network bandwidth ◦ can be increased e.g. through channel bonding
• network latency ◦ the time from the source sending a message to the destination receiving it More Whicheasy one to is more increase easy bandwidth! to increase: bandwidth or latency?
73 // Architectures 2.9.4
2.9 Other classifications
Network topology
• Bus-based networks • Ring - one of the simplest topologies
• Array topology ◦ Example: cube (in case of 3D array) • Hypercube ◦ In ring the longest route between 2 processors is: P/2 – in hypercube – log P Ring:
74 // Architectures
Hypercube:
2.9 Other classifications
75 // Architectures
2.9 Other classifications
Figure. How to design a hypercube: Add similar structure and connect corresponding nodes (adding one 1 bit). Problem: large number of connections per node Easy to emulate Hypercube on e.g. MasPar array in log n time:
76 // Architectures
2.9 Other classifications
77 // Architectures
2.9 Other classifications
• Star topology ◦ Speed depends very much on switch properties (e.g. latency, bandwidth, backplane frequency ) and ability to cope with arbitray communication patterns
• Clos-network For example Myrinet. (quite popular around 2005)
• Cheaper but with higher latency: Gbit Ethernet., (10 Gbit Ethernet) • Nowadays, most popular low-latency network type – Infiniband
78 // Architectures
2.10
2.10 Clos-networks
Clos-networks
Theory of Clos networks came into existence with telephone networks – paper : : Charles Clos, – “A Study of Non-Blocking Switching Networks”, 1952. 2.10.1
Myrinet
Myrinet (http://www.myri.com) Standardised network architecture Aims:
• Max throughput with scalability to large number of processors • Low latency • eliminate the need to allocate processors according to network topology ◦ (historical remark: in the same reasons, RAM replaced sequential access memory (disk, trumble) In general, good network makes 1/2 cluster cost!
79 // Architectures Myrinet card and switch
128-port Myrinet Clos network switch
2.10 Clos-networks
80 // Architectures 2.10.2
2.10 Clos-networks
Topological concepts
• Minimum bisection (minimaalne poolitus) Definition. Minimal number of connections between arbitrary two equally sized sets of nodes in the network
• Describes minimal throughput, independently from communication pattern • Upper bound in case of N-node network is N/2 connections • Network topology achieving this upper bound is called full bisection (täispoolitus)
81 // Architectures Example. 8 nodes, network switch has 5 ports:
Minimal What is Minimal bisectionBisection is 1 – i.e.invery this bad case?
2.10 Clos-networks
82 // Architectures But, if to set the network up as follows:
Question: Does there exist full bisection? If yes, how?
2.10 Clos-networks
83 // Architectures
2.10 Clos-networks
Answer: minimal bisection: 4. But not in case of each possible communication pattern! (find!) => not rearrageable network 2.10.3
Rearrangeable networks (ümberjärjestatavad võrgud)
Definition. Network is called rearrangeable, if it can find a route in case of arbitrary permutation of communication patterns (or, whichever node wants to talk to with whichever other node, there is always a way to do it simultaneously) Theorem. Whichever rearrangeable network has full bisection. Is The theopposite opposite is also not true. true Network or not? can have full bisection without being rearrangeable. (Proof is based on Hall’s Theorem)
84 // Architectures
2.10 Clos-networks
Example: 8-node clos network with 4-port switches. "Leaf"-switches "Spine"-switches
Main rule: As many connections to the leaves as many there are to the spineswitches (enables rearangeability) Therefore: Clos networks are:
• scalable upto a large number of nodes • rearrangeable (proved, that in each permutation there exists a routing table) • There exist multiple routes
85 // Architectures
2.10 Clos-networks
In case of too few ports on a switch, it is possible to compose a larger one:
86 // Architectures
2.10 Clos-networks
87 // Architectures
2.11
2.11 Beowulf clusters
Beowulf clusters
• History
• 1993-1994 Donald Becker (NASA) – 16 486 DX4 66MHz processors, ETHERNET-channel bonding, COTS (Commodity Off The Shelf ) – created in CESDIS (The Center of Excellence in Space Data and Information Sciences), 74 MFlops (4,6 MFlops/node).
• Beowulf – formerly project name, not the system Dr Thomas Sterling: "What the hell, call it ’Beowulf.’ No one will ever hear of it anyway."
• Beowulf - first poem in English based on Scandinavian legend about hero who defeated the monster Grendel with his courage and strength
88 // Architectures
2.11 Beowulf clusters
• Network choice Ethernet 100Mbps, Gigabit Ethernet, Infiniband, Myrinet (Scali, pathscale, numalink, etc)
• OS – most often linux • Typical Beowulf cluster
Internet
Frontend
Switch
Nodes
89 // Architectures
• Building a cluster (rack/shelf)?
2.11 Beowulf clusters
90 // Architectures
2.11 Beowulf clusters
• console switch?
• which software.
• Security • Cluster administration
The part of the system needing security measures
Internet
◦ ClusterMonkey
Frontend
Switch
◦ Rocks ◦ Clustermagic ◦ etc. Nodes
91 // Performance
3
3.1 Speedup
Parallel Performance How to measure parallel performance?
3.1
Speedup
S(N, P) :=
tseq (N) t par (N, P)
tseq (N) – time for solving given problem with best known sequential algorithm • In general, different algorithm than the parallel one ◦ If same (and there exist a faster sequential one): Relative Speedup
• 0 < S(N, P) ≤ P • If S(N, P) = P, – linear speedup or optimal speedup (Example, embarrasingly parallel algorithms)
92 // Performance
3.2 Efficiency
• May happen that S(N, P) > P, (swapping; cache effect) – superlinear speedup
• Often S(N, P) < 1, – is slowdown it speedup?
3.2
Efficiency
E(N, P) :=
tseq (N) . P · t par (N, P)
Presumably, 0 < E(N, P) ≤ 1.
93 // Performance
3.3
3.3 Amdahl’s law
Amdahl’s law
In each algorithm ∃ parts that cannot be parallelised
• Let σ (0 < σ ≤ 1) – sequential part • Assume that the rest 1 − σ parallelised optimally
Then, in best case:
S(N, P) =
tseq σ+
1−σ P
tseq
=
1 1 ≤ . 1−σ σ σ+ P
94 // Performance
3.3 Amdahl’s law
Example 1. Assume 5% of the algorithm is not parallelisable (ie. σ = 0.05) => :
P
max S(N, P)
2 4 10 20 100
1.9 3.5 6.9 10.3 16.8 20
∞
Therefore, not much to gain at all with huge number of processors!
Example 2. If σ = 0.67 (33% parallelisable), with P = 10:
S(N, 10) =
1 = 1.42 0.67 + 0.33/10
95 // Performance
3.3 Amdahl’s law
96 // Performance
3.4
3.4 Gustafson-Barsis’ law
Gustafson-Barsis’ law
John Gustafson & Ed Barsis (Scania Laboratory) 1988:
• 1024-processor nCube/10 claimed: they bet Amdahl’s law! • Their σ ≈ 0.004...0.008
• but got S ≈ 1000
• (Acording to Amdahl’s S might be 125...250) How was it possible? Does Amdahl’s law hold? Mathematically – yes. But in practice – not very good idea to solve a problem with fixed size N on whatever number of processors! In general, σ = σ (N) 6= const Usually, σ decreases with N growing! Algorithm is said to be effectively parallel if σ → 0 with N → ∞ Scaled efficiency to avoid misunderstandings:
97 // Performance
3.5
3.5 Scaled efficiency
Scaled efficiency
ES (N, P) :=
tseq (N) t par (P · N, P)
• Problem size increasing accordingly with adding new processors – does time remain the same?
• 0 < ES (N, P) ≤ 1 • If ES (N, P) = 1 – linear speedup
98 // Performance
3.6
Methods to increase efficiency
Factors influencing efficiency:
• communication time • waiting time • additional computations • changing/improving algorithm Profiling parallel programs
• MPE - jumpshot, LMPI, MpiP • Totalview, Vampir, Allinea OPT • Linux - gprof (compiler switch -pg) • SUN - prof, gprof, prism • Many other commercial applications
3.6 Methods to increase efficiency
99 // Performance 3.6.1
3.6 Methods to increase efficiency
Overlapping communication and computations
Example: Parallel Ax-operation for sparse matrices par_sparse_mat.f90 (http://www.ut.ee/~eero/F95jaMPI/Kood/ mpi_CG/par_sparse_mat.f90.html) (“Fortran95 ja MPI” pp. 119-120) Matrix partitioned, divided between processors. Starting communication (nonblocking); calculations at inside parts of the region => economy in waiting times. 3.6.2
Extra computations instead of communication
• Computations in place instead of importing the results over the network • Sometimes it pays off! Example: Random number generation. Broadcasting only seed and generate in parallel (deterministic algorithm)
100 // Performance
3.7
3.7 Benchmarks
Benchmarks
5 main HPC (High Performance Computing) benchmarks:
• NPB • Linpack
• Perf • IOzone • Graph 500
• HINT
3.7.1
Numerical Aerodynamic Simulation (NAS) Parallel Benchmarks (NPB)
MPI, 8 programs (IS,FT,MG,CG,LU,SP,BT,EP) from Computational Fluid Dynamics (CFD) code
• Integer Sort (IS) - integer operations and communication speed. Latency of critical importance
• Fast Fourier Transform (FT). Remote communication performance; 3D ODV solution
101 // Performance
3.7 Benchmarks
• Multigrid (MG). Well-structured communication pattern test. 3D Poisson problem with constant coefficients
• Conjugate Gradient (CG). Irregular communication on unstructured discretisation grid. Finding smallest eigenvalue of a large positive definite matrix
• Lower-Upper diagonal (LU). Blocking communication operations with small granularity, using SSOR (Symmetric Successive Over-Relaxation) to solve sparse 5x5 lower and upper diagonal systems. Length of the message varying a lot
• Scalar pentadiagonal (SP) and block tridiagonal (BT). Balance test between computations and communication. Needs even number of processes. Solving a set of diagonally not dominant SP and BT systems. (Although similar, difference in the ratio of communication and computations – SP having more communication than BT
• Embarrassingly Parallel (EP). Gauss residual method with some specifics; showing the performance of the slowest processor.
• + some new tests in version 3, good with networked multicore processors
102 // Performance 3.7.2
3.7 Benchmarks
Linpack
Jack Dongarra. HPL - High Performance Linpack, using MPI and BLAS. Solving systems of linear equations with dense matrices. The aim is to fit a problem with maximal size (advisably, utilising 80% of memory).
• Used for http://www.top500.org ◦ Also, http://www.bbc.co.uk/news/10187248 • R peak - peak performance in Gflops
• Rmax - maximal achieved performance in Gflops
• N - size of matrix giving peak per-
• NB - blocksize. In general, the
formance in Gflops (usually ) various locking mechanisms needed...
• A bit similar to the previous one (small granularity)
• benefit: no data ownership exist
111 //-programming models
4.4
4.4 Implicit Parallelism
Implicit Parallelism
• no process interaction is visible to the programmer • instead the compiler and/or runtime is responsible for performing it • most common with domain-specific languages where the concurrency within a problem can be more prescribed
112 //-programming models
4.4 Implicit Parallelism
Problem decomposition • Any parallel program is composed of simultaneously executing processes • problem decomposition relates to the way in which these processes are formulated.
• This classification may also be referred to as: ◦ algorithmic skeletons or ◦ parallel programming paradigms
113 //-programming models
4.1
4.2 Data parallel model
Task parallel model
• focuses on processes, or threads of execution. These processes will often be behaviourally distinct, which emphasises the need for communication. Task parallelism is a natural way to express message-passing communication. It is usually classified as MIMD/MPMD (or MISD)
4.2
Data parallel model
• performing operations on a data set ◦ usually regularly structured in an array ◦ set of tasks will operate on this data – independently on separate partitions
• Data accessibility
114 //-programming models
4.2 Data parallel model
◦ shared memory system, the data accessible to all ◦ in a distributed-memory system the data divided between memories and worked on locally.
• Data parallelism usually classified as SIMD/SPMD. • Generally, granularity small • same operation on multiple computing devices • usually needs some clarification from the programmer, how data distributed between the processors
• communication/synchronisation is created automatically by compiler Example HPF (High Performance Fortran)
115 Shared Memory
5
4.2 Data parallel model
Shared memory programing model Dining Philosopher’s problem 5 plates, 5 forksl. One can et only with 2 forks Eat and Think! Avoid starvation (exp. to death)! not too fat? No deadlocks no fights; fairness? ⇒ need for some locking mechnism!
116 Shared Memory
5.1
5.1 PRAM (Parallel Random Access Machine) model
PRAM (Parallel Random Access Machine) model
• consisting of P deterministic RAM processors (von Neumann machines) • processors work in parallel • communication and synchronisation through shared memory Variations
• EREW PRAM (Exclusive Read Exclusive Write) • CREW PRAM (Concurrent Read Exclusive Write) • CRCW PRAM (Concurrent Read Concurrent Write): ◦ common CRCW PRAM - write succeeds only if all have the same value ◦ arbitrary CRCW PRAM - arbitrary process succeeds ◦ priority CRCW PRAM - with predefined priorities
117 Shared Memory
5.2
5.2 OpenMP
OpenMP
OpenMP tutvustus (http://www.llnl.gov/computing/tutorials/
openMP/) Programmeerimismudel:
• Lõimedel (threads) baseeruv paralleelsus • Programmeerija vahend (mitte automaatne paralleliseerija) (explicit parallelism)
• Hargnevuste-ühendamiste (Fork-Join) mudel Alustatakse 1 protsessist – pealõimest (master thread) Fork,Join
118 Shared Memory
• Baseerub kompilaatoridirektiividel C/C++, Fortran*
• Mitmetasemeline paralleelsus (Nested Parallelism) toetatud (kuigi kõik realisatsioonid ei pruugi seda toetada)
• Dünaamilised lõimed lõimede arvu saab muuta programmi töö käigus
5.2 OpenMP
119 Shared Memory
5.2 OpenMP
• Olemasoleva järjestikprogrammi paralleeliseerimine tihti vaid väheste muudatustega lähtetekstis
• OpenMP kujunenud standardiks
120 Shared Memory OpenMP ajalugu:
• Standardiseerimist alustati 1994, taasalustati 1997 • Okt. 1997: Fortran vers. 1.0 • 1998 lõpp: C/C++ vers. 1.0 • Juuni 2000: Fortran vers. 2.0 • Aprill 2002: C/C++ vers. 2.0 • Mai 2008: vers. 3.0 standard • Juuli 2011: vers. 3.1
5.2 OpenMP
121 Shared Memory
5.2 OpenMP
Näide Fortran:
PROGRAM HELLO INTEGER VAR1, VAR2, VAR3 ! Programmi j ä r j e s t i k u n e osa . . . ! Programmi p a r a l l e e l n e osa , uute lõimede tegemine ; ! m u u t uj a t e skoopide määramine : !$OMP PARALLEL PRIVATE (VAR1, VAR2) SHARED(VAR3) ! P a r a l l e e l n e osa mis jookseb k õ i g i l l õ i m e d e l : . . . ! Kõik lõimed ühinevad pealõimega , p a r a l l e e l s e osa l õ p p : !$OMP END PARALLEL ! Programm j ä r j e s t i k o s a l õ p e t a m i n e : . . . END
122 Shared Memory
5.2 OpenMP
Näide: C/C++:
# include main ( ) { i n t var1 , var2 , var3 ; Järjestikosa . . . Programmi p a r a l l e e l n e osa , uute lõimede tegemine ; muutujate skoopide määramine : #pragma omp p a r a l l e l p r i v a t e ( var1 , var2 ) shared ( var3 ) { P a r a l l e e l n e osa mis jookseb k õ i g i l lõimedel : . . . Kõik lõimed ühinevad pealõimega , p a r a l l e e l s e osa lõpp : } Järjestikprogrammi lõpetamine . . . }
123 Shared Memory
5.2 OpenMP
Üldreeglid:
• Kommentaare samale reale OpenMP direktiividega panna ei saa • Üldjuhul on Fortrani kompileerimisel võimalik lisada võti OpenMP kommentaaride interpreteerimiseks
• Mitmed OpenMP käsud on paarisdirektiivid:
!$OMP d i r e c t i v e [ programmiblokk ] !$OMP end d i r e c t i v e
124 Shared Memory
5.3
5.3 OpenMP direktiivid: PARALLEL
OpenMP direktiivid: PARALLEL
!$OMP PARALLEL [ c l a u s e . . . ] IF ( s c a l a r _ l o g i c a l _ e x p r e s s i o n ) PRIVATE ( l i s t ) SHARED ( l i s t ) DEFAULT ( PRIVATE | FIRSTPRIVATE | SHARED | NONE) FIRSTPRIVATE ( l i s t ) REDUCTION ( o p e r a t o r : l i s t ) COPYIN ( l i s t ) NUM_THREADS ( s c a l a r −i n t e g e r − expression ) block !$OMP END PARALLEL
125 Shared Memory
5.3 OpenMP direktiivid: PARALLEL
Lõimede arvu saab määrata:
• klausliga NUM_THREADS • teegifunktsiooniga omp_set_num_threads() • keskkonnamuutujaga OMP_NUM_THREADS Kitsendused:
• Paralleelpiirkond programmis on blokk mis ei ületa alamrogrammide ega failide piire
• Sünkroniseerimata I/O on etteaimamatu tulemusega • keelatud on paralleelpiirkonda sisse- või väljasuunamine • vaid üksainus IF-lause on lubatud
126 Shared Memory
5.3 OpenMP direktiivid: PARALLEL
Näide: Hello World (F77)
PROGRAM HELLO INTEGER NTHREADS, TID , OMP_GET_NUM_THREADS, + OMP_GET_THREAD_NUM C Fork a team o f t h r e a d s g i v i n g them t h e i r own copies of v a r i a b l e s !$OMP PARALLEL PRIVATE ( TID ) C Obtain and p r i n t t h r e a d i d TID = OMP_GET_THREAD_NUM( ) PRINT ∗ , ’ H e l l o World from t h r e a d = ’ , TID C Only master t h r e a d does t h i s I F ( TID .EQ. 0 ) THEN NTHREADS = OMP_GET_NUM_THREADS( ) PRINT ∗ , ’ Number o f t h r e a d s = ’ , NTHREADS END I F C A l l t h r e a d s j o i n master t h r e a d and disband !$OMP END PARALLEL END
127 Shared Memory
5.4 OpenMP tööjaotusdirektiivid: DO
OpenMP tööjaotusdirektiivid on: DO, SECTIONS ja SINGLE
5.4
OpenMP tööjaotusdirektiivid: DO
!$OMP DO [ c l a u s e . . . ] SCHEDULE ( t y p e [ , chunk ] ) ORDERED PRIVATE ( l i s t ) FIRSTPRIVATE ( l i s t ) LASTPRIVATE ( l i s t ) SHARED ( l i s t ) REDUCTION ( o p e r a t o r | i n t r i n s i c : l i s t ) COLLAPSE ( n ) do_loop !$OMP END DO
[ NOWAIT ]
128 Shared Memory
5.4 OpenMP tööjaotusdirektiivid: DO
DO-direktiivi klauslid: • SCHEDULE kirjeldab kuidas iteratsioonid on eri lõimede vahel ärajagatud: ◦ STATIC - tükid pikkusega chunk või ühtlaselt kui chunk ei ole määratud ◦ DYNAMIC - tükid pikkusega chunk (=vaikimisi 1) jaotatakse dünaamiliselt, peale ühe tüki lõpetamist antakse dünaamiliselt uus
◦ GUIDED - tüki suurust vähendatakse eksponentsiaalselt ◦ RUNTIME - jaotus OMP_SCHEDULE järgi.
otsustatakse
jooksvalt
keskkonnamuutuja
◦ AUTO - jaotus otsustatakse kompilaatori poolt ja/või jooksvalt • ORDERED - tsükli täitmine nagu järjestiktsüklis • NO WAIT - peale tsükli lõppu lõimed ei sünkroniseeru vaid jätkavad kohe programmi järgnevate ridadega
• COLLAPSE - mitu sisemist tsüklit lugeda üheks iteratsiooniks mitme teineteise sees paikneva tsükli puhul
129 Shared Memory
5.4 OpenMP tööjaotusdirektiivid: DO
Kitsendusi:
• DO-tsükkel ei saa olla DO WHILE-tsükkel, iteratsiooniparameeter – täisarv! • Programmi õigsus ei tohi sõltuda sellest milline lõim millist tsükli osa täidab • sisenemide-väljumine antud tsüklist ei ole lubatud • ORDERED ja SCEDULE-käsk vaid 1 kord
130 Shared Memory
5.4 OpenMP tööjaotusdirektiivid: DO
DO-direktiivi näide: PROGRAM VEC_ADD_DO INTEGER N, CHUNKSIZE, CHUNK, I PARAMETER (N=1000) PARAMETER (CHUNKSIZE=100) REAL A(N), B(N), C(N) DO I = 1, N A(I) = I * 1.0 B(I) = A(I) ENDDO CHUNK = CHUNKSIZE !$OMP PARALLEL SHARED(A,B,C) PRIVATE(I) ! Arrays A, B, C will be shared by all threads ! I -- private for each thread (each has a unique copy) !$OMP DO SCHEDULE(DYNAMIC,CHUNK) ! The iterations of the loop will be distributed ! dynamically in CHUNK sized pieces DO I = 1, N C(I) = A(I) + B(I) ENDDO !$OMP END DO NOWAIT !$OMP END PARALLEL END
131 Shared Memory
5.5
5.5 OpenMP tööjaotusdirektiivid: SECTIONS
OpenMP tööjaotusdirektiivid: SECTIONS
SECTIONS võimaldab defineerida sektsioonid, iga sektsioon eraldi lõimele. Saab kasutada PRIVATE, FIRSTPRIVATE, LASTPRIVATE, REDUCTION klausleid muutujate funktsionaalsuse määratlemiseks paralleelses programmi osas. Kui kasutada lõpus NOWAIT märgendit, siis pärast sektsiooni lõpetamist lõim ei oota, kuni teine lõim lõpetab antud operatsiooni, vaid jätkab tööd.
!$OMP
SECTIONS [ c l a u s e . . . ] PRIVATE ( l i s t ) FIRSTPRIVATE ( l i s t ) LASTPRIVATE ( l i s t ) REDUCTION ( o p e r a t o r | i n t r i n s i c : list ) !$OMP SECTION block !$OMP SECTION block !$OMP END SECTIONS [ NOWAIT ]
132 Shared Memory
5.5 OpenMP tööjaotusdirektiivid: SECTIONS
Näide: SECTIONS:
!
!$OMP !$OMP !$OMP
!$OMP
PROGRAM VEC_ADD_SECTIONS INTEGER N, I PARAMETER (N=1000) REAL A(N) , B(N) , C(N) Some i n i t i a l i z a t i o n s DO I = 1 , N A( I ) = I ∗ 1 . 0 B( I ) = A( I ) ENDDO PARALLEL SHARED( A , B , C) , PRIVATE ( I ) SECTIONS SECTION DO I = 1 , N/ 2 C( I ) = A( I ) + B( I ) ENDDO SECTION DO I = 1+N/ 2 , N C( I ) = A( I ) + B( I )
133 Shared Memory
5.5 OpenMP tööjaotusdirektiivid: SECTIONS
ENDDO !$OMP END SECTIONS NOWAIT !$OMP END PARALLEL END
134 Shared Memory
5.6
5.6 OpenMP tööjaotusdirektiivid: SINGLE
OpenMP tööjaotusdirektiivid: SINGLE
Vaid 1 lõim täidab antud blokki, vajalik näiteks I/O puhul jms.
!$OMP
SINGLE [ c l a u s e . . . ] PRIVATE ( l i s t ) FIRSTPRIVATE ( l i s t ) block !$OMP END SINGLE [ NOWAIT ]
Kitsendusi: sisenemine-väljumine lubamatu SINGLE-blokist
135 Shared Memory
5.7
5.7 OpenMP tööjaotusdirektiivide kombineerimine:
OpenMP tööjaotusdirektiivide kombineerimine:
PARALLEL SECTIONS, PARALLEL DO NB! Kui openMP käsk läheb liiga pikaks ühel real, siis saab seda poolitada ja jätkata:
!$OMP PARALLEL DO !$OMP& SHARED( a )
Näide: PARALLEL DO
136 Shared Memory
5.7 OpenMP tööjaotusdirektiivide kombineerimine:
PROGRAM VECTOR_ADD INTEGER N, I , CHUNKSIZE, CHUNK PARAMETER (N=1000) PARAMETER (CHUNKSIZE=100) REAL A(N) , B(N) , C(N) ! Some i n i t i a l i z a t i o n s DO I = 1 , N A( I ) = I ∗ 1 . 0 B( I ) = A( I ) ENDDO CHUNK = CHUNKSIZE !$OMP PARALLEL DO !$OMP& SHARED( A , B , C,CHUNK) PRIVATE ( I ) !$OMP& SCHEDULE( STATIC ,CHUNK) DO I = 1 , N C( I ) = A( I ) + B( I ) ENDDO !$OMP END PARALLEL DO END
137 Shared Memory
5.8
5.8 OpenMP: sünkronisatsioonikäsud
OpenMP: sünkronisatsioonikäsud
Vaid pealõim (master ):
!$OMP MASTER ( block ) !$OMP END MASTER
Samas järjekorras (vaid 1 lõim korraga) nagu paralleliseerimata programm:
!$OMP ORDERED ( block ) !$OMP END ORDERED
(kasutatav vaid DO ja PARALLEL DO blokis)
138 Shared Memory Vaid 1 lõim korraga (üksteise järel):
5.8 OpenMP: sünkronisatsioonikäsud
!$OMP CRITICAL [ name ] block !$OMP END CRITICAL
Näide CRITICAL: programm, mis loeb kokku, kui mitu lõime töötab. Selleks aga ei tohi X väärtus rikutud saada, ehk X võib suurendada järjekorras (nii on tulemus "kaitstud"):
!$OMP !$OMP !$OMP !$OMP
PROGRAM CRITICAL INTEGER X X = 0 PARALLEL SHARED( X ) CRITICAL X = X + 1 END CRITICAL END PARALLEL END
139 Shared Memory
5.8 OpenMP: sünkronisatsioonikäsud
Barjäär:
!$OMP BARRIER
ATOMIC-direktiiv (mini-kriitiline lõik vaid ühe järgneva käsu jaoks, millel on ka omad piirangud, näiteks ei saa see olla call käsk jms.):
!$OMP ATOMIC
FLUSH-käsk – mälusünkronisatsioon:
!$OMP FLUSH
( list )
(kõik lõimed kirjutavad oma lokaalsed muutujad tagasi jagatud mällu)
140 Shared Memory
5.9
5.9 OpenMP skoobidirektiivid
OpenMP skoobidirektiivid
PRIVATE - muutujad privaatsed igal lõimel SHARED - muutujad ühised kõikidele lõimedele DEFAULT (PRIVATE | SHARED | NONE) vaikeväärtuse määramine FIRSTPRIVATE - privaatsed muutujad mis saavad algväärtustatud LASTPRIVATE - peale tsüklit saab jagatud muutuja väärtuseks viimase sektsiooni või tsükli väärtuse COPYIN - paralleelsesse ossa sisenemisel saavad kõigi lõimede vastavad muutujad väärtuseks pealõime vastava väärtuse REDUCTION - etteantud operatsiooni teostamine kõigi privaatsete koopiatega antud muutujast
141 Shared Memory
5.9 OpenMP skoobidirektiivid
Näide REDUCTION - kahe suure vektori skalaarkorrutis. Selleks, nagu me teame on vaja korrutada vektorite vastavad komponendid ja siis liita tulemused. Kasutame DO tsüklit, mis korrutab ja liidab korrutise tulemusele. Sellist programmi on lihtne paralleliseerida. Erinevate paralleelsete jupide tulemused tuleb lõpus kokku liita selleks siis kasutatakse REDUCTION(milline tehe, milline muutuja):
PROGRAM DOT_PRODUCT INTEGER N, CHUNKSIZE, CHUNK, I PARAMETER (N=100) PARAMETER (CHUNKSIZE=10) REAL A(N) , B(N) , RESULT ! Some i n i t i a l i z a t i o n s DO I = 1 , N A( I ) = I ∗ 1 . 0 B( I ) = I ∗ 2 . 0 ENDDO RESULT= 0 . 0 CHUNK = CHUNKSIZE !$OMP PARALLEL DO !$OMP& DEFAULT(SHARED) PRIVATE ( I )
142 Shared Memory
5.9 OpenMP skoobidirektiivid
!$OMP& SCHEDULE( STATIC ,CHUNK) !$OMP& REDUCTION( + : RESULT) DO I = 1 , N RESULT = RESULT + (A( I ) ∗ B( I ) ) ENDDO !$OMP END PARALLEL DO NOWAIT PRINT ∗ , ’ F i n a l R e s u l t = ’ , RESULT END
143 Shared Memory
5.9 OpenMP skoobidirektiivid
Klausel
OpenMP direktiiv
PAR. IF PRIVATE SHARED DEFAULT FIRSTPRIVATE LASTPRIVATE REDUCTION COPYIN SHEDULE ORDERED NOWAIT
x x x x x x x
DO SECTS
SINGLE PAR. DO PAR. SEC
x x
x
x
x x x
x x x
x
x x x
x
x
x x x x x x x x x x
Järgnevatele direktiividel klausleid ei ole:
MASTER CRITICAL BARRIER ATOMIC
FLUSH ORDERED THREADPRIVATE
x x x x x x x x
144 Shared Memory
5.10
5.10 OpenMP teegikäske
OpenMP teegikäske SUBROUTINE OMP_SET_NUM_THREADS(scalar_integer_expr) INTEGER FUNCTION OMP_GET_NUM_THREADS() INTEGER FUNCTION OMP_GET_MAX_THREADS() INTEGER FUNCTION OMP_GET_THREAD_NUM() INTEGER FUNCTION OMP_GET_NUM_PROCS() LOGICAL FUNCTION OMP_IN_PARALLEL() SUBROUTINE OMP_SET_DYNAMIC(scalar_logical_expression) LOGICAL FUNCTION OMP_GET_DYNAMIC()
145 Shared Memory
5.10 OpenMP teegikäske
SUBROUTINE OMP_SET_NESTED(scalar_logical_expression) LOGICAL FUNCTION OMP_GET_NESTED SUBROUTINE OMP_INIT_LOCK(var) SUBROUTINE OMP_INIT_NEST_LOCK(var) SUBROUTINE OMP_DESTROY_LOCK(var) SUBROUTINE OMP_DESTROY_NEST_LOCK(var) SUBROUTINE OMP_SET_LOCK(var) SUBROUTINE OMP_SET_NEST_LOCK(var) SUBROUTINE OMP_UNSET_LOCK(var) SUBROUTINE OMP_UNSET_NEST_LOCK(var)
146 Shared Memory
SUBROUTINE OMP_TEST_LOCK(var) SUBROUTINE OMP_TEST_NEST_LOCK(var)
5.10 OpenMP teegikäske
147 Shared Memory
5.11
5.11 OpenMP näiteprogramme
OpenMP näiteprogramme
• mxmult.c html)
(http://www.ut.ee/~eero/PC/naiteid/mxmult.c.
• omp_mm.f html)
(http://www.ut.ee/~eero/PC/naiteid/omp_mm.f.
• omp_orphan.f (http://www.ut.ee/~eero/PC/naiteid/omp_ orphan.f.html) • omp_reduction.f (http://www.ut.ee/~eero/PC/naiteid/omp_ reduction.f.html) • omp_workshare1.f (http://www.ut.ee/~eero/PC/naiteid/omp_ workshare1.f.html) • omp_workshare2.f (http://www.ut.ee/~eero/PC/naiteid/omp_ workshare2.f.html)
148 Fortran and HPF
6
Fortran and HPF
6.1
Fortran Evolution
• FORmula TRANslation • first compiler 1957 • 1st official standard 1972 – Fortran66 • 2nd standard 1980 – Fortran77 • 1991 – Fortran90 • Fortran95 • Fortran2003 • Fortran2008
6.1 Fortran Evolution
149 Fortran and HPF
6.2 Concept
High Performance Fortran
6.2
Concept
Fortran90 extension
• SPMD (Single Program Multiple Data) model • each process operates with its own part of data • HPF commands specify which processor gets which part of the data • Concurrency is defined by HPF commands based on Fortran90 • HPF directives as comments: !HPF$ Most of the commands of declatrative type concerning data distribution between processes Command INDEPENDENT is an exception (and its attributel NEW) which execute commands
150 Fortran and HPF
6.3 PROCESSORS declaration
6.3 PROCESSORS declaration determines conceptual processor array (which need not reflect the actual hardware structure) Example:
! HPF$ PROCESSORS, DIMENSION ( 4 ) : : P1 ! HPF$ PROCESSORS, DIMENSION ( 2 , 2 ) : : P2 ! HPF$ PROCESSORS, DIMENSION ( 2 , 1 , 2 ) : : P3
• Number of processors needs to remain the same in a program • If 2 different processor arrays similar, one can assume thet corresponding processors have the same physical process
• !HPF$ PROCESSORS :: P – defines a scalar processor
151 Fortran and HPF
6.4 DISTRIBUTE-directive
6.4 DISTRIBUTE-directive
says how to deliver data between processors. Example:
REAL, REAL, ! HPF$ ! HPF$ ! HPF$
DIMENSION( 5 0 ) : : A DIMENSION( 1 0 , 1 0 ) : : B, C, D DISTRIBUTE (BLOCK) ONTO P1 : : A !1 −D DISTRIBUTE ( CYCLIC , CYCLIC ) ONTO P2 : : B , C !2 −D DISTRIBUTE D(BLOCK, ∗ ) ONTO P1 ! a l t e r n a t i v e s y n t a x
• array and processor array rank need to confirm • A(1:50) : P1(1:4) – each processor gets d50/4e = 13 elements except P1(4) which gets the rest (11 elements) • * – dimension to be ignored => in this example D distributed row-wise • CYCLIC – elements delivered one-by-one (as cards from pile between players)
152 Fortran and HPF
6.4 DISTRIBUTE-directive
• BLOCK - array elements are delivered block-wise (each block having elements from close range) Example (9 -element array delivered to 3 processors) : CYCLIC: 123123123 BLOCK: 111222333
153 Fortran and HPF
6.4 DISTRIBUTE-directive
Example: DISTRIBUTE block-wise:
PROGRAM Chunks REAL, DIMENSION( 2 0 ) : : A ! HPF$ PROCESSORS, DIMENSION ( 4 ) : : P ! HPF$ DISTRIBUTE (BLOCK) ONTO P : : A
Example: DISTRIBUTE cyclically:
PROGRAM Round_Robin REAL, DIMENSION( 2 0 ) : : A ! HPF$ PROCESSORS, DIMENSION ( 4 ) : : P ! HPF$ DISTRIBUTE ( CYCLIC ) ONTO P : : A
154 Fortran and HPF
6.4 DISTRIBUTE-directive
Example: DISTRIBUTE 2-dimensional layout:
PROGRAM Skwiffy IMPLICIT NONE REAL, DIMENSION ( 4 , 4 ) : : A, B, C ! HPF$ PROCESSORS, DIMENSION ( 2 , 2 ) : : P ! HPF$ DISTRIBUTE (BLOCK, CYCLIC ) ONTO P : : A , B , C B = 1; C = 1; A = B + C END PROGRAM Skwiffy
cyclic in one dimension; block-wise in other:
(11)(12)(11)(12) (11)(12)(11)(12) (21)(22)(21)(22) (21)(22)(21)(22)
155 Fortran and HPF
6.4 DISTRIBUTE-directive
Example: DISTRIBUTE *:
PROGRAM Skwiffy IMPLICIT NONE REAL, DIMENSION ( 4 , 4 ) : : A, B, C ! HPF$ PROCESSORS, DIMENSION ( 4 ) : : Q ! HPF$ DISTRIBUTE ( ∗ ,BLOCK) ONTO Q : : A , B , C B = 1 ; C = 1 ; A = B + C ; PRINT ∗ , A END PROGRAM Skwiffy
Remarks about DISTRIBUTE
• Without ONTO default structure (given with program arguments, e.g.) • BLOCK – better if algorithm uses a lot of neighbouring elements in the array => less communication, faster
• CYCLIC – good for even distribution • ignoring a dimension (with *) good if calculations due with a whole row or column
156 Fortran and HPF
6.5 Distribution of allocatable arrays
• all scalar variables are by default replicated; updating – compiler task
6.5
Distribution of allocatable arrays
is similar, except that delivery happens right after the memory allocation
REAL, ALLOCATABLE, DIMENSION ( : , : ) : : A INTEGER : : i e r r ! HPF$ PROCESSORS, DIMENSION( 1 0 , 1 0 ) : : P ! HPF$ DISTRIBUTE (BLOCK, CYCLIC ) : : A ... ALLOCATE(A( 1 0 0 , 2 0 ) , s t a t = i e r r ) ! A a u t o m a t i c a l l y d i s t r i b u t e d here ! b l o c k s i z e i n dim=1 i s 10 elements ... DEALLOCATE(A) END
blocksize is determined right after ALLOCATE command
157 Fortran and HPF
6.6
6.6 HPF rule: Owner Calcuates
HPF rule: Owner Calcuates
Processor on the left of assignment performs the computations Example:
DO i = 1 ,n a ( i −1) = b ( i ∗ 6) / c ( i + j )−a ( i ∗∗ i ) END DO
• calculation performed by process owning a(i-1) NOTE that – the rule is not obligatory but advisable to the compiler!
• compiler may (for the purpose of less communication in the whole program) to leave only the assignment to the a(i-1) owner-processor
158 Fortran and HPF
6.7
6.7 Scalar variables
Scalar variables
REAL, DIMENSION( 1 0 0 , 1 0 0 ) : : X REAL : : Scal ! HPF$ DISTRIBUTE (BLOCK,BLOCK) : : X .... Scal = X( i , j ) ....
owner of X(i,j) assigns Scal value and sends to other processes (replication)
159 Fortran and HPF
6.8
6.8 Examples of good DISTRIBUTE subdivision
Examples of good DISTRIBUTE subdivision
Example:
A( 2 : 9 9 ) = (A( : 9 8 ) +A ( 3 : ) ) / 2 ! neighbour c a l c u l a t i o n s B( 2 2 : 5 6 ) = 4.0 ∗ATAN( 1 . 0 ) ! section of B calculated C ( : ) = SUM(D, DIM=1) ! Sum down a column
From the owner calculates rule we get:
! HPF$ ! HPF$ ! HPF$ ! HPF$
DISTRIBUTE DISTRIBUTE DISTRIBUTE DISTRIBUTE
(BLOCK) ONTO P : : A ( CYCLIC ) ONTO P : : B (BLOCK) ONTO P : : C ! or ( CYCLIC ) ( ∗ ,BLOCK) ONTO P : : D ! or ( ∗ , CYCLIC )
160 Fortran and HPF
6.8 Examples of good DISTRIBUTE subdivision
Example (SSOR):
DO j = 2 ,n−1 DO i = 2 ,n−1 a ( i , j ) =(omega / 4 ) ∗ ( a ( i , j −1)+a ( i , j +1)+ & a ( i −1, j ) +a ( i +1 , j ) ) +(1−omega) ∗ a ( i , j ) END DO END DO
Best is BLOCK-distribution in both dimensions
161 Fortran and HPF
6.9
6.9 HPF programming methodology
HPF programming methodology
• Need to find balance between concurrency and communication • the more processes the more communication • aiming to ◦ find balanced load based from the owner calculates rule ◦ data locality • use array syntax and intrinsic functions on arrays • avoid deprecated Fortran features (like assumed size and memory associations) Easy to write a program in HPF but difficult to gain good efficiency Programming in HPF technique is more or less like this: 1. Write a correctly working serial program, test and debug it 2. add distribution directives introducing as less as possible communication
162 Fortran and HPF
6.9 HPF programming methodology
Advisable to add the INDEPENDENT directives (where semantically relevant), Perform data alignment (with ALIGN-directive). First thing is to: Choose a good parallel algorithm! Issues reducing efficiency:
• Difficult indexing (may confuse the compiler where particular elements are situated)
• array syntax is very powerful ◦ but with complex constructions may make compiler to fail to optimise • sequential cycles to be left sequential (or replicate) • object redelivery is time consuming
163 Fortran and HPF
6.9 HPF programming methodology
Additional advice: • Use array syntax possibilities instead of cycles • Use intrinsic functions where possible (for possibly better optimisation) • Before parallelisation think whether the argorithm is parallelisable at all? • Use INDIPENDENT and ALIGN directives • It is possible to assign sizes to the lengths in BLOCK and CYCLE if sure in need for the algorithm efficiency, otherwise better leave the decision to the compiler
164 Fortran and HPF
6.10 BLOCK(m) and CYCLIC(m)
6.10 BLOCK(m) and CYCLIC(m) predefining block size; in general make the code less efficient due to more complex accounting on ownership
REAL, ! HPF$ ! HPF$ ! HPF$
DIMENSION( 2 0 ) : : A, B PROCESSORS, DIMENSION ( 4 ) : : P DISTRIBUTE A(BLOCK( 9 ) ) ONTO P DISTRIBUTE B( CYCLIC ( 2 ) ) ONTO P
2D example:
REAL, DIMENSION ( 4 , 9 ) : : A ! HPF$ PROCESSORS, DIMENSION ( 2 ) : : P ! HPF$ DISTRIBUTE (BLOCK( 3 ) ,CYCLIC ( 2 ) )
ONTO P : : A
165 Fortran and HPF
6.11
6.11 Array alignment
Array alignment
• improves data locality • minimises communication • distributes workload Simplest example: A = B + C – With correct ALIGN’ment it is without communication 2 ways:
! HPF$ ALIGN ( : , : ) WITH T ( : , : )
: : A, B, C
equivalent to:
! HPF$ ALIGN A ( : , : ) WITH T ( : , : ) ! HPF$ ALIGN B ( : , : ) WITH T ( : , : ) ! HPF$ ALIGN C ( : , : ) WITH T ( : , : )
166 Fortran and HPF
6.11 Array alignment
Example:
REAL, DIMENSION( 1 0 ) : : A, B, C ! HPF$ ALIGN ( : ) WITH C ( : ) : : A , B
– need only C be in DISTRIBUTE-command. Example, symbol instead of “:” :
! HPF$ ALIGN ( j ) WITH C( j ) : : A , B
means: for ∀j align arrays A and B with array C. (“:” instead of j – stronger requirement)
167 Fortran and HPF
6.11 Array alignment
! HPF$ ALIGN A( i , j ) WITH B ( i , j )
Example 2 (2-dimensional case):
REAL, DIMENSION( 1 0 , 1 0 ) : : A, B ! HPF$ ALIGN A ( : , : ) WITH B ( : , : )
is stronger requirement than not assuming same size arrays Good to perform A=B+C+B*C operations (everything local!) transposed alignment:
REAL, DIMENSION( 1 0 , 1 0 ) : : A, B ! HPF$ ALIGN A( i , : ) WITH B ( : , i )
(second dimension of A and first dimension of B with same length!)
168 Fortran and HPF
Similarly:
6.11 Array alignment
REAL, DIMENSION( 1 0 , 1 0 ) : : A, B ! HPF$ ALIGN A ( : , j ) WITH B ( j , : )
or:
REAL, DIMENSION( 1 0 , 1 0 ) : : A, B ! HPF$ ALIGN A( i , j ) WITH B ( j , i )
good to perform operation:
A=A+TRNSPOSE(B) ∗A ! e v e r y t h i n g l o c a l !
169 Fortran and HPF
6.12
6.12 Strided Alignment
Strided Alignment
Example: Align elements of matrix D with every second element in E:
REAL, DIMENSION ( 5 ) : : D REAL, DIMENSION( 1 0 ) : : E ! HPF$ ALIGN D ( : ) WITH E ( 1 : : 2 )
could be written also:
! HPF$ ALIGN D( i ) WITH E ( i ∗2 −1)
Operation:
D = D + E(::2) ! local
170 Fortran and HPF
Example: reverse strided alignment:
6.12 Strided Alignment
REAL, DIMENSION ( 5 ) : : D REAL, DIMENSION( 1 0 ) : : E ! HPF$ ALIGN D ( : ) WITH E (UBOUND( E ) : : − 2 )
could be written also:
! HPF$ ALIGN D( i ) WITH E(2+UBOUND( E )− i ∗ 2)
171 Fortran and HPF
6.13
6.13 Example on Alignment
Example on Alignment
PROGRAM Warty IMPLICIT NONE REAL, DIMENSION ( 4 ) : : C REAL, DIMENSION ( 8 ) : : D REAL, DIMENSION ( 2 ) : : E C = 1; D = 2 E = D( : : 4 ) + C( : : 2 ) END PROGRAM Warty
minimal (0) communication is achieved with:
! HPF$ ALIGN C ( : ) WITH D ( : : 2 ) ! HPF$ ALIGN E ( : ) WITH D ( : : 4 ) ! HPF$ DISTRIBUTE (BLOCK) : : D
172 Fortran and HPF
6.14
6.14 Alignment with Allocatable Arrays
Alignment with Allocatable Arrays
• alignment is performed togehter with memory allocation • existing object cannot be aligned to unallocated object Example:
REAL, DIMENSION ( : ) , ALLOCATABLE : : A,B ! HPF$ ALIGN A ( : ) WITH B ( : )
then
ALLOCATE (B( 1 0 0 ) , s t a t = i e r r ) ALLOCATE (A( 1 0 0 ) , s t a t = i e r r )
is OK
ALLOCATE (B( 1 0 0 ) ,A( 1 0 0 ) , s t a t = i e r r )
also OK (allocation starts from left),
173 Fortran and HPF
6.14 Alignment with Allocatable Arrays
but,
ALLOCATE (A( 1 0 0 ) , s t a t = i e r r ) ALLOCATE (B( 1 0 0 ) , s t a t = i e r r )
or
ALLOCATE (A( 1 0 0 ) ,B( 1 0 0 ) , s t a t = i e r r )
give error! Simple array cannot be aligned with allocatable:
REAL, DIMENSION ( : ) : : X REAL, DIMENSION ( : ) , ALLOCATABLE : : A ! HPF$ ALIGN X ( : ) WITH A ( : ) ! ERROR
174 Fortran and HPF
6.14 Alignment with Allocatable Arrays
One more problem:
REAL, DIMENSION ( : ) , ALLOCATABLE : : A, B ! HPF$ ALIGN A ( : ) WITH B ( : ) ALLOCATE(B( 1 0 0 ) , s t a t = i e r r ) ALLOCATE(A( 5 0 ) , s t a t = i e r r )
”:” says that A and B should be with same length (but they are not!) But this is OK:
REAL, DIMENSION ( : ) , ALLOCATABLE : : A, B ! HPF$ ALIGN A( i ) WITH B ( i ) ALLOCATE(B( 1 0 0 ) , s t a t = i e r r ) ALLOCATE(A( 5 0 ) , s t a t = i e r r )
– still: A cannot be larger than B.
175 Fortran and HPF
6.15
6.16 Dimension replication
Dimension collapsing
With one element it is possible to align one or severeal dimensions:
! HPF$ ALIGN ( ∗ , : ) WITH Y ( : )
:: X
– each element from Y aligned to a column from X (First dimension of matrix X being collapsed)
6.16
Dimension replication
! HPF$ ALIGN Y ( : ) WITH X ( ∗ , : )
– each processor getting arbitrary row X(:,i) gets also a copy of Y(i)
176 Fortran and HPF
6.16 Dimension replication
Example: 2D Gauss elimination kernel of the program:
... DO j = i +1 , n A( j , i ) = A( j , i ) / Swap( i ) A( j , i +1:n ) = A( j , i +1:n ) − A( j , i ) ∗Swap( i +1:n ) Y( j ) = Y( j ) − A( j , i ) ∗Temp END DO
Y(k) together with A(k,i) =>
! HPF$ ALIGN Y ( : ) WITH A ( : , ∗ )
Swap(k) together with A(i,k) =>
! HPF$ ALIGN
Swap ( : ) WITH A ( ∗ , : )
No matrix A neighbouring elements in same expression => CYLCIC:
177 Fortran and HPF
! HPF$ DISTRIBUTE A( CYCLIC , CYCLIC )
6.16 Dimension replication
178 Fortran and HPF
6.16 Dimension replication
Example: matrix multiplication
PROGRAM ABmult IMPLICIT NONE INTEGER , PARAMETER : : N = 100 INTEGER , DIMENSION (N,N) : : A, B, C INTEGER : : i , j ! HPF$ PROCESSORS square ( 2 , 2 ) ! HPF$ DISTRIBUTE (BLOCK,BLOCK) ONTO square : : C ! HPF$ ALIGN A ( i , ∗ ) WITH C( i , ∗ ) ! r e p l i c a t e c o p i e s o f row A ( i , ∗ ) onto p r o c e s s o r s which compute C( i , j ) ! HPF$ ALIGN B( ∗ , j ) WITH C( ∗ , j ) ! r e p l i c a t e c o p i e s o f column B( ∗ , j ) ) onto p r o c e s s o r s which compute C( i , j ) A = 1 B = 2 C = 0 DO i = 1 , N DO j = 1 , N ! A l l t h e work i s l o c a l due t o ALIGNs C( i , j ) = DOT_PRODUCT(A( i , : ) , B ( : , j ) ) END DO END DO WRITE ( ∗ , ∗ ) C END
179 Fortran and HPF
6.17
6.17 HPF Intrinsic Functions
HPF Intrinsic Functions
NUMBER_OF_PROCESSORS and PROCESSORS_SHAPE – information about physical hardware needed for portability:
! HPF$ PROCESSORS P1 (NUMBER.OF.PROCESSORS( ) ) ! HPF$ PROCESSORS P2 ( 4 , 4 ,NUMBER.OF.PROCESSORS( ) / 1 6 ) ! HPF$ PROCESSORS P3 ( 0 :NUMBER.OF.PROCESSORS( 1 ) −1, & ! HPF$ 0 :NUMBER.OF.PROCESSORS( 2 ) −1)
2048-processor hyprcube:
PRINT ∗ , PROCESSORS.SHAPE ( )
would return:
2 2 2 2 2 2 2 2 2 2 2
180 Fortran and HPF
6.18
6.18 HPF Template Syntax
HPF Template Syntax
TEMPLATE - conceptual object, does not use any RAM, defined statically (like 0sized arrays being not assigned to) • are declared • distributed • can be used to align arrays Example:
REAL, ! HPF$ ! HPF$ ! HPF$
DIMENSION( 1 0 ) : : A, B TEMPLATE, DIMENSION( 1 0 ) : : T DISTRIBUTE (BLOCK) : : T ALIGN ( : ) WITH T ( : ) : : A , B
(here only T may be argument to DISTRIBUTE) Combined TEMPLATE directive:
181 Fortran and HPF
6.18 HPF Template Syntax
! HPF$ TEMPLATE, DIMENSION( 1 0 0 , 1 0 0 ) , & ! HPF$ DISTRIBUTE (BLOCK, CYCLIC ) ONTO P : : T ! HPF$ ALIGN A ( : , : ) WITH T ( : , : )
which is equivalent to:
! HPF$ TEMPLATE, DIMENSION( 1 0 0 , 1 0 0 ) : : T ! HPF$ ALIGN A ( : , : ) WITH T ( : , : ) ! HPF$ DISTRIBUTE T (BLOCK, CYCLIC ) ONTO P
182 Fortran and HPF
6.18 HPF Template Syntax
Example:
PROGRAM Warty IMPLICIT NONE REAL, DIMENSION ( 4 ) : : C REAL, DIMENSION ( 8 ) : : D REAL, DIMENSION ( 2 ) : : E ! HPF$ TEMPLATE, DIMENSION ( 8 ) : : T ! HPF$ ALIGN D ( : ) WITH T ( : ) ! HPF$ ALIGN C ( : ) WITH T ( : : 2 ) ! HPF$ ALIGN E ( : ) WITH T ( : : 4 ) ! HPF$ DISTRIBUTE (BLOCK) : : T C = 1; D = 2 E = D( : : 4 ) + C( : : 2 ) END PROGRAM Warty
(similar to the example of alignment with stride)
183 Fortran and HPF
6.18 HPF Template Syntax
More examples on using Templates:
• ALIGN A(:)WITH T1(:,∗) – ∀ i, element A(i) replicated according to row T1(i,:). • ALIGN C(i,j) WITH T2(j,i) – transposed C aligned with T2 • ALIGN B(:,∗)WITH T3(:) – ∀ i, matrix B(i,:) aligned with template element T2(i), • DISTRIBUTE (BLOCK,CYCLIC):: T1, T2 • DISTRIBUTE T1(CYCLIC,∗)ONTO P – T1 rows are distributed cyclically
184 Fortran and HPF
6.19 FORALL
6.19 FORALL Syntax:
FORALL([,])& example:
FORALL ( i =1:n , j =1:m,A( i , j ) .NE. 0 ) A( i , j ) = 1 /A( i , j )
185 Fortran and HPF
6.19 FORALL
Circumstances, where Fortran90 syntax is not enough, but FORALL makes it simple:
• index expressions:
FORALL ( i =1:n , j =1:n , i / = j ) A( i , j ) = REAL( i + j )
• intrinsic or PURE-functions (which have no side-effects):
FORALL ( i =1:n : 3 , j =1:n : 5 ) A( i , j ) =
SIN (A( j , i ) )
• subindexing:
FORALL ( i =1:n , j =1:n ) A(VS( i ) , j ) = i +VS( j )
186 Fortran and HPF
6.19 FORALL
• unusual parts can be accessed:
FORALL ( i =1:n ) A( i , i ) = B( i ) ! d i a g o n a l ! . . . DO j = 1 , n FORALL ( i =1: j ) A( i , j ) = B( i ) ! t r i a n g u l a r END DO
To parallelise add before DO also:
! HPF$ INDEPENDENT, NEW( i )
or write nested FORALL commands:
FORALL ( j = 1 : n ) FORALL ( i =1: j ) A( i , j ) = B( i ) ! t r i a n g u l a r END FORALL
187 Fortran and HPF
FORALL-command execution: 1. triple-list evaluation 2. scalar matrix evaluation 3. for each .TRUE. mask elements find right hand side value 4. assignment of right hand side value to the left hand side In case of HPF synchronisation between the steps!
6.19 FORALL
188 Fortran and HPF
6.20 PURE-procedures
6.20 PURE-procedures
PURE REAL FUNCTION F ( x , y ) PURE SUBROUTINE G( x , y , z )
without side effects, i.e.:
• no outer I/O nor ALLOCATE (communication still allowed) • does not change global state of the program • intrinsic functions are of type PURE! • can be used in FORALL and in PURE-procedures • no PAUSE or STOP • FUNCTION formal parameters with attribute INTENT(IN)
189 Fortran and HPF
6.20 PURE-procedures
Example (function):
PURE REAL FUNCTION F ( x , y ) IMPLICIT NONE REAL, INTENT ( IN ) : : x , y F = x ∗ x + y ∗ y + 2 ∗ x ∗ y + ASIN ( MIN ( x / y , y / x ) ) END FUNCTION F
example of usage:
FORALL ( i =1:n , j =1:n ) & A( i , j ) = b ( i ) + F ( 1 . 0 ∗ i , 1 . 0 ∗ j )
190 Fortran and HPF
6.20 PURE-procedures
Example (subroutine):
PURE SUBROUTINE G( x , y , z ) IMPLICIT NONE REAL, INTENT (OUT) , DIMENSION ( : ) : : z REAL, INTENT ( IN ) , DIMENSION ( : ) : : x , y INTEGER i INTERFACE REAL FUNCTION F ( x , y ) REAL, INTENT ( IN ) : : x , y END FUNCTION F END INTERFACE ! ... FORALL( i =1: SIZE ( z ) ) z ( i ) = F ( x ( i ) , y ( i ) ) END SUBROUTINE G
191 Fortran and HPF
6.20 PURE-procedures
“MIMD” example:
REAL FUNCTION F ( x , i ) ! PURE IMPLICIT NONE REAL, INTENT ( IN ) : : x ! element INTEGER , INTENT ( IN ) : : i ! i n d e x I F ( x > 0 . 0 ) THEN F = x∗x ELSEIF ( i ==1 .OR. i ==n ) THEN F = 0.0 ELSE F = x END I F END FUNCTION F
192 Fortran and HPF
6.21 INDEPENDENT
6.21 INDEPENDENT Directly in front of DO or FORALL
! HPF$ INDEPENDENT DO i = 1 ,n x ( i ) = i ∗∗ 2 END DO
In front of FORALL: no synchronisation needed between right hand side expression evaluation and assignment If INDEPENDENT loop...
• ...assigns more than once to a same element, parallelisation is lost! • ...includes EXIT, STOP or PAUSE command, iteration needs to execute sequentially to be sure to end in right iteration
• ...has jumps out of the loop or I/O => sequential execution
193 Fortran and HPF
Is independent:
6.21 INDEPENDENT
! HPF$ INDEPENDENT DO i = 1 , n b( i ) = b( i ) + b( i ) END DO
Not independent:
DO i = 1 , n b ( i ) = b ( i +1) + b ( i ) END DO
Not independent:
DO i = 1 , n b ( i ) = b ( i −1) + b ( i ) END DO
194 Fortran and HPF
6.21 INDEPENDENT
This is independent loop:
! HPF$ INDEPENDENT DO i = 1 , n a ( i ) = b ( i −1) + b ( i ) END DO
Question to ask: does a later iteration depend on a previous one?
195 Fortran and HPF
6.22 INDEPENDENT NEW command
6.22 INDEPENDENT NEW command – create an independent variable on each process!
! HPF$ INDEPENDENT, NEW( s1 , s2 ) DO i =1 ,n s1 = Sin ( a ( 1 ) ) s2 = Cos ( a ( 1 ) ) a ( 1 ) = s1 ∗ s1−s2 ∗ s2 END DO
196 Fortran and HPF
6.22 INDEPENDENT NEW command
Rules for NEW-variable:
• cannot be used outside cycle without redefinition • cannot be used in FORALL • not pointers or formal parameters • cannot have SAVE atribute Disallowed:
! HPF$ INDEPENDENT, NEW( s1 , s2 ) DO i = 1 ,n s1 = SIN ( a ( i ) ) s2 = COS( a ( i ) ) a ( i ) = s1 ∗ s1−s2 ∗ s2 END DO k = s1+s2 ! n o t a l l o w e d !
197 Fortran and HPF
6.23 EXTRINSIC
Example (only outer cycles can be executed independently):
! HPF$ INDEPENDENT, NEW ( i 2 ) DO i 1 = 1 , n1 ! HPF$ INDEPENDENT, NEW ( i 3 ) DO i 2 = 1 , n2 ! HPF$ INDEPENDENT, NEW ( i 4 ) DO i 3 = 1 , n3 DO i 4 = 1 , n4 a ( i1 , i2 , i 3 ) = a ( i1 , i2 , i 3 ) & + b ( i1 , i2 , i 4 ) ∗ c ( i2 , i3 , i 4 ) END DO END DO END DO END DO
6.23 EXTRINSIC Used in INTERFACE-commands to declare that a routine does not belong to HPF
198 Fortran and HPF
7
MPI
See separate slides on course homepage!
6.23 EXTRINSIC
199 // Alg.Design
Part III
Parallel Algorithms 8
Parallel Algorithm Design Principles • Identifying portions of the work that can be performed concurrently • Mapping the concurrent pieces of work onto multiple processes running in parallel
• Distributing the input, output and also intermediate data associated with the program
• Access management of data shared by multiple processes • Process synchronisation in various stages in parallel program execution
200 // Alg.Design
8.1
8.1 Decomposition, Tasks and Dependency Graphs
Decomposition, Tasks and Dependency Graphs
• Subdividing calculations into smaller components is called decomposition Example: Dense matrix-vector multiplication y = Ab
y[i] = ∑nj=1 A[i, j]b[ j]
([1])
• Computational Problem × Decomposition →Tasks
201 // Alg.Design
8.1 Decomposition, Tasks and Dependency Graphs
◦ in case of y = Ab tasks – calculation of each row – independent
([1]) Tasks and their relative order abstraction:
• Task Dependency Graph ◦ directed acyclic graph – nodes – tasks
202 // Alg.Design
8.1 Decomposition, Tasks and Dependency Graphs
– directed edges – dependences
◦ (can be disconnected) ◦ (edge-set can be empty) • Mapping tasks onto processors • Processors vs. processes
203 // Alg.Design
8.2
Decomposition Techniques
• Recursive decomposition
Example: Quicksort
8.2 Decomposition Techniques
204 // Alg.Design
• Data decomposition ◦ Input data ◦ Output data ◦ intermediate results ◦ Owner calculates rule • Exploratory decomposition Example: 15-square game
• Speculative decomposition • Hybrid decomposition
8.2 Decomposition Techniques
205 // Alg.Design
8.3
Tasks and Interactions
• Task generation ◦ static ◦ dynamic • Task sizes ◦ uniform ◦ non-uniform ◦ knowledge of task sizes ◦ size of data associated with tasks
8.3 Tasks and Interactions
206 // Alg.Design
• Characteristics of Inter-Task Interactions ◦ static versus dynamic ◦ regular versus irregular ◦ read-only versus read-write ◦ one-way versus two-way
8.3 Tasks and Interactions
207 // Alg.Design
8.4
8.4 Mapping Techniques for Load balancing
Mapping Techniques for Load balancing
• Static mapping ◦ Mappings based on data partitioning – array distribution schemes
· block distribution · cyclic and block-cyclic distribution · randomised block distribution – Graph partitioning
208 // Alg.Design
8.4 Mapping Techniques for Load balancing
• Dynamic mapping schemes ◦ centralised schemes – master-slave – self scheduling
· single task scheduling · chunk scheduling ◦ distributed schemes – each process can send/receive work from whichever process
· · · ·
how are sending receiving processes paired together? initiator – sender or receiver? how much work exchanged? when the work transfer is performed?
209 // Alg.Design
8.4 Mapping Techniques for Load balancing
Methods for reducing interaction overheads
• maximising data locality ◦ minimise volume of data exchange ◦ minimise frequency of interactions • overlapping computations with interactions • replicating data or computation • overlapping interactions with other interactions
210 // Alg.Design
8.5
8.5 Parallel Algorithm Models
Parallel Algorithm Models
• data-parallel model • task graph model ◦ typically amount of data relatively large to the amount of computation • work pool model ◦ or task pool model ◦ pool can be – centralised – distributed – statically/dynamically created,
• master-slave model
211 // Alg.Design
8.5 Parallel Algorithm Models
• pipeline (or producer-consumer model) ◦ stream parallelism ◦ stream of data triggers computations ◦ pipelines is a chain of consumer-produces processes ◦ shape of pipeline can be: – linear or multidimesional array – trees – general graphs with or without cycles
• Bulk-sunchronous parallel (BSP) model ◦ Synchronisation steps needed in a regular or irregular pattern – p2p synchronisation – and/or global synchronisation
212 // Alg.Design
9.1 EXAMPLE: 2D FEM for Poisson equation
An example of BSP model based parallelisation: 9
Solving systems of linear equations with sparse matrices
9.1
EXAMPLE: 2D FEM for Poisson equation
Consider as an example Poisson equation
−∆u(x) = f (x), ∀x ∈ Ω, u(x) = g(x), ∀x ∈ Γ, where Laplacian ∆ is defined by
∂ 2u ∂ 2u (∆u)(x) = ( 2 + 2 )(x), x = ∂x ∂y
x y
213 // Alg.Design
9.1 EXAMPLE: 2D FEM for Poisson equation
u can be e.g.: • temperature
• displacement of an elastic membrane fixed at the boundary under a transversal load of intensity
• electro-magnetic potential Divergence theorem Z Z
F · n ds,
div F dx = Ω
Γ
where F = (F1 , F2 ) is a vector-valued function defined on Ω,
div F =
∂ F1 ∂ F2 + , ∂ x1 ∂ x2
and n = (n1 , n2 ) – outward unit normal to Γ. Here dx – element of area in R2 ; ds – arc length along Γ. Applying divergence theorem to: F = (vw, 0) and F = (0, vw) (details not shown here) – Green’s first identity Z Z Z ∂u ∇u · ∇v dx = v ds − v4u dx (1) Ω
Γ
∂n
Ω
214 // Alg.Design
9.1 EXAMPLE: 2D FEM for Poisson equation
we come to a variational formulation of the Poisson problem on V :
V = {v : v is continuous on Ω,
∂v ∂v and are piecewise continuous on Ω and ∂ x1 ∂ x2
v = 0 on Γ} Poisson problem in Variational Formulation: Find u ∈ V such that
a(u, v) = ( f , v), ∀v ∈ V
(2)
where in case of Poisson equation Z
∇u · ∇v dx
a(u, v) = Ω
R and (u, v) = Ω u(x)v(x) dx. The gradient ∇ of a scalar function f (x, y) is defined by:
∇f =
∂f ∂f , ∂x ∂y
215 // Alg.Design
9.1 EXAMPLE: 2D FEM for Poisson equation
Again, the Variational formulation of Poisson equation is: Z Ω
Z ∂u ∂v ∂u ∂v + dx = f vdx ∂x ∂x ∂y ∂y Ω
With discretisation, we replace the space V with Vh – space of piecewise linear functions. Each function in Vh can be written as
vh (x, y) = ∑ ηi ϕi (x, y), where ϕi (x, y) – basis function (or ’hat’ functions)
vh
ϕi
Ni T on Ω
216 // Alg.Design
9.1 EXAMPLE: 2D FEM for Poisson equation
With 2D FEM we demand that the equation in the Variational formulation is satisfied for M basis functions ϕi ∈ Vh i.e. Z Z
∇uh · ∇ϕi dx =
Ω
f ϕi dx i = 1, . . . , M Ω
But we have: M
uh =
∑ ξ jϕ j
=⇒
j=1 M
∇uh =
∑ ξ j ∇ϕ j
j=1
M linear equations with respect to unknowns ξ j : Z
M
∑
Ω j=1
ξ j ∇ϕ j · ∇ϕi dx =
Z
f ϕi dx i = 1, . . . , M Ω
217 // Alg.Design
9.1 EXAMPLE: 2D FEM for Poisson equation
The stiffness matrix A = (ai j ) elements and right-hand side b = (bi ) calculation: Z
∇ϕi · ∇ϕ j dx
ai j = a(ϕi , ϕ j ) = Ω
Z
bi = ( f , ϕi ) =
f ϕi dx Ω
Integrals computed only where the pairs ∇ϕi · ∇ϕ j get in touch (have mutual support) Example two basis functions ϕi and ϕ j for nodes Ni and N j Their common support is τ ∪ τ 0 so that Z Z Z
∇ϕi · ∇ϕ j dx =
ai j = Ω
∇ϕi · ∇ϕ j dx +
τ
τ0
∇ϕi · ∇ϕ j dx
218 // Alg.Design
9.1 EXAMPLE: 2D FEM for Poisson equation
Nk Ni
τ τ0
ϕj
ϕi aτk, j Nj
Nk Ni
τ
Nj
Element matrices Consider single element τ Pick two basis functionsϕi and ϕ j (out of three). ϕk – piecewise linear =⇒ denoting by pk (x, y) = ϕk |τ :
pi (x, y) = αi x + βi y + γi , p j (x, y) = α j x + β j y + γ j . =⇒∇ϕi τ = ( ∂∂pxi , ∂∂pyi ) = (αi , βi ) =⇒ their dot product Z τ
∇ϕi · ∇ϕ j dx =
Z
αi α j + βi β j dx. τ
219 // Alg.Design
9.1 EXAMPLE: 2D FEM for Poisson equation
Finding coefficients α and β – put three points (xi , yi , 1), (x j , y j , 0), (xk , yk , 0) to the plane equation and solve the system
αi xi yi 1 αi 1 D βi = x j y j 1 βi = 0 γi xk yk 1 γi 0 The integral is computed by the multiplication with the triangle area 12 |det D|
x y 1 i i 1 τ ai j = ∇ϕi · ∇ϕ j dx = αi α j + βi β j det x j y j 1 2 τ xk yk 1 Z
The element matrix for τ is τ aii aτi j aτik Aτ = aτji aτj j aτjk aτki aτk j aτkk
220 // Alg.Design
9.1 EXAMPLE: 2D FEM for Poisson equation
Assembled stiffness matrix
• created by adding appropriately all the element matrices together • Different types of boundary values used in the problem setup result in slightly different stiffness matrices
• Most typical boundary conditions: ◦ Dirichlet’ ◦ Neumann • but also: ◦ free boundary condition ◦ Robin boundary condition
221 // Alg.Design
9.1 EXAMPLE: 2D FEM for Poisson equation
Right-hand side R Right-hand side b : bi = Ω f ϕi dx. Approximate calculation of an integral through the f value in the middle of the triangle τ :
fτ = f(
xi + x j + xk yi + y j + yk ) 3 3
• support – the area of the triangle • ϕi is a pyramid (integrating ϕi over τ gives pyramid volume) bτi
Z
= τ
f τ ϕi dx =
1 τ f |det Dτ | 6
• Assembling procedure for the RHS b values from bτi (somewhat similarly to matrix assembly, but more simple)
• Introduction of Dirichlet boundary condition just eliminates the rows.
222 // Alg.Design
9.1 EXAMPLE: 2D FEM for Poisson equation
y
EXAMPLE
Consider unit square.
Vh – space of piecewise linar functions on Ω with zero on Γ and values defined on inner nodes
x
Discretised problem in Variational Formulation: Find uh ∈ Vh such that
a(uh , v) = ( f , v), ∀v ∈ Vh where in case of Poisson equation Z
∇u · ∇v dx
a(u, v) = Ω
(3)
223 // Alg.Design
9.1 EXAMPLE: 2D FEM for Poisson equation
• In FEM the equation (3) needs to be satisfied on a set of testfunctions ϕi = ϕi (x), ◦ which are defined such that 1, x = xi ϕi = 0 x = x j j 6= i
• and it is demanded that (3) is satisfied with each ϕi (i = 1, ..., N) • As a result, a matrix of the linear equations is obtained In general case n2 × n2 Laplace’s matrix has form:
B
−I
0
···
0
A=
−I
B
−I
..
.
0
−I
..
.. .
.
.. .
..
0
···
0 , .. .. . . −I 0 −I B
.
B
(4)
224 // Alg.Design
9.1 EXAMPLE: 2D FEM for Poisson equation
where n × n matrix B is of form:
B=
4
−1
0
···
0
−1
4
−1
..
.. .
0 .. .
0
.
.. . 0 −1 4 . .. .. .. . . . −1 · · · 0 −1 4
It means that most of the elements of matrix A are zero Poisson equation with varying coefficient (different materials, varying conductivity, elasticity etc.)
−∇(λ (x, y)∇u) = f
• assume that λ – constant accross each element τ : λ (x, y) = λ τ (x, y) ∈ τ • =⇒ Elemental matrices Aτ get multiplied by λ τ .
225 // Alg.Design
9.1 EXAMPLE: 2D FEM for Poisson equation
9.1.1
Sparse matrix storage schemes
9.1.2
Triple storage format
• n × m matrix A each nonzero with 3 values: integers i and j and (in most applications) real matrix element ai j . =⇒ three arrays:
i n d i ( 1 : nz ) , i n d j ( nz ) , vals ( nz )
of length nz , – number of matrix A nonzeroes Advantages of the scheme:
• Easy to refer to a particular element • Freedom to choose the order of the elements Disadvantages :
• Nontrivial to find, for example, all nonzeroes of a particular row or column and their positions
226 // Alg.Design 9.1.3
9.1 EXAMPLE: 2D FEM for Poisson equation
Column-major storage format
For each matrix A column j a vector col(j) – giving row numbers i for which ai j 6= 0.
• To store the whole matrix, each col(j) ◦ added into a 1-dimensonal array cols(1:nz) ◦ introduce cptr(1:M) referring to column starts of each column in cols.
cols ( 1 : nz ) , c p t r ( 1 :M) , vals ( 1 : nz )
Advantages:
• Easy to find matrix column nonzeroes together with their positions
227 // Alg.Design
9.1 EXAMPLE: 2D FEM for Poisson equation
Disadvantages:
• Algorithms become more difficult to read • Difficult to find nonzeroes in a particular row 9.1.4
Row-major storage format
For each matrix A row i a vector row(i) giving column numbers j for which
ai j 6= 0. • To store the whole matrix, each row(j) ◦ added into a 1-dimensonal array rows(1:nz) ◦ introduce rptr(1:N) referring to row starts of each row in rows.
rows ( 1 : nz ) , r p t r ( 1 :N) , vals ( 1 : nz )
Advantages:
228 // Alg.Design
9.1 EXAMPLE: 2D FEM for Poisson equation
• Easy to find matrix row nonzeroes together with their positions Disadvantages:
• Algorithms become more difficult to read. • Difficult to find nonzeroes in a particular column. 9.1.5
Combined schemes
Triple format enhanced with cols(1:nz), cptr(1:M), rows(1:nz), rptr(1:N). Here cols and rows refer to corresponding matrix A refer to values in triple format. Advantages:
• All operations easy to perform Disadvantages:
• More memory needed. • Reference through indexing in all cases
229 // Alg.Design
9.2 9.2.1
9.2 Solution methods for sparse matrix systems
Solution methods for sparse matrix systems Direct methods
LU-factorisation with minimal fill-in
• analysis
• 100
• factorisation
• 10
• back-substitution
• 1
Discretisations of PDE-s in case of 2D regions – still OK 3D problems – not so efficient Implementations: MUMPS, UMFPACK, ...
230 // Alg.Design 9.2.2
9.2 Solution methods for sparse matrix systems
Krylov subspace method: Preconditioned Conjugate Gradient Method (PCG)
r(0)
= b − Ax(0)
Calculate for i = 1 , 2 , . . . solve Mz(i−1) = r(i−1)
with given s t a r t i n g vector
T
ρi−1 = r(i−1) z(i−1) i f i ==1
p(1) = z(0) else
βi−1 = ρi−1 /ρi−2 p(i) = z(i−1) + βi−1 p(i−1) endif T
q(i) = Ap(i) ; αi = ρi−1 /p(i) q(i) x(i) = x(i−1) + αi p(i) ; r(i) = r(i−1) − αi q(i) check convergence ; continue i f needed end
x(0)
231 // Alg.Design
9.3
9.3 Preconditioning
Preconditioning
Idea: Replace Ax = b with system M −1 Ax = M −1 b. Apply CG to
Bx = c, where B = M −1 A and c = M −1 b. But how to choose M ? Preconditioner M = M T to be chosen such that (i) Problem Mz = r being easy to solve (ii) Matrix B being better conditioned than A, meaning that κ2 (B) < κ2 (A)
(5)
232 // Alg.Design Then
p p #IT(5) = O( κ2 (B)) < O( κ2 (A)) = #IT(soving Ax = b)
• (In extreme cases M = I or M = A) • Preconditioned Conjugate Gradients (PCG) Method ◦ obtained if to take in previous algorithm M 6= I
9.3 Preconditioning
233 // Alg.Design
9.4
9.4 Domain Decompossition Method
Domain Decompossition Method
Preconditioning and parallelisation simultaneously Solution domain Ω divided into P overlapping subdomains
• Substructuring methods • Schwarz method ◦ Additive Schwarz P
M −1 = ∑ RTi A−1 i Ri i=1
– Ri - Restriction operator – A−1 i - i-th subdomain solve · Subdomain solvers – exact or inexact · In case of CG - symmetry of M −1 – prerequisite!
◦ Multiplicative Schwarz • Two-level methods
234 // Alg.Design
9.5
9.5 Domain Decomposition on Unstructured Grids
Domain Decomposition on Unstructured Grids
Domain Decomposition on Unstructured Grids DOUG (University of Bath, University of Tartu) 1997 - 2012 I.G.Graham, M.Haggers, R. Scheichl, L.Stals, E.Vainikko, M.Tehver, K. Skaburskas, O. Batrašev, C. Pöcher Parallel implementation based on:
• MPI
• 2D & 3D problems
• UMFPACK
• 2-level Additive Schwarz method
• METIS • 2-level partitioning • BLAS • Automatic parallelisation and loadbalancing
• Systems of PDEs
• Automatic Coarse Grid generation • Adaptive refinement of the coarse grid
235 // Alg.Design
9.5 Domain Decomposition on Unstructured Grids
DOUG Strategies • Iterative solver based on Krylov subspace methods PCG, MINRES, BICGSTAB, 2-layered FPGMRES with left or right preconditioning
• Non-blocking communication where at all possible
:-) Dot-product: (x, y) – :-( Ax-operation: y := Ax –
• Preconditioner based on Domain Decomposition with 2-level solvers Applying the preconditioner M −1 : solve for z : Mz = r . • Subproblems are solved with a direct, sparse multifrontal solver (UMFPACK)
236 // Alg.Design
9.5 Domain Decomposition on Unstructured Grids
Aggregation-based Domain Decomposition methods • Had been analysed only upto some extent
• We are:
making use of strong connections
• Second aggregation for creating subdomains, or
• using rough aggregation before graph partitioner
• Major progress - development of the theory: sharper bounds
• Implementation based on Fortran95 http://dougdevel. org
1.
R. Scheichl and E. Vainikko, Robust Aggregation-Based Coarsening for Additive Schwarz in the Case of Highly Variable Coefficients, Proceddings of the European Conference on Computational Fluid Dynamics, ECCOMAS CFD 2006 (P. Wesseling, E. ONate, J. Periaux, Eds.), TU Delft, 2006.
2.
R. Scheichl and E. Vainikko, Additive Schwarz and Aggregation-Based Coarsening for Elliptic Problems with Highly Variable Coefficients, Computing, 2007, 80(4), pp 319-343.
237 // Alg.Design
9.5 Domain Decomposition on Unstructured Grids
Aggregation – forming groups of neighbouring graph nodes Key issues:
• how to find good aggregates? • Smoothing step(s) for restriction and interpolation operators Four (often conflicting) aims: 1. follow adequatly underlying physial properties of the domain 2. try to retain optimal aggregate size 3. keep the shape of aggregates regular 4. reduce communication => develop aggregates with smooth boundaries
238 // Alg.Design
9.5 Domain Decomposition on Unstructured Grids
Coefficient jump resolving property
239 // Alg.Design
9.5 Domain Decomposition on Unstructured Grids
Coefficient jump resolving property
240 // Alg.Design
9.5 Domain Decomposition on Unstructured Grids
Algorithm: Shape-preserving aggregation Input: Matrix A, aggregation radius r, strong connection threashold α . Output: Aggregate number for each node in the domain. 1. Scale A to unit diagonal matrix (all ones on diagonal) 2. Find the set S of matrix A strong connectons: S = ∪ni=1 Si , where Si ≡ { j 6= i : |ai j | ≥ α maxk6=i |aik |, unscale A; aggr_num:=0; 3. aggr_num:=aggr_num+1; Choose a seednode x from G or if G = 0, / choose the first nonaggregated node x; level:=0 4. If (level 0 such that T (n) ≤ c f (n) for each n ≥ n0 • T (n) = Ω( f (n)) – ∃ constants c > 0 and n0 > 0 such that T (n) ≥ c f (n) for each n ≥ n0 • T (n) = Θ( f (n)) – T (n) = O( f (n)) and T (n) = Ω( f (n)).
245 // Alg.Optimality
10.2
10.2 Definitions of Optimality
Definitions of Optimality
Consider computational problem Q. Denote T ∗ (n) – Q sequential time complexity ie.: ∃ an algorithm for solution of the problem Q with time O(T ∗ (n)) and it can be shown that the time cannot be improved Sequential algorithm with work time O(T ∗ (n)) is called time optimal 2 types of parallel algorithm optimality definitions (1st weaker than 2nd):
1. Parallel algorithm for solution Q is called optimal or optimal in weak sense or weakly optimal or cost optimal, if W (n) = Θ(T ∗ (n)). 2. Parallel algorithm is called work-time (W T ) optimal or optimal in the strong sense if it can be shown that the program parallel working time T (n) cannot be improved by any other optimal parallel algorithm
246 // Alg.Optimality
10.2 Definitions of Optimality
EXAMPLE: Algorithm Sum Input : n = 2k numbers i n array A Output : Sum S = ∑ni=1 A(i) begin 1 . f o r i = 1 : n pardo
B(i) = A(i) 2 . f o r h = 1 : log n do f o r i = 1 : n/2h pardo
B(i) := B(2i − 1) + B(2i) 3 . S := B(1) end
Number of time units ( j): T (n) =
log n + 2 = O(log n) 1. step n operations j.th step (h = j − 1 in algorithm part 2.) n/2 j−1 operations ( j = 2 : log(n) + 1) log n => W (n) = n + ∑ j=1 (n/2 j ) + 1 = O(n) Due to W (n) = O(n) and T ∗ (n) = n, => algorithm optimal. Can be shown that optimal speedup with number of processors p = O( logn n ), (see the book) (In Ja’Ja’ Section.10 shown that it takes Ω(log n) time independently of number of processors => algorithm optimal also in strong sense)
247 // Algorithm Basic Techniques
11 11.1
11.1 Balanced Trees
Basic Techniques in Parallel Algorithms Balanced Trees
Prefix Sums Find si = x1 x2 ... xi , 1 ≤ i ≤ n, where – some operation, like *,+,... Easiest way: si = si−1 xi – sequential. T ∗ (n) = O(n)
Figure: sequential and parallel
248 // Algorithm Basic Techniques
11.1 Balanced Trees ALGORITHM P r e f i x Sums Input : array of n = 2k elements (x1 , x2 , ..., xn ) , where k ≥ 0 i n t e g e r Output : P r e f i x sums si , 1 ≤ i ≤ n begin 1 . i f n = 1 then {sets1 := x1 ; e x i t } 2 . f o r 1 ≤ i ≤ n/2 pardo set yi := x2i−1 x2i 3 . Recursively , f i n d P r e f i x sums f o r {y1 , y2 , ..., yn/2 } and store i n z1 , z2 , ..., zn/2 . 4 . f o r 1 ≤ i ≤ n pardo {i even: set si := zi/2 i=1: set s1 := x1 i odd > 0 : set si := z(i−1)/2 xi } end
249 // Algorithm Basic Techniques
11.1 Balanced Trees
Program work time T (n) and work W (n) estimation:
T (n) = T (n/2) + a W (n) = W (n/2) + bn, where a and b are constants. Solution of the recursion:
T (n) = O(log n) W (n) = O(n). Algorithm is (W T -optimal) or optimal in strong sense (Proof Ja’Ja’ Sec 10), EREW and CREW PRAM
250 // Algorithm Basic Techniques
11.2
11.2 Pointer Jumping Method
Pointer Jumping Method
Rooted Directed tree T is a directed graph with a special node r such that 1. for each v ∈ V − {r} outdegree is 1, outdegree of r is 0; 2. for each v ∈ V − {r} there exists a directed path from node v to the node r. The special node r is called root of tree T With pointer jumping it is possible to quickly manage data stored in rooted directed trees
251 // Algorithm Basic Techniques
11.2 Pointer Jumping Method
Task: Find all roots of forest Forest F :
• consists of rooted directed trees • given by array P(1 : n) : P(i) = j => edge from node i to j • If P(i) = i => i – root Task is to find root S( j) for each j, ( j = 1, ..., n). Simple algorithm (for example, reversing all arrows and thereafter root finding) – linear time At first take S(i) – successor of i-th node it’s P(i) In pointer jumping, each node successor is overwritten with successor’s successor; repeatedly => After k iterations distance between i and S(i) is 2k if S(i) not root
252 // Algorithm Basic Techniques
11.2 Pointer Jumping Method
ALGORITHM Pointer Jumping Input: forest of rooted directed trees with roots pointing to itself, edges given as pairs (i, P(i)), where i = 1, ..., n. Output: root S(i) of the corresponding tree the node i belongs to begin for i = 1, n pardo S(i) := P(i) while (S(i) 6= S(S(i))) do S(i) := S(S(i)) end
CREW PRAM
253 // Algorithm Basic Techniques Example
11.2 Pointer Jumping Method
254 // Algorithm Basic Techniques
11.2 Pointer Jumping Method
Time and work estimate for the given algorithm:
h- maximal tree-length in forest • Time: stepsize between i and S(i) doubles in each iteration until S(S(i)) is root to the tree conatining i ◦ => number of iterations= O(log h) ◦ Each iteration can be performed in O(1) parallel time and O(n) operations ◦ => T (n) = O(log h) • W (n) = O(n log h) (W (n) non-optimal due to existence of O(n) sequential algorithm)
255 // Algorithm Basic Techniques
11.3
11.3 Divide and Conquer
Divide and Conquer
Divide and conquer strategy
• consists of 3 main steps: 1. Dividing input into multiple relatively equal parts 2. Solving recursively subproblems on given parts 3. Merging the found solutions of different subproblems into a solution of the overall problem
256 // Algorithm Basic Techniques
11.3 Divide and Conquer
Finding convex hull of a given set
• Let a set of points S = {p1 , p2 , ..., pn }, be given with coordinates pi = (xi , yi ). • Planar convex hull of set S is the smallest convex polygon, containing all n points of S • Polygon Q is called convex if for arbitrary 2 polygon Q points p and q the line segment (p, q) lies entirely in polygon Q Convex hull problem: Determine the ordered (say, clockwise) list CH(S) of the points of S, determining the boundary of the convex hull of S
257 // Algorithm Basic Techniques
11.3 Divide and Conquer
258 // Algorithm Basic Techniques
11.3 Divide and Conquer
• Can be shown that in case of sequential algorithm T ∗ (n) = Θ(n log n) (sorting can be reduced to the problem of solving convex-hull problem) Let p and q be points from S with the smallest and largest x-coordinate
• => p and q ∈ CH(S). • => CH(S) points p ... q form upper hull UH(S) • q... p form lower LH(S) Remark. Sorting n numbers can be done in O(log n) time on EREW PRAM using total of O(n log n) operations. (Ja’Ja’ Chapter.4.) Assume for simplicity that no points have the same x or y coordinates and that n is power of 2 Sort the points according to x-coordinate (O(log n) time end O(n log n) operations):
x(p1 ) < x(p2 ) < · · · < x(pn ) Let S1 = (p1 , p2 , · · · , p n ) and S2 = (p n +1 , · · · , pn ) and UH(S1 ) and UH(S2 ) have 2 2 been found already. Upper common tangent for sets Si (i = 1, 2) needs to be found.
259 // Algorithm Basic Techniques
11.3 Divide and Conquer
260 // Algorithm Basic Techniques
11.3 Divide and Conquer
The upper common tangent for UH(S1 ) and UH(S2 ) can be found in O(log n) sequential time using binary search (Ja’Ja’ Ch. 6) 0
0
Let UH(S1 ) = (q1 , ..., qs ) and UH(S2 ) = (q1 , ..., qt ) 0
• (q1 = p1 and qt = pn ) 0
• Upper common tangent defined with points: (qi , q j ). UH(S) consists of first i UH(S1 ) points and t − j + 1 UH(S2 ) points: 0 0 UH(S) = (q1 , ..., qi , q j , ..., qt ) – can be determined in O(1) time and O(n) operations
261 // Algorithm Basic Techniques
11.3 Divide and Conquer
ALGORITHM (Simple Upper Hull) Input: Set S, consisting of points in the plane, no two of which have the same x or y coordinates s.t. x(p1 ) < x(p2 ) < · · · < x(pn ), where n = 2k . Output: The upper hull of S begin 1. If n ≤ 4 use brute-force method to find UH(S) and exit 2. Let S1 = (p1 , p2 , · · · , p n2 ) and S2 = (p 2n +1 , · · · , pn ) Recursively compute UH(S1 ) and UH(S2 ) in parallel 3. Find the upper common tangent between UH(S1 ) and end
262 // Algorithm Basic Techniques
11.3 Divide and Conquer
Theorem. Algorithm correctly computes the upper hull of n points in the plane taking O(log2 n) time and using total of O(n log n) operations. Proof. Correctness follows from induction over n T (n) andW (n)? Step 1. – O(1) time Step 2. – T (n/2) time using 2W (n/2) operations Step 3. – Common tangent: O(log n) time, combining O(1) time and O(n) operations => recursions:
• T (n) ≤ T ( 2n ) + a log n, • W (n) ≤ 2W ( 2n ) + bn. with a and b positive constants. Solving the recursions gives: T (n) = O(log2 n) ja W (n) = O(n log n). It follows that: Convex hull of a set of n points can be found in O(log2 n) time using total of O(n log n) operations. => the algorithm is optimal.
263 // Algorithm Basic Techniques
11.4
11.4 Partitioning
Partitioning
Partitioning strategy: 1. breaking up the given problem into p independent subproblems of almost equal sizes 2. solving p problems concurrently
• On contrary to divide-and-concuer strategy – the main task is partitioning, not combining the solution
264 // Algorithm Basic Techniques
11.4 Partitioning
Merging linearly ordered sets Given set S and partial order relation ≤
• Set S is linarly (or totally) ordered if for every pair a, b ∈ S either a ≤ b or b ≤ a Consider 2 nondecreasing sequences A = (a1 , a2 , ..., an ) and B = (b1 , b2 , ..., bn ) with elements from linearly ordered set S
• Task: merge the sequnces A and B into sorted sequence C = (c1 , c2 , ..., c2n ) Let X = (x1 , x2 , ..., xt ) set with elements from set S.
• Rank of element x ∈ S in set X , denoted with rank(x : X), is the number of set X elements less or equal to x. Let Y = (y1 , y2 , ..., ys ) be arbitrary array of elements in set S.
• Ranking of set Y in set X is an array rank(Y : X) = (r1 , r2 , ..., rs ), where ri = rank(yi : X). Ecample:
X = (26, −12, 27, 30, 53, 7); Y = (12, 28, −28). Then rank(Y : X) = (2, 4, 0).
265 // Algorithm Basic Techniques
11.4 Partitioning
For simplicity: all elements of both sorted sequences A and B to merge are distinct => merging problem solvable finding for each element x in A and B its rank in set A ∪ B.
• If rank(x : A ∪ B) = i, then ci = x. • As rank(x : A ∪ B) = rank(x : A) + rank(x : B), then the merging problem can be solved finding two integer arrays rank(A : B) and rank(B : A). So, consider the problem rank(B : A) Let bi – an arbitrary element of B As A– sorted, finding rank(bi , A) can be performed with binary search. (Compare bi with middle element and go on with lower or upper part etc. until we have found a j(i) < bi < a j(i)+1 and rank(bi , A) = j(i)). Sequential time is O(log n) => perform in parallel with each element of B. Work - O(n log n) , which is not optimal since there exists a linear sequential algorithm
266 // Algorithm Basic Techniques
11.4 Partitioning
Optimal merging algorithm Split sequences A and B into log n blocks => ca n/ log n elements into each approximately equal partition
267 // Algorithm Basic Techniques
11.4 Partitioning
Algorithm 2.7 (Partitioning) Input: Two arrays A = (a1 , ..., an ) and B = (b1 , ..., bm ) with elements in increasing order with log m and k(m) = m/ log m being integers Output: k(m) pairs (Ai , Bi ) of sequences A and B subsequences such that: (1) |Bi | = log m; (2) ∑i |Ai | = n and (3) each subsequence Ai and Bi element larger than all elements in subsequences Ai−1 and Bi−1 ∀ 1 ≤ i ≤ k(m) − 1. begin 1. j(0) := 0; j(k(m)) := n 2. for 1 ≤ i ≤ k(m) − 1 pardo 2.1 rank bi log m in sequence A using binary search and j(i) := rank(bi log m : A) 3. for 0 ≤ i ≤ k(m) − 1 pardo 3.1 Bi :=(bi log m+1 , ..., b(i+1) log m ) 3.2 Ai :=(a j(i)+1 , ..., a j(i+1) ) (Ai is empty if j(i) = j(i + 1))
268 // Algorithm Basic Techniques
end
Example:
A = (4, 6, 7, 10, 12, 15, 18, 20) B = (3, 9, 16, 21) Here m = 4, k(m) = 2. As rank(9 : A) = 3 => A0 = (4, 6, 7), B0 = (3, 9) A1 = (10, 12, 15, 18, 20), B1 = (16, 21)
11.4 Partitioning
269 // Algorithm Basic Techniques
11.4 Partitioning
Lemma 2.1: Let C be a sorted sequence obtained with merging two sorted sequences A and B (with lengths n and m). Then Algorithm 2.7 partitions sequences A and B into pairs (Ai , Bi ) such that |Bi | = O(log m), ∑i |Ai | = n and C = (C0 ,C1 , ...), where Ci is sorted sequence obtained through merging subsequences Ai and Bi . In addition, the algorithm works with O(log n) time using O(n + m) operations. Proof. Ensure that each element from subsequences Ai and Bi larger than each element in Ai−1 and Bi−1 . Step 1: O(1) time Step 2: O(log n) time (binary search for each one separately) Operation count: O((log n) × (m/ log m)) = O(m + n), as (m log n/ log m) < (m log(n + m)/ log m) ≤ n + m, if n, m ≥ 4. Step 3: Time: O(1), linear number of opeartions
270 // Algorithm Basic Techniques
11.4 Partitioning
Theorem 2.4 Let A and B be two sorted sequences, each of length n. Merging sequences A and B can be performed in O(log n) time with O(n) operations. Proof. 1. Using algorithm 2.7 divide A and B into pairs of subsequences (Ai , Bi ) (Time:O(log n), Work O(n)) 2. Could use a sequential algorithm for merging (Ai , Bi ) but Ti = O(|Bi | + |Ai |) =
O(log n+?) Therefore, if
|Ai | > O(log n) – 2a) split Ai into pieces of length log n and apply algorithm 2.7, with B replaced with Ai and A replaced with Bi – O(log log n) time using O(|Ai |) operations. |Ai | ≤ O(log n) 2b) use sequential algorithm for merging subsequences
271 // Algorithm Basic Techniques
11.5
11.5 Pipelining
Pipelining
In pipelining, process is divided into subprocesses in such a way that as soon the previous subprocess has finished an arbitrary algorithm phase, it can be followed by the next subprocess
2-3 Trees 2-3 tree is:
• rooted tree with each node having 2 or 3 children • path length from root to all leaves is the same Properties:
• number of leaves in 2-3 tree is between on 2h and 3h (h - tree height) => height of n-leave trees is Θ(log n) • being used in sorted lists (element search, adding, subtracting an element)
272 // Algorithm Basic Techniques Sorted list A = (a1 , a2 , ..., an ), where a1 < a2 < · · · < an as 2-3 tree leaves – elements ai with direction from left to right Inner nodes: L(ν) : M(ν) : R(ν) L(ν) - largest element of leften subtree M(ν) - middle subtree largest element R(ν) - largest element of right subtree (if it exists) Example
11.5 Pipelining
273 // Algorithm Basic Techniques
11.5 Pipelining
• 2-3-tree corresponding to a sorted list A is not defined uniformly • After elemental operations (adding, deleting a leave) the structure must remain the same
Searching: With given b find the leaf of T for which ai ≤ b < ai+1 . Work is proportional to the tree height (O(log n))
274 // Algorithm Basic Techniques Adding a leaf: First find the place for addition with previous algorithm
• (If b = ai – nothing needs to be done) • add the leaf and repair the tree
11.5 Pipelining
275 // Algorithm Basic Techniques
• Add root on new level, if needed
11.5 Pipelining
276 // Algorithm Basic Techniques
11.5 Pipelining
Adding k sorted elements b1 < b2 < · · · < bk to 2-3 tree Sequential algorithm: Time O(k log n), Work: O(k log n)
• With sequential algorithm add b1 and bk to the tree (Time: O(log n)) • In parallel, search places for each element bi • A. Simple case – between each pair ai and ai+1 at most 1 element b j : ◦ adding nodes in parallel (for repairing the tree) moving from bottom to top (never more than 6 subnodes!)
◦ Time: O(log n), ◦ Work: O(k log n). • B. More difficult case – if for example between ai and ai+1 a sequence bl , bl+1 , ..., b f , repair the tree with middle element, then subsequence middle elements etc. in parallel
• Total time: O(log k log n)
277 // Algorithm Basic Techniques
11.5 Pipelining
Pipelining can be used – operations in parallel in waves (each level has exactly one repairment of the tree)!
• Time: O(log k + log n) = O(log n), Work: O(k log n)
278 // Algorithm Basic Techniques
11.6
11.6 Accelerated Cascading
Accelerated Cascading
Consider:
• fast but nonoptimal algorithm and • slower but optimal algorithm Question: Could these be combned into a fast and optimal algorithm?
279 // Algorithm Basic Techniques 11.6.1
11.6 Accelerated Cascading
Constant time maximum algorithm
Algorithm 2.8 (maximum in O(1) time) Input: Array A, with all p elements being different Output: Array M of logical values such that M(i)=1 exactly iff A(i) is maximal element of array A begin 1. for 1 ≤ i, j ≤ p pardo if (A(i) ≥ A( j)) then B(i, j) := 1 else B(i, j) := 0 2. for 1 ≤ i ≤ p pardo M(i) := B(i, 1) ∧ B(i, 2) ∧ · · · ∧ B(i, p) end
CRCW PRAM model, O(1) time, O(p2 ) operations
280 // Algorithm Basic Techniques 11.6.2
11.6 Accelerated Cascading
Doubly logarithmic-time algorithm k
Assume n = 22 for some integer k √ k−1 Doubly logarithmic-depth n-leave tree (DLT): Root has 22 (= n) children, k−2 k−i−1 each of which have 22 children and in general, each node on i-th level has 22 children, 0 ≤ i ≤ k − 1. Each level k node has 2 children Example: 16-node doubly logarithmic-depth tree: (k = 2)
281 // Algorithm Basic Techniques
11.6 Accelerated Cascading
• through induction, it is easy to ensure that number of nodes on i-th level is k k−i 22 −2 , • number of nodes on level k: 22
k −20
k −1
= 22
= n/2.
• depth of the tree: k + 1 = log log n + 1 • DLT can be used for finding maximum of n elements: each node gets maximum of subtree. Starting from beneath with O(1) time using Algorithm 2.8. • => T (n) = O(log log n) • W (n) =? k−i−1 2 ) ) per node, giving on i-th level number of operations: O((22 k k −2k−i k−i−1 2 2 2 2 ) = O(2 ) = O(n) operations on level i. O((2 ) ·2 • => W (n) = O(n log log n) • => algorithm is not optimal!
282 // Algorithm Basic Techniques 11.6.3
11.6 Accelerated Cascading
Making fast algorithm optimal
logarithmic depth binary tree – optimal, time O(log n) Accelerated cascading 1. Starting with optimal algorithm until size of the problem smaller than a certain level 2. Switching over to fast but nonoptimal algorithm Problem of finding maximum: 1. start with binary tree algorithm from leaves dlog log log ne levels up . as number of nodes reduces by factor of two each level, we have maximum within n0 = O(n/ log log n) nodes . Number of operations so far: O(n), Time: O(log log log n). 2. DLT on remaining n0 elements Time: O(log log n0 ) = O(log log n) W (n) = O(n0 log log n0 ) = O(n) operations. => hybrid algorithm optimal and works with O(log log n) time.
283 // Algorithm Basic Techniques
11.7
11.7 Symmetry breaking
Symmetry breaking
Let G = (V, E) directed cycle (inbound and outbound rank=1, ∀u, v ∈ V ∃ directed path from node u to v.
Graph k-coloring is a mapping: c : V → 0, 1, ..., k − 1 such that c(i) 6= c( j) if hi, ji ∈
E.
284 // Algorithm Basic Techniques
11.7 Symmetry breaking
Consider 3-coloring problem. Sequential algorithm is simple – alternating coloring with 0, 1, in the end pick a color from 1, 2. Let edges be given with an array S(1 : n) (there is an edge hi, S(i)i ∈ E ) Previous edge: P(S(i)) = i Suppose there is a coloring c(1 : n) (can always be found, taking for example, c(i) = i) Consider binary representation i = it−1 · · · ik · · · ii i0 ; k-th least significant bit: ik
Algorithm 2.9 (Basic coloring) Input: Directed cycle with edges given by array S(1 : n) and coloring c(1, n) on nodes Output: New coloring c0 (1 : n) begin for 1 ≤ i ≤ n pardo 1. Find k-th least significant bit, where c(i) and c(S(i)) differ 2. c0 (i) := 2k + c(i)k end
285 // Algorithm Basic Techniques
11.7 Symmetry breaking
v 1 3 7 14 2 15 4 5 6 8 10 11 12 9 13
c 0001 0011 0111 1110 0010 1111 0100 0101 0110 1000 1010 1011 1100 1001 1101
k 1 2 0 2 0 0 0 0 1 1 0 0 0 2 2
c’ 2 4 1 5 0 1 0 1 3 2 0 1 0 4 5
286 // Algorithm Basic Techniques
11.7 Symmetry breaking
Lemma 2.3: Output c0 (1 : n) is valid coloring whenever input c(1 : n) is valid coloring. T (n) = O(1), W (n) = O(n). Proof:
• k can be found always, if c is valid coloring • Consider c0 (i) = c0 ( j), if there is an edge i → j => c0 (i) = 2k + c(i)k and c0 ( j) = 2l + c( j)l as c0 (i) = c0 ( j) => k = l => c(i)k = c( j)k , which is contradicting the definition of k . => c0 (i) 6= c0 ( j), whenever there is an edge i → j.
287 // Algorithm Basic Techniques 11.7.1
11.7 Symmetry breaking
Superfast 3-coloring algorithm
Let t > 3 (number of bits in c ) => c0 (1 : n) can be described with dlogte + 1 bits. => if the number of colors in c(1 : n) is q, where 2t−1 < q ≤ 2t , then coloring c0 uses at most 2dlogte+1 = O(t) = O(log q) colors. => number of colors reduces exponentially. reduction takes place whenever t > dlogte + 1 – i.e. t > 3. Applying base coloring algorithm with t = 3 gives at most 6 colors, as c0 (i) = 2k + c(i)k and => 0 ≤ c0 (i) ≤ 5 (as 0 ≤ k ≤ 2).
288 // Algorithm Basic Techniques
11.7 Symmetry breaking
Number of iterations? Denote log(i) x: log(1) x = log x; log(i) x = log(log(i−1) x) Let log∗ x = min{i| log(i) x ≤ 1} Function log∗ x is growing extremely slowly being < 5 for each x ≤ 265536 ! Theorem 2.6: 3-coloring directed cycles can be performed with O(log∗ n) time and O(n log∗ n) operations. Proof: Applying base coloring algorithm, we reach 6 colors at most in O(log∗ n) iterations. Changing colors 3, 4 and 5 with 3 iterations – change each of the colors to the smallest possibility from set {0, 1, 2} (depending on neighbours). It works correctly because no neighbouring node colored at the same time.
289 // Algorithm Basic Techniques 11.7.2
11.7 Symmetry breaking
Optimal 3-coloring algorithm
Remark. (Sorting integers) Let be given n integers in the interval [0, O(log n)]. These can be sorted in O(log n) time using linear number of operations. (Can be achieved through combination of radix-sort and prefix sum algorithms)
Algorithm 2.10 (3-coloring of a cycle) Input: Directed graph of legth n with edges defined by array S(1 : n) Output: 3-coloring of cycle nodes. begin 1. for 1 ≤ i ≤ n pardo C(i) := i 2. Apply algorithm 2.9 once. 3. Sort depending on node colors 4. for i = 3 : 2 dlog ne do each i-color nodes pardo color with smallest possible from the set 0, 1, 2, which is different from the neighbours. end
290 Scalability
12 12.1
12.1 Isoefficiency Metric of Scalability
Parallel Algorithm Scalability1 Isoefficiency Metric of Scalability
• For a given problem size, as we increase the number of processing elements, the overall efficiency of the parallel system goes down with given algorthm.
• For some systems, the efficiency of a parallel system increases if the problem size is increased while keeping the number of processing elements constant. 1
Based on [1]
291 Scalability
12.1 Isoefficiency Metric of Scalability
Variation of efficiency: (a) as the number of processing elements is increased for a given problem size; and (b) as the problem size is increased for a given number of processing elements. The phenomenon illustrated in graph (b) is not common to all parallel systems
• What is the rate at which the problem size must increase with respect to the number of processing elements to keep the efficiency fixed?
292 Scalability
12.1 Isoefficiency Metric of Scalability
• This rate determines the scalability of the system. The slower this rate, the better
• Before we formalize this rate, we define the problem size W as the asymptotic number of operations associated with the best serial algorithm to solve the problem
• We can write parallel runtime as:
TP =
W + To (W, p) p
• The resulting expression for speedup is:
S = =
W Tp Wp W + To (W, p)
293 Scalability
12.1 Isoefficiency Metric of Scalability
• Finally, we write the expression for efficiency as E =
S p
W W + To (W, p) 1 = 1 + To (W, p)/W
=
• For scalable parallel systems, efficiency can be maintained at a fixed value (between 0 and 1) if the ratio To /W is maintained at a constant value. • For a desired value E of efficiency, 1 1 + To (W, p)/W To (W, p) 1−E = W E E W = To (W, p). 1−E E =
294 Scalability
12.1 Isoefficiency Metric of Scalability
• If K = E/(1–E) is a constant depending on the efficiency to be maintained, since To is a function of W and p, we have
W = KTo (W, p)
• The problem size W can usually be obtained as a function of p by algebraic manipulations to keep efficiency constant.
• This function is called the isoefficiency function. • This function determines the ease with which a parallel system can maintain a constant efficiency and hence achieve speedups increasing in proportion to the number of processing elements
295 Scalability
12.1 Isoefficiency Metric of Scalability
Example: Adding n numbers on p processors • The overhead function for the problem of adding n numbers on p processing elements is approximately 2p log p: ◦ The first stage of algorithm (summing n/p numbers locally) takes n/p time
◦ the second phase involves log p steps with communication and addition on each step. Assume single communication takes unit time, the time is
2 log p n + 2 log p p n S = n p + 2 log p
TP =
E =
1 p 1 + 2p log n
296 Scalability
12.1 Isoefficiency Metric of Scalability
• Doing as before, but substituting To by 2p log p , we get W = K2p log p
• Thus, the asymptotic isoefficiency function for this parallel system is Θ(p log p) • If the number of processing elements is increased from p to p’, the problem size (in this case, n ) must be increased by a factor of (p’ log p’)/(p log p) to get the same efficiency as on p processing elements Efficiencies for different n and p: n
p=1
p=4
p=8
p = 16
p = 32
64
1.0
0.80
0.57
0.33
0.17
192
1.0
0.92
0.80
0.60
0.38
320
1.0
0.95
0.87
0.71
0.50
512
1.0
0.97
0.91
0.80
0.62
297 Scalability
12.1 Isoefficiency Metric of Scalability
Example 2: Let To = p3/2 + p3/4W 3/4 • Using only the first term of To in W = KTo (W, p), we get W = K p3/2
• Using only the second term, yields the following relation between W and p: W = K p3/4W 3/4 W 1/4 = K p3/4 W = K 4 p3 • The larger of these two asymptotic rates determines the isoefficiency. This is given by Θ(p3 )
298 Scalability
12.2
12.2 Cost-Optimality and the Isoefficiency Function
Cost-Optimality and the Isoefficiency Function
• A parallel system is cost-optimal if and only if
pTP = Θ(W )
• From this, we have:
W + To (W, p) = Θ(W ) To (W, p) = O(W ) W = Ω(To (W, p))
=> cost-optimal if and only if its overhead function does not asymptotically exceed the problem size
• If we have an isoefficiency function f (p), then it follows that the relation W = Ω( f (p)) must be satisfied to ensure the cost-optimality of a parallel system as it is scaled up.
299 Scalability
12.2 Cost-Optimality and the Isoefficiency Function
Lower Bound on the Isoefficiency Function
• For a problem consisting of W units of work, no more than W processing elements can be used cost-optimally.
• The problem size must increase at least as fast as Θ(p) to maintain fixed efficiency; hence, Ω(p) is the asymptotic lower bound on the isoefficiency function. Degree of Concurrency and the Isoefficiency Function
• The maximum number of tasks that can be executed simultaneously at any time in a parallel algorithm is called its degree of concurrency.
• If C(W ) is the degree of concurrency of a parallel algorithm, then for a problem of size W , no more than C(W ) processing elements can be employed effectively.
300 Scalability
12.2 Cost-Optimality and the Isoefficiency Function
Degree of Concurrency and the Isoefficiency Function: Example
• Consider solving a system of equations in variables by using Gaussian elimination (W = Θ(n3 )) • The n variables must be eliminated one after the other, and eliminating each variable requires Θ(n2 ) computations. • At most Θ(n2 ) processing elements can be kept busy at any time. • Since W = Θ(n3 ) for this problem, the degree of concurrency C(W ) is Θ(W 2/3 ) • Given p processing elements, the problem size should be at least Ω(p3/2 ) to use them all.
301 Graph Algorithms
12.2 Cost-Optimality and the Isoefficiency Function
Scalability of Parallel Graph Algorithms 12.1 Definitions and Representation
12.2 Minimum Spanning Tree: Prim’s Algorithm
12.3 Single-Source Shortest Paths: Dijkstra’s Algorithm
12.4 All-Pairs Shortest Paths
12.5 Connected Components
?? Algorithms for Sparse Graphs
302 Graph Algorithms
12.1
12.1 Definitions and Representation
Definitions and Representation
• An undirected graph G is a pair (V, E), where V is a finite set of points called vertices and E is a finite set of edges • An edge e ∈ E is an unordered pair (u, v), where u, v ∈ V • In a directed graph, the edge e is an ordered pair (u, v). An edge (u, v) is incident from vertex u and is incident to vertex v • A path from a vertex v to a vertex u is a sequence hv0 , v1 , v2 , . . . , vk i of vertices where v0 = v, vk = u, and (vi , vi+1 ) ∈ E for i = 0, 1, ..., k − 1 • The length of a path is defined as the number of edges in the path • An undirected graph is connected if every pair of vertices is connected by a path
• A forest is an acyclic graph, and a tree is a connected acyclic graph • A graph that has weights associated with each edge is called a weighted graph
303 Graph Algorithms
12.1 Definitions and Representation
• Graphs can be represented by their adjacency matrix or an edge (or vertex) list • Adjacency matrices have a value ai, j = 1 if nodes i and j share an edge; 0 otherwise. In case of a weighted graph, ai, j = wi, j , the weight of the edge • The adjacency list representation of a graph G = (V, E) consists of an array Adj[1..|V |] of lists. Each list Adj[v] is a list of all vertices adjacent to v • For a grapn with n nodes, adjacency matrices take Θ(n2 ) space and adjacency list takes Θ(|E|) space.
304 Graph Algorithms
12.1 Definitions and Representation
• A spanning tree of an undirected graph G is a subgraph of G that is a tree containing all the vertices of G • In a weighted graph, the weight of a subgraph is the sum of the weights of the edges in the subgraph
• A minimum spanning tree (MST) for a weighted undirected graph is a spanning tree with minimum weight
305 Graph Algorithms
12.2
12.2 Minimum Spanning Tree: Prim’s Algorithm
Minimum Spanning Tree: Prim’s Algorithm
• Prim’s algorithm for finding an MST is a greedy algorithm • Start by selecting an arbitrary vertex, include it into the current MST • Grow the current MST by inserting into it the vertex closest to one of the vertices already in current MST
306 Graph Algorithms
12.2 Minimum Spanning Tree: Prim’s Algorithm
procedure PRIM_MST( V , E , w , r ) begin VT := {r} ; d[r] := 0 ; f o r a l l v ∈ (V −VT ) do i f edge (r, v) e x i s t s set d[v] := w(r, v) ; else set d[v] := ∞ ; endif while VT 6= V do begin f i n d a vertex u such t h a t d[u] := min{d[v] | v ∈ (V −VT )} ; VT := VT ∪ {u} ; f o r a l l v ∈ (V −VT ) do d[v] := min{d[v], w(u, v)} ; endwhile end PRIM_MST
307 Graph Algorithms
12.2 Minimum Spanning Tree: Prim’s Algorithm
Parallel formulation • The algorithm works in n outer iter-
• This node is inserted into MST, and
ations - it is hard to execute these iterations concurrently.
the choice broadcast to all processors.
• The inner loop is relatively easy to parallelize. Let p be the number of processes, and let n be the number
• Each processor updates its part of
of vertices.
• The adjacency matrix is partitioned in a 1-D block fashion, with distance vector d partitioned accordingly.
• In each step, a processor selects the locally closest node, followed by a global reduction to select globally closest node.
the d vector locally.
308 Graph Algorithms
12.2 Minimum Spanning Tree: Prim’s Algorithm
• The cost to select the minimum entry is O(n/p + log p). • The cost of a broadcast is O(log p). • The cost of local updation of the d vector is O(n/p). • The parallel time per iteration is O(n/p + log p). • The total parallel time O(n2 /p + n log p): computation
z }| { communication z }| { n2 Tp = Θ + Θ(n log p) p • Since sequential run time is W = Θ(n2 ): Θ(n2 ) Θ(n2 /p) + Θ(n log n) 1 E = 1 + Θ((p log p)/n) S =
309 Graph Algorithms
12.2 Minimum Spanning Tree: Prim’s Algorithm
• => for cost-optimal parallel formulation (p log p)/n = O(1) • => Prim’s algorithm can use only p = O(n/ log n) processes in this formulation • The corresponding isoefficiency is O(p2 log2 p).
310 Graph Algorithms
12.3
12.3 Single-Source Shortest Paths: Dijkstra’s Algorithm
Single-Source Shortest Paths: Dijkstra’s Algorithm
• For a weighted graph G = (V, E, w), the single-source shortest paths problem is to find the shortest paths from a vertex v ∈ V to all other vertices in V • Dijkstra’s algorithm is similar to Prim’s algorithm. It maintains a set of nodes for which the shortest paths are known
• It grows this set based on the node closest to source using one of the nodes in the current shortest path set
311 Graph Algorithms
12.3 Single-Source Shortest Paths: Dijkstra’s Algorithm
procedure DIJKSTRA_SINGLE_SOURCE_SP( V , E , w , s ) begin VT := {s} ; f o r a l l v ∈ (V −VT ) do i f (s, v) e x i s t s set l[v] := w(s, v) ; else set l[v] := ∞ ; endif while VT 6= V do begin f i n d a vertex u such t h a t l[u] := min{l[v] | v ∈ (V −VT )} ; VT := VT ∪ {u} ; f o r a l l v ∈ (V −VT ) do l[v] := min{l[v], l[u] + w(u, v)} ; endwhile end DIJKSTRA_SINGLE_SOURCE_SP
Run time of Dijkstra’s algorithm is Θ(n2 )
312 Graph Algorithms
12.4 All-Pairs Shortest Paths
Parallel formulation • Very similar to the parallel formulation of Prim’s algorithm for minimum spanning trees
• The weighted adjacency matrix is partitioned using the 1-D block mapping • Each process selects, locally, the node closest to the source, followed by a global reduction to select next node
• The node is broadcast to all processors and the l -vector updated. • The parallel performance of Dijkstra’s algorithm is identical to that of Prim’s algorithm
12.4
All-Pairs Shortest Paths
• Given a weighted graph G(V, E, w), the all-pairs shortest paths problem is to find the shortest paths between all pairs of vertices vi , v j ∈ V . • A number of algorithms are known for solving this problem.
313 Graph Algorithms
12.4 All-Pairs Shortest Paths
Matrix-Multiplication Based Algorithm • Consider the multiplication of the weighted adjacency matrix with itself - except, in this case, we replace the multiplication operation in matrix multiplication by addition, and the addition operation by minimization
• Notice that the product of weighted adjacency matrix with itself returns a matrix that contains shortest paths of length 2 between any pair of nodes
• It follows from this argument that An contains all shortest paths • An is computed by doubling powers - i.e., as A, A2 , A4 , A8 , ...
314 Graph Algorithms
12.4 All-Pairs Shortest Paths
315 Graph Algorithms
12.4 All-Pairs Shortest Paths
• We need log n matrix multiplications, each taking time O(n3 ). • The serial complexity of this procedure is O(n3 log n). • This algorithm is not optimal, since the best known algorithms have complexity O(n3 ). Parallel formulation
• Each of the log n matrix multiplications can be performed in parallel. • We can use n3 / log n processors to compute each matrix-matrix product in time log n. • The entire process takes O(log2 n) time.
Dijkstra’s Algorithm
316 Graph Algorithms
12.4 All-Pairs Shortest Paths
• Execute n instances of the single-source shortest path problem, one for each of the n source vertices. • Complexity is O(n3 ). Parallel formulation Two parallelization strategies - execute each of the n shortest path problems on a different processor (source partitioned), or use a parallel formulation of the shortest path problem to increase concurrency (source parallel). Dijkstra’s Algorithm: Source Partitioned Formulation
• Use n processors, each processor Pi finds the shortest paths from vertex vi to all other vertices by executing Dijkstra’s sequential single-source shortest paths algorithm.
• It requires no interprocess communication (provided that the adjacency matrix is replicated at all processes).
317 Graph Algorithms
12.4 All-Pairs Shortest Paths
• The parallel run time of this formulation is: Θ(n2 ). • While the algorithm is cost optimal, it can only use n processors. Therefore, the isoefficiency due to concurrency is Θ(p3 ). Dijkstra’s Algorithm: Source Parallel Formulation
• In this case, each of the shortest path problems is further executed in parallel. We can therefore use up to n2 processors. • Given p processors ( p > n), each single source shortest path problem is executed by p/n processors. • Using previous results, this takes time: computation
z }| { communication z }| { n3 Tp = Θ + Θ (n log p) p
318 Graph Algorithms
12.4 All-Pairs Shortest Paths
• For cost optimality, we have p = O(n2 / log n) and the isoefficiency is Θ((p log p)1.5 ).
Floyd’s Algorithm • Let G = (V, E, w) be the weighted graph with vertices V = {v1 , v2 , ..., vn }. • For any pair of vertices vi , v j ∈ V , consider all paths from vi to v j whose in(k) termediate vertices belong to the subset {v1 , v2 , . . . , vk } (k ≤ n). Let pi, j (of (k)
weight di, j ) be the minimum-weight path among them. (k)
• If vertex vk is not in the shortest path from vi to v j , then pi, j is the same as (k−1)
pi, j
. (k)
(k)
• If vk is in pi, j , then we can break pi, j into two paths - one from vi to vk and one from vk to v j . Each of these paths uses vertices from {v1 , v2 , . . . , vk−1 }.
319 Graph Algorithms
12.4 All-Pairs Shortest Paths
From our observations, the following recurrence relation follows: (k)
di, j =
( w(vin , v j) min
(k−1) (k−1) (k−1) di, j , di,k + dk, j
o if k = 0 if k ≥ 1
This equation must be computed for each pair of nodes and for k = 1, n. The serial complexity is O(n3 ).
procedure FLOYD_ALL_PAIRS_SP ( A ) begin D0 = A ; f o r k := 1 to n do f o r i := 1 to n do f o r j := 1 to n do (k)
(k−1)
di, j := min(di, j
(k−1)
, di,k
end FLOYD_ALL_PAIRS_SP
(k−1)
+ dk, j
);
320 Graph Algorithms
12.4 All-Pairs Shortest Paths
Parallel formulation: 2D Block Mapping
√ √ • Matrix D(k) is divided into p blocks of size (n/ p) × (n/ p). • Each processor updates its part of the matrix during each iteration. (k−1)
• To compute dl,r
(k−1)
processor Pi, j must get dl,k
(k−1)
and dk,r
.
√ • In general, during the kth iteration, each of the p) processes containing part √ of the kth row send it to the p − 1 processes in the same column. √ • Similarly, each of the p processes containing part of the kth column sends it √ to the p − 1 processes in the same row.
321 Graph Algorithms
12.4 All-Pairs Shortest Paths
322 Graph Algorithms
12.4 All-Pairs Shortest Paths
323 Graph Algorithms
12.4 All-Pairs Shortest Paths
procedure FLOYD_2DBLOCK( D(0) ) begin f o r k := 1 to n do begin each process Pi, j t h a t has a segment of the kth row of D(k−1) broadcasts i t to the P∗, j processes ; each process Pi, j t h a t has a segment of the kth column of D(k−1) broadcasts i t to the Pi,∗ processes ; each process waits to receive the needed segments ; each process Pi, j computes i t s p a r t of the D(k) matrix ; end end FLOYD_2DBLOCK
• During each iteration of the algorithm, the kth row and kth column of processors perform a one-to-all broadcast along their rows/columns.
√ √ • The size of this broadcast is n/ p elements, taking time Θ((n log p)/ p). • The synchronization step takes time Θ(log p).
324 Graph Algorithms
12.4 All-Pairs Shortest Paths
• The computation time is Θ(n2 /p). • The parallel run time of the 2-D block mapping formulation of Floyd’s algorithm is computation
communication
z }| { z }| { n3 n2 Tp = Θ + Θ √ log p p p • The above formulation can use O(n2 / log2 n) processors cost-optimally. • The isoefficiency of this formulation is Θ(p1.5 log3 p). • This algorithm can be further improved by relaxing the strict synchronization after each iteration. Speeding things up by pipelining
325 Graph Algorithms
12.4 All-Pairs Shortest Paths
• The synchronization step in parallel Floyd’s algorithm can be removed without affecting the correctness of the algorithm.
• A process starts working on the kth iteration as soon as it has computed the k − 1th iteration and has the relevant parts of the D(k−1) matrix. process 4 at time t has just computed a segment of the kth column of the D(k−1) matrix. It sends the segment to processes 3 and 5. These processes receive the segment at time t + 1 (where the time unit is the time it takes for a matrix segment to travel over the communication link between adjacent processes). Similarly, processes farther away from proCommunication protocol followed in cess 4 receive the segment later. Prothe pipelined 2-D block mapping formu- cess 1 (at the boundary) does not forward lation of Floyd’s algorithm. Assume that the segment after receiving it.
√ • In each step, n/ p elements of the first row are sent from process Pi, j to Pi+1, j .
326 Graph Algorithms
12.4 All-Pairs Shortest Paths
• Similarly, elements of the first column are sent from process Pi, j to process Pi, j+1 . √ • Each such step takes time Θ(n/ p). √ • After Θ( p) steps, process P√ p,√ p gets the relevant elements of the first row and first column in time Θ(n). • The values of successive rows and columns follow after time Θ(n2 /p) in a pipelined mode.
• Process P√ p,√ p finishes its share of the shortest path computation in time Θ(n3 /p) + Θ(n). • When process P√ p,√ p has finished the (n − 1)th iteration, it sends the relevant values of the nth row and column to the other processes.
• The overall parallel run time of this formulation is computation
z }| { communication z }| { n3 Tp = Θ + Θ (n) p
327 Graph Algorithms
12.4 All-Pairs Shortest Paths
• The pipelined formulation of Floyd’s algorithm uses up to O(n2 ) processes efficiently.
• The corresponding isoefficiency is Θ(p1.5 ).
All-pairs Shortest Path: Comparison
328 Graph Algorithms
12.5
12.5 Connected Components
Connected Components
• The connected components of an undirected graph are the equivalence classes of vertices under the “is reachable from” relation
• A graph with three connected components: {1,2,3,4}, {5,6,7}, and {8,9}:
Depth-First Search (DFS) Based Algorithm • Perform DFS on the graph to get a forest - each tree in the forest corresponds to a separate connected component
• Part (b) is a depth-first forest obtained from depth-first traversal of the graph in part (a). Each of these trees is a connected component of the graph in part (a):
329 Graph Algorithms
12.5 Connected Components
Parallel Formulation
• Partition the graph across processors and run independent connected component algorithms on each processor. At this point, we have p spanning forests. • In the second step, spanning forests are merged pairwise until only one spanning forest remains.
330 Graph Algorithms
12.5 Connected Components Computing connected components in parallel: The adjacency matrix of the graph G in (a) is partitioned into two parts (b).
Each process gets a subgraph of G ((c) and (e)).
Each process then computes the spanning forest of the subgraph ((d) and (f)). Finally, the two spanning trees are merged to form the solution.
331 Graph Algorithms
12.5 Connected Components
• To merge pairs of spanning forests efficiently, the algorithm uses disjoint sets of edges.
• We define the following operations on the disjoint sets: • find(x) ◦ returns a pointer to the representative element of the set containing x . Each set has its own unique representative.
• union(x, y) ◦ unites the sets containing the elements x and y. The two sets are assumed to be disjoint prior to the operation.
• For merging forest A into forest B, for each edge (u, v) of A, a find operation is performed to determine if the vertices are in the same tree of B. • If not, then the two trees (sets) of B containing u and v are united by a union operation.
332 Graph Algorithms
12.5 Connected Components
• Otherwise, no union operation is necessary. • Hence, merging A and B requires at most 2(n − 1) find operations and (n − 1) union operations. Parallel 1-D Block Mapping
• The n × n adjacency matrix is partitioned into p blocks. • Each processor can compute its local spanning forest in time Θ(n2 /p). • Merging is done by embedding a logical tree into the topology. There are log p merging stages, and each takes time Θ(n). Thus, the cost due to merging is Θ(n log p). • During each merging stage, spanning forests are sent between nearest neighbors. Recall that Θ(n) edges of the spanning forest are transmitted.
333 Graph Algorithms
12.5 Connected Components
• The parallel run time of the connected-component algorithm is local computation
Tp =
z }| { n2 Θ p
forest merging
z }| { + Θ (n log p)
• For a cost-optimal formulation p = O(n/ log n). The corresponding isoefficiency is Θ(p2 log2 p).
334 Summary
CONTENTS
Contents 1 Parallel Computing – motivation 1.1 History of computing . . . . . . . . . 1.2 Expert’s predictions . . . . . . . . . 1.3 Usage areas of a petaflop computer . 1.4 Example 1.1 . . . . . . . . . . . . . 1.5 Example 1.2 . . . . . . . . . . . . . 1.6 Example 1.3 . . . . . . . . . . . . . 1.7 Example 1.4 . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
11 12 15 17 21 24 27 28
2 Computer architectures and parallelism
29
I
31
Parallel Architectures 2.1 2.2 2.3 2.4 2.5
Processor development . . . . . . . . . . . Memory performance effect . . . . . . . . . Memory Bandwidth (MB) and performance . Other approaches for hiding memory latency Classification of Parallel Computers . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
31 39 44 48 52
335 Summary 2.6 2.7 2.8 2.9 2.10 2.11
Flynn’s classification . . . . . . . . . . . . . Type of memory access . . . . . . . . . . . Communication model of parallel computers Other classifications . . . . . . . . . . . . . Clos-networks . . . . . . . . . . . . . . . . Beowulf clusters . . . . . . . . . . . . . . .
3 Parallel Performance 3.1 Speedup . . . . . . . . . . . . 3.2 Efficiency . . . . . . . . . . . . 3.3 Amdahl’s law . . . . . . . . . . 3.4 Gustafson-Barsis’ law . . . . . 3.5 Scaled efficiency . . . . . . . . 3.6 Methods to increase efficiency . 3.7 Benchmarks . . . . . . . . . .
II
CONTENTS
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
Parallel Programming
4 Parallel Programming Models – an Overview
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
58 66 68 71 78 87
. . . . . . .
91 91 92 93 96 97 98 100
105 105
336 Summary 4.1 4.2 4.3 4.4 4.1 4.2
Model of Tasks and Channels Message passing model . . . Shared Memory model . . . . Implicit Parallelism . . . . . . Task parallel model . . . . . Data parallel model . . . . .
CONTENTS . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
5 Shared memory programing model 5.1 PRAM (Parallel Random Access Machine) model 5.2 OpenMP . . . . . . . . . . . . . . . . . . . . . . 5.3 OpenMP direktiivid: PARALLEL . . . . . . . . . . 5.4 OpenMP tööjaotusdirektiivid: DO . . . . . . . . . 5.5 OpenMP tööjaotusdirektiivid: SECTIONS . . . . . 5.6 OpenMP tööjaotusdirektiivid: SINGLE . . . . . . 5.7 OpenMP tööjaotusdirektiivide kombineerimine: . . 5.8 OpenMP: sünkronisatsioonikäsud . . . . . . . . . 5.9 OpenMP skoobidirektiivid . . . . . . . . . . . . . 5.10 OpenMP teegikäske . . . . . . . . . . . . . . . . 5.11 OpenMP näiteprogramme . . . . . . . . . . . . .
. . . . . .
. . . . . . . . . . .
. . . . . .
. . . . . . . . . . .
. . . . . .
. . . . . . . . . . .
. . . . . .
. . . . . . . . . . .
. . . . . .
. . . . . . . . . . .
. . . . . .
. . . . . . . . . . .
. . . . . .
. . . . . . . . . . .
. . . . . .
. . . . . . . . . . .
. . . . . .
. . . . . . . . . . .
. . . . . .
. . . . . . . . . . .
. . . . . .
106 109 110 111 113 113
. . . . . . . . . . .
115 116 117 124 127 131 134 135 137 140 144 147
337 Summary 6 Fortran and HPF 6.1 Fortran Evolution . . . . . . . . . . . . . . . . 6.2 Concept . . . . . . . . . . . . . . . . . . . . 6.3 PROCESSORS declaration . . . . . . . . . . . 6.4 DISTRIBUTE-directive . . . . . . . . . . . . 6.5 Distribution of allocatable arrays . . . . . . . . 6.6 HPF rule: Owner Calcuates . . . . . . . . . . 6.7 Scalar variables . . . . . . . . . . . . . . . . 6.8 Examples of good DISTRIBUTE subdivision 6.9 HPF programming methodology . . . . . . . . 6.10 BLOCK(m) and CYCLIC(m) . . . . . . . . . 6.11 Array alignment . . . . . . . . . . . . . . . . 6.12 Strided Alignment . . . . . . . . . . . . . . . 6.13 Example on Alignment . . . . . . . . . . . . . 6.14 Alignment with Allocatable Arrays . . . . . . . 6.15 Dimension collapsing . . . . . . . . . . . . . 6.16 Dimension replication . . . . . . . . . . . . . 6.17 HPF Intrinsic Functions . . . . . . . . . . . . 6.18 HPF Template Syntax . . . . . . . . . . . . .
CONTENTS
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
148 148 149 150 151 156 157 158 159 161 164 165 169 171 172 175 175 179 180
338 Summary 6.19 6.20 6.21 6.22 6.23
FORALL . . . . . . . . . . . . . PURE-procedures . . . . . . . . INDEPENDENT . . . . . . . . . INDEPENDENT NEW command EXTRINSIC . . . . . . . . . . .
CONTENTS . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
7 MPI
III
198
Parallel Algorithms
8 Parallel Algorithm Design Principles 8.1 Decomposition, Tasks and Dependency Graphs 8.2 Decomposition Techniques . . . . . . . . . . . 8.3 Tasks and Interactions . . . . . . . . . . . . . . 8.4 Mapping Techniques for Load balancing . . . . 8.5 Parallel Algorithm Models . . . . . . . . . . . .
184 188 192 195 197
199 . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
199 200 203 205 207 210
9 Solving systems of linear equations with sparse matrices 212 9.1 EXAMPLE: 2D FEM for Poisson equation . . . . . . . . . . . . . . . . 212
339 Summary 9.2 9.3 9.4 9.5
CONTENTS
Solution methods for sparse matrix systems . Preconditioning . . . . . . . . . . . . . . . . Domain Decompossition Method . . . . . . . Domain Decomposition on Unstructured Grids
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
229 231 233 234
10 Parallel Algorithm Optimality Definitions 243 10.1 Assessing Parallel Algorithms . . . . . . . . . . . . . . . . . . . . . . 244 10.2 Definitions of Optimality . . . . . . . . . . . . . . . . . . . . . . . . . 245 11 Basic Techniques in Parallel Algorithms 11.1 Balanced Trees . . . . . . . . . . . 11.2 Pointer Jumping Method . . . . . . . 11.3 Divide and Conquer . . . . . . . . . 11.4 Partitioning . . . . . . . . . . . . . . 11.5 Pipelining . . . . . . . . . . . . . . . 11.6 Accelerated Cascading . . . . . . . 11.7 Symmetry breaking . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
247 247 250 255 263 271 278 283
12 Parallel Algorithm Scalability 290 12.1 Isoefficiency Metric of Scalability . . . . . . . . . . . . . . . . . . . . 290
340 Summary 12.2 12.1 12.2 12.3 12.4 12.5
Cost-Optimality and the Isoefficiency Function . . Definitions and Representation . . . . . . . . . . Minimum Spanning Tree: Prim’s Algorithm . . . . Single-Source Shortest Paths: Dijkstra’s Algorithm All-Pairs Shortest Paths . . . . . . . . . . . . . . Connected Components . . . . . . . . . . . . . .
CONTENTS . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
298 302 305 310 312 328