Oct 3, 2013 ... Parallel Programming Methodology. Parallel and Distributed Computing.
Department of Computer Science and Engineering (DEI). Instituto ...
Parallel Programming Methodology
Parallel and Distributed Computing
Department of Computer Science and Engineering (DEI) Instituto Superior T´ ecnico
October 3, 2013
CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
1 / 26
Outline
Parallel programming
Dependency graphs
Overheads influence on programming of shared- vs distributed-memory systems
Foster’s design methodology
CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
2 / 26
Parallel Programming
Steps:
Identify work that can be done in parallel
Partition work and perhaps data among tasks
Manage data access, communication and synchronization
CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
3 / 26
Dependency Graphs
Programs can be modeled as directed graphs:
Nodes: at the finer granularity level, are instructions ⇒ to reduce complexity, nodes may be an arbitrary sequence of statements
CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
4 / 26
Dependency Graphs
Programs can be modeled as directed graphs:
Nodes: at the finer granularity level, are instructions ⇒ to reduce complexity, nodes may be an arbitrary sequence of statements
Edges: data dependency constraints among instructions in the nodes
Data Dependency Graphs
CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
4 / 26
Dependency Graphs
read(A, B); x = initX(A, B); y = initY(A, B); z = initZ(A, B); for(i = 0; i < N_ENTRIES; i++) x[i] = compX(x[i], y[i], z[i]); for(i = 1; i < N_ENTRIES; i++){ x[i] = solveX(x[i-1]); z[i] = x[i] + y[i]; }
. . .
. . .
finalize1(&x, &y, &z); finalize2(&x, &y, &z); finalize3(&x, &y, &z); CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
5 / 26
Types of Parallelism
A
B A C A
A
B A C
B
B
B
B
C
D
B
C C
E
Data Parallelism
Functional Parallelism
CPD (DEI / IST)
Parallel and Distributed Computing – 6
Pipeline Parallelism 2013-10-03
6 / 26
Overheads
Task creation/finish
Data transfer
Communication (synchronization)
Load balancing
CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
7 / 26
Shared vs Distributed Memory Systems
Overheads very different depending on type of architecture!
Shared Distributed
CPD (DEI / IST)
Start/Finish H N
Data H N
Load = =
Parallel and Distributed Computing – 6
Comm N H
2013-10-03
8 / 26
Shared vs Distributed Memory Systems
Tasks SM: more dynamic creation of tasks, hence these can be more fine-grained. DM: typically all tasks active until end, hence requires more coarse-grained tasks.
CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
9 / 26
Shared vs Distributed Memory Systems
Tasks SM: more dynamic creation of tasks, hence these can be more fine-grained. DM: typically all tasks active until end, hence requires more coarse-grained tasks.
Data SM: data partition not an issue when defining tasks; however caution when accessing shared data: avoid races using mutual-exclusive regions DM: data partition is critical for the performance of the application
CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
9 / 26
Shared vs Distributed Memory Systems
Tasks SM: more dynamic creation of tasks, hence these can be more fine-grained. DM: typically all tasks active until end, hence requires more coarse-grained tasks.
Data SM: data partition not an issue when defining tasks; however caution when accessing shared data: avoid races using mutual-exclusive regions DM: data partition is critical for the performance of the application
in both SM and DM: minimize synchronization points be careful about load balancing
CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
9 / 26
Shared Memory Systems Typical diagram of a parallel application under shared memory: Master Thread
Other Threads
Fork
Time
Join
Fork
Join
Fork / Join Parallelism CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
10 / 26
Shared Memory Systems Application is typically a single program, with directives to handle parallelism: fork / join parallel loops private vs shared variables critical sections
CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
11 / 26
Distributed Memory Systems
Cannot use fine granularity!
Each processor gets assigned a (large) task: static scheduling: all tasks start at the beginning of computation dynamic scheduling: tasks start as needed
CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
12 / 26
Distributed Memory Systems
Cannot use fine granularity!
Each processor gets assigned a (large) task: static scheduling: all tasks start at the beginning of computation dynamic scheduling: tasks start as needed Application is typically also a single program! ⇒ identification number of each task indicates what is its job. CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
12 / 26
Task / Channel Model Parallel programming for distributed memory systems uses:
Task / Channel Model Parallel computation is represented as a set of tasks that may interact with each other by sending messages through channels.
Task: program + local memory + I/O ports Channel: message queue that connects one task’s output port with another task’s input port
All tasks start simultaneously, and finishing time is determined by the time the last task stops its execution. CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
13 / 26
Messages in the Task / Channel Model ordering of data in the channel is maintained receiving task blocks until a value is available at the receiver sender never blocks, independently of previous messages not yet delivered
In the task / channel model receiving is a synchronous operation sending is an asynchronous operation
CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
14 / 26
Foster’s Design Methodology Development of scalable parallel algorithms by delaying machine-dependent decisions to later stages.
Four steps: partitioning communication agglomeration mapping
CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
15 / 26
Foster’s Design Methodology
Problem
Partitioning Communication
Primitive Tasks
Agglomeration
CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
16 / 26
Foster’s Design Methodology
Problem
Partitioning Communication
Primitive Tasks
Agglomeration
CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
16 / 26
Foster’s Design Methodology
Problem
Partitioning Communication
Primitive Tasks
Agglomeration
CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
16 / 26
Foster’s Design Methodology
Problem
Partitioning Communication
Primitive Tasks
Agglomeration
CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
16 / 26
Foster’s Design Methodology
Problem
Partitioning Communication
Primitive Tasks
Agglomeration
Mapping
CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
16 / 26
Foster’s Design Methodology: Partitioning Partitioning Process of dividing the computation and data into many small primitive tasks. Strategies: (no single universal recipe...) data decomposition functional decomposition recursive decomposition Checklist: > 10 × P primitive tasks than P processors minimize redundant computations and redundant data storage primitive tasks are roughly the same size number of tasks grows naturally with the problem size CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
17 / 26
Recursive Decomposition Suitable for problems solvable using divide-and-conquer Steps: decompose a problem into a set of sub-problems recursively decompose each sub-problem stop decomposition when minimum desired granularity reached
CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
18 / 26
Data Decomposition Appropriate data partitioning is critical to parallel performance Steps: identify the data on which computations are performed partition the data across various tasks Decomposition can be based on input data output data input + output data intermediate data
CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
19 / 26
Input Data Decomposition
Applicable if each output is computed as a function of the input
May be the only natural decomposition if output is unknown problem of finding the minimum in a set or other reductions
Associate a task with each input data partition task performs computation on its part of the data subsequent processing combines partial results from earlier tasks
CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
20 / 26
Output Data Decomposition
Applicable if each element of the output can be computed independently algorithm is based on one-to-one or many-to-one functions
Partition the output data across tasks Have each task perform the computation for its outputs
Example: Matrix-vector multiplication
CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
21 / 26
Foster’s Design Methodology: Communication Communication Identification of the communication pattern among primitive tasks. local communication: values shared by a small number of tasks draw a channel from producing task to consumer tasks global communication: values are required by a significant number of tasks while important, not useful to represent in the task/channel model Checklist: communication balanced among tasks each task communicates with a small number of tasks tasks can perform their communication concurrently tasks can perform their computations concurrently CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
22 / 26
Foster’s Design Methodology: Agglomeration Agglomeration Process of grouping primitive tasks into larger tasks. Strategies: group tasks that have high communication with each other group sender tasks and group receiving tasks group tasks to allow re-use of sequential code Checklist: locality has been maximized replicated computations take less time than the communications they replace amount of replicated data is small enough to allow algorithm to scale tasks are balanced in terms of computation and communication number of tasks grows naturally with problem size number of tasks is small, but at least as great as P cost of modifications to sequential code is minimized CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
23 / 26
Foster’s Design Methodology: Mapping Mapping Process of assigning tasks to processors.
Strategies: maximize processor utilization (average % time processor are active) ⇒ even load distribution minimize interprocessor communication ⇒ map tasks with channels among them to the same processor ⇒ take into account network topology
CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
24 / 26
Review
Parallel programming
Dependency graphs
Overheads influence on programming of shared- vs distributed-memory systems
Foster’s design methodology
CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
25 / 26
Next Class
OpenMP
CPD (DEI / IST)
Parallel and Distributed Computing – 6
2013-10-03
26 / 26