# Principles of Parallel Algorithm Design

Many solutions are often possible but few will yield good performance and be scalable. We have to consider the computational and storage resources nee...

Principles of Parallel Algorithm Design Alexandre David 1.2.05

1

Overview 

Introduction to parallel algorithms.   



Tasks and decomposition. Processes and mapping. Processes vs. processors.

Decomposition techniques.   

27-02+03-03-2008

Recursive decomposition. Exploratory decomposition. Hybrid decomposition. Alexandre David, MVP'08

2

2

Introduction 



Parallel algorithms have the added dimension of concurrency. Typical tasks:     

27-02+03-03-2008

Identify concurrent works. Map them to processors. Distribute inputs, outputs, and other data. Manage shared resources. Synchronize the processors. Alexandre David, MVP'08

3

There are other courses specifically on concurrency. We won’t treat the problems proper to concurrency such as deadlocks, livelocks, theory on semaphores and synchronization. However, we will use them, and when needed, apply techniques to avoid problems like deadlocks.

3

Decomposing Problems 

Decomposition into concurrent tasks.   

No unique solution. Different sizes. Decomposition illustrated as a directed graph:  

! 27-02+03-03-2008

Nodes = tasks. Edges = dependency.

Task dependency graph Alexandre David, MVP'08

4

Many solutions are often possible but few will yield good performance and be scalable. We have to consider the computational and storage resources needed to solve the problems. Size of the tasks in the sense of the amount of work to do. Can be more, less, or unknown. Unknown in the case of a search algorithm is common. Dependency: All the results from incoming edges are required for the tasks at the current node. We will not consider tools for automatic decomposition. They work fairly well only for highly structured programs or options of programs.

4

Vector

Example: Matrix * Vector

27-02+03-03-2008

Alexandre David, MVP'08

5

5

Example: Database Query Processing MODEL = ``CIVIC'' AND YEAR = 2001 AND (COLOR = ``GREEN'' OR COLOR = ``WHITE)

27-02+03-03-2008

Alexandre David, MVP'08

6

The question is: How to decompose this into concurrent tasks? Different tasks may generate intermediate results that will be used by other tasks.

6

A Solution

27-02+03-03-2008

Meas ure o ? Nb. o f concurr f pro ency? c Optim essors? al?

Alexandre David, MVP'08

7

How much concurrency do we have here? How many processors to use? Is it optimal?

7

Another Solution

? Bet ter /wo rse ? 27-02+03-03-2008

Alexandre David, MVP'08

8

Is it better or worse? Why?

8

!

Granularity 

Number and size of tasks.  



Related: degree of concurrency. (Nb. of tasks executable in parallel).  

27-02+03-03-2008

Maximal degree of concurrency. Average degree of concurrency.

Alexandre David, MVP'08

9

•Previous matrix*vector fine-grained. •Database example coarse grained. Degree of concurrency: Number of tasks that can be executed in parallel. Average degree of concurrency is a more useful measure. Assume that the tasks in the previous database examples have the same granularity. What’s their average degrees of concurrency? 7/3=2.33 and 7/4=1.75. Common sense: Increasing the granularity of decomposition and utilizing the resulting concurrency to perform more tasks in parallel increases performance. However, there is a limit to granularity due to the nature of the problem itself.

9

Vector

Coarser Matrix * Vector

27-02+03-03-2008

Alexandre David, MVP'08

10

10

!

Granularity 







Average degree of concurrency if we take into account varying amount of work? Critical path = longest directed path between any start & finish nodes. Critical path length = sum of the weights of nodes along this path. Average degree of concurrency = total amount of work / critical path length.

27-02+03-03-2008

Alexandre David, MVP'08

11

Weights on nodes denote the amount of work to be done on these nodes. Longest path → shortest time needed to execute in parallel.

11

Database Example Critical path (3). Critical path length = 27. Av. deg. of concurrency = 63/27.

Critical path (4). Critical path length = 34. Av. deg. of conc. = 64/34.

2.33 27-02+03-03-2008

Alexandre David, MVP'08

1.88 12

12

!



Nodes = tasks. Edges = interaction. Optional weights.

27-02+03-03-2008

Alexandre David, MVP'08

13

Another important factor is interaction between tasks on different processors. Share data implies synchronization protocols (mutual exclusion, etc) to ensure consistency. Edges generally undirected. When directed edges are used, they show the direction of the flow of data (and the flow is unidirectional). Dependency between tasks implies interaction between them.

13

Example: Sparse Matrix Multiplication

27-02+03-03-2008

Alexandre David, MVP'08

14

Sparse matrix: A significant number of its entries are zero and the zeros do not conform to predefined patterns. Typically, we do not need to take the zeros into account. In the example: Task i owns row i of A and b. Interaction depends on the mapping work to do / task, i.e., granularity, and mapping tasks – processor.

14

Processes and Mapping  





Tasks run on processors. Process: processing agent executing the tasks. Not exactly like in your OS course. Mapping = assignment of tasks to processes. API exposes processes and binding to processors not always controlled.

?



27-02+03-03-2008

Good mapping? Alexandre David, MVP'08

15

Here we are not talking directly on the mapping to processors. A processor can execute two processes. Good mapping: •Maximize concurrency by mapping independent tasks to different processes. •Minimize interaction by mapping interacting tasks on the same process. Can be conflicting, good trade-off is the key to performance. Decomposition determines degree of concurrency. Mapping determines how much concurrency is utilized and how efficiently.

15

Mapping Example

27-02+03-03-2008

Alexandre David, MVP'08

16

Notice that the mapping keeps one process from the previous stage because of dependency: We can avoid interaction by keeping the same process.

16

Processes vs. Processors   



Processes = logical computing agent. Processor = hardware computational unit. In general 1-1 correspondence but this model gives better abstraction. Useful for hardware supporting multiple programming paradigms. Now remains the question: How do you decompose?

27-02+03-03-2008

Alexandre David, MVP'08

17

Example of hybrid hardware: cluster of MP machines. Each node has shared memory and communicates with other nodes via MPI. 1. Decompose and map to processes for MPI. 2. Decompose again but suitable for shared memory.

17

!

Decomposition Techniques 

Recursive decomposition. 



Data decomposition. 



Large data structure.

Exploratory decomposition. 



Divide-and-conquer.

Search algorithms.

Model-checker

Speculative decomposition. 

27-02+03-03-2008

Dependent choices in computations. Alexandre David, MVP'08

18

18

Recursive Decomposition 

Problem solvable by divide-and-conquer: 

Decompose into sub-problems. 



Combine the sub-solutions. 



Do it recursively. Do it recursively.

Concurrency: The sub-problems are solved in parallel.

27-02+03-03-2008

Alexandre David, MVP'08

19

Small problem is to start and finish: with one process only.

19

Quicksort Example <5≤ <3≤

<9≤ <7≤

<10≤ <11≤

27-02+03-03-2008

Alexandre David, MVP'08

20

Recall on the quicksort algorithm: •Choose a pivot. •Partition the array. •Recursive call. •Combine result: nothing to do.

20

Minimal Number 4 9 1 7 8 11 2 12

27-02+03-03-2008

Alexandre David, MVP'08

21

21

Data Decomposition 

2 steps:  

 

How to partition data? Partition output data: 



Partition the data. Induce partition into tasks.

Independent “sub-outputs”.

Partition input data: 

27-02+03-03-2008

Local computations, followed by combination. Alexandre David, MVP'08

22

Partitioning of input data is a bit similar to divide-and-conquer.

22

BTW:

Trivia l with shared me

Matrix Multiplication

27-02+03-03-2008

Alexandre David, MVP'08

mory.

23

We can partition further for the tasks. Notice the dependency between tasks. What is the task dependency graph?

23

Intermediate Data Partitioning

Linear combination of the intermediate results. 27-02+03-03-2008

Alexandre David, MVP'08

24

24

Useful for our Model-checker.

!

Owner Compute Rule 

Process assigned to some data 



is responsible for all computations associated with it.

Input data decomposition: 



?

All computations done on the (partitioned) input data are done by the process.

Output data decomposition: 

27-02+03-03-2008

All computations for the (partitioned) output data are done by the process. Alexandre David, MVP'08

25

25

Exploratory Decomposition 15-puzzle example

27-02+03-03-2008

Alexandre David, MVP'08

26

Suitable for search algorithms. Partition the search space into smaller parts and search in parallel. We search the solution by a tree search technique.

26

Search

27-02+03-03-2008

Alexandre David, MVP'08

27

27

Happens in our Model-checker.

!

Performance Anomalies Work depends on the order of the search!

27-02+03-03-2008

Alexandre David, MVP'08

28

28

Dependencies between tasks are not known a-priori.  



27-02+03-03-2008

How to identify independent tasks? Conservative approach: identify tasks that are guaranteed to be independent. Optimistic approach: schedule tasks even if we are not sure – may roll-back later.

Alexandre David, MVP'08

29

Not possible to identify independent tasks in advance. Conservative approaches may yield limited concurrency. Optimistic approach = speculative. Optimistic approach is similar to branch prediction algorithms in processors.

29

Speculative Decomposition Example ?

27-02+03-03-2008

Alexandre David, MVP'08

30

More aggregate work is done. Problem is to send inputs to the next stages speculatively. Could be the case that two different kinds of outputs are possible for A and A could start C,D,E twice. Other approaches are possible that combine different techniques: hybrid decompositions.

30