Task-level Parallelism

Task-level Parallelism (Week1)

It’s the note for Parallel Programming in Java Week1 (Coursera, Rice). It’s about how to perform task-level parallelism by using Java Async/Finish, Fork/Join framework. It also introduce the concept of computing graph, how to caculate speed up etc.

Task-level Parallelism (Week1)

Task Creation and Termination (Async, Finish)

finish {
  async S1; // asynchronously compute sum of the lower half of the array
  S2;       // compute sum of the upper half of the array in parallel with S1
}
S3; // combine the two partial sums after both S1 and S2 have finished

PS : A high-level open-source Java-8 library available - PCDP (for Parallel, Concurrent, and Distributed Programming)

Creating Tasks in Java’s Fork/Join Framework

private static class ASum extends RecursiveAction {
  int[] A; // input array
  int LO, HI; // subrange
  int SUM; // return value
  . . .
  @Override
  protected void compute() {
    SUM = 0;
    for (int i = LO; i <= HI; i++) SUM += A[i];
  } // compute()
}

Resource : Fork/Join Tutorial

Computation Graphs, Work, Span, Ideal Parallelism

A CG consists of:

A set of vertices or nodes, in which each node represents a step consisting of an arbitrary sequential computation. A set of directed edges that represent ordering constraints among steps. For fork–join programs, it is useful to partition the edges into three cases:

Continue edges that capture sequencing of steps within a task.
Fork edges that connect a fork operation to the first step of child tasks.
Join edges that connect the last step of a task to all join operations on that task.

CGs can also be used to reason about the ideal parallelism of a parallel program as follows:

Define WORK(G) to be the sum of the execution times of all nodes in CG G, Define SPAN(G) to be the length of a longest path in G, when adding up the execution times of all nodes in the path. The longest paths are known as critical paths, so SPAN also represents the critical path length (CPL) of G.

Multiprocessor Scheduling, Parallel Speedup

We defined TP as the execution time of a CG on P processors, and observed that T∞ ≤ TP ≤ T1. We also saw examples for which there could be different values of TP for different schedules.

We then defined the parallel speedup for a given schedule of a CG on P processors as Speedup(P) = T1/TP, and observed that Speedup(P) must be ≤ the number of processors P , and also ≤ the ideal parallelism, WORK/SPAN.

Amdahl’s Law

if q ≤ 1 is the fraction of WORK in a parallel program that must be executed sequentially, then the best speedup that can be obtained for that program for any number of processors, P , is Speedup(P) ≤ 1/q.

Amdahl’s Law reminds us to watch out for sequential bottlenecks both when designing parallel algorithms and when implementing programs on real machines. As an example, if q = 10%, then Amdahl’s Law reminds us that the best possible speedup must be ≤ 10 (which equals 1/q ), regardless of the number of processors available.