1Graham Brent theorem (continued)¶
2 kinds of steps:
- complete steps: >= p tasks available
- incomplete steps: < p tasks available
$T_p <= T_1/p + T_\infty$
$T_1/p$: complete
$T_\infty$: incomplete
“an application with p processors at most would take time that is addition of sequential time / p and critical path time”
1.1Proof¶
$T_p^* >= max(T_1/p, T_\infty)$ (1)
$and\ T_1/p <= max(T_1/p, T_\infty)$ (2)
$T_\infty <= max(T_1/p, t_\infty)$ (3)
$T_p <= T_1/p + t_\infty$ (0)
$<= max(T_1/p, t_\infty) + max(T_1/p, T_\infty)$ (3)
$<= 2max(T_1/p, T_\infty)$
$<= 2T_P^*$
No more than 2x the optimal time
2Work stealing¶
Have a task queue
If threads have their own queue, they need a way to distribute the work
2.1Cilk++ language¶
Queue and tasks for each processors
($T_1$ stole from $T_2$)
If empty, pop from somewhere else
If tasks live in stack frames (like C), then some code needs to grow stack and some doesn’t
e.g.
main() { A() } A() { spawn B() spawn C() -sync- } B() { spawn D() spawn E() -sync- foo() // regular function } C() { spawn F() spawn G() -sync- }
This gives a cactus stack
In a concurrent program, there may have more than one active.