Wednesday, November 6, 2013 CC-BY-NC
Elimination stack, General approach to a lock-free data structure and OpenMP

Maintainer: admin

1Continuation of linked list

(Someone clean this note because I have no idea what it is)

Harris -> Lazy solution -> make nodes that are being deleted before deleting them (AtomicMarkableReference?)

tryAdd(Node n, Node prev) 
  n.next = prev.next
  return CAS(prev.next, n.next, false, n, false) // n.next, false -> old reference
                                                     // n, false -> new reference
tryRemove(Node n, Node prev)
  Node succ = n.next
  if(CAS(n.next, succ, false, succ, true))
    CAS(prev.next, n, false, succ, false) // don’t care if succeeds
    return true
  return false

P1: T0 trying to remove x
T1 try to insert where w between x and y
H -> x -> y -> z -> T

T0
CAS(x.next, y, false, y, false)
CAS(H.next, x, false, y, false)

T1
3) CAS(x.next, y, false, w, false)

P2: T0 remove x
T1 remove y
H -> x -> y -> z -> T

T0
CAS(x.next, y, false, y, done)
CAS(H.next, x, false, y, false)

T1
3) CAS(y.next, z, false, z, true)
4) CAS(x.next, y, false, z, false)

if 4 < 1 -> x.next = z and T0 fails
if 1 < 4 -> 4 fails

something has to clean up and delete marked nodes

find(int data)
  restart: while(true)
    pred = J
    cur = pred.next
    while(cur != T)
      succ = cur.next
      while(cur.marked)
        if(!CAS(pred.next, cur, false, succ, false))
          continue restart
        cur = succ
        succ = cur.next
      if(cur.data == data)
        return cur
      pred = cur
      cur = succ
    return null

2OpenMP

OpenMP is typically regarded as a "language" for concurrent programming, although not really a separate language. Rather, it is a set of directives that sits on top of another language.

While the language could theoretically be arbitrary, in practice, C/C++ and Fortran are the ones which work with OpenMP. There is also a version of OpenMP working with Java, but considering that Java already has concurrency, it is not very useful.

In C/C++, OpenMP works by inserting pragma instructions. As they appear as comments, compiling code with such instructions on a normal compiler normally results in a sequential program. On an OpenMP compiler, the pragmas are honored and tell the compiler how certain segments of code are to be executed in parallel.

2.1Structured Parallelism

OpenMP allows to achieve structured parallelism. That is, starting with a for loop, each iteration is executed in parallel and joined at the end, following a fork/join model where the beginning and end are well defined.

It is not very flexible, but it is an easy way to parallelize programs.

#pragma omp parallel for
{
    for () {
        ...
    }
}
// A thread team is created to execute the for loop in parallel.

Other times, we might want all the threads to do the same instruction(s).

#pragma omp parallel
{
    printf("Hello!\n");
}
// The team of threads all execute the above statement

The number of threads can be chosen by the system, but it may also be passed as a parameter at the very beginning of the program, or on a statement by statement basis.

2.2Nested Parallelism

We can also have nested parallelism. That is a parallel section, inside a parallel section, in which case the number of threads executing may not be obvious.

2.3Parallel sections

Separate parallel sections can be defined.

#pragma omp section
{
    // Executed by one thread
}

#pragma omp section
{
    // Executed by another
}

This is a bit like unstructured parallelism, although a bit awkward. The goal of OpenMP is not to give fine-grained control over concurrency, but rather to achieve it easily.

Each thread team has a master, one privileged thread which can be specialized.

#pragma omp master
{
    // Only the master thread executes this.
}

#pragma omp single
{
    // Only one thread executes this
}

A task model can also be used.

2.4Data Model

Out of OpenMP, we get a pretty neat data model. Shared data (heap/static) is the default, but one of the things we can do is declare data to be private.

private(x, y, z);

With this, each thread has its own copy (even if x, y, z have a non-local scope). Each thread ends up using its own copy of the data. However, how is it initialized? There are several initialization models.

First private
The value is inherited from the parent, or a default value is used.

What happens to the private data once the thread is finished? We could have it disappear, but why lose the work?

Last private
Thread submits its value back to the parent scope.

With multiple threads, the last one is the one to succeeds. However, a reduction operator can also be used: (op; list).

Maybe 10 threads are used to compute some value, but we are interested in the sum. In this case, the operator is + and the values in the list are summed. The initial value is assumed to be 0. With *, the default value is 1.

2.5Synchronization

2.5.1Master/Single

For a critical section, only the master, or only one thread executes it.

2.5.2Critical

All the threads execute the critical section, but only one at a time.

2.5.3Barriers

All the threads in the pool must arrive at that point, before any leaves.

2.5.4Atomic

Atomic Read Atomic write Atomic update Atomic capture
... = x; x = ...; x++; ... = x++; (kind of like FA)

Nb: Variables are not atomic by default; the above must be used.

2.5.5Flush

flush(a, b, c); synchronizes a, b, and c with main memory. This ends up being the main mechanism for actually making things consistent between different threads.