Wednesday, October 30, 2013 CC-BY-NC
Continuation of JMM, C/C++ Memory Model and Lock-free

Maintainer: admin

1C/C++ Memory Model

In a language like C or C++, which use PThreads for concurrency, what is the memory model? Historically, it was not specified; we could only get what the hardware provided.

PThreads used to be orthogonal in semantics, totally unconnected with the semantics of the language. Fortunately, as inspired by the Java Memory Model approach, people came up with the C++ Memory Model, a simplified model which only takes the important bits of the JMM.

Correctly synchronized programs will have sequential consistency semantics (SC). Incorrectly synchronized have no semantics. That is, any race will lead to undefined behavior, in theory. In practice, some compilers may do some optimizations on their own.

1.1DRF Model

Sequenced-before
Partial order, even within a thread. Kind of like HB in Java, but different because C/C++'s semantics are not as clearly defined as Java.

It is a partial order on memory actions (load/store, lock, unlock, atomic { load, store, read/modify/write }).

If it can be shown that there is no data races, then we get a correctly synchronized, sequentially consistent program. If not, the behavior is undefined. Still, this may seem a bit draconian; why should a single data race cause a program to exhibit completely undefined behavior? Perhaps, there should be some nice way of validating that undefined behavior.

unsigned int i = x; // Assuming there is a data race on x
if (i < 2) {
    foo... // Lots of code, no effect on i
    switch(i) {
        case 0: ...
        case 1: ...
        default: ...
    }
}

The usual way for a compiler to implement the switch statement is as a jump table. The table would have two entries (one for 0, the other for 1) which contain the address of where the specific instructions are defined. Then, the program would jump to table[i].

Now, i is in a register at first (because it just got defined). In foo, which is large, all the registers are likely to end up being used. This would imply storing the value of i in memory and reload it after. However, considering that x does not change, as far as a compiler can tell, between the time i is defined and is used again. So, instead of storing and reloading i, it would be more efficient to merely load x once i is truly needed. This would be totally fine in a single threaded program.

However, if some other thread changes the value of x, loading its value instead of i will rise to unexpected behavior. Since the value may be very different to what was expected (0 or 1), the jump to table[i] will end up somewhere arbitrary in memory.

That is part of the reason why any data race can cause totally unexpected behavior.

2Additional tools

If we really know what we are doing, we can use things called fences or barriers. They are designed to make sure that certain writes are visible to other processors. Very low level, these are assembly instructions which actually flush caches to main memory: membar and infence.

3Lock-free

A lock-free data structure or algorithm allows faster and more concurrent access by avoiding locking altogether, which incurs less overhead. We still need atomicity and a data-race free (DRF) program. To do that, we typically use spin-locking with the help of low-level primitives like CAS.

A method is lock-free if it guarantees that infinitely often, some method call finishes in a finite number of steps.

It is basically a weaker form of wait-free. Trying to do this is also a tricky thing.

3.1Concurrent stack example

To implement a concurrent stack, we can put locks around any access of the data structure, which will work. Instead, maybe we can put locks around smaller pieces, thus achieving a more fine-grained concurrency. However, in doing that, we end up facing some interesting problems.

void push(Node n) {
    // try and push something onto the stack
    do {
        Node t = tos; // top of stack
        n.next = t;
    } while (!CAS(tos, t, n));
    // Spin lock until CAS works to make sure t is still the top of stack
    // and make n the new top of stack if it is
}

Node pop() {
    Node t, n;
    do {
        t = tos;
        if (t == null) return null; // empty stack
        n = t.next;
    } while (!CAS(tos, t, n));
    // Spin lock until CAS works to make sure t is still the top of stack
    // and make n the new top of stack if it is
    return t;
}

However, this code does not work (in C, at least).

3.2Failure of the lock-free stack

To see why the lock-free stack from last lecture fails, let us consider the following scenario:

// Thread 0
pop(); // Tries to pop the top of the stack

// Thread 1
// Tries to delete the second element under the stack
pop();  // Pop the first
pop();  // Pop the second, delete that
push(); // Push back the first

Assume the current stack is [x, y, z]. The program can execute with the following interleaving:

Thread0 Thread1
Pop() { -
....t = tos; // x -
....n = t.next; // y -
- Pop() {
- ....t = tos; // x
- ....n = t.next; // y
- ....CAS(tos, x, y); // OK
- } // At this point, the stack is now [y, z]
- Pop() {
- ....t = tos; // y
- ....n = t.next; // z
- ....CAS(tos, y, z); // OK
- } // At this point, the stack is now [z]
- free(y); // Delete y
- Push(x) {
- ....t = tos; // z
- ....x.next = z;
- ....CAS(tos, z, x); // OK
- } // At this point, stack is now [x, z]
....CAS(tos, x, y); -

Going back to Thread0 after Thread1's operations makes the stack inconsistent; y was actually freed! The next time tos will be accessed, unallocated memory will actually be accessed, which leads to unexpected behavior.

3.3The ABA Problem

The problem depicted above is a classic problem called the ABA problem which shows up for two reasons. Firstly, y was freed. The main reason, however, is that CAS is not as powerful as we may make it out to be; if a value goes from A to B to A again, CAS will be completely oblivious of this. As far CAS is concerned, the value is "still" A, but it has actually changed in between.

3.3.1Load-link/store-conditional

Another primitive (instead of CAS), called the load-link/store-conditional, can be used to avoid this problem altogether. However, it is fairly expensive to implement and not all CPU architectures actually support it.

It is composed of two operations (PowerPC syntax presented here):

lwarx(x) {
    // Load from x, with a reservation on x
    temp = x;
    // x is reserved for this process
    return temp;
} // One big atomic operation

stwcx(x, y) {
    // If the reservation is still there, store can happen
    // Else, it should fail
    if (x is reserved for this process) {
        x = y;
        Renew the reservation;
        return true;
    }
    return false;
}
3.3.1.1Atomic increment example
redoit:                  # Label for looping
    lwarx     r1, 0, x;  # x is reserved
    addi      r1, r1, 1; # Put r1 + 1 in r1
    stwcx     r1, 0, x;  # works if x is still reserved
    bne       redoit     # jump back if stwcx returns false
3.3.1.2Reservation cancellation

In order to work properly, a reservation has to be cancelled by lots of things. For example, if a thread writes to x, then the reservation should be cancelled. If another LL/SC is done on x, the reservation should be cancelled as well.

Processors, to keep things efficient, cancel if:

  • any write to x's cache line occurs (x or close to it)
  • any context switch occurs
  • an hardware interrupt happens
  • another LL/SC happens, even on another variable

The above conditions may not affect x at all, but keep things more efficient, processors have to act in a very conservative way. As such, LL/SC may fail spuriously; it might get cancelled for no good reason. That also means that the further apart LL is from SC, the more likely it will fail. If it is longer than a time slice, it is always going to fail (it will never succeed).

3.3.1.3Hardware support

LL/SC is found in hardware such as PowerPC, ARM and MIPS (possibly with different names). Intel does not and Java does not expose it the same way it exposes other primitives.

3.4CAS fulfilling LL/SC's role

Now, we can solve the ABA problem with LL/SC, but in fact, we do not necessarily need it. CAS can actually solve the problem as well by using it on twice as many bits. Variables are now treated as a <value, version> pair. Every update update the version number, even if it is the same value and CAS checks both. Most architecture provide a double-wide CAS specifically for this purpose.

In Java, there is the AtomicStampedReference which is basically a pair <reference, stamp (version)> which values change at the same time. It is effectively a double-wide CAS inside Java.

One problem with counting versions in a variable is that the number might wrap around after enough iterations. In practice, 32-bits should be plenty enough, but in theory, it means the problem is not entirely solved.

3.5The ABA Problem and Java

In the textbook, there is a lock-free stack which goes as follows:

// Same code as we had, but without a loop
boolean tryPush(Node n) {
    Node oldTop = tos;
    n.next = oldTop;
    return CAS(tos, oldTop, n);
}

// Actual push has the loop
void push(T x) {
    Node n = new Node(x);
    while (true) {
        if (tryPush(n)) {
            return;
        } else {
            [...] // Something about a coefficient
        }
    }
}

Notice: This is basically the same as the push operation in our ABA example. The textbook is assuming that Java is to be used because we do not get the ABA problem in Java; we cannot delete variables. The garbage collector still sees a reference to y.

Thread0 Thread1
Grabs x -
- Pop x to put it back
- Pop y for deletion
- Push x back

As y still exists (Thread0 still has a reference to it and the garbage collector will not collect it), it becomes a "valid" top of stack. We end up with [y, z]. It is still not correct, but it does not break in the same way as in a C or C++ context.