# Monday, November 25, 2013 Transactional memory continued and Message passing

## 1Transactional programming¶

It shares idea with the concept of transactions in databases. Basic idea: that we have been using locks to try and enforce mutual-exclusion, but it's a low level process that is easy to mess up and hard to manage. Could we get a system that does locking for us?

// Language                                 // Translates to
atomic {                                    synchronized(global_lock) {
// Mutual exclusion with all threads        // Mutual exclusion with all threads
}                                           }


We can easily do the above, but the performance suffers. Transactional programming tries to achieve the same result, but differently. It does so with either a pessimistic or optimistic approach.

### 1.1Pessimistic approach (always going to work)¶

When translating the atomic section, a global lock can be allocated and be used to lock the section. This is not performant, as seen above.

Better, if we know what data is being accessed, it can be locked in a more fine-grained fashion:

atomic {                                                // For all x in A, x is locked
// Operations accessing the variables in the set A  // The operations are done
}                                                       // For all x in A, x is unlocked


This way, we end up with better performance because more than one thread can be in the critical section if they do not overlap their accesses.

Finding what data is being accessed is not always trivial to do:

atomic {
while (p.x != 0)
p = p.next;
}


Furthermore, it is required to order the data when locking it (remember the dining philosophers problem) to avoid deadlock.

Locking overhead tends to build up as well.

One observation is that the odd of needing locking is often low. This leads to the optimistic approach.

### 1.2Optimistic approach (usually going to work, but always right in the end)¶

Locking is not done at all. Instead, the atomic block is executed no matter what, but before committing it, conflicts are detected (where the complexity of this approach lies). If there is any, the execution is not committed but tried again.

#### 1.2.1Tentative execution¶

The atomic section is executed in isolation. That is, the writes are all buffered rather committed to main memory. Then, it is possible to check if any of the read variables were written by anyone else.

If not, there is no conflict, and so the buffer is committed.

### 1.3Nested transactions¶

atomic {
atomic {
// Is this useful?
}
}


According to the atomicity property, using the above is actually redundant, but there are many reasons why these sort of things should be allowed. For modularity for example:

void foo() {
atomic {
bar();
}
}

void bar() {
atomic {

}
}


#### 1.3.1Conditional synchronization¶

concurrency deals with atomicity and thread control (communication, coordinate)
atomicity: we have it.
thread control in tm: this is one area where transactional programming ends up being weaker, unless we do it properly.

With an atomic construct, it is impossible to to wait or notify like a monitor. However, some constructs allow to get a bit close to this idea.

atomic {
// beginning
if (...) {
retry // causes transaction to fail and restart at the beginning
// basically allows for a spin lock
}
}


Example: Producer/consumer

produce(data) {
atomic {
if (queue.full()) {
retry
}
}
}

consume() {
atomic {
if (queue.empty()) {
retry
}
data = queue.get()
}
}


This is closer to spinning than it is to waiting. However, we could have a compiler/runtime system that recognizes the boolean condition on retry and does a proper wait/notify under the cover; Rather than a single statement retry;, we can have retry(boolean); to get closer to the idea of wait/notify. With nested transactions, we get inner retries.

#### 1.3.2orElse¶

atomic {
// if this fails,
} orElse {
// execute that
} // the orElse is still atomic
// if orElse fails as a transaction, retry the whole thing


1) Conceptually simple: the programmer can just consider transactional memory as sequential code

void run() {
atomic {
if (c > 0)
i++;
}
}

atomic {
x = L1.pop();
L2.push();
}

2) better parallelism and performance
optimistic concurrency control
no lock/synchronization during transaction execution
3) less error-prone

#### 1.3.4Disadvantages to transactional based programming¶

transaction region cannot contain instructions with side-effects such as I/O
when an instruction does more than merely read or write memory. E.g. I/O, incurs interupts. Read/write to/from a port. Output to the screen.
Cannot rollback
transaction will rollback if buffering is overflown
in STM, transaction overhead is proportional to the number of memory accesses

#### 1.3.5What Transactional memory is suitable for¶

complex data structures
different transactions can work on different parts of the data structure
fine-grain locking is difficult
memory accesses with rare conflicts
most transactions commit
locks have higher contention and overhead

## 2Hardware transactional Memory¶

An atomic sequence of instructions for which either

• all instructions (commit) execute
• no instruction execute (rollback/abort) when the read/write data of thread is written by another

### 2.1Hardware implementation¶

Hardware transactional memory (HTM)
IBM Blue Gene/Q. Gives htm
Intel Haswell. Gives you machine code to code them (tsx).

These give 2 models of transactions:

• Restricted transactional memory (RTM): more flexible HTM implementation.
• Intel Transactional Synchronization Extensions (TSX): Hardware Lock Elision (HLE). Legacy compatible instructions. Good for transition.

### 2.2RTM¶

XBEGIN(offset)
// transaction
XEND


offset means the offset of the failure handler.

Inside the transaction you get two more instructions:
XBEGIN(offset)
XABORT // give up (and goto handle)
XTEST // check whether if in a transition
XEND

But you can't just put anything in the transition (well, you can, but that'll just make it abort and fail):
- cpuid, pause
- debug registers
- signals

Allowed:
- nesting (there's a max depth. beyond that the transition aborts). BTW the handler gets the error code (can act upon it).

The intel model doesn't give a progress guarantee. after a certain amount it'll ask you to stop and try something else (e.g. locking)

In bluegene/Q: has guarantee. if a transition fails a number of times, it'll acquire a lock and prevent all other transitions from moving til the current one is done (serialize everything)

on a conceptual level, hardware transaction isn't hard to do

          L1 cache
_____________
|           |
|______     |
-----|->   |     |
<-----|--   |     |
|xyzw_|_____|
^
need to detect conflict r/w, w/w


This model does have a problem: dealing with cache-line granularity
|x|y|z|w|
if someone reads x and someone writes y, conflict!
even though no real conflict (called false-sharing). simply because they're on the same cache line
false sharing isn't rly a big issue though
struct {
int x
int y
}
but if it ever happens, put a padding between int x and int y (fill the cache line with bullshit so that x and y get on different lines)

### 2.3HLE¶

XACQUIRE // lock-acquire
// whatever you're doing inside here, I'll pretend I'm doing it but not really. instead I'll (pretend) add the write to the transaction
// transaction
XRELEASE // lock-release


Idea: rather to actually lock (expensive), hardware pretend to lock. If someone else gets the lock, this fails and it'll now actually lock (aka, optimistically run without lock the first time). So this is a kind of transaction now
this is good for low-contention of course.

Idea: concurrent programming is hard, take advantage of otherwise idle cores.
Automatic parallelization
Normal sequential exec:

  __
|A|
| |
|_|__fork__> speculate that c will exec in parallel here correctly
|B|          |C|
| |          | |  isolated
|_|<_join___ |_|
|C|
| |
|_|


non-spec

            ____
|   <---> |   |
|         |mem|
|   <---> |___|
V


spec

___
|WB|<---|
|__|--->|
___     |
|RB|--->|
|__|    |
V


b starts at state 1
c starts at state 2
but when in parallel c actually starts at state 1

If b modifies x but c reads x before that, once we get to the join point we need to validate the state. Look at the read buffer and see if everything I read is the same as those in the current memory. If not, we give up. good thing that c is executed in complete isolation. Throw away thread C and execute as normally would do sequentially

Of course, ideally the read buffer matches main mem. in that case, just commit (aka send write buffer, written in isolation, to mem)

Soooo, when do we (auto) fork? static analysis, or some adaptive runtime heuristic

• loop-level speculation
for (...) {
fork
...
join
}


First iteration non-speculative, second iteration speculative.

• method level speculation
|
V
foo() -------> foo....
|
|
<---------------|
| bla
V


(that's normal exec)
now try foo() still non-speculative, try bla speculatively
to make this work better, try to guess the return value of foo (lol)

• null predictor (guess that the val is 0)
• last value predictor (same value as what we got last time. meaning we need to record the previous return value)
• stride predictor (diff between return value v1 and v2)
1, 2, 3, 4, 5 -> maybe 6's next
• hybrid predictor (combination)

#### 2.4.1Speculation model¶

Shat if more than one spec thread?

• out-of-order speculation: allow (only) non-speculative thread to fork multiple speculative threads. works well with method-level speculation. Join in reverse order that we speculated them in (thus the name)

• In-order speculation: join in the order the threads are created. A speculative thread forks one speculative thread