Wikinotes

Maintainer: admin

1 Hardware
2 Atomicity and Independence
- 2.1 Definitions
- 2.2 What happens when processes interact?

There are lots of different kinds of concurrent hardware.

Our desktops or laptops most likely have more that 1 processor, maybe even 8 or more.
Servers might have dozens, or hundreds.

Basic Uniprocessor Schematic
Modern, pipelined and superscalar architecture (multiple functional units).

Can issue and execute more than one instruction at a time.
Low-level parallelism.

On this, we can do multithreaded programming thanks to context switching between threads. In effect, the threads are time-sliced.

Multiple threads can be useful for hiding latency. However, true parallelism is impossible with a basic uniprocessor.

Notice the cache: it is an important part of modern computer architecture due to the locality principle.

Locality Principle: If a program accesses x, odds are that it will access x and nearby data soon.

However, multithreading interferes with locality because the data accesses of a certain thread interleave with the others'.

There is now more than one processor, with some kind of shared memory; there is more than one way to set these processors up.

There are multiple separate processors, but they are all the same.

SMP Schematic

Separate physical chips.
Communication between them occurs over a bus.

Communication over a bus is somewhat slow. If the CPUs are on the same physical chip (i.e. multicore), we can get faster communication.

CMP Schematic

Easier to share a cache.
Fast communication between cores.
Still, there are disadvantages: lots of programs are single-threaded.

If we don't have 2 active threads, half of the chip is idle. This is not an efficient use of resources!

Not really a multiprocessor; it's really just a single CPU with special hardware support for fast context-switching.

However, n-way CMT/FMT have support for n threads in hardware.

CMT Schematic

The hardware can context-switch easily (e.g. for each $x$ cycles).
There is no sacrifice to single threaded performance.
It gives the illusion of multithreading.

There is fine grained switching on every cycle. It is also called "barrel processing".

Barrel Processing Schematic

It works like rotating a barrel where each slat represents a thread. This method exists because regular switching is easy to understand and manage.

Tera was an early multithreaded architecture for the Cray.

It followed an FMT design, but with no data cache. Notably, it took 70 cycles just to access memory.

One efficient way to avoid slowdowns was to have at least 70 threads to keep to CPU busy during the waits. With this many threads, access can start on the cycle of a certain thread, and once the context switching arrives to this thread again, the access has completed.

It gave a good overall throughput, but the performance of each individual thread was not very high.

Susan Eggers et al. in 1994-95

Supports single threaded and multithreaded programs.
Allows true parallelism while also avoiding idleness.

SMT

There is a soft boundary between CPUs: If CPU₀ is not using its resources (i.e. its respective functional units), then CPU₁ can use them, and vice-versa.
In the case of single threaded programs, they see one big CPU with twice the resources.
Yet, multithreaded programs can run in parallel.

Theoretically, SMT (commercialized by Intel as Hyper-threading) works very well.

However, early devices hiccups meant an hyper-threaded processor wasn't as good as 2 processors.
It was also hardly even better than a single CPU.
It is much improved now, but it is still not quite as good as two separate cores (maybe 1.5x).

Suppose we have a CPU device with:

2 ALUs (Arithmetic Logic Units)
2 FPUs (Floating-Point Units)
1 LSU (Load-Store Unit)

It can issue 2 instructions per cycle.

Integer and floating-point operations take 1 cycle.
Load and store operations take 2 cycles.

Three threads are running on the CPU with the following program:

Thread 1	Thread 2	Thread 3
1. Add	1. Load	1. Add_FP
2. Load	2. Add	2. Add_FP
3. Add	3. Add	3. Add_FP

Where T1.3 depends on T1.2 and T2.2 depends on T2.1.

There are 2 instruction per cycle and all threads are in order.

Threads	Slot 1	Slot 2
Thread 1	1. Add	2. Load
Thread 1	-	-
Thread 1	3. Add	-
Thread 2	1. Load	-
Thread 2	-	-
Thread 2	2. Add	3. Add
Thread 3	1. Add_FP	2. Add_FP
Thread 3	3. Add_FP	-

There were 8 cycles in total, but 7 slots were wasted.

Processor 0 Slot 0	Processor 0 Slot 1	Processor 1 Slot 0	Processor 1 Slot 1
T1: 1. Add	T1: 2. Load	T2: 1. Load	-
-	-	-	-
T1: 3. Add	-	T2: 2. Add	T2: 3. Add
T3: 1. Add_FP	T3: 2. Add_FP	-	-
T3: 3. Add_FP	-	-	-

There were 5 cycles in total, but 11 slots were wasted. It is an improvement on the number of cycles, but it is also more wasteful.

Threads	Slot 1	Slot 2
Thread 1	1. Add	2. Load
Thread 2	1. Load	-
Thread 3	1. Add_FP	2. Add_FP
Thread 1	3. Add	-
Thread 2	2. Add	3. Add
Thread 3	3. Add_FP	-

There were 6 cycles in total, but only 3 slots wasted.

Processor 0 Slot 0	Processor 0 Slot 1	Processor 1 Slot 0	Processor 1 Slot 1
T1: 1. Add	T1: 2. Load	T2: 1. Load	-
-	-	-	-
T1: 3. Add	-	T2: 2. Add	T2: 3. Add
T3: 1. FPAdd	T3: 2. FPAdd	T3: 2. FPAdd	-

There were 4 cycles in total, and 7 slots wasted. SMT's ability to share the CPUs' resources for a single thread was only useful once (in the last line, with T3).

In concurrency, we are interested in the relations between one or more processors/threads. Still, it is possible to have 2 completely independent threads, but in such a case, concurrency is not very exciting.

Read-set: The read-set of a program/process/thread is the set of variables it reads, but does not write.
Write-set: The write-set of a program/process/thread is the set of variables it writes (and may read).
Independent threads: 2 processes/threads are independent if the write set of each is disjoint with the read and write set of the other.; As such, 2 processes/threads are not independent if one writes data the other reads and/or also writes.

For example, consider the line of code x = y = z = 0;.

After which Thread₁ executes x = y + z; and Thread₂ executes y = 1; z = 2;.

What is the final value of x? It depends on the order in which threads execute.

Scenario 1	Scenario 2	Scenario 3
T1: `x = y + z;`	T2: `y = 1;`	T2: `y = 1;`
T2: `y = 1;`	T2: `z = 2;`	T1: `x = y + z;`
T2: `z = 2;`	T1: `x = y + z;`	T2: `z = 2;`
Value of `x`: 0	Value of `x`: 3	Value of `x`: 1

Note: To interleave statements, we assume they either execute, or not. They are atomic.

Atomic: Indivisible. No partial state of the action is externally visible.

In fact, statements like x = y + z; are not usually atomic. They might look like a single line of code, but in fact, they decompose to a sequence of machine language instructions:

Load     r1, [y]     # Put the value of y in Register 1
Load     r2, [z]     # Put the value of z in Register 2
Add      r3, r1, r2  # Add the values of Registers 1 and 2, and put the result in Register 3
Store    r3, [x]     # Store the value of Register 3 in x

Interleaving can occur between any machine instruction.

Scenario 4	Values
T1 loads `y` into `R1`	`R1 = 0`
T2 executes its instructions	`y = 1`, `z = 2`
T1 loads `z` into `R2`	`R2 = 2`
T1 adds `R1` + `R2` into `R3`	`R3 = 2`
T1 stores `R3` into `x`	`x = 2`

Wednesday, September 11, 2013
Concurrent Hardware and Atomicity & Independence

1Hardware¶

1.1Basic Uniprocessor¶

1.1.1Cache¶

1.2Multiprocessors¶

1.2.1SMP: Symmetric Multiprocessors¶

1.2.2CMP: On-Chip Multiprocessor¶

1.2.3CMT/FMT: Coarse and fine-grained multithreading¶

1.2.3.1CMT¶

1.2.3.2FMT¶

1.2.3.3The example of Tera¶

1.2.4SMT: Simultaneous Multithreading¶

1.3Practical example¶

1.3.1Uniprocessor¶

1.3.2SMP/CMP (2 processors)¶

1.3.3FMT¶

1.3.4SMT¶

2Atomicity and Independence¶

2.1Definitions¶

2.2What happens when processes interact?¶

Wednesday, September 11, 2013 Concurrent Hardware and Atomicity & Independence