Wikinotes

Maintainer: admin

many different processes, have to fit them all into memory
- many processes - might need more memory than is available
- overlapping addresses, e.g. multiple copies of the same program running at once
- but at any given point in time, the registers/etc contain values for just one process
- how to allow multiple processes to run at once?
- answer: see next section

program addresses (virtual addresses) are mapped to physical addresses by the OS (kernel)
benefits:
- multiple processes can share one processor (and thus one memory space thing)
- separates the program address space and the physical address space
  - so processors don't need to be geared for specific amounts of memory (e.g. 2GB or 4GB or whatevr)
  - adds an abstract layer

Some units in the world of memory and their meanings:

kilobyte:: 1024 (2¹⁰) or 1000 (10³) bytes, depending on the definition; the former is often used with KiB, but that hasn't really caught on
megabyte:: 1048576 (2²⁰) or 1000000 (10⁶) bytes
gigabyte:: 1073741824 (2³⁰) or 1000000000 (10⁹) bytes
terabyte:: same idea
gigabyte:: etc

Note that since one byte = 8 bits, one megabyte (1 MB) = 8 megabits (8 Mb).

MIPS memory is 2³² bytes, which is about 8 GB (8 GiB to be exact).

Some types of disk memory:

floppy disks, 1.44 MB, magnetic - not used lol
CDs, less than 1 GB, optical
DVDs, several GB, optical
hard disk drives, often several hundred GB, magnetic

Access time for disk memories is slow, as there is an actual spinning disk involved, and as data can only be read/written at the read/write head it may take millions of clock cycles to access the desired location in memory. Clearly, relying solely on disk memory would be extremely inefficient; consequently, there is another solution - internal memory, which we'll discuss later.

This is another type of external memory. There are no moving parts, and access is much faster than with disk memory (also, lookup times are uniform). Both of these types of memory are non-volative (i.e. turn off the power won't erase anything).

This is internal memory. The "random" part means that lookup time is constant (uniform), independent of the address. There are two types of RAM: SRAM (static), and dynamic (DRAM); the former is more expensive, but faster, and hence it is used in smaller quantities for the cache. DRAM is used for main memory.

cache (SRAM) on top, since it's the fastest, but also the smallest
main memory (DRAM) beneath, pretty fast, moderately large
HDD/SSD on the bottom, slower, huge

virtual and physical address spaces partitioned into chunks, called pages
a specific byte is then identified by its page number as well as its page offset number (e.g. 12 bits into the page)
so the program address (of 32 bits) needs to encode that information somehow
it could use the lower bits (0-11) for the page offset, and the upper bits (12-31) for the page number
now, this virtual page number needs to correspond to a physical page number
this virtual-to-physical page number mapping is stored in a page table
use the virtual page number as the index into the page table (which can be thought of as an array, with the value being the physical page number)
each page table entry also has a valid bit
indicates whether it's in RAM (1) or on the hard disk (0)
an example: 1 GB of (D)RAM and so there are 2³⁰ bytes and thus 2³⁰ bit addresses
- let's say each page is 4 KB in size (so 2¹² bytes)
- then we would need $2^{30}/2^{12} = 2^{18}$ pages total
- so we would need 18 bits to specify the page number
- and 12 to specify the offset
page table may contain other info as well - if that page is read/write protected, last access time, etc
to keep the page tables small, kernel uses a "clever" data structure like a hash table
each process has its own page table, with up to 2³⁰ entries^confirm?
ways of organising the page tables:
- store it in a part of main memory that is reserved for the kernel
- by keeping them in main memory, they don't themselves need to be paged
- so we keep them in a non-paged section of main memory, and use a fixed translation from virtual kernel address to physical kernel address
- there is an alternative way of storing them using paged memory, which will be introduced later
each process has its own associated page table; the kernel also has a process ID table, which stores the address of the page table for every process

A page fault occurs when a program tries to access something in memory but the valid bit of the page it is on is 0 (indicating that the desired page is on the hard disk and not in main memory). When this happens, the page is swapped back into main memory, often replacing another page in the process. This is taken care of by the page fault handler (a kernel program that functions as an exception handler, which takes care of updating and managing the page tables).

for the TLB miss handler to get the translation, needs to access system bus
if the bus is being used, kernel may just stop the process

say we have a 4 MB page table (2²⁰ entries, 4 bytes each - this is a simplification and possibly an overestimation)
split the table itself up into 2¹⁰ pages, with 2¹² bytes (4 KB) per page
we then must use another table (the "page table table" lol) with a 10-bit index to keep track of where these pages are
- each line in this table stores the physical address of the relevant page table page
- there must also be a 10-bit offset, used to index this chunk of the page table
- and a 12-bit offset to index into the physical page, which gives the original byte you're looking for
theoretically, all pages with content in the page table should be in main memory (otherwise you would generate a page fault just to access the page table)
there is one page table per process
the kernel then needs to keep track of memory usage and page tables for each process, decide when to swap etc
LRU policy: kernel keeps a sorted list of when things were last accessed, those at the back at the list get swapped eventually etc

When discussing datapaths, we assumed that reading from and writing to memory was nearly instantaneous, such that it could be done within a single clock cycle. We now know that that is not the case - not only does accessing an entry in main memory actually require two accesses (one to get its physical address, the other to get the entry), but this process can be very slow if the page is not in main memory.

The solution to this is to use a cache - a small but extremely quick storage in main memory (SRAM) to hold frequently accessed data and instructions. Such caches can be accessed within a single clock cycle. We now proceed to translation lookaside buffers, a very fundamental type of cache.

All memory accesses require first looking up the translation of a virtual memory address to a physical memory address. Since this is so commonly done, we can speed up this process by implementing a cache for the translations themselves. Here's how it integrates with the previously mentioned datapaths:

instruction fetch (current instruction is read from memory): TLB hit to get the physical memory address, then instruction cache hit using that address to get the actual instruction (assuming both cache accesses are hits and not misses, this can be done in one clock cycle)
data memory access (MEM stage): TLB to get the physical memory address, data cache hit using that address to get the data
- note that in a pipelined system, TLB would have to be split ino two parts - one for instructions, one for data
- this would be done for the same reason that we have one instruction cache and one data cache (i.e. so that different pipeline stages can access each)

How is the TLB organised? In fact, they are organised in a manner similar to that of hash tables. Let's say there are 2⁹ entries in the TLB, and that page numbers are 20 bits. Now, all the TLB really needs to do is provide the physical memory address associated with any virtual address. To speed this up a bit, we separate the 20-bit page number into a 9-bit index and a 11-bit tag. If the tag in the TLB for a certain 9-bit index matches the tag we're looking for, then the physical address we need is indeed in the cache; else, it's a cache miss.

We also must consider that the TLB may contain translations from different processes. The standard way of distinguishing between different processes' translations is by adding a "Process ID" column, and only proceeding with the cache lookup if the process ID of the current process (stored in a special register) matches the one in the cache.

Each entry in the TLB also contains a "valid bit" column, which is set to 1 when the process is running and 0 once it has stopped running.

To summarise, the following conditions must be met for a cache hit to occur:

* The process IDs must match
* The valid bit must be 1
* The 11-bit tag must match the upper 11 bits of the page number

The physical page number stored in the TLB is then concatenated with the page offset, and this physical address is used to access the relevant entry in memory. This all happens in one clock cycle.

Note that the TLB can only contain translations for addresses whose pages are held in main memory, not the hard disk. If a page is not in main memory when its number is being looked up in the TLB, a page fault will occur and the newly acquired address of the page in main memory will be stored in the TLB.

(A diagram of the circuit may be useful here, if there is time.)

if a TLB miss occurs, program jumps to exception handler, and a special kernel program (the TLB miss handler) takes over
- program checks the page table of the current process, to see if the desired word/byte/etc is in memory (i.e. valid bit of 1)
- if yes, TLB refill, sets TLB valid bit to 1, return control to program
- else, program is on hard disk and not main memory - page fault handler is called, which updates things
- upon page swap completion, return to miss handler, which copies the page table entry in the TLB; then, return to program

The data cache and the instructions cache differ in that while the former is read/write, the latter is read-only (during runtime) ^confirm?.

Let the cache size be 128 KB (2¹⁷ bytes). There are several ways of organising the cache:

same concept as the direct mapping approach used for the TLB
here the input is a physical memory address and the output is the byte we want to access (of data, or part of an instruction)
the columns would be:
- the 17-bit index
- the upper 13 bits of the physical address used as the tag
- the byte of data
- valid bit - 1 if there is indeed something stored in here, 0 if the contents are junk (left over from a previous process for example)
- dirty bit - 1 if the byte has been written to since it was brought into the cache
no process ID field needed, as multiple processes can and should have access to the same byte (e.g. an instruction)
here's what happens:
- the lower 17 bits of the 30-bit physical address are used as an index into the cache
- the upper 13 bits of the physical address are compared with the 13-bit tag
- if the above match, and the valid bit is on, then we take the byte in the cache
- otherwise, a cache miss occurs and the cache entry needs to be refilled
- more on the use of the dirty bit later

another method: same as above, except instead of storing a byte in each cache entry, store a word (since words are used so often)
if the size of the cache is the same (128 KB), then we would have 2¹⁵ lines in the cache
- so we need 15 bits to index
- and since there are 2²⁸ words, we need a 13-bit tag
byte offset: lowest 2 bits of an address; for the bytes in a word, they should be 00, 01, 10, 11 (in that order^confirm?)
- according to the diagram, they are actually separate bits; this is kind of ambiguous in the lecture notes though

the above took advantage of spatial locality through the use of words
another way of using spatial locality is through recognising that instructions are typically executed in sequence (branches/jumps are the exception, and they're less common)
so when an instruction is copied into the cache, we can copy the neighbours as well
we can do the same for data - this is useful for arrays, for instance
to implement this, we can allow each cache entry to hold a block, that is, a certain number of words that are consecutive in memory
a 128 KB cache can hold up to 2¹³ 4-word blocks
- so we need 13 bits to index
- and since there are 2²⁶ such blocks, we again need a 13-bit tag
we also need a byte offset, and a block offset (to specify the word within the block)
- again, whether these are part of the 13-bit index or whether they are independent of it is ambiguous in the lecture notes
- in any case, to read a word out of memory, we use the block offset as a selector for one of four words

instructions cache is fairly simple, same concept as the TLB cache hit/miss system
for teh data cache, since we can read and write to it, it often happen that the contents of the data cache aren't the same as that in memory
so there are two policies for dealing with this - write through and write back
- write through: ensures cache block is consistent with corresponding memory block
  - updates the cache to reflect the value in memory if there is a miss
  - when writing a word, write through then copies the value both to the cache AND to the location in memory
  - if there is a miss while writing, the word is copied from main memory to the cache, then the above process occurs
  - which seems kind of silly but whatever
- write back policy: avoids copying the updated code block back unless absolutely necessary
  - the cache holds the most updated version
  - we only write it back to main memory when some other block wants to use this entry (i.e. there is a miss)
  - the dirty bit is used to keep track of whether the cache is more updated than the version in main memory or not
  - in this scheme, "write hit" is cheaper and "read miss" is more expensive, because you have to write back the value of the thing currently occupying the cache line
  - write-back is usually better for large caches

the direct mapping scheme above doesn't give the best performance
one reason: many different physical addresses that can map to one cache line (collisions)
- but, there can only be one such entry in the cache at any given time
- you could end up with two instructions, both used very frequently, but they must keep replacing each other in the cache
- results in many cache misses
fully associative cache: pretty much the exact opposite; get rid of the index, tag is the entire 26-bit address (for a 4-word block)
to find a block, we search through the entire cache, looking for the line with the same tag
this allows blocks to be anywhere in the cache
downsides: need special circiuts to check if the tag matches at each cache line
- to fit this design on the chip, we need to use fewer blocks, and hence store fewer things in the cache
- we also need to develop a scheme on how to remove items from the cache (the least recently used, etc)

compromise between direct mapping and fully associative cache
the idea is to use N direct mapped caches, so that a block can be stored in any of the N
or, you can think of it as having (number of blocks in cache) / N direct mapped caches (sets) - we'll use this definition
for example, if N = 2, and our cache can hold 13 blocks, then we can have 2¹² sets, each of which can hold 2 blocks
we can index this cache by taking the block number, mod the set number (?)
so either the block we want can be in a specific place in any of the direct maps, or in any place in a specific section
- seems like we're using the former for this course - see diagram in lecture slides
- good explanation of the latter: http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Memory/set.html

way of connecting CPU, main memory, and I/O devices
bus: set of wires for transmitting information, shared by many devices
assume there is only one system bus, for simplicity (modern computers have multiple)
advantage: reduces the number of connections in computer
- no need for each I/O device to have a direct connection to the CPU/main memory
disadvantages:
- only one signal can be carried at a time
- reduces perform, because any one component must wait until the bus is free before it can start using it
- communication about who gets the use the bus takes time too
- requires a slower clock speed than the CPU, as components on the bus are far away from each other, due to the speed of light
  - e.g. if the processor is 3 GHz, voltages travel at most 10 cm per clock pulse, which isn't large enough
  - also doesn't take into account the time needed to read/write from/to the bus, or for circuits to stabilise
components:
- data bus
- address bus
- control bus
how it works, with lw as an example:
- CPU puts physical address of new block onto address bus
- CPU sets control signals (ReadMem = 1, WriteMem = 0) on the control bus
- main memory puts the requested block onto the data bus (takes several bus cycles, as every word is sent separately)
- CPU reads the data bus, block written into the data cache

electronics, inside the computer case, responsible for read/write with system bus
might contain: clock, registers, ROM circuits, memory, PC, etc
the software equivalent is device drivers, which are part of the kernel and which reside in main memory or on disk
external devices (peripheral) - outside the case, don't plug directly into the system bus, but instead into a port (part of the I/O controller sitting on the motherboard?)
need a tri-state buffer for the system bus, as only one can device write to it at a time
input devices:
- keyboard
  - one byte of data entered at a time
  - address of I/O device is stored in an IOdeviceID register
  - would put it byte of data on the data bus IF the ReadIO control signal = 1 AND address of I/O devices matches lower 7 bits of address on bus AND a key is pressed
- mouse
  - converts mechanical action into packets of data which are decoded by the controller
  - packets written into buffers, which are then sent over bus
output devices:
- printers
  - often use postscript
  - instructions interpreted, converted into raster
  - doesn't put anything on data bus, so doesn't need a tri-state buffer
  - to check when incoming data is meant for the printer: read from all the buses, only take data if the control signal is set and the address matches the printer address
- monitor
  - rectangular array of pixels (picture elements)
  - each has numbers stored: red, green, and blue
  - each number represents the brightness of that colour, from 0 to 255 (0 to FF, or, 00000000 to 11111111 so one byte per colour)
  - for a monitor of 1024 by 768 pixels, 3 bytes per pixel, that's about 3 MB just for the image display of the monitor
  - these pixel values are stored in the frame buffer (we pretend it's in main memory, even though modern computers have it on the graphics card)
  - frame buffer is read at the monitor refresh rate (e.g. every 1/60th of a second)
  - so pixels would have to be read from memory at 50 million pixels or 150 MB per second
  - which is unfeasible - system bus would have to spend much time just transferring pixels
  - one solution: dedicated, read-only bus ("video bus") to carry pixels
  - still, pixel values need to be computed for each frame - this had to be done by the CPU in the past, and the CPU still had to use the system bus to send the pixel values to the frame buffer, which clogged up the system bus and slowed down the CPU
  - solution to that: graphics processing unit (GPU)
  - second processor, specialised for making images
  - instructions having to do with graphics (e.g. to draw a line on the monitor or something) are sent to the GPU
  - GPU takes care of all the graphics stuff, has its own frame buffer and RAM (together with those, called the graphics card)
  - but the more complex the GPU is, the more complex the design has to be - own registers and controls, own FPU, own instructions; must be pipelined, etc; makes the cards expensive
two distinct methods of addressing I/O devices from an assembly language
- first, isolated I/O: special instructions for I/O operations in the assembly language itself
- second, memory-mapped I/O: addresses of the registers/memory for each I/O devide are in a dedicated region of the virtual address space, so that you can use things like lw and sw to access them
  - this keeps instruction sets small, which is one of the goals of MIPS
- note: syscall, while capable of doing I/O, is neither of the above; it simply branches to the kernel, which then determines whether isolated or memory-mapping is used
- memory-mapped I/O in spim:
  - one input device (keyboard)
  - one output device (display)
  - each device has 2 registers - control, and data
    - control's LSB indicates whether the corresponding data register is ready (i.e. character has been typed, and is contained in the data register for the input, or that the output data register is ready to receive a byte from the CPU)
  - addresses in kernel's address space
  - this is fairly limited
  - real MIPS processors would have several I/O devices, many registers, addresses, local memories etc
- to load some byte from the memory-mapped I/O region:
  - use lb etc
  - the hardware must recognise that the desired address is not part of the cache, but rather in the I/O device's memory
  - kind of like a cache miss, except the actual data is found withiin the I/O device
  - so the CPU puts the address of the required device on the address bus, sets controls on the control bus indicating the CPU wants data from that device
  - I/O controller reads the bus, responds with the data
  - quite similar in concept to getting data from main memory
  - same idea with output

wait until the input device is ready before getting its data
can be done by repeatedly checking the input control register bit until it's 1
use a fairly simple mips loop (lw, andi to clear all bits except LSB, beq; if it's not zero, then read from the data register, at least in the case of memory-mapped I/O; the instructions are different for isolated I/O)
not efficient, with many cycles wasted just looping nd waiting for the ready bit
more efficient: finite for loop
- time spent looping depends on the number of other processes, the importance of having I/O etc
buffered input:
- circular array to store characters entered by user
- front and back registers, to hold number of characters entered and number read by CPU
- when the char buffer is full, further keystrokes are ignored
- when it's not fll, ready to provide chars to CPU
- could be a circuit to test for these conditions
- can cause delays for the user, for example if the system bus is being used by something else, so buffer fills up as the user keeps typing, but the CPU can't read from the input and send it to the output until the sytem bus is available

another I/O mechanism
communication between memory and I/O device
specialised set of circuits, registers, local memory, communicate with main memory over system bus
called DMA controller - so the CPU doesn't have to do this
CPU controller only needs to tell DMA controller what to do, then the CPU can do other stuff while the DMA controller uses the bus
for example, when a page fault occurs, CPU tells the DMA controller of the hard disk:
- physical page address in main memory to swap out
- physical page address on the hard disk
- control signals (what to do)
- these are set by placing values in various registers
- CPU does this by putting address of register on address bus, and value on data bus, and control signals (for writing)
- then, DMA controller can initiate the memory transfer, by copying the values from main memory (using the bus) to local memory, then writing to disk
how the CPU and DMA decide who gets to use the bus at any given time - must communicate
- bus request (BR) line, DMA to CPU
- bus granted (BG) line, CPU to DMA
- if CPU is using the bus, BG = 0; else, BG = 1 (disconnects from bus - shuts down tri-state buffers, stops writing)
- when DMA is ready to use the bus, BR = 1, and CPU lets it take over after
- these are obviously not sent along the system bus - separate direct lines, must always be available

hard disk controller has local memory (DRAM), called disk cache
when pages are read from disk, first put in cache, then main memory; when written to disk, first written to cache, then moved to disk itself
could read neighbouring physical pages as well, at the same time, to improve performance
disk fragmentation:
- after a while, disk might not have large intervals, so you can't store large blocks
- we can periodically degragment the disk, increase the number of large empty gaps
instead of performing r/w in order of request (when there are several), sort the list so that they could be accessed in one spin

more efficient than polling at coordinating CPU and I/O
similar to bus requests, but different - bus request asks CPU to get off the bus so DMA can use it
interrupt asks CPU to stop what it's doing and do something for the DMA instead
can occur from input or output devices
example of input interrupt: mouse click, keyboard press, or Ctrl+Alt+Delete
output: printer runs out of paper, etc, so the CPU can send a message to the user
mechanism: similar to bus request, a control signal called IRQ is set to 1
- if the CPU does not ignore it (which can occur sometimes), CPU sets IACK to 1 (interupt acknowledge)
- CPU then stops writing on system bus (turns off tristate gates)
- I/O observes that IACK is 1, so it can start writing to bus
could have separate IRQ and IACK lines for each I/O device, or, have all the I/O devices to share one IRQ line, then have the CPU ask all the I/O devices which one sent the interrupt
better method: daisy chaining, so I/O devices coordinate which one of them can interrupt the CPU at any given time
- priority ordering, so lower priority devices cannot interrupt higher ones that are interrupting the CPU
- physical ordering, based on OR gates
- each I/O device: IACKin, IACKout lines (IACKin of one connected to IACKout out of higher priority device)
- any device can interrupt, CPU responds with IACK of 1 to the highest priority I/O device
- if a device requested the interrupt, IACKout set to 0; else, set to 1 (continues)
- whenever IACKin switches from 0 to 1, sets IACKout = 0 if it requested interrupt, or 1 if it did not
- this is done so that two I/O devices don't simultaneously write to the system bus
- if low priority device's interrupt is interrupted by a higher priority one, lower one killed first by the high priority one changing its IACKout to 0 (mimicking the CPU ignoring it basically), so the low priority one finishes up, sets IRQ to 0, so CPU sets IACK back to 0, then high priority time
- when a device gets acknowledged, it writes its address on the system bus
- CPU reads it, takes appropriate action (may ignore it if it's low priority etc)
- so, steps:
  - IRQ 0->1: I/O device sends interrupt signal
  - IACK 0->1: CPU responds, asking what the I/O device wants and what it is
  - System bus, device number written: I/O devices identifies itself, states what it wants (control line maybe?)
  - IACK 1->0: CPU will ignore interrupt request
  - Otherwise, if not set, it might send a message back on system bus; I/O device must tri-state to listen to CPU's response
  - Daisy chaining can be used with DMA, only instead of IRQ and IACK, use BR and BG signals
interrupt handlers:
- when interrupts occur during program execution, process branches to exception handler in kernel
- kernel examines the cause and status registers, and handles the interrupt if necessary
- similar in nature to function calls
- kernel disables other interrupts, and saves state information (EPC, etc) allowing you to go back to the prior state of execution
- interrupts can be nested
- priority ordering imposed on kernel routines and possible interrupts
- for example, kernel should not be interrupted while saving state information, so it should be unnterruptable
- also, lower priority I/O should not interrupt higher priority interrupt handlers, but the other way around is okay
- so, kernel needs to:
  - disable interrupts while saving process state
  - then, enable higher-priority interrupts
  - deal with interrupt
  - restore process state
  - return from interrupt, enable interrupts again
- bitwise, we can disable interrupts by turning off certain interrupt enable bits in the status register, with ori and and etc
process scheduling: CPU timer, keeps track of how many clock cycles the current process is using
- when it has used enough, interrupt occurs, kernel saves state of current process, switches to another one

IF stage: page faults, memory protection violation
ID: invalid op code
ALU: arithmetic exception (div by 0, integer overflow)
MEM: page fault, mem protection violation
WB: none
if there is an exception at one stage, the instruction just ahead in the pipeline would be allowed to finish
then, process is paused, state of the process is saved
exception is handled, process state restored, pipeline continues (at the beginning)

though the clock speed is lower than that of the CPU, there is still a clock, and things are done synchronously
but synchronous system buses don't work so well when there are large distances to travel
in that case, use asynchronous I/O
two methods: handshaking, serial I/O

source-initiated handshaking: sender sends a "data request" control signal, then puts data on the line (usually multiple parallel lines)
- receiver reads control, reads data, replies with "data acknowledge" control signal
- source receives acknowledgement, stops writing data, resets data request line to 0, resets data acknowledge line to 0
destination-initiated handshaking: receiver initiates it by settings its data request control signal to 1
- source reads it, puts data on line, sends data ready signal
- receiver reads data ready signal, reads data, resets the data request signal to 0
- sender turns off data ready signal, stops writing data
no clock involved - all asynchronous, waits for the other side to be done etc
example of this scheme: parallel port, for printin
- a bunch of different wires, some of which are output only (e.g. source ready, data), others are input only (e.g. ACK, busy, no paper)
- source-initiated handshaking method (so CPU tells printer controller when it needs to print something)
- CPU sends data, printer reads, then sets ACK to 1, then controller stops sending data; if it cannot keep up with the data ready signals, the printer sets BUSY to 1, and the controller waits before trying again
you can use an asynchronous, handshaking scheme for the system bus itself (source- or destination- initiated)
- for a shared bus, one device must be the one deciding who gets to to write to each line of the bus (CPU - master/slave)

completely different - sequence of signal pulses traveling down a wire
clock is necessary since you have to time the reads of the pulses, but data is not constant over the bus
frequency of serial bus measured in bits per second (bauds)
clock speed much slower than the CPU
how the receiver reads from the serial bus:
- sender/receiver have agreed on a bit duration T, and the number to be transmitted
- sender sets wire to 1 for the normal, non-transmitting state, then switches it to 0 (start bit) to indicate that data is about to be transferred
- receiver waits half the bit duration, checks that the signal is still 0 (to make sure it wasn't just noise)
- then, 1 T later, it samples the signal, repeatedly sampling every T for as many bits as it's expecting
- samples from the middle of the pulse - avoids transition at boundaries
- after reading all the expected bits, it reads a stop bit (1) - necessary to ensure some spacing between bytes
- receiver only starts sampling again when it transitions from 1 to 0 (again)
- sender and receiver each have their own clock, which can be of different speeds (clock cycles much shorter than T)
mechanisms for error-checking (noise, etc):
- parity bits: even parity bit, set to 1 if there is an even number of 1s, set to 0 otherwise; if there's one error, it will be detected
hardware-wise:
- sender must be able to convert bytes -> bit sequences (parallel-in serial-out shift register)
- receiver: bit sequences -> bytes (serial-in parallel out shift register)
- UART: I/O controller that can convert serial/parallel to parallel/serial (does the bit/byte conversion)
  - has several registers, for communicating with other devices
  - also has special circuits to remove the start/stop bits, check for parity
USB (universal serial bus)
- USB devices connect to a computer in a tree structure
- root is the USB controller, inside the computer (host)
  - single host can have up to 127 devices connected to it
  - uses polling to communicate with devices
  - can handle very fast data transfer (500 Mbits/sec for example)
- drivers usually included in OS already, or downloaded automatically etc ("plug and play")
- reasons for safely removing hardware:
  - any changes written to a file may not have been saved to the USB device itself (more recent version in the cache etc, not written back - only for the write back policy)
  - voltage fluctuations

Final review: Memory and I/O

1Introduction to memory¶

1.1Virtual memory¶

1.2Physical memory¶

1.2.1Units of memory¶

1.2.2Disk memory¶

1.2.3Flash memory¶

1.2.4Random access memory¶

1.3Memory hierarchy¶

1.4Page tables¶

1.4.1Page faults¶

1.4.2Page tables in paged memory¶

2Caching¶

2.1Translation lookaside buffers¶

2.1.1TLB organisation¶

2.1.2TLB misses¶

2.2Data and instruction caches - direct mapping¶

2.2.1One byte per entry¶

2.2.2One word per entry¶

2.2.3One block per entry¶

2.2.4Cache hits and misses¶

2.3Fully associative caches¶

2.4N-way set associative cache¶

3I/O¶

3.1System bus¶

3.2I/O controllers¶

3.3Polling¶

3.4Direct memory access¶

3.5Hard disk performance¶

3.6Interrupts¶

3.7Exceptions during pipelining¶

3.8System bus synchronisation¶

3.8.1Handshaking¶

3.8.2Serial bus¶