Wikinotes

Maintainer: admin

Most of this information comes from Professor Langer's lecture notes, publicly available on his website. This page is simply a summarisation of the notes, combined with some additional information gleaned from various sources, and with some simple homemade diagrams. Please feel free to edit this page if you notice any errors or omissions. If you have questions or want to challenge the accuracy of something, please contact @dellsystem directly.

The information is not presented in the same order as that used in the slides, but I tried to keep it as close as possible while keeping with the organisation of this page.

basic review of instructions (don't need to memorise, but be familiar with them)

loading/saving from memory
- lw, sw, offsets
basic arithmetic
- add, addi, sub, subi
branching
- program counter (PC) register holds the address of the current instruction line
- incremented after each instruction
- how to do an if-else statement (basic concept)

each instruction is one word
upper (first) 6 bits contain "op code"
but there are more than 2⁶ = 64 instructions
3 different instruction formats
some instructions are actually "pseudoinstructions" in that the assembler converts them into others
- not really part of MIPS
- e.g. move $t0, $t1 is really just addi $t0, $t1, 0

field	opcode	rs	rt	rd	shamt	funct
number of bits	6	5	5	5	5	6

when there's no immediate value involved and you're doing something with 1 or more registers
first, 6-bit op code of 000000
3 registers max, 5 bits per reg (2⁵ = 32), so 15 bits for registers
order: source, other (source/destination), destination
"shamt" - shift amount (5 bits), used for srl/sll
"funct" - different variants of operation, to differentiate, 6 bits
for example, sub and add have same op code, but diff funct
64 different R-format instructions possible, due to "funct" (2⁶)
examples: add, sub, sll, or, slt, and, mult, div, jr

field	opcode	rs	rt	immediate
number of bits	6	5	5	16

when you want to specify some immediate value
6 bits opcode
10 bits, for two registers (rs, rt)
16 bits for an immediate value (signed, offset); 2^-15 to 2^15-1
for lw, the address is rs + offset; stored in register rt
for sw, the address is rs + offset; store in it contents of rt
for beq, compare rs and rt, jump to PC + offset if rs == rt
- so usually we store a label, i guess the label is the offset
- note that the offset is always a multiple of 4 bytes
- cuz, instructions are 4 bytes long etc
must be specified entirely by the opcode
things like blt, bgt, ble etc DON'T ACTUALLY EXIST IN MIPS
although there are pseudo-instructions like that
but those aren't really part of mips, translated by assembler
blt substituted by slt, beq (using \$t0 - WILL OVERWRITE VALUE)
to do something like blt: use slt - "set on less than" (R-format)
then, beq with \$0 (if the slt condition is true, will NOT be 0)
didn't make pseudo-instructions core to lower number of I-format
this way, hardware could be simpler, run faster etc
some R-formats have corresponding I-format instructions
- add --> addi, can put in a decimal number
- slt --> slti
examples: lw, sw, beq, bne, addi, lb, sb, lui

field	opcode	rs	rt	immediate
number of bits	6	5	5	16

6 bit opcode
26 bit address (so we can jump farther)
to be specific, 2²⁶ possible instructions we can jump to
examples: j, jal (the only ones)

addu, addiu, etc
the "u" in the name means that the immediate or source register is treated as an unsigned int
-50000, for example, would be treated as a negative number if not unsigned
sltu and slt can have different results if one has a MSB of 1, other 0
for sign extending: to put something in an ALU (for arithmetic/logic)
the sign must be extended, because ALUs operate on 32-bit numbers
- if unsigned: just add 16 0's to the left (upper 16 bits)
- if signed: copy the MSB (far left) to the upper 16 bits (fill em all)

There are 32 registers, \$0 to \$31. The following table does not need to be memorised, but it's probably a good idea to at least be familiar with the different types of registers out there and their purposes.

Number	Name	Description
\$0	\$zero	cannot be used, hardware prevents you from writing to it
\$1	\$at	temporary, used by the assembler
\$2-\$3	\$v0-\$v1	function return values
\$4-\$7	\$a0-\$a3	function arguments
\$8-\$15	\$t0-\$t7	temporary registers
\$16-\$23	\$s0-\$s7	"save" registers (child function should not overwrite)
\$24-\$25	\$t8-\$t9	more temporary registers
\$26-\$27	\$k0-\$k1	reserved for the kernel
\$28	\$gp	global pointer
\$29	\$sp	stack pointer
\$30	\$fp	frame pointer
\$31	\$ra	return address, automatically saved when calling `jal`

partitioned into "kernel" and "user" regions (2³¹ bytes each, so the total memory size is 2³² bytes)
each: "instructions" region of 2²⁸ bytes (2²⁶ words), "data" region of (2³¹ - 2²⁸) bytes
instructions are word-aligned
- so, multiple of 4 etc
- lower two bits of each instruction are 00
consequently, the jump instruction can jump to any instruction in the relevant section (kernel or user)

what if you want to store an immediate 32-bit value in a register?

this is possible, though not with one instruction
since li etc all take a 16-bit value per instruction ...
lui: loads upper two bytes with the 16 bits you specify
then, to fill in the lower 16 bits (currently all 0's), use ori
which ORs every bit, with an immediate (so the lower 16 become filled)
we could use addiu, but sometimes it doesn't work, so don't do that
this is really just a logical operation, so, use ori

Shifting bits to the left or right (multiplying/dividing by 2ⁿ)

sll: multiplies by 2, so moving to the left (lower bits => 0)
srl (shift right logical): divides by 2, upper bits => 0

For arrays of ints

can't transfer data directly from memory -> memory, so, registers
base address, offset is 4 * the index
must increment by 4 each because each int element takes up 4 bytes

each character is a byte, just because (so 256 possible characters)
lb, sb - loading/saving bytes, I-format
lb saves the byte (8 bits in a 32-bit register)
the loaded byte is then sign-extended
so the upper 24 bits are filled with the MSB of the loaded byte
the unsigned version (lbu) just fills the upper 24 with 0s
sb just ignores the upper 24 bits in the register
to compute the length of a string, do it the C way - stop at "\0"
- since the null character "\0" is 00000000 in binary, use beq with \$0
- differs from a regular array because each element uses 1 BYTE, not WORD
- so, instead of incrementing by 4 each time, just increment by 1

.data: for stuff to be put in the data part of memory
use a label like string: .asciiz "this is a string"
this label is translated into EITHER an offset or an absolute address
offset for conditional branches, absolute for jump etc
other things: .word, .byte, .space (reserves number of bytes, used in arrays)
.text: instruction part of memory
.align 2: does word align; this is sometimes necessary if you have a string declaration right before a float declaration, as word-alignment is expected for floats^confirm
.globl main: makes the "main" label global, seems to work without it
"main" label: like the main method for C/Java/etc; program starts here

la: pseudoinstruction, translated into "lui \$reg, address"
where address is the address of the thing label refers to (in data section)
this translation from label to address is done for you by the assembler

syscalls for doing I/O or ending the program etc (operating system stuff)
specify the thing to do with \$v0
1-4 to print: int, float, double, string
put in \$a0/\$f12 the relevant thing
5-8 to read: the above order
int is stored in \$v0, \$f12, or the address specified in \$a0 (length \$a1)
10 to exit. For example:

li $v0, 10
syscall

jal, stores the return address in \$ra (J format)
you can then return to the next line with jr \$ra (R format)
pass the child function arguments in \$a0-\$a3
typically, the returned values are in \$v0, \$v1
child function should not overwrite \$s0-\$s7 registers
instead, if it needs those, it should save the contents to the stack
the stack is explained in the next section

used when a child function wants to preserve values in the \$s0-\$s7 regs
special region of MIPS memory
grows downwards rather than upwards
the "base" of the stack is the address at the "top" (really the bottom) when stack is empty
the base is at a fixed address of 0x7fffffff (halfway point in mips memory)
(also note that the kernel begins at 0x80000000)
the address of the "top" of the stack is called the stack pointer, in \$sp
to push a word onto the stack, FIRST decrease \$sp by 4, then do sw
to pop, do lw, THEN increase \$sp by 4
if there are multiple things to store, then, put in a code example:

addi $sp, $sp, -8
sw $t0, ($sp)
sw $t1, 4($sp)

# Do things here, making use of the $t0 and $t1 registers if desired

# Now get the words back into $t0 and $t1
lw $t0, ($sp)
lw $t1 4($sp)
addi $sp, $sp, 8

A very simplified and not-to-scale diagram of the situation is below:

Stack and memory usage diagram

if a child function is also a parent, you'll probably have to push \$ra here
may also need to preserve argument registers or temp registers in that case
stack frames: the contiguous set of words on the stack used to store shit
per function. makes sense: say you decrease \$sp by 12, then, 3 words in your stack frame
if you don't know the stack frame size in advance (only determined during runtime) ...
for example, if the amount of memory allocated is based on I/O stuff ...
then you need to move the stack pointer as the stack frame changes size
to mark the beginning of the stack frame, use \$fp
don't know how the above really works or if it's ever actually necessary

since int multiplication takes two 32-bits and returns a 64 bit ...
the product of mult $t0, $t1 is located in a "product" register, which is read-only
accessed using two separate instructions
mfhi \$t0, stores the "hi" (upper) 32 bits from it in \$t0
mflo \$t1, stores the "lo" (lower) 32 bits from it in \$t1
i think there is a "hi" register and a "lo" one, so two separate regs
for division, call div
mfhi to get the remainder and store it in a reg
mflo to get the quotient and store it in a reg
mult, div, mflo, mfhi are all R-format

floating-point operations need to be done with special circuits/regs
located on a physically separate chip (no room on the original lol)
called the floating-point coprocessor/unit (FPU), coprocessor 1
has its own set of registers, \$f0 to \$f31
for double precision, two registers, starts with an even (\$f0, \$f2, etc)
only need to mention the first register, the second is used automatically
lwc1/swc1 to load/save to/from memory (address held in CPU register)
for doubles, there are pseudoinstructions to make it faster: l.d, s.d
mfc1/mtc1 to move from/to the CPU (specify the CPU register)
- note that the order of operands is reversed from the usual for mtc1
- FROM _ to _, not destination, source as it usually is
- e.g. mtc1 \$s0, \$f0
floating point ops: add.s, add.d, sub.s, sub.d, etc
have to use \$f? registers for the above
can't mix and match \$f and non-\$f registers here
similarly, can't use \$f registers for non-floating-point operations
multiplication and division are easier here, same style as add.s etc
mul.s, div.s, mul.d, div.d (no separate circuit, no remainder, etc)
conditional branching: c.__.s/d \$f0, \$f1/2; bc1t label
the result of the comparison is stored in a special D flip-flop
bc1t label: if the condition is true, branch to label
bc1f label: if the condition is FALSE, go to the label

All arithmetic operations (add, sub, mult, etc) have to be done with either integers for both inputs or floats for both inputs - you can't mix and match. So what do you do if you have to perform some arithmetic operation on an integer and a float? Depending on the type of the desired result, you can either convert the integer to a float or the float to an integer, and perform that operation on the appropriate coprocessor.

To convert from int to float:

mtc1 $t0, $f0
cvt.s.w $f0, $f0

To convert from float to int:

cvt.w.s $f0, $f0
mfc1 $t0, $f0

Note that the above just truncates the value, so 1.01, 1.99, 1.00 etc would all become 1. Also note that this method of type conversion differs from using trunc.s, round.s, ceil.s, etc as the output type is still a float, part of the FPU.

converting from single to double - tack on 0's but only to the significand part
converting from double to single - this rounds the figure, as opposed to only truncating it

done in coprocessor 0
an event that requires the program to stop, OS has to do something
has special registers, used only in kernel mode:
- EPC, \$14: return address, the one that follows the line that caused the exception
- cause, \$13: code for what kind of exception occurred
- BadVaddr, \$8: holds the address that led to an exception if user tried to access an illegal address
- Status, \$12: whether the program is running in kernel mode or in user mode
examples of things that cause errors/exceptions etc:
- overflow, e.g. adding 2³¹ to itself
- division by 0 (same error as above)
- illegal address, e.g. trying to jump to something in the data segment
- note that trying to load from the text segment doesn't cause an exception as mips can't detect (2 lines etc)
note that with floating points, certain things don't cause exceptions ...
- e.g. overflow, would result in +infinity in certain cases
- division by zero also results in +infinity
- except when dividing 0 by 0 - NaN
what actually happens is that the program jumps to the first address in the kernel, 0x80000000
so 0x80000000 is written into the PC by the PCSrc (the selector)

single cycle as in, one instruction per clock cycle
first, to fetch the current instruction out of memory, store its address in the PC (program counter) register
for a simple non-branching instruction, PC becomes PC + 4 at the end of the clock cycle
only at the END i believe (you use PC for this instruction)
then to get it out of memory, use a huge multiplexor circuit and PC, selects one of 2^30 words from memory
the opcode + funct code (for R-format only) control what happens next
for example, for add (R-format), after it is read out of memory:
- two ReadReg and one WriteReg is selected (the registers to read from/write to)
- the control signals for the ALU are determined ^{needs explanation?}
- RegData values are read from the source registers, sent as input to the ALU
- the ALU operation is performed
- result is written to the WriteReg
- the new value for the PC is written
- this can all be done in a single clock cycle, if it's long enough for the circuits to stabilise (it should be)
- (for example, the ALU operation involves carry values to ripple though, which takes time)
another example, for lw (I-format):
- first two are the same, ReadReg and WriteReg selected
- base address for memory word is read from the source register, fed to the ALU
- the provided offset is sign-extended, also fed to the ALU
- control signals are set for the ALU operation ^{needs explanation?}
- ALU computes the sum of the base address + offset
- the resulting address is used to select a word from memory, which is then written to the register
- PC + 4 is written
sw:
- first two are the same, two ReadReg are selected
- get the resulting memory address, offset + base using ALU etc
- the word in one of the registers is written to the specified address in memory
- PC + 4 is written
bne:
- in this case, PC might not take the value PC + 4 at the end of the clock cycle
- note: because instructions are word-aligned in memory, MIPS assumes lowest two bits are 00
- therefore it doesn't store those two bits in memory, so the 16-bit field is really 16 + 00 at the end
- no shift register is required for turning it into a 32-bit word eventually: 16 -> 30 etc (sign-extended)
- anyways, select between PC + 4 and PC + 4 + offset
- read from the registers, subtract, if the result is NotZero, select PC + 4 + offset; else, PC + 4
- this value is written back into PC
jump, J-format:
- 26 bits to specify address; as address is word-aligned, bits 0 and 1 are always 0 and 0
- so you specify bits 2-27 with the 26-bit address
- as for bits 28-31, based on the upper 4 bits of the current PC - 0000 for user, 1000 for kernel programs

merging datapaths (i.e. how they all work together)

Obviously, there is only one chip layout which has to accommodate all the different possible instructions
So now we merge together all the datapaths and illustrate how they function in tandem
The controls:
- RegDst: the register to be written to; bits 16-20 or 11-15 or not used at all, depending on the instruction
- ALUSrc: one of the ReadData registers is hardwired to be an input for the ALU
  - the other input either comes from a register (e.g. in add) or the immediate (addi, etc)
  - so we need a multiplexor to determine where the other input comes from
- MemToReg: selects where the thing to write to the register (if there is one) comes from (memory, or ALU)
- RegWrite: specifies whether or not we are writing to a register at all (also called WriteEnable in lecture notes)
- MemWrite: whether or not we are writing to memory
- MemRead: whether or not we are reading from memory
- ALUOp: specifies the operation that the ALU should perform (addition, subtraction, etc)
  - Recall from previous lectures that the ALU has two control inputs
  - One is the Binvert, which specifies whether or not we are doing subtraction
  - The other is the "op" field, which controls if it's AND, OR, or XOR (i.e. +, includes subtraction)
  - 10 for addition, 00 for AND, 01 for OR, 11 for slt (set less than; 1 if the difference is negative, 0 otherwise)
  - show the truth table, make sure it's correct somehow
  - If ALUOp is 10, then, use the funct field to determine Binvert and op
  - Otherwise, the opcode determines Binvert and op directly
  - Draw the circuit
- show a table linking the input (opcode) to the output (the values of the controls)
In the case of bne, there are two things to compute: the final address, and whether or not the inputs are equal
- Since the ALU can only be used for one of them, we use a separate adder circuit for the address calculation
- draw out a basic version, and explain what's going on
Datapath for jal
Datapath for jr
- jr is obviously R format since there is no intermediate; funct field is used here
Of course, this is all done in one clock cycle. We must choose a clock rate to ensure that the slowest operation gets carried out in that time
- The worst case is the lw instruction, and this is fairly slow
- But designing for the worst case doesn't seem like the best way of doing it
- Some instructions are quick, some are slow, but if they're all the same length they all become slow
- Which seems silly and inefficient
- Consequently, most MIPS implementations actually make use of multiple cycles, next section

Multi-cycle models: pipelining and microinstructions
- Continued from above: we break each instruction into short stages
- The clock cycle only has to be as long as the longest stage

The first method of doing this: microinstructions

Hold the PC constant during an instruction, only update the PC at the END of the instruction (after the stages are all complete)
Each stage in the instruction is defined by the control settings that carry the data from one datapath to the next
i.e. by a microinstruction
we can naturally partition each instruction into stages based on the main actions performed
that is, fetching the instruction, reading from the source register, ALU stuff, accessing memory, writing to the dest register
so, 5 stages max, per instruction (not all stages needed for each instruction)
we then have a bunch of different possible stages, each with its own microinstruction
certain specific stages are used in more than one instruction - for example, fetch instruction
- (in which the PC address is read from memory, etc)
- this microinstruction is used in every instruction
but we need a way of knowing which microinstruction we're on, and which is next
solution: microPC, which is just like PC except at a lower level, of sorts
each microinstruction/stage thing has 2 controls: next, and fetch
if "next" is 1, then the next microinstruction will just be the next address (presumably microPC + 4, if each is a word)
if next is 0, then the next instruction is "fetch instruction", so jump to there
fetching the next instruction is always the last real microinstruction for this instruction
the next microinstruction is the first one for the next instruction, if there is one
of course, we need a table to store the first microinstruction address for each instruction
this could be implemented as a read-only combinatorial circuit
PCWriteEnable is a control that specifies when to write to PC (at the end of the instruction)
the lecture notes indicate that this is set only when Next=Fetch=0, so presumably this happens at the same time as fetching?
Not really sure ...
anyway, although this method is an improvement over the single cycle, it has its downsides
for instance, most parts of the CPU are doing nothing in any given clock cycle
since only one thing is active at a time (only one microinstruction etc)
for ex, if the ALU is active, then all the memory-related stuff is just idling
so now we come to another method

the PC still changes at the end of each clock cycle
but more than one instruction is in the data path at any given time (sort of like parallel processing)
you still partition instructions into stages
but, you have more than one stage running per clock cycle
ideally, one instruction completes per clock cycle (this is the maximum efficiency)
if there are 5 stages in a pipeline, then there's a 5x speedup relative to the single cycle model
this reduces the time per clock cycle, AND ALSO the clock cycles per instruction (theoretically)
in any case, here's the process:
- each stage completes in one clock cycle (which is long enough for the longest stage)
- there are five stages:
  - IF: instruction fetch
  - ID: instruction decode AND register read
  - ALU: ALU operation
  - MEM: reading from/writing to memory
  - WB: writing data into the destination register
- we need new registers to control the flow from one stage to another
  - because new instructions are needed for ID after IF, etc
  - so there's IF/ID for that case, which contains the fetched instruction
  - and ID/ALU, ALU/MEM, MEM/WB, each of which contains data necessary to execute the next stage
  - these registers only hold their value for one clock cycle, they propagate along sort of, or overwritten by the next instruction
data hazards in pipelining:
- when a register is read from (second instruction) before it can be written to (first instruction)
- example: adding two numbers and storing it in \$t0, then subtracting \$t0 from something else right after
  - obviously this will produce the wrong result
  - as in, by the time add is in the ALU stage, sub will be in ID, and so will read the OLD value of \$t0
  - several ways to avoid this; one is to insert a "nop" (no operation) between add and sub
  - basically, stalling
  - but this is not the most desirable, as this shuts down the pipeline, which wastes a turn ...
  - another method: data forwarding
  - this way, the results of the addition could be carried forward to sub, so no useless write/read of registers is needed
  - we could put a new multiplexor in front of each ALU - if the one we're writing to is the same as the one we're reading from next
  - then, pass it backwards (not really forwards, physically) so that sub (in this case) can use the computed value for \$t0
- another example: lw, then add
  - we could stall by adding 3 nops
  - or, data forwarding, by stalling with one nop, then forward (pass backwards) the value read at MEM/WB as an ALU input
  - we could add multiplexors in this case, may draw diagrams later
  - another solution is to reorder, by taking other nearby instructions (that don't use the lw register etc) and putting them betwee
control hazards:
- unconditional jumps:
  - you only realise that you need to jump after the sequentially "next" instruction has been fed in
  - so you need to get rid of it once you've reached the ID stage and determined that you have to jump somewhere else
  - this is done by writing nop instead of that instruction (overwriting it)
  - stalls it by one clock cycle
  - not really sure how this is done ...
- conditional jumps (branching):
  - you can't really know until after the ALU stage whether or not you need to branch
  - one way is to stall until the ALU stage is done, after which, feed in the next instruction (whichever is necessary)
  - another: put in the sub the usual way, but change it to nop (also: change RegWrite to 0) if the branch is taken (not sure how)
  - or: we can reorder, by taking the instruction before the beq (if there is no clash etc) and putting it after
  - the instruction before is executed anyway, regardless of the branch condition, so it should only be after in the assembled code
  - i.e. it's all taken care of by the compiler
incidentally, for floating point ops, the "execute" stage (equivalent of ALU stage) is broken down into multiple stages
- so, it takes more than one clock cycle to finish the computation, though each individual stage is still one clock cycle
- this makes it more complicated etc, but we're not going to go into that
Comparing computer speeds
- comparing computer speeds based on their clock speeds is naive
- it's only one indicator of performance
- a more useful way is to see how two computers run a particular program on some input (high level --> assembly)
- three factors come into play: time per clock cycle, clock cycles per instructions, instructions in the program (all are variable)
  - (the last one is variable because the number of instructions in the program depends on the assembly language for the CPU)
  - (this is the number of instructions EXECUTED, not necessarily in memory ... takes into account loops, etc)
  - if the time per clock cycle is low (high clock speed), that's good, BUT, it might mean more cycles per intruction
  - of course, number of clock cycles per instruction can vary for multi-cycle models (always 1 for single-cycle implementation)
- also, for computers with multiple processors, it's harder to compare, depends on how work is divvied up among the processors

Final review: MIPS

1MIPS: the basics¶

1.1Types of instructions¶

1.2Instruction formats¶

1.2.1R-format instructions¶

1.2.2I-format instructions¶

1.2.3J-format instructions¶

1.3Signed vs unsigned instructions¶

1.4Register conventions¶

1.5Memory¶

2Various MIPS operations and actions¶

2.1Storing an immediate 32-bit value¶

2.2Bitshift operations¶

2.3Array operations¶

2.4Working with strings¶

2.5Assembler directives¶

2.6Loading addresses¶

2.7IO and system calls¶

2.8Functions¶

2.9Using the stack¶

2.10Multiplication and division¶

2.11Floating point operations¶

2.12Type conversion¶

2.13Exception handling¶

3Datapaths¶

3.1Single-cycle datapaths¶

3.1.1Merging datapaths¶

3.2Multi-cycle datapaths¶

3.2.1Microinstructions¶

3.2.2Pipelining¶