Most of this information comes from Professor Langer's lecture notes, publicly available on his website. This page is simply a summarisation of the notes, combined with some additional information gleaned from various sources, and with some simple homemade diagrams. Please feel free to edit this page if you notice any errors or omissions. If you have questions or want to challenge the accuracy of something, please contact @dellsystem directly.
The information is not presented in the same order as that used in the slides, but I tried to keep it as close as possible while keeping with the organisation of this page.
- 1 MIPS: the basics
- 2 Various MIPS operations and actions
- 2.1 Storing an immediate 32-bit value
- 2.2 Bitshift operations
- 2.3 Array operations
- 2.4 Working with strings
- 2.5 Assembler directives
- 2.6 Loading addresses
- 2.7 IO and system calls
- 2.8 Functions
- 2.9 Using the stack
- 2.10 Multiplication and division
- 2.11 Floating point operations
- 2.12 Type conversion
- 2.13 Exception handling
- 3 Datapaths
1MIPS: the basics¶
1.1Types of instructions¶
basic review of instructions (don't need to memorise, but be familiar with them)
- loading/saving from memory
- lw, sw, offsets
- basic arithmetic
- add, addi, sub, subi
- branching
- program counter (PC) register holds the address of the current instruction line
- incremented after each instruction
- how to do an if-else statement (basic concept)
1.2Instruction formats¶
- each instruction is one word
- upper (first) 6 bits contain "op code"
- but there are more than 26 = 64 instructions
- 3 different instruction formats
- some instructions are actually "pseudoinstructions" in that the assembler converts them into others
- not really part of MIPS
- e.g.
move $t0, $t1
is really justaddi $t0, $t1, 0
1.2.1R-format instructions¶
field | opcode | rs | rt | rd | shamt | funct |
---|---|---|---|---|---|---|
number of bits | 6 | 5 | 5 | 5 | 5 | 6 |
- when there's no immediate value involved and you're doing something with 1 or more registers
- first, 6-bit op code of 000000
- 3 registers max, 5 bits per reg (25 = 32), so 15 bits for registers
- order: source, other (source/destination), destination
- "shamt" - shift amount (5 bits), used for srl/sll
- "funct" - different variants of operation, to differentiate, 6 bits
- for example, sub and add have same op code, but diff funct
- 64 different R-format instructions possible, due to "funct" (26)
- examples: add, sub, sll, or, slt, and, mult, div, jr
1.2.2I-format instructions¶
field | opcode | rs | rt | immediate |
---|---|---|---|---|
number of bits | 6 | 5 | 5 | 16 |
- when you want to specify some immediate value
- 6 bits opcode
- 10 bits, for two registers (rs, rt)
- 16 bits for an immediate value (signed, offset); 2-15 to 215-1
- for lw, the address is rs + offset; stored in register rt
- for sw, the address is rs + offset; store in it contents of rt
- for beq, compare rs and rt, jump to PC + offset if rs == rt
- so usually we store a label, i guess the label is the offset
- note that the offset is always a multiple of 4 bytes
- cuz, instructions are 4 bytes long etc
- must be specified entirely by the opcode
- things like blt, bgt, ble etc DON'T ACTUALLY EXIST IN MIPS
- although there are pseudo-instructions like that
- but those aren't really part of mips, translated by assembler
- blt substituted by slt, beq (using \$t0 - WILL OVERWRITE VALUE)
- to do something like blt: use slt - "set on less than" (R-format)
- then, beq with \$0 (if the slt condition is true, will NOT be 0)
- didn't make pseudo-instructions core to lower number of I-format
- this way, hardware could be simpler, run faster etc
- some R-formats have corresponding I-format instructions
- add --> addi, can put in a decimal number
- slt --> slti
- examples: lw, sw, beq, bne, addi, lb, sb, lui
1.2.3J-format instructions¶
field | opcode | rs | rt | immediate |
---|---|---|---|---|
number of bits | 6 | 5 | 5 | 16 |
- 6 bit opcode
- 26 bit address (so we can jump farther)
- to be specific, 226 possible instructions we can jump to
- examples: j, jal (the only ones)
1.3Signed vs unsigned instructions¶
- addu, addiu, etc
- the "u" in the name means that the immediate or source register is treated as an unsigned int
- -50000, for example, would be treated as a negative number if not unsigned
- sltu and slt can have different results if one has a MSB of 1, other 0
- for sign extending: to put something in an ALU (for arithmetic/logic)
- the sign must be extended, because ALUs operate on 32-bit numbers
- if unsigned: just add 16 0's to the left (upper 16 bits)
- if signed: copy the MSB (far left) to the upper 16 bits (fill em all)
1.4Register conventions¶
There are 32 registers, \$0 to \$31. The following table does not need to be memorised, but it's probably a good idea to at least be familiar with the different types of registers out there and their purposes.
Number | Name | Description |
---|---|---|
\$0 | \$zero | cannot be used, hardware prevents you from writing to it |
\$1 | \$at | temporary, used by the assembler |
\$2-\$3 | \$v0-\$v1 | function return values |
\$4-\$7 | \$a0-\$a3 | function arguments |
\$8-\$15 | \$t0-\$t7 | temporary registers |
\$16-\$23 | \$s0-\$s7 | "save" registers (child function should not overwrite) |
\$24-\$25 | \$t8-\$t9 | more temporary registers |
\$26-\$27 | \$k0-\$k1 | reserved for the kernel |
\$28 | \$gp | global pointer |
\$29 | \$sp | stack pointer |
\$30 | \$fp | frame pointer |
\$31 | \$ra | return address, automatically saved when calling jal |
1.5Memory¶
- partitioned into "kernel" and "user" regions (231 bytes each, so the total memory size is 232 bytes)
- each: "instructions" region of 228 bytes (226 words), "data" region of (231 - 228) bytes
- instructions are word-aligned
- so, multiple of 4 etc
- lower two bits of each instruction are 00
- consequently, the jump instruction can jump to any instruction in the relevant section (kernel or user)
2Various MIPS operations and actions¶
2.1Storing an immediate 32-bit value¶
what if you want to store an immediate 32-bit value in a register?
- this is possible, though not with one instruction
- since li etc all take a 16-bit value per instruction ...
- lui: loads upper two bytes with the 16 bits you specify
- then, to fill in the lower 16 bits (currently all 0's), use ori
- which ORs every bit, with an immediate (so the lower 16 become filled)
- we could use addiu, but sometimes it doesn't work, so don't do that
- this is really just a logical operation, so, use ori
2.2Bitshift operations¶
Shifting bits to the left or right (multiplying/dividing by 2n)
- sll: multiplies by 2, so moving to the left (lower bits => 0)
- srl (shift right logical): divides by 2, upper bits => 0
2.3Array operations¶
For arrays of ints
- can't transfer data directly from memory -> memory, so, registers
- base address, offset is 4 * the index
- must increment by 4 each because each int element takes up 4 bytes
2.4Working with strings¶
- each character is a byte, just because (so 256 possible characters)
- lb, sb - loading/saving bytes, I-format
- lb saves the byte (8 bits in a 32-bit register)
- the loaded byte is then sign-extended
- so the upper 24 bits are filled with the MSB of the loaded byte
- the unsigned version (lbu) just fills the upper 24 with 0s
- sb just ignores the upper 24 bits in the register
- to compute the length of a string, do it the C way - stop at "\0"
- since the null character "\0" is 00000000 in binary, use beq with \$0
- differs from a regular array because each element uses 1 BYTE, not WORD
- so, instead of incrementing by 4 each time, just increment by 1
2.5Assembler directives¶
- .data: for stuff to be put in the data part of memory
- use a label like
string: .asciiz "this is a string"
- this label is translated into EITHER an offset or an absolute address
- offset for conditional branches, absolute for jump etc
- other things: .word, .byte, .space (reserves number of bytes, used in arrays)
- .text: instruction part of memory
.align 2
: does word align; this is sometimes necessary if you have a string declaration right before a float declaration, as word-alignment is expected for floatsconfirm- .globl main: makes the "main" label global, seems to work without it
- "main" label: like the main method for C/Java/etc; program starts here
2.6Loading addresses¶
- la: pseudoinstruction, translated into "lui \$reg, address"
- where address is the address of the thing label refers to (in data section)
- this translation from label to address is done for you by the assembler
2.7IO and system calls¶
- syscalls for doing I/O or ending the program etc (operating system stuff)
- specify the thing to do with \$v0
- 1-4 to print: int, float, double, string
- put in \$a0/\$f12 the relevant thing
- 5-8 to read: the above order
- int is stored in \$v0, \$f12, or the address specified in \$a0 (length \$a1)
- 10 to exit. For example:
li $v0, 10 syscall
2.8Functions¶
- jal, stores the return address in \$ra (J format)
- you can then return to the next line with jr \$ra (R format)
- pass the child function arguments in \$a0-\$a3
- typically, the returned values are in \$v0, \$v1
- child function should not overwrite \$s0-\$s7 registers
- instead, if it needs those, it should save the contents to the stack
- the stack is explained in the next section
2.9Using the stack¶
- used when a child function wants to preserve values in the \$s0-\$s7 regs
- special region of MIPS memory
- grows downwards rather than upwards
- the "base" of the stack is the address at the "top" (really the bottom) when stack is empty
- the base is at a fixed address of 0x7fffffff (halfway point in mips memory)
- (also note that the kernel begins at 0x80000000)
- the address of the "top" of the stack is called the stack pointer, in \$sp
- to push a word onto the stack, FIRST decrease \$sp by 4, then do sw
- to pop, do lw, THEN increase \$sp by 4
- if there are multiple things to store, then, put in a code example:
addi $sp, $sp, -8 sw $t0, ($sp) sw $t1, 4($sp) # Do things here, making use of the $t0 and $t1 registers if desired # Now get the words back into $t0 and $t1 lw $t0, ($sp) lw $t1 4($sp) addi $sp, $sp, 8
A very simplified and not-to-scale diagram of the situation is below:
- if a child function is also a parent, you'll probably have to push \$ra here
- may also need to preserve argument registers or temp registers in that case
- stack frames: the contiguous set of words on the stack used to store shit
- per function. makes sense: say you decrease \$sp by 12, then, 3 words in your stack frame
- if you don't know the stack frame size in advance (only determined during runtime) ...
- for example, if the amount of memory allocated is based on I/O stuff ...
- then you need to move the stack pointer as the stack frame changes size
- to mark the beginning of the stack frame, use \$fp
- don't know how the above really works or if it's ever actually necessary
2.10Multiplication and division¶
- since int multiplication takes two 32-bits and returns a 64 bit ...
- the product of
mult $t0, $t1
is located in a "product" register, which is read-only - accessed using two separate instructions
- mfhi \$t0, stores the "hi" (upper) 32 bits from it in \$t0
- mflo \$t1, stores the "lo" (lower) 32 bits from it in \$t1
- i think there is a "hi" register and a "lo" one, so two separate regs
- for division, call div
- mfhi to get the remainder and store it in a reg
- mflo to get the quotient and store it in a reg
- mult, div, mflo, mfhi are all R-format
2.11Floating point operations¶
- floating-point operations need to be done with special circuits/regs
- located on a physically separate chip (no room on the original lol)
- called the floating-point coprocessor/unit (FPU), coprocessor 1
- has its own set of registers, \$f0 to \$f31
- for double precision, two registers, starts with an even (\$f0, \$f2, etc)
- only need to mention the first register, the second is used automatically
- lwc1/swc1 to load/save to/from memory (address held in CPU register)
- for doubles, there are pseudoinstructions to make it faster: l.d, s.d
- mfc1/mtc1 to move from/to the CPU (specify the CPU register)
- note that the order of operands is reversed from the usual for mtc1
- FROM _ to _, not destination, source as it usually is
- e.g. mtc1 \$s0, \$f0
- floating point ops: add.s, add.d, sub.s, sub.d, etc
- have to use \$f? registers for the above
- can't mix and match \$f and non-\$f registers here
- similarly, can't use \$f registers for non-floating-point operations
- multiplication and division are easier here, same style as add.s etc
- mul.s, div.s, mul.d, div.d (no separate circuit, no remainder, etc)
- conditional branching: c.__.s/d \$f0, \$f1/2; bc1t label
- the result of the comparison is stored in a special D flip-flop
- bc1t label: if the condition is true, branch to label
- bc1f label: if the condition is FALSE, go to the label
2.12Type conversion¶
All arithmetic operations (add, sub, mult, etc) have to be done with either integers for both inputs or floats for both inputs - you can't mix and match. So what do you do if you have to perform some arithmetic operation on an integer and a float? Depending on the type of the desired result, you can either convert the integer to a float or the float to an integer, and perform that operation on the appropriate coprocessor.
To convert from int to float:
mtc1 $t0, $f0 cvt.s.w $f0, $f0
To convert from float to int:
cvt.w.s $f0, $f0 mfc1 $t0, $f0
Note that the above just truncates the value, so 1.01, 1.99, 1.00 etc would all become 1. Also note that this method of type conversion differs from using trunc.s, round.s, ceil.s, etc as the output type is still a float, part of the FPU.
- converting from single to double - tack on 0's but only to the significand part
- converting from double to single - this rounds the figure, as opposed to only truncating it
2.13Exception handling¶
- done in coprocessor 0
- an event that requires the program to stop, OS has to do something
- has special registers, used only in kernel mode:
- EPC, \$14: return address, the one that follows the line that caused the exception
- cause, \$13: code for what kind of exception occurred
- BadVaddr, \$8: holds the address that led to an exception if user tried to access an illegal address
- Status, \$12: whether the program is running in kernel mode or in user mode
- examples of things that cause errors/exceptions etc:
- overflow, e.g. adding 231 to itself
- division by 0 (same error as above)
- illegal address, e.g. trying to jump to something in the data segment
- note that trying to load from the text segment doesn't cause an exception as mips can't detect (2 lines etc)
- note that with floating points, certain things don't cause exceptions ...
- e.g. overflow, would result in +infinity in certain cases
- division by zero also results in +infinity
- except when dividing 0 by 0 - NaN
- what actually happens is that the program jumps to the first address in the kernel, 0x80000000
- so 0x80000000 is written into the PC by the PCSrc (the selector)
3Datapaths¶
3.1Single-cycle datapaths¶
- single cycle as in, one instruction per clock cycle
- first, to fetch the current instruction out of memory, store its address in the PC (program counter) register
- for a simple non-branching instruction, PC becomes PC + 4 at the end of the clock cycle
- only at the END i believe (you use PC for this instruction)
- then to get it out of memory, use a huge multiplexor circuit and PC, selects one of 2^30 words from memory
- the opcode + funct code (for R-format only) control what happens next
- for example, for add (R-format), after it is read out of memory:
- two ReadReg and one WriteReg is selected (the registers to read from/write to)
- the control signals for the ALU are determined needs explanation?
- RegData values are read from the source registers, sent as input to the ALU
- the ALU operation is performed
- result is written to the WriteReg
- the new value for the PC is written
- this can all be done in a single clock cycle, if it's long enough for the circuits to stabilise (it should be)
- (for example, the ALU operation involves carry values to ripple though, which takes time)
- another example, for lw (I-format):
- first two are the same, ReadReg and WriteReg selected
- base address for memory word is read from the source register, fed to the ALU
- the provided offset is sign-extended, also fed to the ALU
- control signals are set for the ALU operation needs explanation?
- ALU computes the sum of the base address + offset
- the resulting address is used to select a word from memory, which is then written to the register
- PC + 4 is written
- sw:
- first two are the same, two ReadReg are selected
- get the resulting memory address, offset + base using ALU etc
- the word in one of the registers is written to the specified address in memory
- PC + 4 is written
- bne:
- in this case, PC might not take the value PC + 4 at the end of the clock cycle
- note: because instructions are word-aligned in memory, MIPS assumes lowest two bits are 00
- therefore it doesn't store those two bits in memory, so the 16-bit field is really 16 + 00 at the end
- no shift register is required for turning it into a 32-bit word eventually: 16 -> 30 etc (sign-extended)
- anyways, select between PC + 4 and PC + 4 + offset
- read from the registers, subtract, if the result is NotZero, select PC + 4 + offset; else, PC + 4
- this value is written back into PC
- jump, J-format:
- 26 bits to specify address; as address is word-aligned, bits 0 and 1 are always 0 and 0
- so you specify bits 2-27 with the 26-bit address
- as for bits 28-31, based on the upper 4 bits of the current PC - 0000 for user, 1000 for kernel programs
3.1.1Merging datapaths¶
merging datapaths (i.e. how they all work together)
- Obviously, there is only one chip layout which has to accommodate all the different possible instructions
- So now we merge together all the datapaths and illustrate how they function in tandem
- The controls:
- RegDst: the register to be written to; bits 16-20 or 11-15 or not used at all, depending on the instruction
- ALUSrc: one of the ReadData registers is hardwired to be an input for the ALU
- the other input either comes from a register (e.g. in add) or the immediate (addi, etc)
- so we need a multiplexor to determine where the other input comes from
- MemToReg: selects where the thing to write to the register (if there is one) comes from (memory, or ALU)
- RegWrite: specifies whether or not we are writing to a register at all (also called WriteEnable in lecture notes)
- MemWrite: whether or not we are writing to memory
- MemRead: whether or not we are reading from memory
- ALUOp: specifies the operation that the ALU should perform (addition, subtraction, etc)
- Recall from previous lectures that the ALU has two control inputs
- One is the Binvert, which specifies whether or not we are doing subtraction
- The other is the "op" field, which controls if it's AND, OR, or XOR (i.e. +, includes subtraction)
- 10 for addition, 00 for AND, 01 for OR, 11 for slt (set less than; 1 if the difference is negative, 0 otherwise)
- show the truth table, make sure it's correct somehow
- If ALUOp is 10, then, use the funct field to determine Binvert and op
- Otherwise, the opcode determines Binvert and op directly
- Draw the circuit
- show a table linking the input (opcode) to the output (the values of the controls)
- In the case of bne, there are two things to compute: the final address, and whether or not the inputs are equal
- Since the ALU can only be used for one of them, we use a separate adder circuit for the address calculation
- draw out a basic version, and explain what's going on
- Datapath for jal
- Datapath for jr
- jr is obviously R format since there is no intermediate; funct field is used here
- Of course, this is all done in one clock cycle. We must choose a clock rate to ensure that the slowest operation gets carried out in that time
- The worst case is the lw instruction, and this is fairly slow
- But designing for the worst case doesn't seem like the best way of doing it
- Some instructions are quick, some are slow, but if they're all the same length they all become slow
- Which seems silly and inefficient
- Consequently, most MIPS implementations actually make use of multiple cycles, next section
3.2Multi-cycle datapaths¶
- Multi-cycle models: pipelining and microinstructions
- Continued from above: we break each instruction into short stages
- The clock cycle only has to be as long as the longest stage
3.2.1Microinstructions¶
The first method of doing this: microinstructions
- Hold the PC constant during an instruction, only update the PC at the END of the instruction (after the stages are all complete)
- Each stage in the instruction is defined by the control settings that carry the data from one datapath to the next
- i.e. by a microinstruction
- we can naturally partition each instruction into stages based on the main actions performed
- that is, fetching the instruction, reading from the source register, ALU stuff, accessing memory, writing to the dest register
- so, 5 stages max, per instruction (not all stages needed for each instruction)
- we then have a bunch of different possible stages, each with its own microinstruction
- certain specific stages are used in more than one instruction - for example, fetch instruction
- (in which the PC address is read from memory, etc)
- this microinstruction is used in every instruction
- but we need a way of knowing which microinstruction we're on, and which is next
- solution: microPC, which is just like PC except at a lower level, of sorts
- each microinstruction/stage thing has 2 controls: next, and fetch
- if "next" is 1, then the next microinstruction will just be the next address (presumably microPC + 4, if each is a word)
- if next is 0, then the next instruction is "fetch instruction", so jump to there
- fetching the next instruction is always the last real microinstruction for this instruction
- the next microinstruction is the first one for the next instruction, if there is one
- of course, we need a table to store the first microinstruction address for each instruction
- this could be implemented as a read-only combinatorial circuit
- PCWriteEnable is a control that specifies when to write to PC (at the end of the instruction)
- the lecture notes indicate that this is set only when Next=Fetch=0, so presumably this happens at the same time as fetching?
- Not really sure ...
- anyway, although this method is an improvement over the single cycle, it has its downsides
- for instance, most parts of the CPU are doing nothing in any given clock cycle
- since only one thing is active at a time (only one microinstruction etc)
- for ex, if the ALU is active, then all the memory-related stuff is just idling
- so now we come to another method
3.2.2Pipelining¶
- the PC still changes at the end of each clock cycle
- but more than one instruction is in the data path at any given time (sort of like parallel processing)
- you still partition instructions into stages
- but, you have more than one stage running per clock cycle
- ideally, one instruction completes per clock cycle (this is the maximum efficiency)
- if there are 5 stages in a pipeline, then there's a 5x speedup relative to the single cycle model
- this reduces the time per clock cycle, AND ALSO the clock cycles per instruction (theoretically)
- in any case, here's the process:
- each stage completes in one clock cycle (which is long enough for the longest stage)
- there are five stages:
- IF: instruction fetch
- ID: instruction decode AND register read
- ALU: ALU operation
- MEM: reading from/writing to memory
- WB: writing data into the destination register
- we need new registers to control the flow from one stage to another
- because new instructions are needed for ID after IF, etc
- so there's IF/ID for that case, which contains the fetched instruction
- and ID/ALU, ALU/MEM, MEM/WB, each of which contains data necessary to execute the next stage
- these registers only hold their value for one clock cycle, they propagate along sort of, or overwritten by the next instruction
- data hazards in pipelining:
- when a register is read from (second instruction) before it can be written to (first instruction)
- example: adding two numbers and storing it in \$t0, then subtracting \$t0 from something else right after
- obviously this will produce the wrong result
- as in, by the time add is in the ALU stage, sub will be in ID, and so will read the OLD value of \$t0
- several ways to avoid this; one is to insert a "nop" (no operation) between add and sub
- basically, stalling
- but this is not the most desirable, as this shuts down the pipeline, which wastes a turn ...
- another method: data forwarding
- this way, the results of the addition could be carried forward to sub, so no useless write/read of registers is needed
- we could put a new multiplexor in front of each ALU - if the one we're writing to is the same as the one we're reading from next
- then, pass it backwards (not really forwards, physically) so that sub (in this case) can use the computed value for \$t0
- another example: lw, then add
- we could stall by adding 3 nops
- or, data forwarding, by stalling with one nop, then forward (pass backwards) the value read at MEM/WB as an ALU input
- we could add multiplexors in this case, may draw diagrams later
- another solution is to reorder, by taking other nearby instructions (that don't use the lw register etc) and putting them betwee
- control hazards:
- unconditional jumps:
- you only realise that you need to jump after the sequentially "next" instruction has been fed in
- so you need to get rid of it once you've reached the ID stage and determined that you have to jump somewhere else
- this is done by writing nop instead of that instruction (overwriting it)
- stalls it by one clock cycle
- not really sure how this is done ...
- conditional jumps (branching):
- you can't really know until after the ALU stage whether or not you need to branch
- one way is to stall until the ALU stage is done, after which, feed in the next instruction (whichever is necessary)
- another: put in the sub the usual way, but change it to nop (also: change RegWrite to 0) if the branch is taken (not sure how)
- or: we can reorder, by taking the instruction before the beq (if there is no clash etc) and putting it after
- the instruction before is executed anyway, regardless of the branch condition, so it should only be after in the assembled code
- i.e. it's all taken care of by the compiler
- unconditional jumps:
- incidentally, for floating point ops, the "execute" stage (equivalent of ALU stage) is broken down into multiple stages
- so, it takes more than one clock cycle to finish the computation, though each individual stage is still one clock cycle
- this makes it more complicated etc, but we're not going to go into that
- Comparing computer speeds
- comparing computer speeds based on their clock speeds is naive
- it's only one indicator of performance
- a more useful way is to see how two computers run a particular program on some input (high level --> assembly)
- three factors come into play: time per clock cycle, clock cycles per instructions, instructions in the program (all are variable)
- (the last one is variable because the number of instructions in the program depends on the assembly language for the CPU)
- (this is the number of instructions EXECUTED, not necessarily in memory ... takes into account loops, etc)
- if the time per clock cycle is low (high clock speed), that's good, BUT, it might mean more cycles per intruction
- of course, number of clock cycles per instruction can vary for multi-cycle models (always 1 for single-cycle implementation)
- also, for computers with multiple processors, it's harder to compare, depends on how work is divvied up among the processors