one-zero-six

Intro to 106

Signed
Unsigned

-128 to 128

0 to 255

Left most bit is used to denote positive or negative

All bits are used for the number

-2^n-1 to 2^n-1

(2^n)-1

funny expressions

  • -2 < 1U etc etc

    • You evaluate -2 as a signed.

    • 1U is to be evaluated as a unsigned.

Funny expression
To do?

Signed with Unsigned

⇒ Treat every number as unsigned, then evaluate

Signed with Signed

Evaluate as it is

Unsigned with unsigned

Evaluate as it is

In the cheat sheet:

char not signed by default. rest are.

0xAA

short

0xAAAA

int

0xAAAA AAAA

long

0xAAAA AAAA AAAA AAAA

Sorting questions:

1) Find who's signed and unsigned (8 to F)

2) Sort

Bits and Bytes

  • 2's complement allows us to convert subtraction into addition, which simplifies hardware design.

    • Ie. Addition (A + B)

    • Ie. Subtraction (A + ~B)

      • A adds negation B, which can be simply done by flipping the bits of B and adding 1 to create ⇒ ~B

1 hex byte = 4 bits char -> short -> int -> long

  • Shifting

    • Denoted by >> and <<

    • Always shift in bits. << 8 means shift 8 bits, which is 2 hex bytes.

    • If there is a shift of << then >>, it is extracting the rightmost byte.

    • Logical vs Aritmatic Shift

Logical
Arithmatic

For unsigned integers

For signed integers

For literal shifting, no extension or anything

For dividing (right shift >>), multiplying (left shift <<) by 2 Sign extends the MSBit

  • Overflow

    • Does it leave the range of the given datatype?

  • Bit Masks

    • Checking for MSB/LSB

      • B can stand for bit or byte here

Assembly

  • Most instructions will update the conditional flags, and some will NOT store the results in the destination register/memory address

    • leaq does not update flags, and also does not store results

    • cmpq is similar to subq; not addq

      • Does update flags, but not store results

    • testq is similar to a bitwise AND

      • Also update flags, and does not store results

  • Little Endian

    • Smallest byte ('Least Significant Byte') ends up in the lowest address

    • Ie. [Byte 2] [Byte 3] [Byte 4]

    • >> [Byte 4] [Byte 3] [Byte 2]

  • Big Endian

    • First byte ('Most significant byte') ends up in the highest address

    • Ie. [Byte 2] [Byte 3] [Byte 4]

    • >> [Byte 2] [Byte 3] [Byte 4]

When doing questions, converting > for codes, will mean ≤

Caching

Only a program can have temporal locality and spacial locality.

  • Good temporal locality + Spatial locality:

int sum = 0;
int array[5] = {1, 2, 3, 4, 5};
// On spatial locality:
// Data are accessed one after another

for (int i = 0; i < 5; i++) {
    sum += array[i];
}

// maybe later
sum += array[2]; 

// array2 probably still in cache
// so this second time accessing array[2] and also the sum, meaning that
// access to this data is very fast.

  • Bad temporal locality:

int sum = 0;
int array[10000];

for (int i = 0; i < 10000; i++) {
    sum += array[i];
}

// Much later, after a lot of other work...
sum += array[2];

// A lot of time passes before accessing array[2]
// Array[2] might not be in the cache
  • Program reuses data too late ⇒ Bad temporal locality

    • When checking temporal locality -> Are there repeated accesses to the same data?

Processor-Memory Bottleneck

  • Core 2 Duo (CPU + Reg)

    • Process 256 bytes/cycle

    • Bus bandwidth is only 2 bytes/cycle from main memory

⇒ So, lots of waiting on memory and the bus to transport memory to the Core 2 Duo for processing (also called slow memory access)

Cache

  • Small and fast storage device (compared to Main Memory)

  • Acts as a stage area for a subset of frequently or recently used instructions or data

    • Instructions are in i-cache

    • Data are in d-cache

Main Memory-Cache Interaction

  • Data is copied in blocks from Main Memory to Cache

Cache Hit

  • When CPU calls for block n, and block n is in cache.

Cache Miss

  • When CPU calls for block n, and block n is not in cache.

  • So, a request is then made to main memory to fetch block n

  • Here, based on the Mapping Policy of Cache, it decides where block n will be placed in the cache

    • Replacement Policy of Cache adds onto the mapping policy. If there is not free space left, this policy decides which block to evict in cache so that block n can be placed in cache for CPU access.

Cache Storage

  • Is faster than main memory, as when a program runs, it accesses memory in a way that follows patterns that predict what data the program will need next and store it in advance

    • Temporal pattern/locality (Time-based)

      • If program uses data now, program will likely use it again soon

      • likely be referenced

    • Spatial pattern/locality (Nearby)

      • If program accesses one piece of memory, it will likely access nearby memory

      • tend to be referenced

        • Stride-1 access is good spatial locality

Cache Performance Metrics

Cache Hit Access time = HT\text {Cache Hit Access time = HT}
Cache Miss Access Time = HT + MP\text{Cache Miss Access Time = HT + MP}

  • Hit Time (HT)

    • Time to deliver block in cache to CPU

  • Miss Penalty (MP)

    • Additional time required from querying the main memory, fetching the block, and returning to cache because of a cache miss

Miss Rate=MissesTotal Accesses=1Hit Rate\text{Miss Rate} = \frac{Misses}{\text{Total Accesses}} = 1 - \text{Hit Rate}
  • Miss Rate (MR)

    • Percentage of times the CPU does not find the needed data in cache

For Questions that require you to find the MR: 1) Find the number of sets and vlocks per set of the cache

2) Find out the datatype and how many can of that datatype can fit on the block 3) If there is a loop, how many iterations of the loop, based on the resulting number in '2'?

4) In each iteration, find out how many memory operations (reads + writes) and how many cache misses 5) Repeats of loop * misses / total memory ops * loops

  • AMAT

    • Average time to access memory considering both hits and misses

AMAT = HT + MR * MP\text{AMAT = HT + MR * MP}

AMAT can be expressed in clock cycles, or picoseconds.

*Refer to Slide 31 of slides Caching Part 1.

  • To have the most significant decrease in AMAT, first reduce MR, then MP, then clock cycle.

  • Calculating Multi Cache AMAT

    • L1 hit time + L1 miss rate x (L2 hit time + L2 miss rate x [Ln hit time + Ln miss rate * MP])

L1 and L2 Cache

  • L1 is closest to CPU, followed by L2

  • Adding an extra layer of cache improves performance by reducing AMAT

Mapping Policy

  • Block Size (K) -> allows one to find the Offset bits

    • Unit of transfer from MM to cache

  • Cache Size -> allows one to find out the Index

    • Amount of data in bytes that the cache can store

  • Tag

    • Where block is from MM.

Find the nearest starting block address to be able to find what other addresses are loaded along with the requested address.

  • Index

    • May map to one or multiple blocks

      • Direct mapped cache (E=1)

        • when each memory address maps to exactly one index

        • conflict in cache often happens

      • Fully associative cache

        • 8-way: meaning that there are 8 blocks in 1 set.

      • Set associative cache (or E-way set)/ Associavity

        • E is defined to be either 1, 2, 4 or 8, where the number dictates whether how many blocks there are in each set.

        • Increasing the E always decreases conflict misses and increases miss rate.

          • Think -> increasing E increases depth, so DFS vs BFS

          • *However, you need to consider the access patterns

            • If all values can be accessed in the same block, then E-way does not impact.

Question on largest/smallest memory range that starts at (address): 1) Calculate which set it is in 2) Find if there are any more E-ways, and depending on E-way number, calculate the largest memory range.

Collision

  • When different addresses map to the same cache index

Reading Tips

  • Read Little Endian

    • ie. D3C2 (byte 3, byte 2) , not C2D3 (byte 2, byte3), when address request is from byte2.

  • If no valid tag, cache miss. No return bytes and one line is evicted according to replacement policy

Setup of the Table

Set
Line 0

Recently Used (RU)

Recent

Tag

Byte0

Byte 1

0

1

Replacement Policy

  • Replacement is needed when there is fully-associative sets/E-way sets that are full and require eviction of one block so that there is space for the newly request block.

  • FIFO

    • Oldest block is replaced

  • LRU

    • Least recently used

    • Best but hard to implement

  • Random

Cache Write-Hit

1) Write-Through

  • Immediately updates MM with change when cache is also updated

  • Slow

2) Write-back

  • Cache keeps new data, does not update MM immediately

    • Ie. When adding new data to a block, the block is a marked as "dirty" with the dirty bit in the cache.

  • MM is updated only when the cache block is replaced/evicted so that other memory can be freed for others

    • Uses dirty bit to track which have been modified

  • Faster, but they may be out of sync for a while

Cache when Write-Miss

  • When data you want to change is not in the memory

1) Write-Allocate

  • Loads data from MM to cache

  • Write new data to cache

  • Pro: Good for mass reads and writes

2) No-Write-Allocate

  • Directly writes to MM

  • Cache not updated

  • Pro: Good for data that is uncommon

Cache Miss

Virtual Memory

  • Memory management technique

  • Allows programs to act as if there is more physical memory than there actually is in the computer

    • Does this via pagination, where they organise data into pages ( large, fixed-size blocks of memory)

      • A page is larger than a cache block size

      • Fully associative and Write-back.

      • Complicated replacement policy

  • Each process has its own page table, which translates virtual address to physical address.

  • Each virtual page is not mapped to a physical page

Address Translation for VM

  • Each memory access requires translating a virtual address (VA) into a physical address (PA).

  • Processor allocates the virtual address and passes it onto the MMU, where it does a translation to provide a physical address in main memory.

    • For translation, it uses a lookup table called the

      • Indexed by VPN, where there is an entry for every virtual page

      • Each PT entry stores PPN and management bits (Valid, Dirty)

    • In the PT table

    • MMU then sends physical address to

  • In processor, virtual address can be split into

    • +

    • Virtual Page Number

      • Phyiscal Page Size/Physical Page Offset/Virtual Page Offset

Page Hit

  • VM reference is in physical memory

Page Fault

  • VM reference is NOT in physical memory, which causes an exception

  • Page Fault handler then selects a victim to be evicted.

  • Instruction is then restarted to get the page hit

Translation Lookaside Buffer (TLB)

  • A cache, stored in MMU

    • Caches for page table.

  • Stores frequently used virtual-to-physical address mappings

  • If TLB miss, CPU must access page table in memory

Last updated