l19_handouts_4up - More (Oh YES!) on Memory Online Course...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: More (Oh YES!) on Memory Online Course Evals (DO IT, DO IT, DO IT): Wed 23 Apr -- Weds 7 May (FLEXGRADE!) http://www.engineering.cornell.edu/courseeval We listen to your feedback, so you can help next year's students Mistake in Lecture 15 slides We're taking our mistake into account wrt grading Should have corrected datapath up on website ASAP What We've Covered Thus Far SRAM and DRAM Adding caches to reduce average memory latency Locality principles Memory hierarchy structure and relative latencies Direct-mapped vs associative caches Replacement policies for associative caches Write policies Copyright Sally A. McKee 2006 Copyright Sally A. McKee 2006 Hennessy & Patterson Read 7.4 (it's long, alas) Read 8.1-8.7 for next week Other Cache Design Decisions Write Policy: how to deal with write misses? Write-through / no-allocate WritenoTotal traffic? Read misses block size + writes Common for L1 caches back by L2 (especially on-chip) on- Write-back / write-allocate WritewriteNeeds a dirty bit to determine whether cache data differs Total traffic? (read misses + write misses) block size + dirty-block-evictions block size dirty- blockCommon for L2 caches (memory bandwidth limited) Variation: Write validate Copyright Sally A. McKee 2006 Write-allocate without fetch-on-write Writefetch- onNeeds sub-block cache with valid bits for each word/byte subCopyright Sally A. McKee 2006 1 Other Cache Design Decisions Write Buffering Delay writes until bandwidth available Put them in FIFO buffer Only stall on write if buffer is full Use bandwidth for reads first (since they have latency problems) Important for write-through caches write traffic frequent writecaches The Unnecessary but Well Meant Slide Plan your time carefully for the final project Organize together with your team to make sure you can work on things TOGETHER and AT THE SAME TIME Otherwise, if you divide stuff up into portions, you're going to miss out on the learning of doing the other person's portion Think about the industrial perspective: remember that it's just as important to VALIDATE as to DESIGN, so switch roles amongst teammates Copyright Sally A. McKee 2006 Write-Back Buffer Holds evicted (dirty) lines for Write-Back caches WriteGives reads priority on the L2 or memory bus Usually only needs a small buffer Copyright Sally A. McKee 2006 Other Cache Design Decisions Why do we use the uppermost bits as the tag? Why do we use the middle bits as cache index? Today More examples Advanced cache designs Prefetching Copyright Sally A. McKee 2006 Copyright Sally A. McKee 2006 2 Let's Do Some Examples What if you have a 64KB cache that's 2-way associative? How big is each set? How big is each way? WHAT QUESTION DO YOU NEED TO ASK BEFORE YOU CAN ANSWER? Array Addresses byte offset word offset index tag What if the cache is 512KB and 4-way? Copyright Sally A. McKee 2006 Copyright Sally A. McKee 2006 Another Example Consider the following MIPS code: Our Direct Mapped Cache You've got a simple, direct-mapped cache: direct128 byte cache, 16 byte (4 word) blocks/lines Copyright Sally A. McKee 2006 Addresses: x (tag = 1, index = 4) y (tag = 4, index = 0) s (tag = 5, index = 2) At end of 1st loop this is the cache state Copyright Sally A. McKee 2006 What if all vars had same index? 3 New Cache Organization Consider the following example: S0 S1 tag Four-Way Associative Cache tag tag tag v v v v 0000 0000 LRU info S0 S1 way0 way1 way2 way3 You've got a simple, 4-way associative cache: 128 byte cache, 16 byte (4 word) blocks/lines Copyright Sally A. McKee 2006 Addresses: x (tag = 6, index = 0) y (tag = 16, index = 0) s (tag = 21, index = 0) Request: load x[0] Copyright Sally A. McKee 2006 Addresses Address of x is 0x0C0 tag 000000000000000000000000110 index 0 00 00 word offset 0 00 00 byte offset Four-Way Associative Cache Addresses: x (tag = 6, index = 0) y (tag = 16, index = 0) s (tag = 21, index = 0) Request: load y[0] Addresses: x (tag = 6, index = 0) y (tag = 16, index = 0) s (tag = 21, index = 0) Request: store s[0] Copyright Sally A. McKee 2006 Address of y is 0x200 000000000000000000000010000 Address of s is 0x2a0 000000000000000000000010101 0 00 00 Anybody see a performance problem coming here? Copyright Sally A. McKee 2006 4 Four-Way Associative Cache Addresses: x (tag = 6, index = 0) y (tag = 16, index = 0) s (tag = 21, index = 0) Request: load x[1] Addresses: x (tag = 6, index = 0) y (tag = 16, index = 0) s (tag = 21, index = 0) Request: load y[1] Copyright Sally A. McKee 2006 Four-Way Associative Cache First iteration misses Second iteration hits Third iteration? Fourth iteration? What if the vectors were longer? Copyright Sally A. McKee 2006 Four-Way Associative Cache Addresses: x (tag = 6, index = 1) y (tag = 16, index = 1) s (tag = 21, index = 1) Request: load x[4] Addresses: x (tag = 6, index = 1) y (tag = 16, index = 1) s (tag = 21, index = 1) Request: load y[4] Copyright Sally A. McKee 2006 More Advanced Cache Organizations Want advantages of associativity w/o costs Victim caches (Jouppi [Joepy]) Hash/rehash caches Copyright Sally A. McKee 2006 5 Adding a Victim Cache V d tag data 0 0001 0 0010 0 0011 0 0100 0 0101 0 0110 0 0111 0 1000 0 1001 1 010 110 1010 0 1011 0 1100 0 1101 0 1110 0 1111 0 0000 Hash-Rehash Cache 11010011 01010011 Miss 01000011 Rehash miss (Direct mapped) V d tag data (fully associative) 0 1101001 0 0 0 Victim cache (4 lines) Allocate? Ref: 11010011 Ref: 01010011 Small victim cache adds associativity to "hot" lines Blocks evicted from direct-mapped cache go to victim Tag compares are made to direct mapped and victim Victim hits cause lines to swap from L1 and victim Not very useful for associative L1 caches Copyright Sally A. McKee 2006 11010011 V d tag data 0 0 0 0 0 0 0 0 0 1 110 0 0 0 0 0 0 (Direct mapped) Copyright Sally A. McKee 2006 Hash-Rehash Cache 11010011 01010011 11010011 V d tag data 0 0 0 0 0 0 0 0 0 1 110 0 0 0 0 0 0 (Direct mapped) Hash-Rehash Cache 11010011 01010011 Miss 01000011 Rehash miss 11010011 V d tag data 0 R1 110 0 0 0 0 0 0 0 1 010 0 0 0 0 0 0 (Direct mapped) Copyright Sally A. McKee 2006 Copyright Sally A. McKee 2006 6 Hash-Rehash Cache 11010011 01010011 01000011 V d tag data 0 R1 110 0 0 0 0 0 0 0 1 010 0 0 0 0 0 0 (Direct mapped) Compiler Support for Caching Array merging (array of structs vs. two arrays) Loop interchange (row vs. column access) Structure padding and alignment (malloc()) malloc()) Cache conscious data placement Pack working set into same line Map to non-conflicting address if packing nonimpossible 11010011 Miss 11000011Rehash Hit! Copyright Sally A. McKee 2006 Copyright Sally A. McKee 2006 Hash-Rehash Cache Calculating performance: Primary hit time (normal Direct Mapped) Rehash hit time (sequential tag lookups) Block swap time? Hit rate comparable to two-way associative two- Prefetching Already done loading entire line assumes spatial locality Extend this... Next Line Prefetch Bring in next block in memory too on a miss Very good for Icache (why?) Software prefetch Loads to R0 have no data dependency Aggressive/speculative prefetch useful for L2 Speculative prefetch problematic for L1 Copyright Sally A. McKee 2006 Copyright Sally A. McKee 2006 7 Virtual Memory How do we fit several programs in memory at once? How does the assembler/compiler know what addresses are OK to use? Multiprogramming allows >1 program to run "at the same time" (we've been over this one) Who keeps track of what's where? Pages Divide memory into chunks For caches, lines or blocks For main memory, pages Assume virtual == physical page size 4KByte pages for now 0x00000000 is invalid (remember why?) 0x00001000 0x00002000, . . . Copyright Sally A. McKee 2006 Copyright Sally A. McKee 2006 Virtual Memory Virtual to Physical Mapping 0x40000000 x, y, and s would have been in one of these places, so would have had different addrs 0x00001000 0x00000000 0x00000000 virtual memory Copyright Sally A. McKee 2006 Copyright Sally A. McKee 2006 physical memory (DRAM) 8 Virtual to Physical Mapping virtual address 0x00001000 physical address 0x0000C000 00000000000000000001 00000000 00 00 page number translation page offset Page Table What if the page table entry is invalid? Page misses (Fig 7.22 greatly oversimplifies) Code Static variables Stack Heap Page table? 00000000000000001101 00000000 00 00 page number page offset page offset doesn't change! Swapping (need replacement policy, probably LRU) Copyright Sally A. McKee 2006 Copyright Sally A. McKee 2006 Page Table virtual page number page offset V D R X K Discussion of VM page table holds: physical page numbers bookkeeping info valid dirty read-only/writeable executable OS kernel access only physical page number page offset Copyright Sally A. McKee 2006 Copyright Sally A. McKee 2006 9 Page Table Each process's page table is in memory But memory is SLOW! What can we do? TLBs Size: 16-512 entries 16Tag is virtual page number Need to track LRU information (this is the ref bit in the book) TLB lookup takes place in parallel with cache access Typically high hit rates (miss < 1% of time) L1 cache indexed by virtual addresses L2 can be virtual or physical (if TLB hit, we'll have physical address in time for lookup) Copyright Sally A. McKee 2006 Copyright Sally A. McKee 2006 Translation Lookaside Buffer Why "TLB"? Cache for PTEs Looks a lot like a little page table Usually fully associative May have two levels (like regular cache) tag physical addr v dr x k TLBs It's just cache for our current working set of Page Table Entries Replacement can be hardware or software Simple TLB hardware replacement easy Can have other associativities, multiple levels When entry evicted, bookkeeping bits must be updated in page table Copyright Sally A. McKee 2006 Copyright Sally A. McKee 2006 10 TLB What if entry is invalid? Not present? TLB misses If present, get PTE from memory If not present, OS handles page fault Can be very expensive! Many memory accesses Need to transfer control to OS Possibly disk accesses or network accesses Copyright Sally A. McKee 2006 11 ...
View Full Document

Ask a homework question - tutors are online