l18_handouts_4up - Today Memory Hierarchies More Advanced...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Today Memory Hierarchies More Advanced Topics Quick review Tags - what they are, how they work Assume word-addressed memory, or that word size is a byte Copyright Gary S. Tyson 2003 Copyright Sally A. McKee 2005 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 3 Announcements Project 4a due Tuesday 4/10 Homework 4 will be due 4/17 Posted tonight sometime (No Homework 5!) Prelim2 is Tuesday 4/24, 7:30 p.m. (same room split as last time) Calculators allowed on Prelim2 Makeup evening 4/19 OR evening 4/26 Cool guest lecture at end of semester! Let us know which prelim makeup works for you ASAP - you MUST contact both profs (we're not (we' giving four makeups this time!) Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 2 SRAM vs. DRAM SRAM (static random access memory) Faster than DRAM Each storage cell is larger, so smaller capacity for same area 2-10ns access time DRAM (dynamic random access memory) Each storage cell tiny (capacitance on wire) Can get 2GB chips today 50-70ns access time 50Leakyneed to periodically refresh data Leaky What happens on a read? CPU clock rates ~0.2ns-2ns (5GHz-500MHz) ~0.2ns(5GHzCopyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 4 1 Non-Uniform DRAM Access column address write back row address column address sense amplifiers (row buffer) precharge Terminology Temporal locality: If memory location X is accessed, then it is more likely to be re-accessed re in the near future than some random location Y Caches exploit temporal locality by placing a memory element that has that been referenced into the cache Spatial locality: If memory location X is accessed, then locations near X are more likely to be accessed in the near future than some random location Y Caches exploit spatial locality by allocating a cache line of data (including data near the referenced location) Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 5 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 7 Modern DRAMs Add internal banking for parallelism (separate chips are sense amps ranks) possibly shared among adjacent banks A Simple Fully Associative Cache Processor Cache 2 cache lines 3 bit tag field 2 word block tag data V V How many address bits? 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111 Ld Ld Ld Ld Ld R1 M[ R2 M[ R3 M[ R3 M[ R2 M[ 1 5 1 4 0 ] ] ] ] Can be in process of performing ops on more than one bank Pipelined, synchronous interface allows this Rambus uses high memory clock rate, narrower channels, and fixed packet lengths Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 6 R0 R1 R2 R3 four What if we had fewer tag bits? 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Memory 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 8 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 2 1st Access Processor Cache Memory 0 100 1 110 2 120 3 130 4 140 5 150 6 160 7 170 8 180 9 190 10 200 11 210 12 220 13 230 14 240 15 250 9 2nd Access Processor Cache Memory 0 100 1 110 2 120 3 130 4 140 5 150 6 160 7 170 8 180 9 190 10 200 11 210 12 220 13 230 14 240 15 250 11 Ld Ld Ld Ld Ld R1 M[ R2 M[ R3 M[ R3 M[ R2 M[ 1 5 1 4 0 ] ] ] ] tag 0 0 data Ld Ld Ld Ld Ld R1 M[ R2 M[ R3 M[ R3 M[ R2 M[ 1 5 1 4 0 ] ] ] ] tag 1 0 lru 0 data 100 110 R0 R1 R2 R3 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 R0 R1 R2 R3 110 Misses: 1 Hits: 0 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 1st Access Processor Cache 0 1 2 3 4 5 6 7 t se 8 f of 9 k loc b 10 11 12 13 14 15 2nd Access Memory 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 10 Processor Cache Ld Ld Ld Ld Ld R1 M[ R2 M[ R3 M[ R3 M[ R2 M[ 1 5 1 4 0 ] ] ] ] tag 1 0 lru 0 data 100 110 Ld Ld Ld Ld Ld R1 M[ R2 M[ R3 M[ R3 M[ R2 M[ 1 5 1 4 0 ] ] ] ] tag lru 1 0 1 2 data 100 110 140 150 Addr: 0001 R0 R1 R2 R3 Addr: 0101 R0 R1 R2 R3 110 Misses: 1 Hits: 0 110 150 Misses: 2 Hits: 0 0 1 2 3 4 5 6 7 t se 8 f of k 9 loc b 10 11 12 13 14 15 Memory 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 12 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 3 Questions Why do valid bits never change, once a block/line is loaded? (Or WHEN do they do change?) Why can any address go anywhere in our (very small) cache? How do we flush the cache? (e.g., on a context switch) Hennessy and Patterson (NOW) Read 7.1-7.3 ASAP, 7.6-7.8 before Prelim2 7.17.6The book will be very useful for of Prelim2 I'm giving you my own Virtual Memory info: you're advanced enough to get the real info, and most books gloss over how things are actually implemented You can read the section on Virtual Memory, but you should follow my lecture notes wrt the exam (like any textbook of its size, this one does include some errors, or some unneccessary simplifications) Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 13 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 15 Hennessy and Patterson (ASAP) Read 6.1-6.6, 6.8-6.12 (Pipelining) Remember that they give a different perspective, and you should refer to last year's slides when in doubt year' re: your MIPS implementation You now have three different "perspectives" on pipelining: this should help you cement concepts It's all the same idea, but there are many ways to implement it (just like programming!) Basic Cache Design Decide on the block size How? Simulate lots of different block sizes and see which one gives the best performance Most systems use a block size between 32 bytes and 128 bytes Longer sizes reduce the overhead by Reducing the number of tags Reducing the size of each tag Tag Block Offset 16 Put this off until next week read memory stuff now Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 14 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 4 What about Stores? Where should you write the result of a store? If that memory location is in the cache? Send it to the cache Should we also send it to memory right away? (write-through policy) writepolicy) Wait until we kick the block out (write-back policy) (writepolicy) Two-Way Set Associative Cache Address 01101 Cache V D tag data 0 0 0 0 Block Offset (unchanged) 1-bit Set Index Larger (3-bit) Tag Rule of thumb: Increasing associativity decreases conflict misses. A 2-way associative cache has about the same hit rate as a direct mapped cache twice the size. If it is not in the cache? Allocate the line (put it in the cache)? (write allocate policy) policy) Write it directly to memory without allocation? (no-write allocate policy) nopolicy) Memory 00000 78 00010 29 00100 120 00110 123 01000 71 01010 150 01100 162 01110 173 10000 18 10010 21 10100 33 10110 28 11000 19 11010 200 11100 210 11110 225 23 218 10 44 16 141 28 214 33 98 181 129 119 42 66 74 19 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 17 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Direct-Mapped Cache Address 01011 Cache V D tag data 0 0 0 0 Block Offset (1-bit) Line Index (2-bit) Tag (2-bit) Compulsory Miss: First reference to memory block Capacity Miss: Working set doesn't fit in cache Conflict Miss: Working set maps to same cache line Effects of Varying Cache Parameters 23 218 10 44 16 141 28 214 33 98 181 129 119 42 66 74 18 Memory 00000 78 00010 29 00100 120 00110 123 01000 71 01010 150 01100 162 01110 173 10000 18 10010 21 10100 33 10110 28 11000 19 11010 200 11100 210 11110 225 Total cache size: block size # sets associativity Positives: Should decrease miss rate Negatives: May increase hit time Probably increase area requirements (how are these related?) Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 20 5 Effects of Varying Cache Parameters Bigger block size Positives: Exploits spatial locality ; reduce compulsory misses Reduces tag overhead (bits) Reduces transfer overhead (address, burst data mode) Effects of Varying Cache Parameters Replacement strategy: (for associative caches) How is the evicted line chosen? 1. LRU: intuitive; difficult to implement with high LRU: associativity; worst case performance can occur associativity; (N+1 element array) 2. Random: Pseudo-random easy to implement; Random: Pseudoperformance close to LRU for high associativity; associativity; usually avoids pathological behavior (programmers HATE it!) 3. Optimal: replace block that has its next reference Optimal: furthest in the future; Belady replacement; hard/impossible to implement Negatives: Fewer blocks for given size; increase conflict misses Increases miss transfer time (multi-cycle transfers) (multiWastes bandwidth for non-spatial data non- Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 21 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 23 Effects of Varying Cache Parameters Increasing associativity Positives: Reduces conflict misses Low-associative caches can have pathological behavior Low(very high miss rates) Other Cache Design Decisions Write Policy: how to deal with write misses? Write-through / no-allocate WritenoTotal traffic? Read misses block size + writes Common for L1 caches back by L2 (especially on-chip) on- Write-back / write-allocate WritewriteNeeds a dirty bit to determine whether cache data differs Total traffic? (read misses + write misses) block size + dirty-block-evictions block size dirty- blockCommon for L2 caches (memory bandwidth limited) Negatives: Increased hit time More hardware requirements (comparators, muxes, bigger tags) Decreases improvements past 4- or 8- way 4- 8Belady's anomaly (eventually more associativity = lower Belady' performance!) Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 22 Variation: Write validate Write-allocate without fetch-on-write Writefetch- onNeeds sub-block cache with valid bits for each word/byte subCopyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 24 6 Other Cache Design Decisions Write Buffering Delay writes until bandwidth available Put them in FIFO buffer Only stall on write if buffer is full Use bandwidth for reads first (since they have latency problems) Important for write-through caches write traffic frequent writecaches 0000 Adding a Victim Cache V d tag data 0 0001 0 0010 0 0011 0 0100 0 0101 0 0110 0 0111 0 1000 0 1001 1 010 110 1010 0 1011 0 1100 0 1101 0 1110 0 1111 0 (Direct mapped) V d tag data (fully associative) 0 1101001 0 0 0 Victim cache (4 lines) Ref: 11010011 Ref: 01010011 Small victim cache adds associativity to "hot" lines Blocks evicted from direct-mapped cache go to victim Tag compares are made to direct mapped and victim Victim hits cause lines to swap from L1 and victim Not very useful for associative L1 caches Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 27 Write-Back Buffer Holds evicted (dirty) lines for Write-Back caches WriteGives reads priority on the L2 or memory bus Usually only needs a small buffer Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 25 Other Cache Design Decisions Why do we use the uppermost bits as the tag? Why do we use the middle bits as cache index? 11010011 01010011 11010011 Hash-Rehash Cache V d tag data 0 0 0 0 0 0 0 0 0 1 110 0 0 0 0 0 0 (Direct mapped) Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 26 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 28 7 Hash-Rehash Cache 11010011 01010011 Miss 01000011 Rehash miss Hash-Rehash Cache 11010011 01010011 01000011 V d tag data 0 R1 110 0 0 0 0 0 0 0 1 010 0 0 0 0 0 0 (Direct mapped) Allocate? 11010011 V d tag data 0 0 0 0 0 0 0 0 0 1 110 0 0 0 0 0 0 (Direct mapped) 11010011 Miss 11000011Rehash Hit! Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 29 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 31 Hash-Rehash Cache 11010011 01010011 Miss 01000011 Rehash miss Hash-Rehash Cache Calculating performance: Primary hit time (normal Direct Mapped) Rehash hit time (sequential tag lookups) Block swap time? Hit rate comparable to two-way associative two- 11010011 V d tag data 0 R1 110 0 0 0 0 0 0 0 1 010 0 0 0 0 0 0 (Direct mapped) Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 30 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 32 8 Compiler Support for Caching Array merging (array of structs vs. two arrays) Loop interchange (row vs. column access) Structure padding and alignment (malloc()) malloc()) Cache conscious data placement Pack working set into same line Map to non-conflicting address if packing nonimpossible Calculating the Effects of Latency Does a cache miss reduce performance? depends if critical instructions are waiting for the result Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 33 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 35 Prefetching Already done loading entire line assumes spatial locality Extend this... Next Line Prefetch Bring in next block in memory too on a miss Very good for Icache (why?) Calculating the Effects of Latency Depends on whether critical resources are held up Blocking: When a miss occurs, all later references to the cache must wait. This is a resource conflict. Non-blocking: Allows later references to access Noncache while a miss is being processed. Generally there is some limit to how many outstanding misses can be bypassed (e.g. eight or 16). Software prefetch Loads to R0 have no data dependency Aggressive/speculative prefetch useful for L2 Speculative prefetch problematic for L1 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 34 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 36 9 Section This Week + Homework These focus on practical, analytical problems (as in H&P) If we have this many ways in an associative cache, and the block size is thus, how many bits in the tag? You need to practice these kinds of questions for the exam Unfortunately, we don't make you do cache hierarchy design in your projects So no creative cache solutions in homeworks or projects If you're interested in memory, come see me - it's a huge problem, and we're only hitting the tip of the iceberg (esp. if you want to learn something about parallel computing) Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 37 10 ...
View Full Document

This note was uploaded on 09/01/2008 for the course ECE 3140 taught by Professor Mckee/long during the Spring '07 term at Cornell University (Engineering School).

Ask a homework question - tutors are online