L15Mem2 - CS324 Computer Architecture CS324 The Memory Hierarchy Caches II Direct Mapped Cache Mapping address is modulo Mapping the number of

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CS324: Computer Architecture CS324: The Memory Hierarchy: Caches II Direct Mapped Cache Mapping: address is modulo Mapping: the number of blocks in the cache Hit Cache 000 001 010 011 100 101 110 111 Address (showing bit positions) 31 30 13 12 11 210 Byte offset 20 Tag Index 10 Data Index Valid Tag 0 1 2 Data 1021 1022 1023 20 32 00001 00101 01001 01101 10001 10101 11001 11101 Memory What kind of locality are we taking advantage of? spatial locality Hits vs. Misses Read hits Read – this is what we want! Processor continues as if nothing happened. Read misses Read – stall the CPU, fetch block from memory, deliver to cache, restart Write hits: Write – can replace data in cache and memory (write-through) – write the data only into the cache (write-back the cache later) Write misses: Write – read the entire block into the cache, then write the word Handling Cache Misses Handling Data: Data: – stall CPU & freeze all register contents – Wait until data arrives (write data to cache) – restart instruction at cycle that caused the miss Instructions: Instructions: – – – – Send PC-4 to the memory wait for memory to complete read write the data into the cache restart instruction at first step, this time finding it in the cache What to do on a write hit? What Write-through Write – update the word in cache block and corresponding word in memory. – Can be slow! Solution: use buffering – Time to generate must be smaller than time to process Write-back Write – – update word in cache block allow memory word to be “stale” be updated when block is replaced ⇒ add ‘dirty’ bit to each block indicating that memory needs to ⇒ OS flushes cache before I/O… Performance trade-offs? Performance – wt is easier to implement, but generation can be > process – wb addresses above problem, but harder to implement Block Size Tradeoff Block Benefits of Larger Block Size Benefits – Spatial Locality: if we access a given word, we’re likely to access other nearby words soon (Another Big Idea) – Very applicable with Stored-Program Concept: if we execute a given instruction, it’s likely that we’ll execute the next few as well – Works nicely in sequential array accesses too But, miss rate NOT the only cache performance But, metric… Block Size Tradeoff Block Drawbacks of Larger Block Size Drawbacks – Larger block size means larger miss penalty on a miss, takes longer time to load a new block from next level on – If block size is too big relative to cache size, then there are too few blocks Result: miss rate goes up Result: In general, minimize: In Average Access Time = Hit Time x Hit Rate + Miss Penalty x Miss Rate Block Size Tradeoff Block Hit Time = time to find and retrieve data from Hit current level cache Miss Penalty = average time to retrieve data on a Miss current level miss (includes the possibility of misses on successive levels) Hit Rate = % of requests that are found in current Hit level cache Miss Rate = 1 - Hit Rate Miss Extreme Example: One Big Block Extreme Valid Bit Tag Cache Data B3 B2 B1 B0 Cache Size = 4 bytes Block Size = 4 bytes Cache – Only ONE entry in the cache! If item accessed, likely accessed again soon If – But unlikely will be accessed again immediately! The next access will likely be a miss The – Continually loading data into the cache but discard data (force out) before use it again – Nightmare for cache designer: Ping Pong Effect Block Size Tradeoff Conclusions Block Miss Penalty Miss Rate Exploits Spatial Locality Fewer blocks: compromises temporal locality Block Size Average Access Time Block Size Increased Miss Penalty & Miss Rate Block Size Cache Performance Cache Assume: instr. cache miss rate = 2%, data cache miss rate = 4% CPI = 2 (without any memory stalls) Miss penalty = 100 cycles Frequency of loads and stores = 36% How much faster would the processor run with a perfect cache (never misses)? Let I = number of instructions. Stall cycles = instructions*miss rate*miss penalty Cycles due to instruction misses = I*.02*100 = 2I How many instructions will yield a data miss? Cycles due to data misses = .36I *.04 * 100 = 1.44I CPI = 2 + 1.44 + 2 = 5.44 with stalls CPI = 2 if perfect .36 I Perfect cache performance = 5.44/2 = 2.72 Cache Performance Cache What happens if the processor is made faster, but no What other changes are made? – Memory stalls take an increasing fraction of execution time – Lets assume CPI is decreased from 2 to 1 then With stalls, CPI = 1 + 1.44 (for instr) + 2 (for data) = 4.44 CPI = 1 + 1.44 + 2 = 4.44 with stalls CPI = 1 if perfect 4.44/1 = 4.44 times faster when perfect Types of Cache Misses Types Compulsory Misses Compulsory – occur when a program is first started – cache does not contain any of that program’s data yet, so misses are bound to occur – can’t be avoided easily, so won’t focus on these – Besides ... if program executes thousands of instructions, these have negligible effect. Types of Cache Misses Types Conflict Misses Conflict – miss that occurs because two distinct memory addresses map to the same cache location – two blocks (which happen to map to the same location) can keep overwriting each other – big problem in direct-mapped caches – how do we lessen the effect of these? Dealing with Conflict Misses Dealing Solution 1: Make the cache size bigger Solution – fails at some point Solution 2: Increase the number of entries per Cache Solution Index. One Extreme: One Fully Associative Cache Memory address fields: Memory – Tag: same as before – Offset: same as before – Index: non-existent (what was the point of the index?) What does this mean? What – no “rows”: any block can go anywhere in the cache – must compare with all tags in entire cache to see if data is there Fully Associative Cache Fully Fully Associative Cache (e.g., 32 B block) Fully – compare tags in parallel 31 Cache Tag (27 bits long) Valid 4 Byte Offset 0 Cache Tag = = : = = = : : Cache Data B31 B1 B 0 : : Fully Associative Cache Fully Benefit of Fully Associative Cache Benefit – no Conflict Misses (since data can go anywhere) Drawbacks of Fully Assoc Cache Drawbacks – need hardware comparator for every single entry if we have a 64KB of data in cache with if 4B entries, 4B we need 16K comparators: infeasible we Third Type of Cache Miss Third Capacity Misses Capacity – miss that occurs because the cache has a limited size – miss that would not occur if we increase the size of the cache – sketchy definition, so just get the general idea This is the primary type of miss for Fully Associate This caches. N-Way Set Associative Cache Memory address fields: Memory – Tag: same as before – Offset: same as before – Index: points us to the correct “row” (called a set in this case) So what’s the difference? So – each set contains multiple blocks – once we’ve found correct set, must compare with all tags in that set to find our data N-Way Set Associative Cache Given memory address: Given – Find correct set using Index value. – Compare Tag with all Tag values in the determined set. – If a match occurs, it’s a hit, otherwise a miss. – Finally, use the offset field as usual to find the desired data within the desired block. An implementation: 4-Way An 31 30 12 11 10 9 8 22 8 3210 Index 0 1 2 253 254 255 V Tag Data V Tag Data V Tag Data V Tag Data 22 32 4-to-1 multiplexor Hit Data N-Way Set Associative Cache Summary: Summary: – cache is direct-mapped with respect to sets – each set is fully associative – basically N direct-mapped caches working in parallel: each has its own valid bit and data N-Way Set Associative Cache What’s so great about this? What – even a 2-way set assoc cache avoids a lot of conflict misses – hardware cost isn’t that bad: only need N comparators In fact, for a cache with M blocks, In – it’s Direct-Mapped if it’s 1-way set assoc – it’s Fully Assoc if it’s M-way set assoc – so these two are just special cases of the more general set associative design Block Replacement Policy Block Direct-Mapped Cache: index completely specifies which Direct position a block can go in on a miss N-Way Set Assoc (N > 1): index specifies a set, but block can occupy any position within the set on a miss Fully Associative: block can be written into any position Fully Question: if we have the choice, where should we write an Question: incoming block? Block Replacement Policy Block Solution: Solution: – If there are any locations with valid bit off (empty), then usually write the new block into the first one. – If all possible locations already have a valid block, we must pick a replacement policy: rule by which we determine which block gets “cached out” on a miss. Block Replacement Policy: LRU Block LRU (Least Recently Used) LRU – Idea: cache out block which has been accessed (read or write) least recently – Pro: temporal locality => recent past use implies likely future use: in fact, this is a very effective policy – Con: with 2-way set assoc, easy to keep track (one LRU bit); with 4-way or greater, requires complicated hardware and much time to keep track of this Block Replacement Example Block We have a 2-way set associative cache with a four We word total capacity and one word blocks (How total many sets in this cache?). We perform the following word accesses (ignore bytes for this problem): 0, 2, 0, 1, 4, 0, 2, 3, 5, 4 How many hits and how many misses will occur for the LRU block replacement policy? Block Replacement Example: LRU Block loc0 loc1 set 0 set 1 set 0 set 1 set 0 set 1 set 0 set 1 set 0 set 1 set 0 set 1 lru lru Addresses 0, 2, 0, 1, 4, 0, ... Addresses • 0: miss, bring into set 0 (loc 0) 0 lru 0 2 lru • 2: miss, bring into set 0 (loc 1) 0 2 • 0: hit 0 2 1 lru 0 1 0 1 4 lru lru lru lru • 1: miss, bring into set 1 (loc 0) • 4: miss, bring into set 0 (loc 1, replace 2) • 0: hit 4 Ways to reduce miss rate Ways 1. Larger cache – limited by cost and technology – hit time of first level cache < cycle time 2. Increase associativity: – Advantage: more places in cache to put each block of memory fully-associative fully – any block any line k-way set associated – k places for each block – direct map: k=1 – Disadvantages: Increases hit time (more places to look for B) Increases More complicated hardware More Big Idea Big How to choose between associativity, block size, replacement & write policy? Design against a performance model – Minimize: Average Memory Access Time = Hit Time + Miss Penalty x Miss Rate – influenced by technology & program behavior Create the illusion of a memory that is large, cheap, and fast - on average How can we improve miss penalty? Improving Miss Penalty Improving When caches first became popular, Miss Penalty ~ 10 When processor clock cycles Today 4 GHz Processor (0.25 ns per clock cycle) and Today 100 ns to go to DRAM ⇒ 400 processor clock cycles! MEM DRAM Proc $ $2 Solution: another cache between memory and the processor cache: Second Level (L2) Cache Analyzing Multi-level cache hierarchy Analyzing DRAM Proc $ $2 L1 hit time Avg Mem Access Time = L1 Hit Time + L1 Miss Rate * L1 Miss Penalty L1 Miss Penalty = L2 Hit Time + L2 Miss Rate * L2 Miss Penalty Avg Mem Access Time = L1 Hit Time + L1 Miss Rate * (L2 Hit Time + L2 Miss Rate * L2 Miss Penalty) L2 Miss Rate L2 Miss Penalty L1 Miss Rate L1 Miss Penalty L2 hit time ...
View Full Document

This note was uploaded on 02/15/2010 for the course CS 324 taught by Professor Lballesteros during the Fall '08 term at Mt. Holyoke.

Ask a homework question - tutors are online