cache - Cache • Memory hierarchy is our solution for...

Info iconThis preview shows pages 1–4. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Cache • Memory hierarchy is our solution for unlimited fast accesses – at least 1 instruction fetch and maybe 1 data access per cycle • more for a superscalar • Each level of the hierarchy is based (at least in part) on Principle of Locality of Reference – as we move higher in the hierarchy, each level gets faster but also more expensive, therefore it is more restricted in size • some issues are generic across the hierarchy • but, each level has its unique characteristics, technology and solutions • we have already looked at registers, here we will study cache The higher levels of the hierarchy (which are faster) are more expensive so we have less storage there But we want to prevent accesses at the lower levels because of their longer access times Effect on Performance • Memory access time has a direct effect on CPU performance – CPU execution time = (CPU clock cycles + memory stall cycles) * clock cycle time • mem stall cycles = IC * mem references per instr * miss rate * miss penalty – mem references per instr > 1 since there will be the instruction fetch itself, and possible 1 or more data fetches • whenever an instruction or data is not in registers, we must fetch it from cache, but if it is not in cache, we accrue a miss penalty by having to access the much slower main memory • A large enough miss penalty will cause a substantial decrease in CPU execute time, consider for example: – CPI = 1.0 when all memory accesses are hits – only data accesses are during loads and stores (50% of all instructions are loads or stores) – miss penalty is 25 clock cycles, miss rate is 2% – what is the impact on CPI? • CPI = 1.0 + 100%*2%*25 + 50%*2%*25 = 1.75 • so we are 75% slower than we might be because of cache misses Four questions • The general piece of memory will be called a block – blocks differ in size depending on the level of the memory hierarchy • cache block, memory block, disk block • We ask the following questions: – – – – Q1: Q2: Q3: Q4: where can a block be placed? how is a block found? which block should be replaced on a miss? what happens on a write? • The answers to questions 1-3 differ depending on the type of cache: – direct mapped, associative, set-associative • The answer to the last question is based on the write policy – write through or write back Q1: Where can a block be placed? • Type determines placement: – Associative cache • any available block – Direct mapped cache • given memory block has only one location where it can be placed in cache determined by the equation: – (block address) mod size – Set associative cache • given memory block has a set of blocks in the cache where it can be placed determined by: – (block addr) mod (size / associativity) Here we have a cache of size 8 and a memory of size 32 – to place memory block 12, we can put it in any block in associative cache, in block 4 in direct mapped cache, and in block 0 or 1 in a 2 way set associative cache Q2: How is a block found in cache? • All memory addresses consist of a tag, a line number (or index), and a block offset – in a direct mapped cache, the line number • dictates the line where a block must be placed or where it will be found • the tag is used to make sure that the line we have found is the line we want – in a set associative cache, the line number references a set of lines • the block must be placed in one of those lines, but there is some variability – which line should we put it in, which line will we find it in? • we use a multiplexor to compare the tags of the line from each set as selected by the line number – in a fully associative cache, a line can go anywhere • since the fully associative cache is all 1-line sets, we have to compare all blocks at once, this is done through associative memory (a large number of MUXs) Q3: Which block should be replaced? • For the direct-mapped cache – there is no choice of which line a new block must be placed, so there is no need for a replacement strategy • For set-associative and fully associative caches – we have a choice, for instance in a 2-way set-associative cache we can place the new block in set 0 or set1 – for the fully associative cache, our choice is to place the new block into any line at all • There are three common replacement strategies – random – FIFO – least recently used • this best models locality of reference as it predicts which block will not be used again in the near future by looking at how long ago it was last accessed, however, this is difficult to implement especially in hardware • instead we might use a variation, LRU approximation or least frequently used – figure C.4 on page C-10 compares the data miss rate when using FIFO, Random, and LRU replacement strategies • what we see is that LRU is always the best, but the difference between the three is not great, and that random is often as good as FIFO Q4: What happens on a write? • Writes will occur only to data – when the datum is loaded into cache, there are (at least) two copies in memory now, in the cache and in main memory (other copies may exist if we have multiple levels of cache) – on a cache write, the datum in cache is modified, but what happens to the old ( dirty) value in memory? • Write Through cache – write the datum to both cache and memory at the same time • this is inefficient because the data access is a word, typical data movement between cache and memory is a block, so this write uses only part of the bus for a transfer • notice other words in the same block may also soon be updated, so waiting could pay off • Write Back cache – write to cache, wait on writing to memory until the entire block is being removed from cache • add a dirty bit to the cache to indicate that the cache value is right, memory is wrong • • Write Through is easier to implement since memory will always be up-to-date and we don’t need dirty bit mechanisms Write Back is preferred to reduce memory traffic (a write stall occurs in Write Through if the CPU must wait for the write to take place) Write Miss Example • What happens on a write miss? – write allocate • block fetched on a miss, the write takes place at both the cache and memory – no-write allocate • block modified in memory without being brought into the cache • Consider write-back cache which starts empty and has the sequence of operations as shown below to the left – how many hits and how many misses occur with no-write allocate versus write allocate? • Solution: Write [100] Write [100] Read [200] Write [200] Write [100] – for no-write allocate • the first two operations cause misses (since after the first one, 100 is still not loaded into cache), the third instruction causes a miss, the fourth instruction is a hit (since 200 is now in cache) but the fifth is also a miss, so 4 misses, 1 hit – for write allocate • the first access to a memory location is always a miss, but from there, it is in cache and the rest are hits, so we have 2 misses (one for each of 100 and 200) and 3 hits Using a Write Buffer • To alleviate the inefficiency of Write Through, we may add a write buffer – writes go to cache and the buffer, the CPU continues without stalling – writes to memory occur when the buffer is full or when a line is filled – thus, a write to memory is done in parallel with the processor continuing execution • although stalls can still arise when using a write buffer, they are less frequent – question: on a read or write miss, should we examine the write buffer? • the buffer may still be storing a value so that we can avoid the time consuming memory access • Example: – given the three memory operations below where locations 512 and 1024 map to the same direct-mapped cache line, if the cache is a write-through cache with a 4-word write buffer that is not checked on a cache miss, will the value in R2 equal the value in R3 at the end? not necessarily SW LW LW R3, 512(R0) // value written to write buffer, but not yet to memory R1, 1024(R1) // cache miss, new block brought in R2, 512(R0) // cache miss, fetch Mem[512] which might still be the // old value since the write buffer may not have written yet! Example: Opteron Data Cache • Found in AMD Opteron processors – – – – 2-way set ass., 512 blocks (per set) 64 Kbytes in 64-byte blocks write-back, write-allocate CPU address: 40-bit address • comprised of 25-bit tag, 9-bit line number and 6-bit byte number – tags from both sets of the given line number are compared to this line number (the 2:1 MUX is used to select which (if either) datum to return to the CPU) – the valid bit is used to denote that a block has been modified • so that it can be written back to memory prior to being removed from the cache – on a miss, the cache notifies the processor of the miss so that the processor can stall, and the request continues down to the next level of the hierarchy • which takes 7 further cycles to get the first 8 bytes of the needed block – replacement strategy is LRU using a single bit to denote which sit has been accessed most recently and replacing the older one Split Versus Unified Caches • Notice that the previous example was of only a data cache • Why have separate data and instruction caches? – below are statistics indicating the performance of two split caches versus a single unified cache • the unified cache would be twice the size of the two split caches, so compare for instance two 8KB caches to a 16KB unified cache Size 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB Instruction cache 8.16 3.82 1.36 0.61 0.30 0.02 Data cache 44.0 40.9 38.4 36.9 35.3 32.6 Unified cache 63.0 51.0 43.3 39.4 36.2 32.9 Why do you suppose the instruction cache performs so much better than the data cache of equal size? Note: this table does not show miss rate – we are seeing misses per 1000 instruction, not per 1000 access Divide by 10 to get percentage (e.g., 6.3% for 8KB Unified cache) Example • Compare 16 KB instr and 16 KB data caches vs. 32 KB unified cache assuming – 1 clock cycle hit time – 100 clock cycle miss penalty for the individual caches – add 1 clock cycle hit time for load/store in the unified cache (36% of instructions are load/stores) • since a single cache cannot handle both an instruction fetch and data access in the same cycle – write-through caches with write buffer, no stalls on writes • What is the average memory access time for both caches? • To determine each cache’s performance, we compute memory access time: – avg memory access time = hit time + miss rate * miss penalty • as noted above, the unified cache, we have a higher hit time because the unified cache will need two accesses at a time if an instruction is a load or store Solution • We get misses per instruction from table on the earlier slide • Converting to miss rate: – (3.82 / 1000) / 1 instr = .00382 – (40.9 / 1000) / .36 = .1136 – (43.3 / 1000) / 1.36 = .0318 • Of 136 accesses per 100 instructions, percentage of instruction accesses = – 100 / 136 = 74% • Percentage of data accesses = 36 / 136 = 26% – memory access time for 2 caches = 74% * (1 + .00382 * 100) + 26% * (1 + . 1136 * 100) = 4.236 – memory access for unified cache = 74% * (1 + .0318 * 100) + 26% * (2 + . 0318 * 100) = 4.44 • Separate caches perform better – note: typo in the book, they state 100 clock cycle miss penalty but use 200 in their solution Revised CPU Performance • Recall our previous CPU formula = – (CPU cycles + memory stall cycles) * clock cycle time – assume memory stalls are caused by cache misses, not problems like bus contention, I/O, etc… • Memory stall cycles = – memory accesses * miss rate * miss penalty = reads * read miss rate * read miss penalty + writes * write miss write * write miss penalty • CPU time = – IC * (CPI + mem access per instr * miss rate * miss penalty) * clock cycle time = – IC * (CPI * CCT + mem accesses per instr * miss rate * mem access time) • The memory access time is independent of the clock speed – so an interesting tradeoff arises – the faster the clock, the better the CPU performance, but the higher the miss penalty! Associativity Example • What impact does cache organization (direct-mapped vs. 2-way set associative) have on a CPU? – assume CPI with perfect cache is 1.6, clock cycle time = .35 ns, 1.4 memory references per instruction, 128 KB data and instr caches with 64 byte blocks each – cache 1 is direct-mapped with a miss rate of 2.1%, cache 2 is 2-way set associative with a miss rate of 1.9% • the direct-mapped cache is faster, so the clock speed is faster, assume clock speed is lengthened by 35% for 2-way set associative cache – cache miss penalty = 65 ns – CPU Time Cache 1 = • IC * (1.6 * .35 + 1.4 * .021 * 65) = 2.47 * IC – CPU Time Cache 2 = • IC * (1.6 * .35 * 1.35 + 1.4 * .019 * 65) = 2.49 * IC – CPU with Cache 1 = 2.47 / 2.49 = 1.01 times faster • Note: cache miss penalty should actually be higher for cache 2 – why? Out of Order and Miss Penalty • In our prior examples, cache misses caused the pipeline to stall thus impacting CPI – in a multiple-issue out-of-order execution architecture, like Tomasulo, a data miss means that a particular instruction stalls, possibly stalling others because it ties up a reservation station or reorder buffer slot, but it is more likely that it will not impact overall CPI • How then do we determine the impact of cache misses on such architectures? – we might define memory stall cycles / instruction = misses / instruction * (total miss latency – overlapped miss latency) • total miss latency – the total of all memory latencies where the memory latency for a single instruction • overlapped miss latency – the amount of time that the miss is not impacting performance because other instructions remain executing – these two terms are difficult to analyze, so we won’t cover this in any more detail – typically a multi-issue out-of-order architecture can hide some of the miss penalty, up to 30% as shown in an example on page C-20 – C-21 Improving Cache Performance • Having reviewed 1000s of research papers on caches, the authors provide a number of approaches to improve cache performance – recall: average memory access time = hit time + miss rate * miss penalty – these approaches can be categorized by what aspect of the above formula they are trying to reduce • reduce hit time • reduce miss rate • reduce miss penalty (or increase cache bandwidth so that part of the miss penalty can be reduced) • using parallelism in order to reduce miss rate and/or penalty • Comments: – miss penalty is the biggest value in the equation, so this should be the obvious target to reduce, but in fact little can be done to increase memory speed – reducing miss rate has a number of different approaches however miss rates today are often less than 2%, can we continue to improve? – reducing hit time has the benefit of allowing us to lower clock cycle time as well The 3 Cs • Cache misses can be categorized as – compulsory misses • very first access to a block cannot be in the cache because the process has just begun and there has not been a chance to load anything into the cache – capacity misses • the cache cannot contain all of the blocks needed for the process – conflict misses • the block placement strategy only allows a block to be placed in a certain location in the cache bringing about contention with other blocks for that same location – some of the approaches we will attempt to reduce one particular type of miss • figure C.8 on page C-23 demonstrates the miss rates for various sized caches and associativities • conflict misses are more common in direct-mapped caches and are 0 in fully associative caches whereas capacity misses are the most common cause of a cache miss Solution 1: Larger Block Sizes • Larger block sizes will reduce compulsory misses – larger blocks can take more advantage of temporal and spatial reference – but, larger blocks can increase miss penalty because it physically takes longer to transfer the block from main memory to cache – also, larger blocks means less blocks in cache which itself can increase the miss rate • this depends on program layout and the size of the cache vs. block size A block size of 64 to 128 bytes provides the lowest miss rates Block size 16 32 64 128 256 4K 8.57% 7.24% 7.00% 7.78% 9.51% Cache Size 16 K 3.94% 2.87% 2.64% 2.77% 3.29% 64 K 2.04% 1.35% 1.06% 1.02% 1.15% 256 K 1.09% 0.70% 0.51% 0.49% 0.49% Example: Impact of Block Size • Assume memory takes 80 cycles to respond and delivers 16 bytes every 2 clock cycles thereafter • Which block size has minimum avg memory access time for each cache size? – average memory access time = hit time + miss rate * miss penalty – hit time = 1, miss rates from previous slide – miss penalty depends on size of block (82 cycles for 16 bytes, 84 cycles for 32 bytes, etc, for k byte blocks: miss penalty = (k / 16) * 2 + 80) • Solution: – 16 byte block, 4 KB cache = 1 + (8.57 * 82) = 8.027 clock cycles – 256 byte block, 256KB cache = 1 + (.49% * 112) = 1.549 clock cycles Block Size Miss Penalty 4K 16K 64K 256K 16 82 8.03 4.23 2.67 1.89 32 84 7.08 3.41 2.13 1.59 64 88 7.16 3.32 1.93 1.45 128 96 8.47 3.66 1.8 256 112 11.6 4.69 2.29 1.55 1.47 We must compromise because a bigger block size reduces miss rate to some extent, but also increases hit time Solution 2: Larger Caches • A larger cache will reduce capacity miss rates since the cache has a larger capacity, but also conflict miss rates because the larger cache allows more refill lines and so fewer conflicts • This is an obvious solution and has no seeming performance drawbacks – you must be careful where you put this larger cache • a larger on-chip cache might take space away from other hardware that could provide performance increases (registers, more functional units, logic for multiple-issue of instructions, etc) – and more cache means a greater expense for the machine • Larger caches however are often slower as they need more multiplexors – the best place for a large cache is off the chip where we have space and because we don’t mind a slightly slower cache given that the miss penalty will be larger due to having to go off the chip – the authors note that second-level caches from 2001 computers are equal in size to main memories from 10 years ago! Solution 3: Higher Associativity • Cache research points out the “2:1 cache rule of thumb” – a direct-mapped cache of size N has about the same miss rate as a 2-way set associative cache of size N/2 so that larger associativity yields smaller miss rates • • Why use direct-mapped? – associativity will always have a higher hit time • How big is the difference? – as we saw in an earlier example, a 2-way set However, a large 8-way associative cache was 35% associative cache will have about slower than the directmapped a 0% conflict miss rate – remember that clock speed – meaning that an 8-way associative is usually equal to cache hit cache is about as good as a fully time so we wind up slowing associative cache down the entire computer – so we don’t need to have when using associative associativity beyond 8-way caches of some kind So, with this in mind, should we use direct-mapped or set associative? Example: Impact of Associativity • Assume higher associativity increases clock cycle time as follows: – clock 2-way = 1.36 * clock direct mapped – clock 4-way = 1.44 times * clock direct-mapped – clock 8-way = 1.52 times * clock direct-mapped – assume L1 cache is direct-mapped with 1 cycle hit time • Determine best L2 type – given that miss penalty for direct-mapped is 25 cycles and L2 never misses • Average memory access time = hit time + miss rate * miss penalty – 4k direct: 1 + .098 * 25 = 3.45 – 512 KB 8-way: 1.52 + .006 * 25 = 1.67 Size (KB) Direct 2-way 4-way 8-way 4 3.44 3.25 3.22 3.28 8 2.69 2.58 2.55 2.62 16 2.23 2.40 2.46 2.53 32 2.06 2.30 2.37 2.45 64 1.92 2.14 2.18 2.25 128 1.52 1.84 1.92 2.00 256 1.32 1.66 1.74 1.82 512 1.20 1.55 1.59 1.66 in most cases, higher associativity means higher access time so direct mapped is often the best Solution 4: Multilevel Caches • To improve performance, we find that we would like: – a faster cache to keep pace with memory – a larger cache to lower miss rate • Which should we pick? Both – offer a small but fast first level cache (L1) – offer a larger but slower second level cache (L2) on the chip • since this second cache is larger, it will be somewhat slower, we make it even slower by using some degree of associativity to improve hit rate – this gives us a new formula for average memory access time = • hit time L1 + miss rate L1 * miss penalty L1 • miss penalty L1 = hit time L2 + miss rate L2 * miss penalty L2 • avg mem access time = hit time L1 + miss rate L1 * (hit time L2 + miss rate L2 * miss penalty L2) • We redefine miss rate for second cache: – local miss rate = # of cache misses / # of mem accesses this cache – global miss rate = # of cache misses / # of mem accesses overall • these values remain the same for the 1st level cache Impact of L2 Cache • Local miss rate for second cache will be larger than local miss rate for first cache since the first cache skims the “cream of the crop” – second level cache is only accessed when the first level misses entirely – global miss rate is more useful than local miss rate for the second cache • global miss rate tells us how many misses there are in all accesses • For example: – assume in 1000 references, L1 has 40 misses, L2 has 20 – local (and global) miss rate cache1 = 40/1000 = 4% – local miss rate cache2 = 20/40 = 50%, global miss rate cache2 = 20/1000 = 2% • Local miss rate cache2 is misleading, global miss rate gives us an indication of how both caches perform overall – L1 hit time is 1, L2 hit time is 10, memory access time is 200 cycles – what is the average memory access time? – avg. mem access time = 1 + 4%*(10+50%*200) = 5.4 cycles • without L2, we have avg. mem access time = 1 + 4% * 200 = 9, so the L2 cache gives us a Associativity for L2 Cache • Here we see the benefit of an associative cache for a second-level cache instead of direct-mapped – compare direct-mapped vs. 2-way set associative caches for second level assuming a 2-way set associative cache is 10% slower than the direct-mapped cache • • • • • direct-mapped L2 has hit time = 10 cycles direct-mapped L2 has local miss rate = 25% 2-way set-associative L2 has hit time = 10.1 cycles 2-way set-associative L2 has local miss rate = 20% miss penalty L2 = 200 cycles – direct-mapped L2, miss penalty = 10 + .25 * 200 = 60 cycles – 2-way set-associative L2, miss penalty = 10.1 + .20 * 200 = 50.1 cycles • we round 10.1 and 50.1 up to 11 and 51 respectively since the L2 cache is still governed by the system clock Solution 5: Priority of Reads over Writes • Reads occur with a much greater frequency than writes – we want to make the more common case fast so we would like to make sure that reads are faster than writes • writes are slower because of the need to write to both cache and main memory as well as perform a tag check prior to starting the write (a tag check on a read can be done in parallel, if the tag is incorrect, we just cancel the remainder of the read) – we might use a write buffer for both types of write policy • write-through cache writes to write buffer first, and any read misses are given priority over writing the write buffer to memory • write-back cache writes to write buffer and the write buffer is only written to memory when we are assured of no conflict with a read miss • in either case, a read miss will first examine the write buffer before going on to memory in order to potentially save time – so, read misses have priority over writes since read misses are more common, so we make the common case fast Solution 6: Avoid Address Translation • CPU generates an address and sends it to cache – but the address generated by the CPU is a logical (virtual) address, not the physical address in memory • the addresses differ because of paging and virtual memory – caches store items based on their physical address (for instance, the line number is derived from the physical address, not the virtual address) – to obtain the physical address, the virtual address must first be translated using the TLB (or page table if the entry isn’t in the TLB) • thus, any memory access will actually take at least 2 cache accesses and we want to prevent this • If we store virtual addresses in the cache, we can skip this translation but there are problems with this approach: – the address translation process includes security mechanisms to make sure that a generated address is not trying to access some other process’ portion of memory (such as a user process and the OS) • a solution is to copy protection information into the cache and check it on every cache access – if a process is switched out of memory then the cache must be flushed • unless we add process id information to tags, which like the previous solution means using more cache space for non-data Another Solution • A very simple solution to the problem of needing to translate virtual to physical addresses is this: – make sure that the line number is identical in both the virtual and physical addresses • Consider for instance a memory system of – 4G of words (32 bit addresses, assume word-level addressing, not bytelevel) – block size of 16 words – a 16KB on-chip cache would store 1024 blocks requiring 10 bits of the address • so the address format is 18 bits for the tag, 10 bits for the block, 4 bits for the word – assume a page size of 16K words so that the page offset is 14 bits and the page number size is 18 bits – paging process will swap out the page number for the frame number, both of which are 18 bits of the address so that the virtual address’ page offset is the same bits as the physical address’ block number and word offset and so no additional address translation is necessary prior to cache access • For this to work, we must ensure that log 2 page size >= tag size Solution 7: Small and Simple Caches • Cache access (for any but an associative cache) requires – using the index part of the address to find the appropriate line in the cache – then comparing tags to see if the entry is the right one • the tag comparison can be time consuming – especially with associative caches that have large tags or set associative caches where comparisons use more hardware to be done in parallel • Two solutions to keeping a fast cache are: – use direct-mapped caches – keep tags on the chip for quick tag check but move the data off the chip • this latter approach is often the case for L2 caches, but not L1 • Another idea is to use a small enough L2 cache to fit on the processor to keep L1 miss penalties down – L1 cache sizes have not increased lately, but small enough L2 caches can now fit on the chip Solution 8: Way Prediction • The direct-mapped cache offers a faster access time than a 2-way set associative cache – because we can avoid the additional multiplexor to select between the two sets • Let’s assume that an access to one set will be followed by an access to the same set – this prediction is called “way prediction” and can be used to speed up access to a 2-way set associative cache by maintaining a prediction bit for the cache • the bit is toggled when we have a miss-prediction and need to switch to the other set • So we get the lower miss rate of the 2-way set associative cache and the lower hit time whenever we predict correctly – simulations indicate a prediction accuracy of 85% or more – the Pentium IV uses way prediction Variation: Pseudo-Associative Cache • We can alter a direct-mapped cache to have some associativity as follows: – consult the direct-mapped cache as normal • provides fast hit time – if there is a miss, invert the address and try the new address inversion might flip the last bit in the line number • the second access comes at a cost of a higher hit rate for a second attempt (it may also cause other accesses to stall while the second access is being performed!) – thus, the same address might be stored in one of two locations, thus giving some associativity – the pseudo-associative cache will reduce the amount of conflict misses • any cache miss may still become a cache hit – first check is fast (hit time of direct-mapped) – second check might take 1-2 cycles further, so is still faster than a second-level cache Example • Which provides faster avg memory access time for 4KB and 256 KB caches – direct-mapped, 2-way associative or pseudo-associative (PAC)? • Assume hit time of 1 cycle for direct mapped, 1.36 for 2-way set associative, and miss penalty of 50 cycles where the PAC will have a hit time of 3 cycles for a second access – we alter our formula for PAC because a miss does not necessarily accrue a 50 cycle penalty but instead a 3 cycle penalty if the item is in the other position in cache • we need two hit rates: the normal hit rate and the hit rate of finding the item in the second position (we will call this the alternative hit rate) – alternative hit rate = hit rate2 way - hit rate1 way = 1 - miss rate2 way - (1 - miss rate1 way) = miss rate1 way - miss rate2 way • Avg mem access time PAC = – 1 + (miss rate1 way – miss rate2 way) * 3 + miss rate2 way * miss penalty Solution • PAC: 4 KB = 1 + (.098 - .076) * 3 + (.076 * 50) = 4.866 • PAC: 256 KB = 1 + (.013 - .012) * 3 + (.012 * 50) = 1.603 • Direct-mapped: 4 KB = 1 + .098 * 50 = 5.9 • Direct-mapped: 256 KB = 1 + .013 * 50 = 1.65 • 2-way: 4 KB = 1.36 + .076 * 50 = 5.16 • 2-way: 256 KB = 1.36 + .012 * 50 = 1.96 • So, pseudo-associative cache outperforms both! Solution 9: Trace Caches • This type of a cache is an instruction cache which supports multiple issue of instructions by providing 4 or more independent instructions per cycle – cache blocks are dynamic, unlike normal caches where blocks are static based on what is stored in memory – here, the block is formed around branch prediction, branch folding, and trace scheduling – note that because of branch folding and trace scheduling, some instructions might appear multiple times in the cache, so it is somewhat more wasteful of cache space • This type of cache then offers the advantage of directly supporting a multiple issue architecture – the Pentium 4 uses this approach, but most RISC computers do not because repetition of instructions and high frequency of branches cause this approach to waste too much cache space Solution 10: Pipelined Cache Accesses • We can pipeline writes to the cache since writes take longer than reads – the longer duration is because we need to perform a tag check before the write can begin – a read can commence and if the tag is wrong, the item read can be discarded • The write takes two steps, tag comparison first, followed by the write (a third step might be included in a writeback cache by combining items in a buffer) – by pipelining writes • we can partially speed up the process – this works by overlapping the tag checking and writing portions • assuming the tag is correct • in this way, the second write takes the same time as a read would – although this only works with more than 1 consecutive write where all writes are cache hits Example • Here we see the impact of pipelining cache accesses – the advantage is that it allows us to reduce clock cycle time, the disadvantage is that with a shorter clock, cache misses have a larger impact – compare the MIPS 5-stage pipeline vs. the MIPS R4000 8-stage pipeline • assume clock rates of 1 GHz for MIPS and 1.8 GHz for MIPS R4000 • a main memory access time of 50 ns (we will assume no second level cache) and a cache miss rate of 5% • Assuming no other source of stalls, which machine is faster? – – – – MIPS miss penalty = 50 ns / (1 / 1) ns = 50 cycles MIPS R4000 miss penalty = 50 ns / (1 / 1.8) ns = 90 cycles CPU time MIPS = (1 + .05 * 50) * 1 = 3.5 CPU time MIPS R4000 = (1 + .05 * 90) * 1 / 1.8 = 3.06 • So the gain of increased clock speed by pipelining cache accesses more than offsets the increased miss penalty – to truly see if this is advantageous, we would also have to factor in the impact of structural hazards and branch penalties • the longer the pipeline, the greater the impact is Solution 11: Non-Blocking Caches • When an ordinary cache has a miss and must retrieve the needed block from memory, the cache is unable to respond to other requests – a more expensive cache is a non-blocking cache which can continue to respond to CPU requests while waiting for memory as long as the new requests are to blocks that are present • this is sometimes referred to as a “hit under miss” – this capability becomes essential for out-of-order execution architectures like Tomasulo’s so that the cache does not cause stalls • a variation if the cache can permit hits to continue in spite of multiple misses (although for this to make any sense, it means that memory must also be able to respond to multiple requests) • see figure 5.5 on page 297 which illustrates how various benchmarks perform with hit under 1 miss, hit under 2 misses and hit under 64 misses – the non-blocking cache becomes important for implementing some other optimizations which we will see next Solution 12: Multi-banked Caches • A common memory optimization is to have multiple memory banks – each of which can be accessed independently – this allows for parallel access to memory either by • different devices accessing memory simultaneously • accessing multiple words of the same block simultaneously • We can use this idea on our caches as well by interleaving cache addresses across banks – L2 of AMD Opteron uses 2 banks, Sun Niagara uses 4 banks • To make this easy to implement, we interleave the blocks across banks – for instance, for a 4-bank cache, block address % 4 = = 0 would be placed into bank 0, block address % 4 = = 1 would be placed into bank 1, etc • When would this approach be advantageous? – we could use this to implement a form of non-blocking cache where we can access another bank on a miss (thus, we have hit under miss in 3 out of 4 instances) and this would be cheaper than a truly non-blocking cache Solution 13: Early Restart • On a cache miss, memory system moves a block into cache – moving a full block requires many memory accesses and bus transfers • Rather than having the cache (and CPU) wait until the entire block is available – move requested word from the block first to allow cache access as soon as the item is available • transfer rest of block in parallel with that access • This requires two ideas: – early restart: the cache transmits the requested word as soon as it arrives from memory – critical word first: have memory return the requested word first and the remainder of the block afterward (this is also known as wrapped fetch) – note: we could implement early restart without critical word first, but the improvement then is based on which word was needed, for instance if the last word of the block is requested, then early restart does nothing for us • This optimization requires a non-blocking cache since the cache needs to start responding to the request after the first word is returned Solution 14: Merging Write Buffer • Recall that a writethrough cache may use a write buffer to make memory accesses more efficient – the write buffer contains multiple items waiting to be written to memory – with the write buffer, writes to memory are postponed until either the buffer is full or a modified refill line is being discarded • We can make our write buffer more efficient as follows – organize the write buffer in rows, one row represents one refill line – multiple writes to the same line are saved in the same buffer row – a write to memory moves the entire block from the buffer, reducing the number of writes Solution 15: Compiler Optimizations • • We have already seen that compiler optimizations can be used to improve hardware performance There are many ways that compiler optimizations can improve cache performance by making sure that array accesses are done in a way that supports those refill lines currently in cache • Specific techniques include: – merging parallel arrays into an array of records so that access to a single array element is made to consecutive memory locations and thus the same (hopefully) refill line – loop interchange exchange loops in a nested loop situation so that array elements are accessed based on order that they will appear in the cache and not programmer-prescribed order – loop fusion combines loops together that access the same array locations so that all accesses are made within one iteration – blocking executes code on a part of the array before moving on to another part of the array so that array elements do not need to be reloaded into the cache • this is common for applications like image processing where several different passes through a matrix are made Solution 16: Hardware Prefetching • Prefetching can be controlled by hardware or compiler (see the next slide) – prefetching can operate on either instructions or data or both – one simple idea for instruction prefetching is that when there is an instruction miss to block i, fetch it and then fetch block i + 1 • this might be controlled by hardware outside of the cache • the second block is left in the instruction stream buffer and not moved into the cache • if an instruction is part of the instruction stream buffer, then the cache access is cancelled (and thus a potential miss is cancelled) – if prefetching is to place multiple blocks into the cache, then the cache must be non-blocking • See figure 5.10 on page 306 which shows the speedup of many SPEC 2000 benchmarks (mostly FP) when hardware prefetching is turned on (speedup ranges from 1.16 to 1.97) Solution 17: Compiler Prefetching • Another idea is to have the compiler insert prefetching instructions into the code (this is for data prefetching only) • Consider the loop: – for (i=0;i<3;i=i+1) for (j=0;j<100;j=j+1) a[i][j]=b[j][0]*b[j+1][0]; – if we have a 8KB direct-mapped data cache with 16 byte blocks and each element of a and b are 8 bytes long (double precision floats) we will have 150 misses for array a and 101 misses for array b – by scheduling the code with prefetch instructions, we can reduce the misses • New loop becomes: – for (j=0;j<100;j=j+1) { prefetch(b[j+7][0]); /* prefetch 7 iterations later */ prefetch(a[0][j+7]); a[0][j]=b[j][0];}; for (i=1;i<3;i=i+1) for (j=0;j<100;j=j+1) { prefetch(a[i][j+7]); a[i][j]=b[j][0]*b[j+1][0]; } – This new code has only 19 misses improving performance to 6.2 times faster – See page 307-308 for the rest of the analysis for this problem Solution 18: Victim Caches • Misses might arise when refill lines conflict with each other – one line is discarded for another only to find the discarded line is needed in the future • The victim cache is a small, fully associative cache, placed between the cache and memory – this cache might store 1-5 blocks • Victim cache only stores blocks that are discarded from the cache when a miss occurs The victim cache is most useful if it backs up a fast direct-mapped cache to reduce the direct-mapped cache’s conflict miss rate by adding some associativity – victim cache is checked on a miss before going on to main memory and if found, the block in the A 4-item victim cache might remove ¼ of the cache and the block in the victim misses from a 4KB direct-mapped data cache cache are switched AMD Athlon uses 8-entry victim cache Cache Optimization Summary Technique Miss Penalty Larger block sizes Larger caches - Higher associativity Multilevel caches Read miss over write priority No address translation Small/simple caches Way prediction Pseudo associative cache Bandwidth Miss Rate Hit Time Hardware Complexity Comments: (* means widely used) + + - 0 1 Trivial * especially L2 caches + - + 1 2 * *, costly, difficult if L1 block size != L2 block size + 1 * 1 0 1 2 * *, trivial Pentium IV Found in RISC + + + + Hardware complexity ranges from 0 (cheapest/easiest) to 3 (most expensive/hardest) Technique CBand Miss Hit Hardware ontinued Miss Multibanked caches Early restart Merging write buffer + - 3 1 Pentium IV * + 3 Used with all OOC + 1 2 1 For L2 caches * * for writethrough caches + 0 Software is challenging Penalty width Trace cache Pipelined cache accesses Nonblocking caches Time Complexity Comments: (* means widely used) Rate + + + + Compiler techniques Hardware prefetching + + 2 (instr) or 3 (data) * for instr, P4 & Opteron for data Compiler prefetching + + 3 *, needs nonblocking cache Victim caches + 2 AMD Athlon + Sample Problem #1 • Computer uses a fully associative write-back data cache • Block size is 64 bytes for(i=1;i<1024;i++) { x[i] = b[i] * y; c[i] = x[i] + z; // S1 // S2 } • Given the code above, assume x[1024], b[1024] and c[1024] are all double precision floating point arrays with arrays b and c already in cache, but not x – moving elements of x into the cache will not cause elements of b or c to be moved out of cache – x[0] is stored at the beginning of a block • Questions: – how many misses arise with respect to accessing x if the cache uses a no-write allocate policy? – how many misses arise with respect to accessing x if the cache uses a write allocate policy? – redo part a assuming that statement S2 comes before statement S1 – redo part b assuming that statement S2 comes before statement S1 Solution • Each array stores 1024 doubles • A refill line stores 64 bytes so we can get exactly 8 array elements into each refill line • Our misses only arise because of accesses to array x – since a and b are already in the cache • No-write allocate policy means that the write in S1 will not load x into the cache therefore a miss in S1 results in a read miss in S2 • Any read miss will bring in a refill line along with the next 7 array items – so a read miss occurs for 1 in 8 array elements – we will have 1024 / 8 = 128 read misses – any read miss will have previously had a write miss giving us a total of 256 misses • Write allocate policy means that the write miss in S1 loads the line into cache, so we won’t have a corresponding read miss in S2 and therefore will have a total of 128 misses • Reversing the order of S1 and S2 leads to the read miss happening first so no matter which write policy is used, we will have 128 read misses and no write misses Sample Problem #2 • Memory organized as follows: – – – – – two on-chip caches (one data, one instruction) off-chip cache main memory disk cache disk (swap space) • Assume miss rates and access times of – – – – – – data cache: 5%, 1 clock cycle instruction cache: 1%, 1 clock cycle off-chip cache: 10%, 10 clock cycles main memory: 0.2%, 100 clock cycles disk cache: 20%, 1000 clock cycles swap space: 0%, 250000 clock cycles • If 40% of all instructions are loads or stores, what is the effective memory access time for this machine? Solution • Average memory access time = – %instruction * (hit time instr cache + miss rate instr cache * (hit time second level cache + miss rate second level cache * (hit time main memory + miss rate main memory * (hit time disk cache + miss rate disk cache * hit time disk)))) – + %data * (hit time data cache + miss rate data cache * (hit time second level cache + miss rate second level cache * (hit time main memory + miss rate main memory * (hit time disk cache + miss rate disk cache * hit time disk)))) • With 1.4 memory accesses per instruction, the % of instruction accesses = – 1.0 / 1.4 = 71.4%, and % of data accesses is 0.4 / 1.4 = 28.6% • Average memory access time = – 71.4% * (1 + .01 * (10 + .10 * (100 + .002 * (1000 + .20 * 250000)))) + 28.6% * (1 + .05 * (10 + .10 * (100 + .002 * (1000 + .20 * 250000)))) = 1.647. ...
View Full Document

Page1 / 51

cache - Cache • Memory hierarchy is our solution for...

This preview shows document pages 1 - 4. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online