L14Mem1 - CS324: Computer Architecture CS324: The Memory...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CS324: Computer Architecture CS324: The Memory Hierarchy: Caches Memory Hierarchy Memory Storage in computer systems: Processor – holds data in register file (~100 Bytes) – Registers accessed on nanosecond timescale Memory (we’ll call “main memory”) – More capacity than registers (~Gbytes) – Access time ~50-100 ns – Hundreds of clock cycles per memory access?! Disk – HUGE capacity (virtually limitless) – VERY slow: runs ~milliseconds Motivation: Why We Use Caches (written $) Motivation: 1000 Performance 100 10 1 1981 1983 1984 1985 1986 1987 1982 1988 1980 CPU µProc 60%/yr. Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 7%/yr. 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 1989 first Intel CPU with cache on chip 1989 1998 Pentium III has two levels of cache on chip 1998 Memory Caching Memory Mismatch between processor and memory speeds leads Mismatch us to add a new level: a memory cache Implemented with same IC processing technology as the Implemented CPU (usually integrated on same chip): faster but more expensive than DRAM memory. Cache is a copy of a subset of main memory. Cache Most processors have separate caches for instructions Most and data. Memory Memory Hierarchy Higher Levels in memory hierarchy Lower Processor Increasing Distance from Proc., Decreasing speed Level 1 Level 2 Level 3 ... Level n Size of memory at each level As we move to deeper levels the latency goes up and price per bit goes down. Memory Hierarchy Memory If level closer to Processor, it is: If – smaller – faster – subset of lower levels (contains most recently used data) Lowest Level (usually disk) contains all available Lowest data (or does it go beyond the disk?) Memory Hierarchy presents the processor with the Memory illusion of a very large, very fast memory. How is the hierarchy managed? How Registers <-> Memory Registers – by compiler (programmer?) cache <-> memory cache – by the hardware memory <-> disks memory – by the hardware and operating system (virtual memory) – by the programmer (files) Memory Hierarchy Analogy: Library Memory You’re writing a term paper at a table in the Dwight Library is equivalent to disk – essentially limitless capacity – very slow to retrieve a book Table is main memory – smaller capacity: means you must return book when table fills up – easier and faster to find a book there once you’ve already retrieved it Memory Hierarchy Analogy: Library Memory Open books on table are cache – smaller capacity: can have very few open books fit on table; again, when table fills up, you must close a book – much, much faster to retrieve data Illusion created: whole library open on the tabletop – Keep as many recently used books open on table as possible since likely to use again – Also keep as many books on table as possible, since faster than going to stacks Why hierarchy works Why The Principle of Locality: The – Program access a relatively small portion of the address space at any instant of time. Probability of reference 0 Address Space 2^n - 1 Memory Hierarchy Basis Memory Cache contains copies of data in memory that are being used. Memory contains copies of data on disk that are being used. Caches work on the principles of temporal and spatial locality. – Temporal Locality: if we use it now, chances are we’ll want to use it again soon. – Spatial Locality: if we use a piece of memory, chances are we’ll use the neighboring pieces soon. Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (ex: Block X) – Hit Rate: the fraction of memory access found in the upper level – Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieve from block in a lower level (Block Y) – Miss Rate = 1 - (Hit Rate) – Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << Miss Penalty Upper Level To Processor Memory Blk X Why? Lower Level Memory Blk Y From Processor Memory Hierarchy of a Modern Computer System Processor Control Second Level Cache (SRAM) Main Memory (DRAM) Secondary Storage (Disk) Tertiary Storage (Disk) On-Chip Cache Registers Datapath Speed (ns): 1s Size (bytes): 100s 10s Ks 100s Ms 10,000,000s 10,000,000,000s (10s sec) (10s ms) Ts Gs By taking advantage of the principle of locality: By – Present the user with as much memory as is available in the cheapest technology. – Provide access at the speed offered by the fastest technology. Cache Design Cache How do we organize cache? How Where does each memory address map to? Where (Remember that cache is subset of memory, so multiple memory addresses map to the same cache location.) How do we know which elements are in cache? How How do we quickly locate them? How Direct Mapped Cache There is exactly one location to which a block can be There moved Mapping: address is modulo the number of blocks in the Mapping: cache Cache 000 001 010 011 100 101 110 111 01101 10001 00001 00101 01001 10101 11001 11101 Memory Direct-Mapped Cache Direct Each memory address maps to only one block in the cache Each – Block is the unit of transfer between cache and memory – We only need look in a single location in the cache for the data if it exists in the cache Mapping: address is modulo the number of blocks in the cache Mapping: Cache 000 001 010 011 100 101 110 111 00001 00101 01001 01101 10001 10101 11001 11101 Memory Direct-Mapped Cache Direct Memory Address Memory 0 1 2 3 4 5 6 7 8 9 A B C D E F Cache Index 0 1 2 3 4 Byte Direct Mapped Cache Cache Location 0 can be occupied by Cache data from: – Memory location 0, 4, 8, ... – In general: any memory location that is multiple of 4 Issues with Direct-Mapped Issues Since multiple memory addresses map to same cache Since index, how do we tell which one is in there? What if we have a block size > 1 byte? What Result: divide memory address into three fields Result: ttttttttttttttttt iiiiiiiiii oooo tag to check if have correct block index to select block byte offset within block (given possible blocks for this location) (of cache) Direct-Mapped Cache Direct Terminology All fields are read as unsigned integers. All Index: specifies the cache index (which “row” of the Index cache we should look in) Offset: once we’ve found correct block, specifies Offset which byte within the block we want Tag: the remaining bits after offset and index are Tag determined; these are used to distinguish between all the memory addresses that map to the same location Direct-Mapped Cache Example Direct Suppose we have a 16KB direct-mapped cache with Suppose 4 word blocks. Determine the size of the tag, index and offset fields Determine if we’re using a 32-bit architecture. Offset Offset – need to specify correct byte within a block – block contains 4 words 16 bytes 24 bytes – need 4 bits to specify correct byte Direct-Mapped Cache Example Direct Index Index – – – – need to specify correct row in cache cache contains 16 KB = 214 bytes block contains 24 bytes (4 words) # rows/cache = # blocks/cache (since there’s one block/row) = bytes/cache bytes/row = 214 bytes/cache 24 bytes/row = 210 rows/cache – need 10 bits to specify this many rows Direct-Mapped Cache Example Direct Tag Tag – used remaining bits as tag – tag length = mem addr length - offset - index = 32 - 4 - 10 bits = 18 bits – so tag is leftmost 18 bits of memory address Accessing data in a direct mapped cache Accessing Example: 16KB, directExample: mapped, 4 word blocks Read 4 addresses Read Memory Address (hex) Value of Word ... ... a 00000010 b 00000014 c – 0x00000014, 0x0000001C, 00000018 d 0000001C 0x00000034, 0x00008014 ... ... Memory values on right: Memory e 00000030 – only cache/memory level of 00000034 f hierarchy g 00000038 h 0000003C ... ... i 00008010 j 00008014 k 00008018 l 0000801C ... ... Accessing data in a direct mapped cache Accessing 4 Addresses: Addresses: – 0x00000014, 0x0000001C, 0x00000034, 0x00008014 4 Addresses divided (for convenience) into Tag, Addresses Index, Byte Offset fields 000000000000000000 000000000000000000 000000000000000000 000000000000000010 Tag 0000000001 0100 0000000001 1100 0000000011 0100 0000000001 0100 Index Offset Accessing data in a direct mapped Accessing cache So lets go through accessing some data in this cache So – 16KB, direct-mapped, 4 word blocks Will see 3 types of events: Will cache miss: nothing in cache in appropriate block, so fetch cache from memory cache hit: cache block is valid and contains proper address, cache so read desired word cache miss, block replacement: wrong data is in cache at cache appropriate block, so discard it and fetch desired data from memory 16 KB Direct Mapped Cache, 16B blocks Valid bit: determines whether anything is stored in that row Valid (when computer initially turned on, all entries are invalid) Valid Index Tag 00 10 20 30 40 50 60 70 ... 0x0-3 0x4-7 0x8-b Example Block 0xc-f ... 1022 0 1023 0 Read 0x00000014 = 0…00 0..001 0100 000000000000000000 0000000001 0100 000000000000000000 Tag field Valid Index Tag 0 10 20 30 40 50 60 70 ... 0 Index field Offset 0x4-7 0x8-b 0xc-f 0x0-3 ... 1022 1023 0 0 So we read block 1 (0000000001) So 000000000000000000 0000000001 0100 000000000000000000 Tag field Valid Index0 Tag 00 10 20 30 40 50 60 7 0 1022 0 Index field Offset 0x4-7 0x8-b 0xc-f 0x0-3 ... ... 1023 No valid data No 000000000000000000 0000000001 0100 000000000000000000 Valid Index Tag 00 10 20 30 40 50 60 70 ... Tag field 0x0-3 0x4-7 Index field Offset 0x8-b 0xc-f ... 1022 0 1023 0 So load that data into cache, setting tag, valid So 000000000000000000 0000000001 0100 000000000000000000 Valid Index Tag 00 11 0 20 30 40 50 60 70 ... Tag field 0x0-3 a Index field Offset 0x4-7 b 0x8-b c 0xc-f d ... 1022 0 1023 0 Read from cache at offset, return word b Read 000000000000000000 0000000001 0100 000000000000000000 Valid Index Tag 00 11 0 20 30 40 50 60 70 ... Tag field 0x0-3 a Index field 0x4-7 b Offset 0x8-b 0xc-f c d ... 1022 0 1023 0 Read 0x0000001C = 0…00 0..001 1100 Read 000000000000000000 0000000001 1100 000000000000000000 Valid Index Tag 00 11 0 20 30 40 50 60 70 ... Tag field 0x0-3 a Index field Offset 0x4-7 b 0x8-b c 0xc-f d ... 1022 0 1023 0 Data valid, tag OK, so read offset return word d Data 000000000000000000 0000000001 1100 000000000000000000 Valid Index Tag 00 11 0 20 30 40 50 60 70 ... 0x0-3 a 0x4-7 b 0x8-b c 0xc-f d ... 1022 0 1023 0 Read 0x00000034 = 0…00 0..011 0100 Read 000000000000000000 0000000011 0100 000000000000000000 Tag field Valid 0x0-3 Index Tag 00 a 11 0 20 30 40 50 60 70 ... ... Index field Offset 0x4-7 0x8-b 0xc-f b c d 1022 0 1023 0 So read block 3 So 000000000000000000 0000000011 0100 000000000000000000 Tag field Valid Index Tag 0 01 0 10 20 30 40 50 60 7 ... Index field Offset 0x4-7 b 0x0-3 a 0x8-b c 0xc-f d ... 1022 0 1023 0 No valid data No 000000000000000000 0000000011 0100 000000000000000000 Tag field Valid 0x0-3 Index Tag 00 a 11 0 20 30 40 50 60 70 ... ... Index field Offset 0x4-7 0x8-b 0xc-f b c d 1022 0 1023 0 Load that cache block, return word f Load 000000000000000000 0000000011 0100 000000000000000000 Tag field Valid 0x0-3 Index Tag 00 a 11 0 20 e 31 0 40 50 60 70 ... ... Index field Offset 0x8-b 0xc-f 0x4-7 b f c g d h 1022 0 1023 0 Read 0x00008014 = 0…10 0..001 0100 Read 000000000000000010 0000000001 0100 000000000000000010 Valid Index Tag 00 11 0 20 31 0 40 50 60 70 ... Tag field 0x0-3 a e Index field Offset 0x4-7 0x8-b 0xc-f b f c g d h ... 1022 0 1023 0 So read Cache Block 1, Data is Valid So 000000000000000010 0000000001 0100 000000000000000010 Tag field Valid Index 0 Tag 0 0 10 21 0 30 40 50 60 7 0 1022 0 Index field Offset 0x4-7 b f 0x0-3 a e 0x8-b c g 0xc-f d h ... ... 1023 Cache Block 1 Tag does not match (0 != 2) Cache 000000000000000010 0000000001 0100 000000000000000010 Tag field Valid 0x0-3 Index Tag 00 a 11 0 20 e 31 0 40 50 60 70 ... ... Index field Offset 0x4-7 0x8-b 0xc-f b f c g d h 1022 0 1023 0 Miss, so replace block 1 with new data & tag Miss, 000000000000000010 0000000001 0100 000000000000000010 Tag field Valid 0x0-3 Index Tag 00 i 11 2 20 e 31 0 40 50 60 70 ... ... Index field Offset 0x4-7 0x8-b 0xc-f j f k g l h 1022 0 1023 0 And return word j And 000000000000000010 0000000001 0100 000000000000000010 Tag field Valid Index Tag 0x0-3 00 i 112 20 e 310 40 50 60 70 ... ... Index field Offset 0x4-7 j f 0x8-b k g 0xc-f l h 1022 0 10230 Direct Mapped Cache A d d r e s s ( s h o w i n g b it p o s i t io n s ) 31 30 13 12 11 210 Byte offset Hit 20 Tag Index 10 Data For MIPS: For Index Valid Tag 0 1 2 Data 1021 1022 1023 20 32 What kind of locality are we taking advantage of? Direct Mapped Cache Taking advantage of spatial locality: Taking Address (showing bit positions) 31 16 15 16 Tag Index 16 bits V Tag 128 bits Data 4 32 1 0 12 2 Byte offset Block offset Hit Data 4K entries 16 32 32 32 32 Mux 32 Hits vs. Misses Read hits Read – this is what we want! Read misses Read – stall the CPU, fetch block from memory, deliver to cache, restart Write hits: Write – can replace data in cache and memory (write-through) – write the data only into the cache (write-back the cache later) Write misses: Write – read the entire block into the cache, then write the word Handling Cache Misses Handling Data: Data: – stall CPU & freeze all register contents – Wait until data arrives (write data to cache) – restart instruction at cycle that caused the miss Instructions: Instructions: – – – – Send PC-4 to the memory wait for memory to complete read write the data into the cache restart instruction at first step, this time finding it in the cache Block Size Tradeoff Block Benefits of Larger Block Size Benefits – Spatial Locality: if we access a given word, we’re likely to access other nearby words soon (Another Big Idea) – Very applicable with Stored-Program Concept: if we execute a given instruction, it’s likely that we’ll execute the next few as well – Works nicely in sequential array accesses too But, miss rate NOT the only cache performance But, metric… Block Size Tradeoff Block Drawbacks of Larger Block Size Drawbacks – Larger block size means larger miss penalty on a miss, takes longer time to load a new block from next level on – If block size is too big relative to cache size, then there are too few blocks Result: miss rate goes up Result: In general, minimize In Average Access Time = Hit Time x Hit Rate + Miss Penalty x Miss Rate Block Size Tradeoff Block Hit Time = time to find and retrieve data from Hit current level cache Miss Penalty = average time to retrieve data on a Miss current level miss (includes the possibility of misses on successive levels) Hit Rate = % of requests that are found in current Hit level cache Miss Rate = 1 - Hit Rate Miss Extreme Example: One Big Block Extreme Valid Bit Tag Cache Data B3 B2 B1 B0 Cache Size = 4 bytes Block Size = 4 bytes Cache – Only ONE entry in the cache! If item accessed, likely accessed again soon If – But unlikely will be accessed again immediately! The next access will likely be a miss The – Continually loading data into the cache but discard data (force out) before use it again – Nightmare for cache designer: Ping Pong Effect Block Size Tradeoff Conclusions Block Miss Penalty Miss Rate Exploits Spatial Locality Fewer blocks: compromises temporal locality Block Size Average Access Time Block Size Increased Miss Penalty & Miss Rate Block Size Things to Remember Things We would like to have the capacity of disk at the We speed of the processor: unfortunately this is not feasible. So we create a memory hierarchy: So – each successively lower level contains “most used” data from next higher level – exploits temporal locality and spatial locality – do the common case fast, worry less about the exceptions (design principle of MIPS) Locality of reference is a Big Idea Locality ...
View Full Document

Ask a homework question - tutors are online