12LecSp11CacheIIx6 - 2/24/11 New ­School Machine...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 2/24/11 New ­School Machine Structures (It’s a bit more complicated!) So+ware Hardware •  Parallel Requests Assigned to computer e.g., Search “Katz” CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches II •  Parallel Threads Parallelism & Assigned to core e.g., Lookup, Ads >1 instruc[on @ one [me e.g., 5 pipelined instruc[ons •  Parallel Data >1 data item @ one [me e.g., Add of 4 pairs of words •  Hardware descrip[ons All gates @ one [me Spring 2011  ­ ­ Lecture #12 1 2/24/11 Review Spring 2011  ­ ­ Lecture #11 •  •  •  •  •  •  4 Handling Cache Misses (Single Word Blocks) Input/Output Instruc[on Unit(s) Core Func[onal Unit(s) A0+B0 A1+B1 A2+B2 A3+B3 Main Memory Logic Gates Spring 2011  ­ ­ Lecture #12 3 Cache Hits and Misses, Consistency Administrivia Cache Performance and Size Technology Break Designing Memory Systems for Caches (If [me permits – cache blocking with video!) 2/24/11 Spring 2011  ­ ­ Lecture #12 5 •  Need to make sure cache and memory have same value: 2 policies 1) Write ­Through Policy: write cache and write through the cache to memory –  Stall execu[on, fetch the block from the next level in the memory hierarchy, install it in the cache, send requested word to processor, and then let execu[on resume •  Write misses (D$ only) –  Write allocate: Stall execu[on, fetch the block from next level in the memory hierarchy, install it in cache, write the word from processor to cache, also update memory, then let execu[on resume or –  No ­write allocate: skip the cache write and just write the word to memory (but must invalidate cache block since it will now hold stale data) Spring 2011  ­ ­ Lecture #12 Today’s Core Lecture Memory (Cache) Cache ­Memory Consistency? (1/2) •  Read misses (I$ and D$) 2/24/11 Computer … Core Agenda •  Principle of Locality for Libraries /Computer Memory •  Hierarchy of Memories (speed/size/cost per bit) to Exploit Locality •  Cache – copy of data lower level in memory hierarchy •  Direct Mapped to find block in cache using Tag field and Valid bit for Hit •  Larger caches reduce Miss rate via Temporal and Spa[al Locality, but can increase Hit [me •  Larger blocks to reduces Miss rate via Spa[al Locality, but increase Miss penalty •  AMAT (Average Memory Access Time) helps balance Hit [me, Miss rate, Miss penalty 2/24/11 Achieve High Performance •  Parallel Instruc[ons Instructors: Randy H. Katz David A. PaGerson hGp://inst.eecs.Berkeley.edu/~cs61c/sp11 2/24/11 Harness Smart Phone Warehouse Scale Computer –  Every write eventually gets to memory –  Too slow, so include Write Buffer to allow processor to con[nue once data in Buffer, Buffer updates memory in parallel to processor 6 2/24/11 Spring 2011  ­ ­ Lecture #12 7 1 2/24/11 Cache ­Memory Consistency? (2/2) •  Compulsory (cold start, first reference): •  Need to make sure cache and memory have same value: 2 policies 2) Write ­Back Policy: write only to cache and then write cache block back to memory when evict block from cache –  1st access to a block, “cold” fact of life, not a lot you can do about it. •  If running billions of instruc[on, compulsory misses are insignificant –  Solu[on: increase block size (increases miss penalty; very large blocks could increase miss rate) •  Capacity: –  Cache cannot contain all blocks accessed by the program –  Solu[on: increase cache size (may increase access [me) –  Writes collected in cache, only single write to memory per block –  Include bit to see if wrote to block or not, and then only write back if bit is set •  Conflict (collision): –  Mul[ple memory loca[ons mapped to the same cache loca[on –  Solu[on 1: increase cache size (may increase hit [me) –  Solu[on 2: (later in semester) increase associa[vity (may increase hit [me) •  Called “Dirty” bit (wri[ng makes it “dirty”) 2/24/11 Sources of Cache Misses (3 C ’s) Spring 2011  ­ ­ Lecture #12 8 2/24/11 AMAT = Time for a hit + Miss rate x Miss penalty •  How reduce Miss Penalty? 2/24/11 Spring 2011  ­ ­ Lecture #12 10 •  Use mul[ple $ levels •  With advancing technology, have more room on die for bigger L1 caches and for second level cache – normally a unified L2 cache (i.e., it holds both instruc[ons and data,) and in some cases even a unified L3 cache •  E.g., CPIideal of 2, 100 cycle miss penalty (to main memory), 25 cycle miss penalty (to L2$), 36% load/stores, a 2% (4%) L1 I$ (D$) miss rate, add a 0.5% L2$ miss rate –  CPIstalls = 2 + .02×25 + .36×.04×25 + .005×100 + .36×.005×100 = 3.54 (vs. 5.44 with no L2$) 2/24/11 •  Local miss rate – the frac[on of references to one level of a cache that miss •  Local Miss rate L2$ = $L2 Misses / L1$ Misses •  Global miss rate – the frac[on of references that miss in all levels of a mul[level cache •  L2$ local miss rate >> than the global miss rate •  Global Miss rate = L2$ Misses / Total Accesses = L2$ Misses / L1$ Misses x L1$ Misses / Total Accesses = Local Miss rate L2$ x Local Miss rate L1$ •  AMAT = Time for a hit + Miss rate x Miss penalty •  AMAT = Time for a L1$ hit + (local) Miss rateL1$ x (Time for a L2$ hit + (local) Miss rate L2$ x L2$ Miss penalty) Spring 2011  ­ ­ Lecture #12 Spring 2011  ­ ­ Lecture #12 11 Mul[level Cache Design Considera[ons Local vs. Global Miss Rates 2/24/11 9 Reducing Cache Miss Rates Average Memory Access Time (AMAT) •  Average Memory Access Time (AMAT) is the average to access memory considering both hits and misses Spring 2011  ­ ­ Lecture #12 12 •  Different design considera[ons for L1$ and L2$ –  L1$ focuses on minimizing hit [me for shorter clock cycle: Smaller $ with smaller block sizes –  L2$(s) focus on reducing miss rate to reduce penalty of long main memory access [mes: Larger $ with larger block sizes •  Miss penalty of L1$ is significantly reduced by presence of L2$, so can be smaller/faster but with higher miss rate •  For the L2$, hit [me is less important than miss rate –  L2$ hit [me determines L1$’s miss penalty 2/24/11 Spring 2011  ­ ­ Lecture #12 13 2 2/24/11 Agenda •  •  •  •  •  Administrivia •  •  •  •  Cache Hits and Misses, Consistency Administrivia Cache Performance and Size Technology Break Memory Performance for Caches 2/24/11 Spring 2011  ­ ­ Lecture #12 –  Exam: Tu, Mar 8, 6 ­9 PM, 145/155 Dwinelle •  Split: A ­Lew in 145, Li ­Z in 155 –  Covers everything through lecture March 3 –  Closed book, can bring one sheet notes, both sides –  Copy of Green card will be supplied –  No phones, calculators, …; just bring pencils & eraser –  TA Review: Su, Mar 6, 2 ­5 PM, 2050 VLSB 14 Spring 2011  ­ ­ Lecture #13 Spring 2011  ­ ­ Lecture #12 15 1999 •  2 to 3 [mes/year spend weekend with old friends I went to high school (South Torrance) with to play poker, watch Superbowl, go bodysurfing, talk about life •  Dadʼs family Scotch-Irish, 
 Momʼs family Swedish" •  Grew up in Torrance, CA" •  (Still) married to high school sweetheart" •  1st to graduate from college" •  Liked it so much didnʼt stop to PhD" •  Spend 1 week/summer hosting Patterson Family Reunion" 2/24/11 2/24/11 Geyng to Know Profs: (old) Friends Geyng to Know Profs: Family 27 people: 2 parents, 3 siblings, 2 sons, 7 nieces and nephews, 7 spouses, 3 grandchildren, 1 grandnephew, 1 grandniece, 6 dogs … Lab #6 posted Project #2 Due Sunday @ 11:59:59 No Homework this week! Midterm in less than 2 weeks: 2009 1974 16 2/24/11 Improving Cache Performance (1 of 3) Spring 2011  ­ ­ Lecture #13 Old friends even more 17 valuable as you age Improving Cache Performance (2 of 3) 3. Reduce the miss penalty 1. Reduce the [me to hit in the cache –  Smaller cache –  1 word blocks (no mul[plexor/selector to pick word) 2. Reduce the miss rate –  Smaller blocks –  Use mul[ple cache levels •  L2 cache not [ed to processor clock rate –  Use a write buffer to hold dirty blocks being replaced so don’t have to wait for the write to complete before reading –  Check write buffer on read miss – may get lucky –  Faster backing store/improved memory bandwidth –  Bigger cache –  Larger blocks (16 to 64 bytes typical) –  (Later in semester: More flexible placement by increasing associa[vity) •  (Later in lecture) 2/24/11 Spring 2011  ­ ­ Lecture #12 18 2/24/11 Spring 2011  ­ ­ Lecture #12 19 3 2/24/11 The Cache Design Space (3 of 3) •  Several interac[ng dimensions Cache Size –  Cache size –  Block size –  Write ­through vs. write ­back –  Write alloca[on –  (Later Associa[vity) –  (Later Replacement policy) (Associa(vity) Block Size •  Op[mal choice is a compromise –  Depends on access characteris[cs •  Workload •  Use (I ­cache, D ­cache) –  Depends on technology / cost •  Simplicity o{en wins 2/24/11 Bad Good Factor A Less Factor B More Spring 2011  ­ ­ Lecture #12 20 2/24/11 Valid<1> Dirty<1> Tag<30 ­n ­m> Index<n bits> Block offset<m+2 bits> –  32 ­bit byte address => 32 – n – (m+2) Spring 2011  ­ ­ Lecture #12 … 2n •  For a direct mapped cache with 2n blocks, n bits are used for the index •  For a block size of 2m words (2m+2 bytes), m bits are used to address the word within the block and 2 bits are used to address the byte within the word: block offset •  Size of the tag field is Address size – index size – block offset size 2/24/11 21 Cache Sizes Fields within an Address Tag<32 ­n ­(m+2)> Spring 2011  ­ ­ Lecture #12 22 … … Data in block … Valid<1> Dirty<1> Tag<30 ­n ­m> Data in block •  Number of bits in a direct ­mapped cache includes both the storage for data and for the tags + valid bit + dirty bit (if needed) •  Total number of bits in a cache is then –  2n x (block size + tag field size + valid field size + dirty field size if needed) •  Why don’t need to store Block Offset in Student RouleGe Cache? Why not Index in Cache? 2/24/11 Spring 2011  ­ ­ Lecture #12 23 Peer Instruc[on Peer Instruc[on •  Assuming a direct ­mapped, write ­through cache with 16 KB of data and 4 ­word blocks, how divide a 32 ­bit byte address to access a cache? •  Assuming a direct ­mapped, write ­through cache with 16 KB of data and 4 ­word blocks, how divide a 32 ­bit byte address to access a cache? A red) Tag <14 bits> | Index <14 bits> | Block Offset <4 bits> B orange) Tag <16 bits> | Index <14 bits> | Block Offset <2 bits> A red) Tag <14 bits> | Index <14 bits> | Block Offset <4 bits> B orange) Tag <16 bits> | Index <14 bits> | Block Offset <2 bits> C green) Tag <18 bits> | Index <10 bits> | Block Offset <4 bits> C green) Tag <18 bits> | Index <10 bits> | Block Offset <4 bits> Tag <20 bits> | Index <10 bits> | Block Offset <2 bits> E pink) Tag <20 bits> | Index <10 bits> | Block Offset <2 bits> Valid <1> | Tag <14 bits> | Index <14 bits> | Block Offset <4 bits> E pink) Valid <1> | Tag <14 bits> | Index <14 bits> | Block Offset <4 bits> F blue) Valid <1> | Dirty <1> |Tag <14 bits> | Index <14 bits> | Block Offset <4 bits> F blue) Valid <1> | Dirty <1> |Tag <14 bits> | Index <14 bits> | Block Offset <4 bits> G purple) Valid <1> | Tag <18 bits> | Index <10 bits> | Block Offset <4 bits> G purple) Valid <1> | Tag <18 bits> | Index <10 bits> | Block Offset <4 bits> H teal) Valid <1> | Dirty <1> |Tag <18 bits> | Index <10 bits> | Block Offset <4 bits> 2/24/11 Spring 2011  ­ ­ Lecture #12 24 H teal) Valid <1> | Dirty <1> |Tag <18 bits> | Index <10 bits> | Block Offset <4 bits> 2/24/11 Spring 2011  ­ ­ Lecture #12 25 4 2/24/11 CPI/Miss Rates/DRAM Access SpecInt2006 Instruc[ons and Data Data Only Data Only Peer Instruc[on •  How many total bits are required for that cache? (Round to nearest Kbits) –  Direct ­mapped, write ­through, 16 KBytes of data, 4 ­word (16 Byte) blocks, 32 ­bit address –  Tag <18 bits> | Index <10 bits> | Block Offset <4 bits> 16 Kbits E pink) 139 Kbits B orange) 18 Kbits F blue) 146 Kbits C green) 128 Kbits G purple) 147 Kbits H teal) 2/24/11 148 Kbits Spring 2011  ­ ­ Lecture #12 26 2/24/11 Reading Miss Penalty: Memory Systems that Support Caches •  The off ­chip interconnect and memory architecture on ­chip affects overall system performance in drama[c ways CPU •  bus 32 ­bit data & 32 ­bit addr per cycle DRAM Memory •  •  2/24/11 1 memory bus clock cycle to return a word of data Number of bytes accessed from memory and transferred to cache/CPU per memory bus clock cycle Spring 2011  ­ ­ Lecture #11 29 One Word Wide Bus, One Word Blocks •  Transfers a burst of data (ideally a cache block) from a series of sequen[al addresses within that row CPU Cache bus DRAM Memory 2/24/11 N x M SRAM Cycle Time 1st M ­bit Access 2nd M ­bit Row Address M bit planes M ­bit Output 3rd M ­bit 4th M ­bit RAS CAS Row Address 2/24/11 Col Address Row Add Spring 2011  ­ ­ Lecture #11 30 One Word Wide Bus, Four Word Blocks •  If block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall for the number of cycles required to return one data word from memory CPU •  What if block size is four words and each word is in a different DRAM row? cycle to send 1st address 1 Cache bus 1 memory bus clock cycle to send address 15 memory bus clock cycles to read DRAM 1 memory bus clock cycle to return data 17 total clock cycles miss penalty x 1 = 6 0 cycles to read DRAM 4 5 1 cycles to return last data word 2 total clock cycles miss penalty 6 15 cycles DRAM Memory 15 cycles 15 cycles 15 cycles •  Number of bytes transferred per clock cycle (bandwidth) for a single miss is •  Number of bytes transferred per clock cycle (bandwidth) for a single miss is 4/17 = 0 .235 bytes per memory bus clock cycle Spring 2011  ­ ­ Lecture #11 32 DRAM -  Memory bus clock controls transfer of successive words in the burst on ­chip on ­chip N cols Input CAS as the star[ng “burst” address along with a burst length 15 memory bus clock cycles to get the 1st word in the block from DRAM (row cycle [me), 5 memory bus clock cycles for 2nd, 3rd, 4th words (subsequent column access [me)—note effect of latency! Memory ­Bus to Cache bandwidth +1 •  1 memory bus clock cycle to send address •  Column Address A{er a row is read into the SRAM register One word wide organiza[on (one word wide bus and one word wide memory) •  28 (DDR) SDRAM Opera[on Assume Cache Spring 2011  ­ ­ Lecture #12 N rows A red) x 0.258 (4 4 )/62 = bytes per clock 2/24/11 Spring 2011  ­ ­ Lecture #11 34 5 2/24/11 Interleaved Memory, One Word Wide Bus One Word Wide Bus, Four Word Blocks on ­chip CPU •  What if the block size is four words and all words are in the same DRAM row? •  on ­chip cycle to send 1st address 1 Cache bus CPU *5 0 15 + 3 = 3 cycles to read DRAM cycles to return last data word 1 3 2 total clock cycles miss penalty 4 *1 = 4 cycles to return last data word 2 0 total clock cycles miss penalty Cache 15 cycles 15 cycles 15 cycles bus 5 cycles DRAM Memory 5 cycles 5 cycles •  Number of bytes transferred per clock cycle (bandwidth) for a single miss is DRAM DRAM DRAM DRAM Memory Memory Memory Memory bank 0 bank 1 bank 2 bank 3 (4 x 4 = 0 .5 bytes per clock )/32 2/24/11 Spring 2011  ­ ­ Lecture #11 For a block size of four words 1 cycle to send 1st address 1 5 cycles to read DRAM banks 15 cycles 15 cycles •  Number of bytes transferred per clock cycle (bandwidth) for a single miss is )/20 .8 (4 x 4 = 0 bytes per clock 36 DRAM Memory System Observa[ons •  Its important to match the cache characteris[cs –  Caches access one block at a [me (usually more than one word) 2/24/11 Spring 2011  ­ ­ Lecture #11 38 Performance Programming: Adjust so{ware accesses to improve miss rate •  Now that understand how caches work, can revise program to improve cache u[liza[on –  Cache size –  Block size –  Mul[ple levels 1) With the DRAM characteris[cs –  Use DRAMs that support fast mul[ple word accesses, preferably ones that match the block size of the cache 2) With the memory ­bus characteris[cs –  Make sure the memory ­bus can support the DRAM access rates and paGerns –  With the goal of increasing the Memory ­Bus to Cache bandwidth 2/24/11 Spring 2011  ­ ­ Lecture #11 39 2/24/11 Performance of Loops and Arrays Spring 2011  ­ ­ Lecture #12 40 Matrix Mul[plica[on •  Array performance o{en limited by memory speed •  OK if access memory different order as long as get correct result •  Goal: Increase performance by minimizing traffic from cache to memory = –  That is, reduce Miss rate by geyng beGer reuse of data already in cache * •  One approach called Cache Blocking: “shrink” problem by performing mul[ple itera[ons within smaller cache blocks •  Use Matrix Mul[ply as example: Next Lab and Project 3 42 6 2/24/11 The simplest algorithm Matrix Mul[plica[on Assump[on: the matrices are stored as 2 ­D NxN arrays cij for (i=0;i<N;i++) for (j=0;j<N;j++) for (k=0;k<N;k++) c[i][j] += a[i][k] * b[k][j]; * = ai* b*j Advantage: code simplicity Simple Matrix Mul[ply  ­ www.youtube.com/watch?v=yl0LTcDIhxc 43 Improving reuse via Blocking: 1st “Naïve” Matrix Mul[ply Note on Matrix in Memory •  A matrix is a 2 ­D array of elements, but memory addresses are “1 ­D” •  Conven[ons for matrix layout C = C + A*B} –  by column, or “column major” (Fortran default); A(i,j) at A+i+j*n –  by row, or “row major” (C default) A(i,j) at A+i*n+j Column major Column major matrix in memory Row major 0 5 10 15 0 1 2 3 1 6 11 16 4 5 6 7 2 7 12 17 8 9 8 13 18 12 13 14 15 4 9 14 19 16 17 18 19 {implements for i = 1 to n {read row i of A into cache} for j = 1 to n {read c(i,j) into cache} {read column j of B into cache} for k = 1 to n c(i,j) = c(i,j) + a(i,k) * b(k,j) {write c(i,j) back to main memory} 10 11 3 Disadvantage: Marches through memory and caches 44 100 x 100 Matrix, Cache 1000 blocks, 1 word/block Cache blocks C(i,j) Blue row of matrix is stored in red cache blocks * B(:,j) Consider A,B,C to be N ­by ­N matrices of b ­by ­b subblocks where b=n / N is called the block size for i = 1 to N for j = 1 to N {read block C(i,j) into cache} for k = 1 to N {read block A(i,k) into cache} {read block B(k,j) into cache} C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix mul[ply on blocks} {write block C(i,j) back to main memory} •  Instead of Mul[plying two, say, 6 x 6 matrices •  Thus, can get same result as mul[plica[on of a set of submatricies Spring 2011  ­ ­ Lecture #14 + Blocked Matrix Mul[ply Linear Algebra to the Rescue! 2/24/11 A(i,:) C(i,j) = C(i,j) C(i,j) = 47 A(i,k) + * B(k,j) Blocked Matrix Mul[ply  ­ www.youtube.com/watch?v=IFWgwGMMrh0 9/10/2007 CS194 Lecture 100 x 100 Matrix, 1000 cache blocks, 1 word/block, block 30x30 48 7 2/24/11 Another View of “Blocked” Matrix Mul[plica[on Maximum Block Size C11 C12 C13 C14 A11 A12 A13 A14 B11 B12 B13 B14 C21 C22 C23 C24 A21 A22 A23 A24 B21 B22 B23 B24 C31 C32 C43 C34 A31 A32 A33 A34 B32 B32 B33 B34 C41 C42 C43 C44 A41 A42 A43 A144 B41 B42 B43 B44 C22 = A21B12 + A22B22 + A23B32 + A24B42 = ∑k A2k*Bk2 N = 4 *r •  The blocking op[miza[on works only if the blocks fit in cache. •  That is, 3 blocks of size r x r must fit in memory (for A, B, and C) •  M = size of cache (in elements/words) •  We must have: 3r2 ≈ M, or r ≈ √(M/3) •  Ra[o of cache misses blocked vs. unblocked up to ≈ √M Simple Matrix Mul[ply Whole Thing  ­ www.youtube.com/watch?v=f3 ­z6t_xIyw 1x1 blocks: 1,020,000 misses: read A once, read B 100 [mes, read C once   Main Point: each multiplication operates on small “block” matrices, whose size may be chosen so that they fit in the cache. 49 Blocked Matrix Mul[ply Whole Thing  ­ www.youtube.com/watch?v=tgpmXX3xOrk 30x30 blocks: 90,000 misses = read A and B four [mes, read C once “Only” 11X vs 30X Matrix small enough that row of A in simple 51 version fits completely in cache; other things Review •  To access cache, Memory Address divided into 3 fields: Tag, Index, Block Offset •  Cache size is Data + Management (tags, valid, dirty bits) •  Write misses trickier to implement than reads –  Write back vs. Write through –  Write allocate vs. No write allocate •  Cache Performance Equa[ons: –  CPU [me = IC × CPIstall × CC = IC × (CPIideal + Memory ­stall cycles) × CC –  AMAT = Time for a hit + Miss rate x Miss penalty •  If understand caches, can adapt so{ware to improve cache performance and thus program performance 2/24/11 Spring 2011  ­ ­ Lecture #12 52 8 ...
View Full Document

Ask a homework question - tutors are online