Comreviewintel haswell core i7 4770kreview dz87kl75k

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: mplexity 20 Caches and Cache Performance •  Remember the von Neumann model CPU Registers CPU ALU Cache Memory Memory cs420: speed with complexity 21 Why and how does a cache help? •  Only because of the principle of locality –  Programs tend to access the same and/or nearby data repeatedly –  Temporal and spatial locality •  Size of cache •  Multiple levels of cache •  Performance impact of caches –  Designing programs for good sequential performance cs420: speed with complexity 22 Reality today: mul3 ­level caches •  Remember the von Neumann model CPU CPU Caches Cache Memory cs420: speed with complexity Memory 23 Example: Intel’s Nehalem chip •  Nehalem architecture, core i7 chip: –  64 KB L1 instruction and 64 KB L1 data cache per core –  256 KB L2 cache (combined instruction and data) per core –  8 MB L3 (combined instruction and data) "inclusive", shared by all cores •  Still, even L1 cache is several cycles –  (reportedly 4 cycles, inreased from 3 before) –  L2: 10 cycles cs420: speed with complexity 24 Intel’s latest chip: Haswell http://wccftech.com/review/intel-haswell-core-i7-4770kreview-dz87kl75k-motherboard/ Haswell cs420: speed with complexity 25 Example program •  •  •  •  Imagine a sequential program running using a large array, A For each I, A[i] = A[i] + A[some other index] How long should the program take, if each addition is a ns What is the performance difference you expect, depending on how the other index is chosen? for (i=0; i<size-1; i++) { A[i] += A[i+1]; } for (i=0, index2=0; i<size; i++) { index2 += SOME_NUMBER; // smaller than size if (index2 > size) index2 -= size; A[i] += A[index2]; } cs420: speed with complexity 26 A liPle bit about microprocessor architecture •  Architecture over the last 2 ­3 decades was driven by the need to make clock cycle go faster and faster –  Pipelining developed as an essential technique early on. –  Each instruction execution is pipelinesd: •  Fetch, decode, execute, stages at least •  In addition, `loating point operations, which take longer to calculate, have their own separate pipeline •  L1 cache accesses in Nehalem are pipelined – so even though it takes 4 cycles to get the result, you can keep issuing a new load every cycle, and you wouldn’t notice a difference (almost) if they are all found in L1 cache...
View Full Document

This note was uploaded on 03/16/2014 for the course CS 420 taught by Professor Kale,l during the Fall '08 term at University of Illinois, Urbana Champaign.

Ask a homework question - tutors are online