Low Power Design, Intel Processors

Computer Organization and Design: The Hardware/Software Interface

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Recap: I/O Summary CS152 Computer Architecture and Engineering Lecture 25 Low Power Design, Advanced Intel Processors ° I/O performance limited by weakest link in chain between OS and device ° Queueing theory is important • 100% utilization means very large latency • Remember, for M/M/1 queue (exponential source of requests/service) - queue size goes as u/(1-u) - latency goes as Tser×u/(1-u) • For M/G/1 queue (more general server, exponential sources) - latency goes as m1(z) x u/(1-u) = Tser x {1/2 x (1+C)} x u/(1-u) ° Three Components of Disk Access Time: • Seek Time: advertised to be 8 to 12 ms. May be lower in real life. • Rotational Latency: 4.1 ms at 7200 RPM and 8.3 ms at 3600 RPM • Transfer Time: 2 to 50 MB per second ° I/O device notifying the operating system: • Polling: it can waste a lot of processor time • I/O interrupt: similar to exception except it is asynchronous ° Delegating I/O responsibility from the CPU: DMA, or even IOP May 3, 2004 John Kubiatowicz (http.cs.berkeley.edu/~kubitron) lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/ 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.2 Slides Borrowed from Bob Broderson Low Power Design 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.3 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.4 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.5 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.6 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.7 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.8 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.9 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.10 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.11 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.12 3/4 × 1/4 = 3/16 3/16 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.13 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.14 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.15 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.16 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.17 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.18 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.19 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.20 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.21 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.22 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.23 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.24 Desired Throughputingle-user system not always computing Back to original goal: Processor Usage Model System Optimizations: • Maximize Peak Throughput • Minimize Average Energy/operation 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 5/03/04 Delivered Throughput Typical Usage ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.25 5/03/04 Delivered Throughputeak Wake up → Compute ASAP → Go to idle/sleep mode Excess throughput time Peak Problems: • Circuits designed to be fast are now “wasted”. • Demand for peak throughput not met. 0000000000000000000000000000000000000000000000000000000000000000000 5/03/04 Always high throughput Always high energy/operation ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.27 5/03/04 Energy/operation remains unchanged... while throughput scales down with fCLK Another approach: Reduce Frequency (maximize computation per battery life) fCLK Reduced Compute-intensive and low-latency processes ©UCB Spring 2004 ©UCB Spring 2004 Background and high-latency processes Frequency set by user top speed Ceiling: Set by processor of the PowerBook Control Panel Slow Fast CS152 / Kubiatowicz Lec25.28 CS152 / Kubiatowicz Lec25.26 time time Dynamically scale energy/operation with throughput Key: Process scheduler determines operating point. Delivered Throughputeak Increasing importance of wires relative to transistors • Spend transistors to drive wires more efficiently? • Try to reduce transitions over wires Orthogonal to other power-saving techniques • I.e. voltage reduction, low-swing drive • clock gating • Parallelism (like vectors!) Reasoning Alternative: Dynamic Voltage Scaling Extend battery life by up to 10x with the same hardware! ©UCB Spring 2004 ©UCB Spring 2004 Reduce throughput & fCLK, Reduce energy/operation CS152 / Kubiatowicz Lec25.31 CS152 / Kubiatowicz Lec25.29 time 5/03/04 5/03/04 Can we encode information in a way that takes less power? • Do this on chip?! • Trying to reduce total number of transitions Can we reduce total number of transitions on buses by sophisticated bus drivers? Huffman-based Compression What about bus transitions? Input Input Less bits != less transitions Possible soln: macro clock Variable bit length – problem! Encoder Encoder ©UCB Spring 2004 ©UCB Spring 2004 Encoded Version Decode Decode … Output Output CS152 / Kubiatowicz Lec25.32 CS152 / Kubiatowicz Lec25.30 Context-based encoder Just the Shift-register: “window-based” ° Focus on shift-register • 8 or 16 entries ° Context-based encoder • Detecting of repeated values going across bus • Shift-register finds short-term frequent values • Frequency table holds long-term values ° Careful Design can break-even! • Register bus results • 16 entires break even: 7.0 mm for .13µm 2.7 mm for .07µm 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.33 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.34 Administrivia ° Pending schedule: • Wednesday 5/5: Midterm II. 5:30 – 8:30, 306 Soda hall - No class that day (I will be having office hours) - 1 page of handwritten notes, both sides - Fair topics: – – – – Pipelining Memory Systems I/O, Disks, Queueing Theory Power I. II. 7 Talk Commandments for a Bad Talk Thou shalt not illustrate. Thou shalt not covet brevity. III. Thou shalt not print large. IV.Thou shalt not use color. V. Thou shalt not skip slides in a long talk. VI.Thou shalt cover thy naked slides. VII. Thou shalt not practice. - Pizza at LaVal’s afterwards • Monday 5/10 (wrap up, evaluations, etc) • Thursday 5/13: Oral reports: Times TBA - Signup sheet will be on my office door next week - Project reports must be submitted via web by 5pm on 5/10 • Monday 5/17: Final project reports due? ° Oral Report • Powerpoint • 20 minute presentation, 5 minutes2004 questions for 5/03/04 ©UCB Spring CS152 / Kubiatowicz Lec25.35 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.36 Following all the commandments Alternatives to a Bad Talk ° Practice, Practice, Practice! • Use casette tape recorder to listen, practice • Try videotaping • Seek feedback from friends ° We describe the philosophy and design of the control flow machine, and present the results of detailed simulations of the performance of a single processing element. Each factor is compared with the measured performance of an advanced von Neumann computer running equivalent code. It is shown that the control flow processor compares favorablylism in the program. We present a denotational semantics for a logic program to construct a control flow for the logic program. The control flow is defined as an algebraic manipulator of idempotent substitutions and it virtually reflects the resolution deductions. We also present a bottom-up compilation of medium grain clusters from a fine grain control flow graph. We compare the basic block and the dependence sets algorithms that partition control flow graphs into clusters. Our compiling strategy is to exploit coarse-grain parallelism at function application level: and the function application level parallelism is implemented by fork-join mechanism. The compiler translates source programs into control flow graphs based on analyzing flow of control, and then serializes instructions within graphs according to flow arcs such that function applications, which have no control dependency, are executed in parallel. A hierarchical macro-control-flow computation allows them to exploit the coarse grain parallelism inside a macrotask, such as a subroutine or a loop, hierarchically. We use a hierarchical definition of macrotasks, a parallelism extraction scheme among macrotasks defined inside an upper layer macrotask, and a scheduling scheme which assigns hierarchical macrotasks on hierarchical clusters. We apply a parallel simulation scheme to a real problem: the simulation of a control flow architecture, and we compare the performance of this simulator with that of a sequential one. Moreover, we investigate the effect of modelling the application on the performance of the simulator. Our study indicates that parallel simulation can reduce the execution time significantly if appropriate modelling is used. We have demonstrated that to achieve the best execution time for a control flow program, the number of nodes within the system and the type of mapping scheme used are particularly important. In addition, we observe that a large number of subsystem nodes allows more actors to be fired concurrently, but the communication overhead in passing control tokens to their destination nodes causes the overall execution time to increase substantially. The relationship between the mapping scheme employed and locality effect in a program are discussed. The mapping scheme employed has to exhibit a strong locality effect in order to allow efficient execution. We assess the average number of instructions in a cluster and the reduction in matching operations compared with fine grain control flow execution. Medium grain execution can benefit from a higher output bandwidth of a processor and finally, a simple superscalar processor with an issue rate of ten is sufficient to exploit the internal parallelism of a cluster. Although the technique does not exhaustively detect all possible errors, it detects nontrivial errors with a worst-case complexity quadratic to the system size. It can be automated and applied to systems with arbitrary loops and nondeterminism. ° ° Use phrases, not sentences • Notes separate from slides (don’t read slide) ° ° ° Pick appropriate font, size (~ 24 point to 32 point) ° Estimate talk length • - 2 minutes per slide • Use extras as backup slides (Question and Answer) ° ° ° ° Use color tastefully (graphs, emphasis) ° Don’t cover slides • Use overlays or builds in powerpoint ° ° Go to room early to find out what is WRONG with setup • Beware: PC projection + dark rooms after meal! 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.37 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.38 Include in your final presentation ° Who is on team, and who did what • Everyone should say something Review: Road to Faster Processors ° Time = Instr. Count x CPI x Clock cycle time ° How get a shorter Clock Cycle Time? ° Can we get CPI < 1? ° Can we reduce pipeline stalls for cache misses, hazards, … ? ° IA-32 P6 microarchitecture (µarchitecture): Pentium Pro, Pentium II, Pentium III ° IA-32 “Netburst” µarchitecture (Pentium 4, … ° IA-32 AMD Athlon, Opteron µarchitectures ° IA-64 Itanium I and II microarchitectures ° High-level description of what you did and how you combined components together • Use block diagrams rather than detailed schematics • Assume audience knows Chapters 6 and 7 already ° Include novel aspects of design • Did you innovate? How? • Why did you choose to do things the way that you did? ° Give Critical Path and Clock cycle time • Bring paper copy of schematics in case there are detailed questions. • What could be done to improve clock cycle time? ° Description of testing philosophy! ° Mystery program statistics: instructions, clock cycles, CPI, why stalls occur (cache miss, load-use interlocks, branch mispredictions, ... ) ° Lessons learned, what might do different next time / Kubiatowicz CS152 5/03/04 ©UCB Spring 2004 Lec25.39 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.40 Dynamic Scheduling in Pentium Pro, II, III ° P6 doesn’t pipeline 80x86 instructions ° P6 decode unit translates the Intel instructions into 72-bit "microoperations" (~ MIPS instructions) ° Takes 1 clock cycle to determine length of 80x86 instructions + 2 more to create the micro-operations Tying it all together Pentium-IV °Most instructions translate to 1 to 4 micro-operations ° Sends micro-operations to reorder buffer & reservation stations 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.41 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.42 Dynamic Scheduling in P6 (Pentium Pro, II, III) P6 Pipeline ° 14 clocks in total (~3 state machines) ° Complex 80x86 instructions are executed by a conventional microprogram (8K x 72 bits) that issues long sequences of micro-operations ° 10 stage pipeline for micro-operations ° 14 clocks in total pipeline ° 8 stages are used for in-order instruction fetch, decode, and issue • Takes 1 clock cycle to determine length of 80x86 instructions + 2 more to create the micro-operations (uops) ° 3 stages are used for out-of-order execution in one of 5 separate functional units ° 3 stages are used for instruction commit Reserv. Reorder ExecuGraduStation Buffer tion ation Renaming units 3 uops 3 uops (5) /clk /clk Instr Fetch 16B /clk 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.43 5/03/04 16B Instr 6 uops Decode 3 Instr /clk ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.44 P6 Block Diagram Dynamic Scheduling in P6 Parameter 80x86 3 microops 6 5 3 40 20 40 2 1 1 1 load + 1 store ° IP = PC ° Simple Decoder: • 1 µop ° Complex Decoder: • 1-4 µops ° Rename: 40 regs Max. instructions issued/clock Max. instr. complete exec./clock Max. instr. commited/clock Window (Instrs in reorder buffer) Number of reservations stations Number of rename registers No. integer functional units (FUs) No. floating point FUs No. SIMD Fl. Pt. FUs No. memory Fus From: http://www.digitlife.com/articles/pentium4/ 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.45 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.46 Pentium III Die Photo ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° EBL/BBL - Bus logic, Front, Back MOB - Memory Order Buffer Packed FPU - MMX Fl. Pt. (SSE) IEU - Integer Execution Unit FAU - Fl. Pt. Arithmetic Unit MIU - Memory Interface Unit DCU - Data Cache Unit PMH - Page Miss Handler DTLB - Data TLB BAC - Branch Address Calculator RAT - Register Alias Table SIMD - Packed Fl. Pt. RS - Reservation Station BTB - Branch Target Buffer IFU - Instruction Fetch Unit (+I$) ID - Instruction Decode P6 Performance: uops/x86 instr 200 MHz, 8KI$/8KD$/256KL2$, 66 MHz bus go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor hydro2d mgrid applu turb3d apsi fpppp wave5 1 5/03/04 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Lec25.48 ° ROB - Reorder Buffer 1st Pentium III, Katmai: 9.5 M CS152 / Kubiatowicz MS 5/03/04transistors, 128 mm**2 in 0.25-micron Spring°2004 - Micro-instruction Sequencer ©UCB Lec25.47 1.2 to 1.6 uops per IA-32 instruction: 1.36 avg. (1.37 integer) Kubiatowicz CS152 / ©UCB Spring 2004 P6 Performance: Speculation rate (% instructions issued that do not commit) go go m88ksim m88ksim gcc gcc compress compress li ijpeg perl vortex tomcatv swim su2cor hydro2d mgrid applu turb3d apsi fpppp wave5 li ijpeg perl vortex tomcatv swim su2cor hydro2d mgrid applu turb3d apsi fpppp wave5 P6 Performance: µops commit/clock 0 uops commit 1 uop commits 2 uops commit 3 uops commit Average 0: 55% 1: 13% 2: 8% 3: 23% Integer 0: 40% 1: 21% 2: 12% 3: 27% 0% 5/03/04 10% 20% 30% 40% 50% 60% 1% to 60% instructions do not commit: 20% avg (30% integer)/ Kubiatowicz CS152 ©UCB Spring 2004 Lec25.49 0% 5/03/04 20% 40% 60% 80% 100% CS152 / Kubiatowicz Lec25.50 ©UCB Spring 2004 P6 Dynamic Benefit? Sum of parts CPI vs. Actual CPI go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor hydro2d mgrid applu turb3d apsi fpppp wave5 Pentium 4 features ° Called “NetBurst” Microarchitecture • • • • • Still translate from 80x86 to micro-ops Instruction Cache (Execution Trace Cache) Out-of-Order (OOO) execution engine Double-pumped Arithmetic Logic Unit Memory Subsystem (L1 access in 2 CP) uops Instruction cache stalls Resource capacity stalls Branch mispredict penalty Data Cache Stalls Actual CPI Ratio of sum of parts vs. actual CPI: 1.38X avg. (1.29X integer) ° Floating Point/Multi-Media performance • Multimedia instructions 128 bits wide vs. 64 bits wide for P6 ⇒ 144 new instructions • When used by programs?? • Faster Floating Point: execute 2 64-bit Fl. Pt. Per clock • Memory FU: 1 128-bit load, 1 128-store /clock to MMX regs ° P4 has better branch predictor • BTB: 4096 vs 512 (Intel: 1/3 misprediction improvement) ° Instruction Cache holds micro-operations vs. 80x86 instructions • no decode stages of 80x86 on cache hit • called “trace cache” (TC) CS152 / Kubiatowicz Lec25.51 CS152 / Kubiatowicz Lec25.52 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 0.8 to 3.8 Clock cycles per instruction: 1.68 avg (1.16 integer) 5/03/04 ©UCB Spring 2004 5/03/04 ©UCB Spring 2004 Pentium 4 features (Continued) ° Faster memory bus: initially 400 MHz v. 133 MHz ° Caches • Pentium III: L1I 16KB, L1D 16KB, L2 256 KB • Pentium 4: L1I 12K µops, L1D 8 KB, L2 256 KB • Block size: PIII 32B v. P4 128B; 128 v. 256 bits/clock Registers 32-bit 1 8 General Purpose Registers GPR 16-bit 1 6 ° Initial P4 Clock rates: • Pentium III 1 GHz v. Pentium IV 1.5 GHz • 14 stage pipeline vs. 24 stage pipeline ° Using RAMBUS DRAM • Bandwidth faster, latency same as SDRAM • Later changed to support DDR SDRAM ° Rename registers: 128 vs 40 (although 8 for architectural state) SEG Segment Registers EFLAGS Register Control Register 80-bit 1 Floating Point Registers 8 SSE2 Registers MMX/SSE (FP/Int…) Registers 1 128-bit 8 CS152 / Kubiatowicz Lec25.54 1 FPU MMX XMM 64-bit 8 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.53 5/03/04 ©UCB Spring 2004 SIMD: Single Instruction Multiple Data ° Beginning with Pentium II, “SIMD” instructions added ° “Partitions” ALU to do multiple narrow data operations in 1 clock cycle by breaking carry chain: • 64 bits => 2 32-bit int ops OR 4 16-bit ops OR 8 8-bit ops ° SSE2 added in Pentium 4 • 128 bits => 2 64-bit Fl. Pt. OR 4 32-bit Fl. Pt. OR ... Pentium 4 Cache Level Capacity Associativity 4 8 8 Line Size (bytes) 64 N/A 128 read 64 write Latency int/float (clocks) 2/9 N/A 7/7 Write Update Policy write through N/A write back First Data Trace Cache 8KB 12K µops 256KB, 512KB 0, 512KB or 1MB or 2 MB Instructions MMX (57) Pentium II SSE (70) Pentium III SSE2 (144) Pentium 4 5/03/04 Packed Data INT B,W,Q SP Float Registers MXM 64-bit Yes Registers XMM 128bit --- APPS Second Imaging, MM, comm. Third Yes --3-D geo/rendering video en/decode 4-D graphics Scientific Comp 8 128 read 64 write 14/14 write back INT, SP/DP Float Yes Yes ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.55 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.56 Pentium 4 basic block diagram Pentium 4 Trace Cache 1/4 • Trace Cache: • L1 instruction cache after the Instruction Fetch. • Arranges decoded instructions (µops) into some miniprograms that are ready to be used whenever there is a L1 Cache Hit. • The trace cache can send up to 3 µops directly to execution engine. ° Source: “The Microarchitecture of the Pentium 4 Processor” • Intel Technology Journal Q1, 2001 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.57 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.58 Trace Cache Example Example: Code in memory cmp br T1 ... sub br T2 ... mov sub br T3 ... add sub mov br T4 ... add Pentium 4 Trace Cache 2/4 Decoded into 6 µop traces In Trace Cache cmp br T2 Br T3 Mov br T1 mov sub sub • What happens when there is a Trace Miss? • Trace Miss happens when L1 Cache misses, therefore, it needs to go to L2 cache, and fetch it from there. This results in 8 pipeline stages in order to translate and decode the instructions. • Trace cache operates in two modes : 1) Execute mode : trace cache -> execution logic->executed. This is the mode Trace cache normally runs on when there is no Cache miss 2) Trace segment build mode: Happens when L1 cache miss. Fetch code from L2 cache, translate to µops, build trace segment, load segment to trace cache. T1: T2: Add Br T4 Sub add T3: T4: 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.59 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.60 Pentium 4 Trace Cache 3/4 Trace cache applies Branch Prediction when building a trace. It gets the code from the branch that it thinks the program will run on behind the code that it knows the program will take. x86 code with branch: Trace cache build a trace from instructions up to including branch instruction, then pick a branch. Pentium 4 Trace Cache 4/4 Conventional way: Branch predictor figure outs branch to speculatively execute, then load a branch. takes up to 1 cycle of delay after every conditional branch instruction With Trace cache: the branch code is within the trace segment so there is no delay associated with bringing in the branch code. • Most x86 instructions decode into 2 or 3 µops • Rare long instructions, which could decode into 100s of µops. PIII and P4 use microcode ROM which process these instructions so the regular decoder can do decoding on normal smaller instructions. • Trace cache put a tag in trace segment when sees long instruction, Tag points to section of microcode ROM contains the µop sequence. • When trace cache encounters the flag in execute mode, it lets microcode ROM stream proper sequence of µops into instruction stream for execution engine 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.61 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.62 Full Block diagram (Intel) Out-of-Order Execution -- Pipeline Pentium III processor misprediction pipeline 1 Fetch 2 Fetch 3 4 5 6 7 8 9 Decode Decode Decode Rename ROB Rd Rdy/Sch Dispatch 10 Exec 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 TC FetchTC FetchDrive Alloc Rename Que Sch Sch Sch Disp Disp FR FR Ex Flgs BrCkDrive Pentium 4 processor misprediction pipeline 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.63 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.64 Comparison of two architectures Pentium 4pipeline stages Stage 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 5/03/04 Register Renaming: Pentium III vs NetBurst Fetch Fetch Decode Decode Decode Rename ROB Rd Rdy/Sch Dispatch Exec Pentium 3 pipeline stages 1 2 3 4 5 6 7 8 9 10 Work Trace Cache next instruction pointer Trace Cache next instruction pointer Trace Cache fetch Trace Cache fetch Drive (Wire latency) Allocation Rename Rename Queue Schedule Schedule Schedule Dispatch Dispatch Register Files Register Files Execute Flags Branch Check Drive (Wire Latency) ° Pentium III ties names of registers to reorder buffer slots (Implicit?) • Values copies to architectural register file (RRF) on commit ° Pentium IV performs explicit register renaming • 128 physical registers • Commit of translations CS152 / Kubiatowicz Lec25.65 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.66 ©UCB Spring 2004 Staggered ALU Add Pentium 4 Speeds & Feeds Regs. 1 W (load) CP 1 W (store) CP 4 W CP on die 1W 6 CP @400MHz FSB 2.4GHz CPU PC800 RDRAM L1 Data 8KB 2-7 CP L2 256/512KB ~90 CP Memory Latencies 2 CP 32B wide Exec ° Add pipeline can operate at 2X clock speed ° Split into 3 fast cycles with forwarding (16bits x 2 + flags) ° L1 data cache access starts with bottom 16 bits 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.67 5/03/04 ~4 CP (3uops/CP stream) Trace Cache W PF Word (64 bit) Int Integer (64 bit) CP Clock Period ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.68 Line size L1/L2 =32/64 bytes Pentium 4 Basic Features ° 42 million transistors (256 KB L2 cache) 55 watts @1.5GHz, 217 mm**2 (0.18u) ° 55 million transistors (512 KB L2 cache) 82 watts @ 3.0 GHz, 131 mm**2 (0.13u) ° Xeon (server): 160 million transistors (512 KB L2 cache + 2048 KB L3) 65 watts @ 2.0 GHz, 211 mm**2 (0.13u) ° 400/533/800 MHz Front Side Bus • Bus to Memory Hub, which connects to DRAM, AGP graphics bus, and I/O Hub Pentium-4 die floor plan L1 Dcache L2 cache 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.69 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.70 Performance Comparison SPEC 2000 Performance 3/2001 Source: Microprocessor Report, 1.5X 1.2X 100 x 100 x 8 = 80 KB Scott Wasson “Intel’s Pentium 4 Processor, Radical Chic” www.tech-report.com/reviews/2001q3/pentium4-2ghz/ 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.71 5/03/04 ©UCB Spring 2004 1.7X CS152 / Kubiatowicz Lec25.72 Conclusion: Power Conclusion: Intel ° OOO processors • HW translation to RISC operations • Superpipelined P4 with 22-24 stages vs. 12 stage Opteron • Trace cache in P4 • SSE2 increasing floating point performance ° Best way to say power or energy: do nothing! ° Most Important equations to remember: • Energy = CV2 • Power = CV2f ° Slowing clock rate does not reduce energy for fixed operation! ° Ways of reducing energy: • Pipelining with reduced voltage • Parallelism with reduced voltage 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.73 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.74 ...
View Full Document

This note was uploaded on 01/29/2008 for the course CS 152 taught by Professor Kubiatowicz during the Spring '04 term at University of California, Berkeley.

Ask a homework question - tutors are online