Lecture 24 - P4 Hyperthreading (2010-04-08)

Lecture 24 - P4 Hyperthreading (2010-04-08) - Relative...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Relative power/die size/perf vs 486 25 20 15 10 5 0 Single Single-stream Performance vs Costs Performance vs Costs SPECInt SPECFP Die Size Power 486 Pentium® Processor Pentium® III Processor Pentium® 4 Processor MultiMulti-Processor Systems: Thread Level Parallelism Proc 0 Proc 1 Proc 2 Proc 3 System Bus DRAM and I/O Parallelism Design Space Parallelism Design Space Instruction Level Parallelism HW Multi-Threading Multi-Stream Perf More CPUs Multi-Core Thread Level Parallelism Single-Stream Perf Faster frequency Wider Superscalar More Out-of-order speculation Long Long Latency DRAM Accesses: Needs Memory Level Parallelism (MLP) (MLP) Peak Instructions during DRAM Access 1400 1200 1000 800 600 400 200 0 Pentium® Processor 66 MHz Pentium-Pro Pentium III III Processor Processor 200 MHz 1100 MHz Pentium 4 Processor 2 GHz Future CPUs CPUs Today’s Software Today Software Servers – Multi-threaded and highly scalable on multi-processor Multimultisystems High-end desktop and workstation desktop and workstation – Increasingly multi-threaded multi- Desktop – Increasing number of support threads – Simultaneously running multiple unrelated applications – Windows XP* allows MP in Desktop Opportunity to Exploit Thread-level Parallelism pp ThreadIn Today’s Software and Usage Models *Third-party brands and names are the property of their respective owners. Superscalar Issue Superscalar Issue Time Chip Multiprocessor (CMP) Chip Multiprocessor (CMP) Time CPU0 CPU1 Fine Fine Grained Time-slicing Multi-Threading TimeMultiTime Switch Switch-on-Event Multi-Threading (SOE-MT) Multi (SOE Time Simultaneous Multi-Threading (SMT) Simultaneous Multi (SMT) Time Maximum utilization of function units by independent independent operations Hyper Hyper-Threading Technology is SMT Technology is SMT Executes two tasks simultaneously – Two different applications Two different applications – Two threads of same application CPU maintains architecture state for two processors – Two logical processors per physical processor Demonstrated on prototype Intel® Xeon™ Processor MP – Two logical processors for < 5% additional die area – Power efficient performance gain – Result of significant research, design effort, and validation HyperHyper-Threading Technology brings Simultaneous MultiMulti-Threading(SMT) to Intel Architecture Resources: Replicated vs Shared Resources: Replicated vs Shared Multiprocessor Arch State L3 Cache Cache L2 Cache L1 D-Cache Dand D-TLB DStore AGU Load AGU Hyper-Threading HyperArch Arch State Arch State Arch State L1 DL1 D-Cache L3 L2 and D-TLB L1 D- Cache DDCache Cache L3 L2 DDCache CacheCache Control andStoreTLB Cache L2/L3 BTB IDecoder & I-TLB Decoder Trace Cache Trace Cache RenameeAlloc R / name/Alloc Reorder/Retire L2/L3 L2/L3 Cache Control SchedSchedulers ulers FP RF Integer RF FP RF Integer RF Arch State L3 Cache Cache L2 Cache L1 D-Cache Dand D-TLB DStore AGU Load AGU L2/L3 L2/L3 Cache Control uop Queues uop Queues Reorder/Retire Reorder/Retire Rename/Alloc Rename/Alloc /Alloc Trace Cache BTB & I-TLB Trace Cache BTB & I-TLB Schedulers Schedulers 3 3 ALU ALU ALU ALU ALU ALU BTB & I-TLB uop Queues uop Queues ALU ALU ALU 3 3 3 3 3 3 FP load ALU FP store FP RF FP RF FP load FP store Fmul, FAdd MMX, SSE FP load FP store Fmul, FAdd MMX, SSE BTB uCode ROM Fmul, FAdd MMX, SSE FP load FP store Fmul, FAdd MMX, SSE BTB uCode ROM BTB uCode ROM BTB uCode ROM MultiMulti-Processors replicate execution resources HyperHyper-Threading Technology shares resources Reorder/Retire Processor Execution Resources Integer RF L2/L3 L2/L3 Cache Control ALU Processor Execution Resources Integer RF ALU Processor Processor Execution Execution Resources Resources Resources AGU Load Store AGU AGU ALU Load ALU AGU ALU ALU Decoder Decoder Changes for Hyper-Threading Changes for Hyper Replicate resources resources – All per-CPU architectural state per– Instruction Pointers, renaming logic – Some smaller resources smaller resources – E.g, return stack predictor, ITLB, etc E.g, Partition resources (share by splitting in half per thread) thread) – Several buffers (Re-order buffer, load/store buffers, (Rebuffers, queues,etc) queues,etc) Sh Share most resources – Out-of-Order execution engine Out-of– Caches Out Out-of-Order Execution Pipeline Execution Pipeline I-Fetch Queue Rename Queue Sched Register Read Read Register Register Execute L1 Cache Write Write Retire Store Buffer Buffer IP IP Register Renamer Trace Cache Cache ReRe-Order Buffer Registers L1 D-Cache D- Registers Hyper Hyper-Threading Pipeline Pipeline I-Fetch Queue Rename Queue Sched Register Read Read Register Register Execute L1 Cache Write Write Retire Store Buffer Buffer IP IP Register Renamer Trace Cache Cache ReRe-Order Buffer Registers L1 D-Cache D- Registers i Thread Thread-Selection Points Points I-Fetch Queue Rename Queue Sched Register Read Read Register Register Execute L1 Cache Write Write Retire IP IP Store Buffer Buffer Register Renamer Trace Cache ReRe-Order Buffer Registers L1 D-Cache D- Re Registers ...
View Full Document

Ask a homework question - tutors are online