Pent_arch - ECE4100/6100 H­H. S. Lee ECE4100/6100 Guest...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ECE4100/6100 H­H. S. Lee ECE4100/6100 Guest Lecture: P6 & NetBurst Microarchitecture Prof. Hsien­Hsin Sean Lee School of ECE Georgia Institute of Technology February 11, 2003 1 ECE4100/6100 H­H. S. Lee Why study P6 from last millennium? „ A paradigm shift from Pentium „ A RISC core disguised as a CISC „ Huge market success: ƒ Microarchitecture ƒ And stock price „ Architected by former VLIW and RISC folks ƒ Multiflow (pioneer in VLIW architecture for super-minicomputer) ƒ Intel i960 (Intel’s RISC for graphics and embedded controller) „ Netburst (P4’s microarchitecture) is based 2 ECE4100/6100 H­H. S. Lee P6 Basics „ „ „ „ „ „ „ One implementation of IA32 architecture Super-pipelined processor 3-way superscalar In-order front-end and back-end Dynamic execution engine (restricted dataflow) Speculative execution P6 microarchitecture family processors include ƒ ƒ ƒ ƒ ƒ Pentium Pro Pentium II (PPro + MMX + 2x caches—16KB I/16KB D) Pentium III (P-II + SSE + enhanced MMX, e.g. PSAD) Celeron (without MP support) Later P-II/P-III/Celeron all have on-die L2 cache 3 ECE4100/6100 H­H. S. Lee x86 Platform Architecture Host Processor P6 Core L1 Cache (SRAM) Back-Side Bus Bus L2 Cache (SRAM) On-die or on-package GPU Graphics Processor Processor Local Frame Buffer Front-Side Front-Side Bus Bus AGP MCH ICH System System Memory (DRAM) (DRAM) chipset PCI USB 4 I/O ECE4100/6100 H­H. S. Lee Pentium III Die Map „ EBL/BBL – External/Backside Bus logic „ MOB - Memory Order Buffer „ Packed FPU - Floating Point Unit for SSE „ IEU - Integer Execution Unit „ FAU - Floating Point Arithmetic Unit „ MIU - Memory Interface Unit „ DCU - Data Cache Unit (L1) „ PMH - Page Miss Handler „ DTLB - Data TLB „ BAC - Branch Address Calculator „ RAT - Register Alias Table „ SIMD - Packed Floating Point unit „ RS - Reservation Station „ BTB - Branch Target Buffer „ TAP – Test Access Port „ IFU - Instruction Fetch Unit and L1 ICache „ ID - Instruction Decode 5 ECE4100/6100 H­H. S. Lee ISA Enahncement (on top of Pentium) „ CMOVcc / FCMOVcc r, r/m ƒ Conditional moves (predicated move) instructions ƒ Based on conditional code (cc) „ FCOMI/P : compare FP stack and set integer flags „ RDPMC/RDTSC instructions „ Uncacheable Speculative Write-Combining (USWC) —weakly ordered memory type for graphics memory „ MMX in Pentium II ƒ SIMD integer operations „ SSE in Pentium III ƒ Prefetches (non-temporal nta + temporal t0, t1, t2), nta t0 t1 t 2 6 ECE4100/6100 H­H. S. Lee P6 Pipelining Next IP I-Cache ILD Rotate Dec1 Dec2 Br Dec RS Write IFU1 IFU2 IFU3 DEC1 DEC2 RAT ROB DIS EX RET1 RET2 In-order FE 11 12 13 14 15 16 17 IDQ RAT RS Disp Exec / WB RS schd 20 21 22 Multi-cycle inst pipeline inst FE in-order boundary MOB disp 81 82 83 MOB blk blk MOB MOB wr wr AGU Mob wakeup DCache 1 Dcache 2 Non-blocking Non-blocking memory pipeline pipeline Blocking Blocking memory pipeline pipeline … 31 32 33 42 43 AGU 82: Int WB 83: Data WB DCache1 DCache2 RS RS Scheduling Delay D 81: Mem/FP WBelay 31 32 33 .. .. 81 82 83 Exec2 82 83 Retirement in-order boundary Ret ptr wr Exec n Single-cycle inst pipeline inst … 31 32 33 ROB ROB Scheduling Delay Delay … 31 32 33 42 43 …….. 40 41 42 43 81 82 83 … 91 92 93 7 MOB MOB Scheduling Delay Delay Ret ROB rd RRF wr P6 Microarchitecture External bus External bus Chip boundary Bus Cluster ECE4100/6100 H­H. S. Lee Data Cache Data Unit (L1) Memory Memory Order Buffer Order Memory Cluster Bus interface unit Instruction Fetch Unit Instruction Fetch Unit Control Flow AGU MMX IEU/JEU IEU/JEU BTB/BAC Instruction Fetch Cluster FEU MIU (Restricted) Data Flow Instruction Instruction Decoder Decoder Microcode Microcode Sequencer Sequencer Register Register Alias Table Alias Reservation Reservation Station Station ROB & ROB Retire RF Retire Out­of­order Cluster Allocator Issue Cluster 8 ECE4100/6100 H­H. S. Lee Instruction Fetching Unit Other fetch Other requests requests data addr Streamin g Buffer Select Select mux mux Instruction Instruction buffer buffer Linear Address Instructi on Cache Victim Cache P.Addr ILD Length Length marks marks Instruction Instruction rotator rotator Next PC Mux Instructi on TLB Branch Target Buffer Prediction Prediction marks marks #bytes #bytes consumed by ID by „ „ „ IFU1: Initiate fetch, requesting 16 bytes at a time IFU2: Instruction length decoder, mark instruction boundaries, BTB makes prediction 9 IFU3: Align instructions to 3 decoders in 4-1-1 format ECE4100/6100 H­H. S. Lee Dynamic Branch Prediction W0 W1 W2 W3 New (spec) history Pattern History Pattern Tables (PHT) (PHT) 0000 0001 0010 Spec. update 512-entry BTB 512-entry 1110 1 0 Branch History Register (BHR) „ „ „ „ Similar to a 2-level PAs design Associated with each BTB entry W/ 16-entry Return Stack Buffer 4 branch predictions per 2­bit sat. counter Rc: Branch Result 1101 1110 1 1111 0 Predicti on „ Static prediction provided by Branch Address Calculator when BTB misses (see prior slide) 10 ECE4100/6100 H­H. S. Lee Static Branch Prediction No Unconditional Unconditional No PC-relative? PC-relative? Yes BTB miss? Yes PC-relative? Yes No No Return? Yes No Indirect Indirect jump jump Conditional? Yes BTB’s BTB’s decisio decisio n Taken Backwards? Taken Yes Taken 11 No Taken Not Not Taken Taken Taken ECE4100/6100 H­H. S. Lee X86 Instruction Decode IFU3 Next 3 Next inst inst complex complex (1-4) Microinstruction instruction sequencer (MS) #Inst to #Inst dec dec 3 First 2 First 1 First 1 3 simple (1) simple (1) S,S,S S,S,C S,C,S S,C,C C,S,S Instruction decoder queue (6 µ ops) (6 „ „ „ „ „ C,S,C First 2 4-1-1 decoder Decode rate depends on instruction alignment C,C,S First 1 DEC1: translate x86 into micro-operation’s (µ ops) C,C,C S: Simple First 1 DEC2: move decoded µ ops to ID queue C: Complex MS performs translations either ƒ Generate entire µ op sequence from microcode ROM ƒ Receive 4 µ ops from complex decoder, and the rest from microcode 12 ROM ECE4100/6100 H­H. S. Lee Allocator „ The interface between in-order and out-oforder pipelines „ Allocates ƒ “3-or-none” µ ops per cycle into RS, ROB ƒ “all-or-none” in MOB (LB and SB) „ Generate physical destination Pdst from Pdst the ROB and pass it to the Register Alias Table (RAT) „ Stalls upon shortage of resources 13 ECE4100/6100 H­H. S. Lee Register Alias Table (RAT) Int and FP Overrides Logical Src Logical Integer RAT Array Array Renaming Example RRF PSrc EAX 0 25 In-order In-order queue queue Array Array Physical Src (Psrc) Src FP FP TOS Adjus Adjus t FP FP RAT Array Array RAT RAT PSrc’s PSrc’s EBX 0 2 ECX 1 ECX EDX 0 15 Allocator Physical ROB Pointers RRF ROB „ „ „ Register renaming for 8 integer registers, 8 floating point (stack) registers and flags: 3 µ op per cycle 40 80-bit physical registers embedded in the ROB (thereby, 6 bit to specify PSrc) physical PSrc RAT looks up physical ROB locations for renamed sources based on RRF bit 14 ECE4100/6100 H­H. S. Lee Partial Register Width Renaming Logical Src Logical Int and FP Overries Integer Integer RAT Array Array Size(2) RRF(1) PSrc(6) In-order In-order queue queue Array Array Physical Src Src FP FP TOS Adjus Adjus t FP FP RAT Array Array RAT RAT Physical Src Physical INT Low Bank (32b/16b/L): 8 INT entries High Bank INT entries (H): 4 entries entries µ op0: µ op1: µ op2: µ op3: MOV MOV ADD ADD AL AH AL AH = = = = (a) (b) (c) (d) Allocator Physical ROB Pointers from Allocator „ 32/16-bit accesses: ƒ ƒ Read from low bank low Write to both banks 15 „ 8-bit RAT accesses: depending on which Bank is being written ECE4100/6100 H­H. S. Lee Partial Stalls due to RAT read AX EAX EAX write MOVB AL, m8 ; MOVB ADD EAX, m32 ; stall ADD Partial flag stalls (1) Partial TEST EBX, EBX TEST LAHF ; stall LAHF CMP CMP INC JBE JBE EAX, EBX ECX XX ; stall Partial register stalls Partial XOR EAX, EAX XOR MOVB AL, m8 ; ADD EAX, m32 ; no stall ADD SUB EAX, EAX SUB MOVB AL, m8 ; ADD EAX, m32 ; no stall ADD Partial flag stalls (2) Partial „ JBE reads both ZF and CF ZF CF while INC affects (ZF,OF,SF,AF,PF) LAHF loads low byte of LAHF EFLAGS EFLAGS Idiom Fix (1) Idiom Fix (2) „ „ Partial register stalls: Occurs when writing a smaller (e.g. 8/16-bit) register followed by a larger (e.g. 32-bit) read „ Partial flags stalls: Occurs when a subsequent 16 instruction read more flags than a prior unretired ECE4100/6100 H­H. S. Lee Reservation Stations WB bus 0 Port 0 Port 1 IEU0 IEU1 Fadd JEU Fmul Pfadd LDA Imul Pfshuf Div WB bus 1 Pfmul Loaded data RS Port 2 Port 3 Port 4 AGU0Ld addr MOB AGU1St addr St data STA STD DCU ROB Retir ed ed data data RRF „ „ „ „ Gateway to execution: binding max 5 µ op to each port per cycle 20 µ op entry buffer bridging the In-order and Out-of-order engine RS fields include µ op opcode, data valid bits, Pdst, Psrc, source data, BrPred, etc. Oldest first FIFO scheduling when multiple µ ops are ready at the same cycle 17 ECE4100/6100 H­H. S. Lee ReOrder Buffer „ A 40-entry circular buffer ƒ Similar to that described in [SmithPleszkun85] [SmithPleszkun85] RS ALLOC ƒ 157-bit wide ƒ Provide 40 alias physical registers „ Out-of-order completion „ Deposit exception in each entry „ Retirement (or de-allocation) ƒ After resolving prior speculation ƒ Handle exceptions thru MS ƒ Clear OOO state when a mis-predicted branch or exception is detected ƒ 3 µ op’s per cycle in program order in ƒ For multi-µ op x86 instructions: none or 18 all (atomic) all ROB RAT . . . RRF (exp) µ code assist (exp) MS ECE4100/6100 H­H. S. Lee Memory Execution Cluster RS / ROB LD STA STD Load Buffer DTLB DCU LD STA FB EBL Store Buffer Memory Cluster Blocks Memory „ Manage data memory accesses „ Fill buffers in DCU (similar to MSHR [Kroft’81]) for handling cache misses [Kroft’81] „ Address Translation (non-blocking) „ Detect violation of access ordering 19 ECE4100/6100 H­H. S. Lee Memory Order Buffer (MOB) „ Allocated by ALLOC „ A second order RS for memory operations „ 1 µ op for load; 2 µ op’s for store: Store Address (STA) and Store Data (STD) „ MOB „ 16-entry load buffer (LB) „ 12-entry store address buffer (SAB) „ SAB works in unison with „ Store data buffer (SDB) in MIU „ Physical Address Buffer (PAB) in DCU „ Store Buffer (SB): SAB + SDB + PAB „ Senior Stores „ Upon STD/STA retired from ROB „ SB marks the store “senior” „ Senior stores are committed back in program order to memory when program bus idle or SB full „ Prefetch instructions in P-III „ Senior load behavior „ 20 ECE4100/6100 H­H. S. Lee Store Coloring x86 Instructions x86 mov (0x1220), ebx mov mov (0x1110), eax mov mov ecx, (0x1220) mov mov edx, (0x1280) mov mov (0x1400), edx mov mov edx, (0x1380) mov µ op’s std sta std std sta ld ld std sta sta ld (ebx) 0x1220 (eax) 0x1100 (edx) 0x1400 store color 2 2 3 3 3 3 4 4 4 „ ALLOC assigns Store Buffer ID (SBID) in program order „ ALLOC tags loads with the most recent SBID „ Check loads against stores with equal or younger SBIDs for potential address conflicts „ SDB forwards data if conflict detected 21 ECE4100/6100 H­H. S. Lee Memory Type Range Registers (MTRR) „ Control registers written by the system (OS) „ Supporting Memory Types Memory ƒ UnCacheable (UC) ƒ Uncacheable Speculative Write-combining (USWC or WC) ‚ Use a fill buffer entry as WC buffer ƒ WriteBack (WB) ƒ Write-Through (WT) ƒ Write-Protected (WP) ‚ E.g. Support copy-on-write in UNIX, save memory space by allowing child processes to share with their parents. Only create new memory pages when child processes attempt to write. „ Page Miss Handler (PMH) ƒ Look up MTRR while supplying physical addresses ƒ Return memory types and physical address to DTLB 22 ECE4100/6100 H­H. S. Lee Intel NetBurst Microarchitecture „ Pentium 4’s microarchitecture, a post-P6 new generation „ Original target market: Graphics workstations, but … the major competitor screwed up themselves… „ Design Goals: ƒ Performance, performance, performance, … ƒ Unprecedented multimedia/floating-point performance ‚ Streaming SIMD Extensions 2 (SSE2) ƒ Reduced CPI ‚ Low latency instructions ‚ High bandwidth instruction fetching ‚ Rapid Execution of Arithmetic & Logic operations ƒ Reduced clock period ‚ New pipeline designed for scalability 23 ECE4100/6100 H­H. S. Lee Innovations Beyond P6 „ „ „ „ „ „ „ Hyperpipelined technology Streaming SIMD Extension 2 Enhanced branch predictor Execution trace cache Rapid execution engine Advanced Transfer Cache Hyper-threading Technology (in Xeon and Xeon MP) 24 ECE4100/6100 H­H. S. Lee Pentium 4 Fact Sheet IA-32 fully backward compatible Available at speeds ranging from 1.3 to ~3 GHz Hyperpipelined (20+ stages) 42+ million transistors 0.18 μ for 1.7 to 1.9GHz; 0.13μ for 1.8 to 2.8GHz; Die Size of 217mm2 Consumes 55 watts of power at 1.5Ghz 400MHz (850) and 533MHz (850E) system bus 512KB or 256KB 8-way full-speed on-die L2 Advanced Transfer Cache (up to 89.6 GB/s @2.8GHz to L1) „ 1MB or 512KB L3 cache (in Xeon MP) „ 144 new 128 bit SIMD instructions (SSE2) „ HyperThreading Technology (only enabled in Xeon and 25 Xeon MP) „ „ „ „ „ „ „ „ „ ECE4100/6100 H­H. S. Lee Recent Intel IA­32 Processors 26 ECE4100/6100 H­H. S. Lee Building Blocks of Netburst System bus Bus Unit Level 2 Cache Memory subsystem L1 Data Cache Execution Units INT and FP Exec. Unit Fetch/ Dec ETC μROM OOO OOO logic logic Retire BTB / Br Pred. Branch history update 27 Front-end Out-of-Order Engine ECE4100/6100 H­H. S. Lee Pentium 4 Microarchitectue BTB (4k entries) I-TLB/Prefetcher IA32 Decoder 64 bits µ Code ROM 64-bit 64-bit System System Bus Trace Cache BTB Execution Trace Cache µ op op (512 entries) Queue Allocator / Register Renamer Allocator Memory µ op Queue Memory Memory scheduler Memory scheduler INT / FP µ op Queue INT Fast Slow/General FP scheduler Simple FP Quad Quad Pumped Pumped 400M/533MHz 3.2/4.3 GB/sec 3.2/4.3 BIU FP RF / Bypass INT Register File / Bypass FP INT Ntwk Network Ntwk Network U-L2 Cache FP FP AGU AGU 2x ALU 2x ALUSlow ALU FP 256KB 8-way 256KB Simple Simple Complex MMX Mov Simple Simple Mov MMX Ld addr St addr 128B line, WB Inst. Inst. Inst. e Inst. Inst. SSE/2 SSE/2 48 GB/s 48 256 bits L1 Data Cache (8KB 4-way, 64-byte line, WT, 1 rd + 1 wr port) 28 @1.5Gz @1.5Gz ECE4100/6100 H­H. S. Lee Pipeline Depth Evolution PREF DEC DEC EXEC WB P5 Microarchitecture IFU1 IFU2 IFU3 DEC1 DEC2 RAT ROB DIS EX RET1 RET2 P6 Microarchitecture TC NextIPTC Fetch DriveAlloc RenameQueue Schedule Dispatch Reg File Exec FlagsBr CkDrive NetBurst Microarchitecture 29 ECE4100/6100 H­H. S. Lee Execution Trace Cache „ Primary first level I-cache to replace conventional L1 ƒ Decoding several x86 instructions at high frequency is difficult, take several pipeline stages ƒ Branch misprediction penalty is horrible ‚ lost 20 pipeline stages vs. 10 stages in P6 „ Advantages ƒ ƒ ƒ ƒ Cache post-decode µ ops post-decode High bandwidth instruction fetching Eliminate x86 decoding overheads Reduce branch recovery time if TC hits „ Hold up to 12,000 µ ops ƒ 6 µ ops per trace line 30 ƒ Many (?) trace lines in a single trace ECE4100/6100 H­H. S. Lee Execution Trace Cache „ Deliver 3 µ op’s per cycle to OOO engine „ X86 instructions read from L2 when TC misses (7+ cycle latency) „ TC Hit rate ~ 8K to 16KB conventional I-cache „ Simplified x86 decoder ƒ Only one complex instruction per cycle ƒ Instruction > 4 µ op will be executed by micro-code ROM (P6’s MS) „ Perform branch prediction in TC ƒ 512-entry BTB + 16-entry RAS ƒ With BP in x86 IFU, reduce 1/3 misprediction compared to P6 ƒ Intel did not disclose the31details of BP algorithms used in ECE4100/6100 H­H. S. Lee Out­Of­Order Engine „ Similar design philosophy with P6 uses ƒ ƒ ƒ ƒ ƒ ƒ Allocator Register Alias Table 128 physical registers 126-entry ReOrder Buffer 48-entry load buffer 24-entry store buffer 32 ECE4100/6100 H­H. S. Lee Register Renaming Schemes Allocated Allocated sequentially Allocated sequentially RAT EAX EBX ECX EDX ESI EDI ESP EBP ROB (40-entry) RF Front-end RAT Front-end RAT EAX EBX ECX EDX ESI EDI ESP EBP Retirement RAT EAX EBX ECX EDX ESI EDI ESP EBP 33 (128-entry) (126) ROB Data Status . . . . . . . . . . . . RRF Data Status Status P6 Register Renaming P6 NetBurst Register Renaming NetBurst ECE4100/6100 H­H. S. Lee Micro­op Scheduling „ µ op FIFO queues ƒ Memory queue for loads and stores ƒ Non-memory queue „ µ op schedulers ƒ Several schedulers fire instructions to execution (P6’s RS) ƒ 4 distinct dispatch ports ƒ Maximum dispatch: 6 µ ops per cycle (2 fast ALU from Port 0,1 per cycle; 1 from ld/st ports) Exec Port 0 Exec Port 1 Load Port Store Port Fast ALU FP Fast ALU INT FP (2x pumped) Move (2x pumped) Exec Move FP FP Exec Exec Memory Memory Load Load Memory Memory Store Store •Stores •Add/sub •FP/SSE Move Add/sub •Shift • •FP/SSE Add •Loads •Logic •FP/SSE Store •Rotate •FP/SSE Mul •LEA •Store Data FXCH • •FP/SSE Div •Prefetch •Branches 34 •MMX ECE4100/6100 H­H. S. Lee Data Memory Accesses „ 8KB 4-way L1 + 256KB 8-way L2 (with a HW prefetcher) „ Load-to-use speculation ƒ Dependent instruction dispatched before load finishes ‚ Due to the high frequency and deep pipeline depth ƒ Scheduler assumes loads always hit L1 ƒ If L1 miss, dependent instructions left the scheduler receive incorrect data temporarily – mis-speculation mis-speculation ƒ Replay logic – Re-execute the load when misReplay speculated ƒ Independent instructions are allowed to proceed „ Up to 4 outstanding load misses (= 4 fill buffers in original P6) 35 „ Store-to-load forwarding buffer ECE4100/6100 H­H. S. Lee Streaming SIMD Extension 2 „ P-III SSE (Katmai New Instructions: KNI) ƒ Eight 128-bit wide xmm registers (new architecture xmm state) ƒ Single-precision 128-bit SIMD FP ‚ Four 32-bit FP operations in one instruction ‚ Broken down into 2 µ ops for execution (only 80-bit data in ROB) ƒ 64-bit SIMD MMX (use 8 mm registers — map to FP mm stack) ƒ Prefetch (nta, t0, t1, t2) and sfence „ P4 SSE2 (Willamette New Instructions: WNI) ƒ Support Double-precision 128-bit SIMD FP ‚ Two 64-bit FP operations in one instruction ‚ Throughput: 2 cycles for most of SSE2 operations (exceptional examples: DIVPD and SQRTPD: 69 cycles, non-pipelined.) 36 ƒ Enhanced 128-bit SIMD MMX using xmm registers ECE4100/6100 H­H. S. Lee Examples of Using SSE X3 X2 X1 X0 xmm1 Y3 Y2 Y1 Y0 xmm2 op op op op X3 op Y3 op Y2 op Y1 op Y0 X2 X1 X0 X3 X2 X1 X0 xmm1 Y3 Y2 Y1 Y0 xmm2 op X3 X2 X1 X0 xmm1 X3 Y3 Y2 Y1 Y0 xmm2 Y3 Y3 X0 X1 xmm1 xmm1 X3 X2 X1 X0 op Y0 xmm1 Y3 .. Y0 3 .. Y0X3 .. X0 .. X0 Y3 X3 Y X3 Packed SP FP operation Scalar SP FP operation Shuffle FP operation (8-bit imm Scalar 0xf1) (e.g. ADDPS xmm1, xmm2) (e.g. ADDSS xmm1, xmm2) SHUFPS xmm1, xmm2, imm8) (e.g. SHUFPS (e.g. ADDPS ADDSS (e.g. imm8 37 ECE4100/6100 H­H. S. Lee Examples of Using SSE and SSE2 SE SE X3 X2 X1 X0 xmm1 Y3 Y2 Y1 Y0 xmm2 op op op op X3 op Y3 op Y2 op Y1 op Y0 X2 X1 X0 X3 X2 X1 X0 xmm1 Y3 Y2 Y1 Y0 xmm2 op X3 X2 X1 X0 xmm1 X3 Y3 Y2 Y1 Y0 xmm2 Y3 Y3 X0 X1 xmm1 xmm1 X3 X2 X1 X0 op Y0 xmm1 Y3 .. Y0 3 .. Y0X3 .. X0 .. X0 Y3 X3 Y X3 S huffle FP operation Packed SP FP operation Scalar SP FP operationhuffle FP operation (8-bit imm) S Shuffle FP imm Packed SP Scalar SP Shuffle ((e.g. SHUFPS xmm1, xmm2, imm8 e.g. SHUFPS (e.g. ADDPS xmm1, xmm2) (e.g. ADDSS xmm1, xmm2) SHUFPS xmm1, xmm2, 0xf1) (e.g. SHUFPS (e.g. ADDPS ADDSS SE2 X1 Y1 op X0 Y0 op xmm1 xmm2 X1 Y1 X0 Y0 op op xmm1 xmm2 X1 Y1 X0 Y0 X1 op Y1 op Y0xmm1 X0 X1 X0 op Y0xmm1 Y1 or Y0 or X0 X1 X1 Y1 Packed DP FP operation Scalar DP FP operationhuffle DP operation (2-bit imm S huffle DP operation S Packed DP Scalar DP Shuffle FP Shuffle (e.g. ADDPD xmm1, xmm2) (e.g. ADDSD xmm1, xmm2) SHUFPD xmm1, xmm2, imm2 (e.g. SHUFPS (e.g. imm8 (e.g. ADDPD xmm1, ADDSD xmm1, (e.g. SHUFPS imm2 38 ECE4100/6100 H­H. S. Lee HyperThreading „ In Intel Xeon Processor and Intel Xeon MP Processor „ Enable Simultaneous Multi-Threading (SMT) ƒ Exploit ILP through TLP (—Thread-Level Parallelism) ƒ Issuing and executing multiple threads at the same snapshot „ Single P4 Xeon appears to be 2 logical processors processors „ Share the same execution resources 39 ECE4100/6100 H­H. S. Lee Multithreading (MT) Paradigms FU1 FU2FU3FU4 Execution Time Chip Chip Conventional Fine-grainedCoarse-grained Simultaneous Multiprocessor M Superscalar Multithreading Multithreading ultiprocessor Multithreading Single (cycle-by-cycle (Block Interleaving) (CMP) Threaded Interleaving) Unused Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 40 ECE4100/6100 H­H. S. Lee More SMT commercial processors „ Intel Xeon Hyperthreading ƒ Supports 2 replicated hardware contexts: PC (or IP) and architecture registers ƒ New directions of usage ‚ Helper (or assisted) threads (e.g. speculative precomputation) ‚ Speculative multithreading „ Clearwater (once called Xtream logic) 8 context SMT “network processor” designed by DISC architect (company no longer exists) „ SUN 4-SMT-processor CMP? 41 ECE4100/6100 H­H. S. Lee Speculative Multithreading „ SMT can justify wider-than-ILP datapath „ But, datapath is only fully utilized by multiple threads „ How to speed up single-thread program by utilizing multiple threads? „ What to do with spare resources? ƒ Execute both sides of hard-to-predictable branches ‚ Eager execution or Polypath execution ‚ Dynamic predication ƒ Send another thread to scout ahead to warm up caches & BTB ‚ Speculative precomputation ‚ Early branch resolution ƒ Speculatively execute future work ‚ Multiscalar or dynamic multithreading ‚ e.g. start several loop iterations concurrently as different threads, if data dependence is detected, redo the work ƒ Run a dynamic compiler/optimizer on the side ƒ Dynamic verification 42 ‚ DIVA or Slipstream Processor ...
View Full Document

This note was uploaded on 09/21/2010 for the course ECE 4180 taught by Professor Staff during the Spring '08 term at Georgia Institute of Technology.

Ask a homework question - tutors are online