Performance th whats the effect of a preferred thread

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: T makes sense only with fine-grained implementation, what’s the impact of fine-grained scheduling on single thread performance? th – What’s the effect of a preferred thread (having high execution priority)? – Unfortunately, with a preferred thread*, the processor is likely to sacrifice some throughput, when preferred thread stalls • Larger register file needed to hold multiple contexts • To not affect clock cycle time, especially in – Instruction issue - more candidate instructions need to be considered – Instruction completion - choosing which instructions to commit may be challenging • Ensuring that cache and TLB conflicts generated by that cache and TLB conflicts by SMT do not degrade performance * if it has the highest priority like coarse-grained 12 IBM Power Processor Architecture: History • Power1: multichip, Febr. 1990, 1 FPU (2 clock cycles delay), issuing one compound floatingpoint multiply/add (MADD) (MADD) – Jan. 1992, single chip • Power2: Sept. 1993, 2 integer units, 2 FPUs Sept 1993 integer units FPUs (multiply/add) – Oct. 1996, single chip • PowerPC: Sept. 1993 – IBM, Motorola, Apple partnership • Power3: Oct. 1998 13 Power4 • 2 cores. Each core has: – 2 integer units, – 2 FPUs, – 2 L/S, – 1 branch, – 1 for logical operations on the condition – Up to 5 instructions in a group are issued simultaneously & always end with a branch 14 Power 4 2 Power3 processors Single-threaded predecessor to to Power 5. 8 execution units in out-of-order engine, each may engine, each may issue an instruction each cycle. 1 processor shown here Common last stage 4 pipelines, one exec. unit not shown (like FX) pipelines one exec unit not shown (like FX) 15 Power Power 5 2 threads th 2 fetch (PC), fetch (PC) 2 initial decodes 2 commits (architected register sets) register sets) 16 Power 5 data flow ... Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck 17 Power 5 thread performance ... Relative priority of each thread of each thread controllable in hardware. For balanced balanced operation, both threads run threads run slower than if they “owned” the machine. 18 Changes in Power 5 to support SMT • Increased associativity of L1 instruction cache and the instruction address translation buffers (TLBs) • Per thread load and store queues • Increased cache sizes for L2 (1.92 vs. 1.44 MB) & L3 • Separate instruction prefetch & buffering per thread th • Increased the number of virt...
View Full Document

This document was uploaded on 02/09/2014.

Ask a homework question - tutors are online