10_multithreading

10_multithreading - This Unit: Multithreading (MT)...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
CIS 501 (Martin): Multithreading 1 CIS 501 Computer Architecture Unit 10: Hardware Multithreading Slides originally developed by Amir Roth with contributions by Milo Martin at University of Pennsylvania with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood. CIS 501 (Martin): Multithreading 2 This Unit: Multithreading (MT) • Why multithreading (MT)? • Utilization vs. performance • Three implementations • Coarse-grained MT • Fine-grained MT • Simultaneous MT (SMT) Application OS Firmware Compiler I/O Memory Digital Circuits CPU CIS 501 (Martin): Multithreading 3 Readings • Textbook (MA:FSPTCM) • Section 8.1 • Paper • Tullsen et al., “Exploiting Choice…” CIS 501 (Martin): Multithreading 4 Performance And Utilization Even moderate superscalar (e.g., 4-way) not fully utilized Average sustained IPC: 1.5–2 ! < 50% “utilization”. • Utilization is (actual IPC / peak IPC) Why so low? Many “dead” cycles, due too: Mis-predicted branches Cache misses, especially misses to off-chip memory Data dependences Some workloads worse than others, for example, databases Big data, lots of instructions, hard-to-predict branches Have resource idle is wasteful… … How can we better utilize the core? Multi-threading (MT) Improve utilization by multiplexing multiple threads on single core If one thread cannot fully utilize core? Maybe 2 or 4 can
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
CIS 501 (Martin): Multithreading 5 Superscalar Under-utilization • Time evolution of issue slot • 4-issue processor Superscalar cache miss CIS 501 (Martin): Multithreading 6 Simple Multithreading • Time evolution of issue slot • 4-issue processor • Where does it find a thread? Same problem as multi-core Same shared-memory abstraction Superscalar cache miss Multithreading Fill in with instructions from another thread time CIS 501 (Martin): Multithreading 7 Latency vs Throughput MT trades (single-thread) latency for throughput – Sharing processor degrades latency of individual threads + But improves aggregate latency of both threads + Improves utilization • Example • Thread A: individual latency=10s, latency with thread B=15s • Thread B: individual latency=20s, latency with thread A=25s • Sequential latency (first A then B or vice versa): 30s • Parallel latency (A and B simultaneously): 25s – MT slows each thread by 5s + But improves total latency by 5s Different workloads have different parallelism • SpecFP has lots of ILP (can use an 8-wide machine) • Server workloads have TLP (can use multiple threads) CIS 501 (Martin): Multithreading 8 MT Implementations: Similarities • How do multiple threads share a single processor? • Different sharing mechanisms for different kinds of structures
Background image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 10/19/2011 for the course CS 501 taught by Professor Matin during the Fall '10 term at UPenn.

Page1 / 7

10_multithreading - This Unit: Multithreading (MT)...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online