Lecture 24 - Alpha 21464 (2010-04-08)

Lecture 24 - Alpha 21464 (2010-04-08) - EV8: The...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: EV8: The Post-Ultimate Alpha EV8: Dr. J Emer oel It F nel ellow I t Arc i tre Grou nel ht u ec p I t Corp t nel orai on Alpha Microprocessor Overview Higher Performance Lower Cost 0. µ m 35 0. µ m 18 0. µ m 125 21264 EV6 0. µ m 28 EV7 EV8 0. µ m 125 21264 EV67 0. µ m 18 EV78 .. . 21264 EV68 1998 1999 2000 2001 2002 2003 First S yste m S h ip Goals Goals Leadership single stream performance Leadership Ex multistream performance with multithreading tra Ex W ithout maj architectural changes or W ithout significant additional cost EV8 Architecture Overview EV8 Aggressive instruction fetch unit Aggressive 8-wide super-scalar execution unit 4-waysimultaneous multithreading ( SMT) Large on-chip L2 cache Large Direct RAMBUS interface Direct On-chip router for sy stem interconnect On for glueless, directory based, ccNUMA for with up to 512-waymultiprocessing with System Block Diagram System 0123 M EV8 IO M M EV8 IO M EV8 EV8 IO M EV8 EV8 IO M IO M IO M IO EV8 EV8 IO M EV8 IO Instruction Issue Instruction Time Reduced function unit utilization due to dependencies Superscalar Issue Superscalar Time Superscalar leads to more performance, b ut low er utilization Predicated Issue Predicated Time Adds to function unit utilization, but results are thrown away Chip Multiprocessor Chip Time Limited utilization when only running one thread Fine Grained Multithreading Fine Time Intra-thread dependencies still limit performance Simultaneous Multithreading Simultaneous Time Maximum utilization of function units by independent operations Basic Out-of-order Pipeline Basic Fetch Decode/ Map Queue Reg Read Execute Dcache/ Store Buffer Reg W rite Retire PC Register Map Regs Icache D cache Regs T hread-blind SMT Pipeline SMT Fetch Decode/ Map Queue Reg Read Execute Dcache/ Store Buffer Reg Write Retire PC Register Map Regs Icache Dcache Regs Architectural Abstraction Architectural 1 CPU with 4 Thread Processing Units (TPUs) CPU Shared hardware resources Shared TPU 0 I cache TPU1 TLB Scache TPU2 TPU3 Dcache Key Design Principles Key High throughput single stream design High Enhancements for SMT Enhancements Little’s Law Little’s Average Number of Tasks in Region (N) Throughput (T) = ---------------------------------------------------Average Latency in Region (L) Little’s Law for Instruction Fetch Little’s L = fixed pipe length + average memory latency fixed N = number of instructions fetched number T N = -------L Instruction Fetch Unit Instruction Wider fetch Wider Fetch more statically consecutive instructions Fetch Limited by “ trace”length Limited Trace Cache Trace Build sequences of dynamically consecutive instructions Build Significantly greater complexity Significantly Double fetch Double Fetch two non-consecutive blocks of instructions Fetch Instruction Fetch Unit Instruction C ollapsing Logic Icache Address Latches Rate Matching B uffer Line P redictor Misprediction C alculation B ranch P redictor J ump Address P redictor Return P redictor Instruction Fetch Characteristics Instruction Two 8-instruction fetches per cycle Two 16 branch predictions per cycle 16 J ump target prediction Return address prediction Return Rate matching buffer of fetched instructions Rate Collapse fetched instructions into groups of 8 Collapse Execution Unit Execution Issue Q ueue Regs Dcache Regs Execution Unit Characteristics Execution Single issue queue Single 8-wide 112+ entries 112+ Register file Register 512 registers 512 16 read/ write ports 8 16 Function units Function 8 integer ALUs ALUs 4 floating ALUs ALUs 4 memory operations (2 read/ write) 2 memory Little’s Law for Execution Unit Little’s L (min) = Number of cycles in pipe (min) T (desired) = Number of desired instructions per cycle (8) (desired) 8 N = -------13 Little’s Law for Execution Unit Little’s Little's Law for the IQ 8 7 6 5 Max IPC 4 3 2 1 0 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Av erage Q chunk Lifetime Key Design Principles Key High throughput single stream design High Enhancements for SMT Enhancements Additions for SMT Additions Replication required resources Replication Program counters Program Register File (architectural space) Register Register maps Register … Sharable resources Sharable Register file (rename space) Register I nstruction queue Branch predictor Branch First and second level caches First Translation buffers Translation …. …. Approaches Approaches Replicated resources used for… Replicated all per TPU state (except register file) all some sharable resources where design is easier (* ) some – E.g,, return stack predictor Shared resources used for… Shared register file (* ) register all other sharable resources (* ) all *Policy may be needed to make priority decisions Choosing Policy Choosing Choosing policies Choosing FIFO – trivial FIFO Round robin – easy Proportional – special case Proportional Icount-style – fair Icount Icount Choosing Policy Icount Why Does Icount Make Sense? Icount N T = -------L N/4 = -------L T/4 Choosers Choosers Address Latches Retire Line Predictor Choosers - Fetch Choosers Address Latches Retire Line Predictor Choosers – Fetch Choosers Address Latches Retire Line Predictor Choosers - Map Choosers Address Latches Retire Line Predictor Choosers - Retire Choosers Address Latches Retire Line Predictor Choosers – LD/ST numbers Choosers cache Address Latches Retire Line Predictor Choosers – LD/ST numbers Choosers cache Address Latches Retire Line Predictor Choosers – Miss/Store Choosers cache Address Latches Retire Line Predictor Choosers Choosers Fetch Chooser - Icount Map Chooser - Icount LD/ST Number Chooser - Proportional Retire Chooser – Round Robin Load miss Chooser – Round Robin Store Buffer Chooser - FIFO Area Cost of SMT Support Area Total overhead: 6%± Multiprogrammed workload Multiprogrammed 250% 200% 150% 100% 50% 0% SpecInt SpecFP Mixed Int/FP 1T 2T 3T 4T Decomposed SPEC95 Applications Decomposed 250% 200% 150% 100% 50% 0% Turb3d Swm256 Tomcatv 1T 2T 3T 4T Multithreaded Applications Multithreaded 300% 250% 200% 150% 100% 50% 0% Barnes Chess Sort TP 1T 2T 4T Acknowledgements Acknowledgements Tryggve Fossum Tryggve Chuan-Hua Chang Chuan George Chrysos George Steve Felix Steve Chris Gianos Chris Partha Kundu Partha Jud Leonard Jud Matt Mattina Matt Matt Reilly Matt ...
View Full Document

This note was uploaded on 08/22/2010 for the course CDA 5106 taught by Professor Staff during the Spring '08 term at University of Central Florida.

Ask a homework question - tutors are online