Lec19 The Future of High Performance Computing

Art of Parallel Programming

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Lecture 19 The Future of High Performance Computing Today's lecture Architectural directions Blue Gene / L STI CELL, including a brief introduction to instruction level parallelism 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 2 Blue Gene/L IBM-US Dept of Energy collaboration 64K dual processor nodes: 180 (360) TF peak One CPU dedicated to communication on each node (But may be used for computation) Low power Relatively slow processors; power PC 440 Small memory (256 MB) High performance interconnect Lightweight kernel runs on each node Single user, single 2-thread process No context switching, no demand paging SDSC has 3072 nodes (6144 processors) 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 3 Blue Gene/L Interconnect 3D toroidal mesh (end around) 175 MBsec Bandwidth peak bidirectional Rapid combining-broadcast network 350MB/sec 1.5 s latency (1-way) Fast barriers (1.5 s) Image courtesy of IBM http://www.research.ibm.com/journal/rd/492/gara4.gif http://www.research.ibm.com/journal/rd/492/gara4.gif 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 4 Blue Gene/L Compute chip Shared L3 Cache L1 caches not coherent G. Almasi et. Al, 2002 IEEE 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 5 STI CELL 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 6 A brief introduction to Instruction Level Parallelism 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 7 What is instruction-level parallelism? The potential for executing certain instructions in parallel, because they are independent Any technique for identifying and exploiting such opportunities Static: can be implemented by a compiler Dynamic: requires hardware support 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 8 How does it manifest itself? Basic blocks Sequences of instruction that appear between branches Usually no more than 5 or 6 instructions! Loops for ( i=0; i<N; i++) x[i] = x[i] + s; We can only realize ILP by finding sequences of independent instructions Dependent instructions must be separated by a sufficient amount of time, that is functional unit latency 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 9 SIMD Instruction Set Operations defined on vectors Modern architectural extensions Intel SSE = Streaming SIMD extensions AltiVec: Apple Computer, IBM and Freescale Semiconductor Both support 128-bit vector registers Represent 16 x 8-bit chars, ... 4 x 32-bit integers, floating point ... PE Interconnect PE Control Unit PE PE PE 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 10 Limitations of ILP Recall the loop: for ( i=0; i<N; i++) x[i] = x[i] + s; L:LD ADDD SD SUBI F0, 0(R1) F4, F0, F2 0(R1), F4 R1, R1, #8 ; F0 is the vector element ; add the scalar ; store the result ; decrement by 8 ; to previous word ; Branch if R1 0 ; Delayed branch slot 11 BNEZ R1, L NOP 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 Delays for ( i=0; i<N; i++) x[i] = x[i] + s; Unit Integer Memory FP Add FP Multiply FP Div Integer unit EX Latency 0 1 3 6 24 Init Interval 1 1 1 1 24 FP/integer multiply M1 M2 M3 M4 M5 M6 M7 IF ID FP adder A1 A2 A3 A4 MEM WB FP/integer divider DIV L: LD stall ADDD stall stall SD SUBI BNEZ stall F0, 0(R1) F4, F0, F2 0(R1), F4 R1, R1, #8 R1, L 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 12 Dynamic scheduling Idea: modify the processor to permit instructions to execute as soon as their operands become available This is known as out-of-order execution Complications: dynamically scheduled instructions also complete out of order Increased hardware complexity Stations and execution units Bookkeeping Buffering Slows down the processor 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 13 Dynamic scheduling (Tomasulo's Algorithm) From instruction unit From memory Load buffers 6 5 4 3 2 1 Floatingpoint operation queue FP registers Unit Integer Operand buses Store buffers 3 2 1 Operation bus To memory 2 1 Latency 0 1 2 10 40 Memory FP Add FP Multiply FP Div 3 2 1 FP adders Reservation stations FP multipliers Common data bus (CDB) 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 14 The three steps of instruction execution Issue Get instruction from fetch queue If there are available resources buffer the register operands at station If there is no such station, stall the instruction Execute Monitor internal interconnect for any need operands Intercept those operands and write to the station When both operands are present, execute the operation Write result Send to the result registers and any waiting units 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 15 Automatic, dynamic loop unrolling L: LD MULTD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 R1, R1, #8 R1, L 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 16 Automatic, dynamic loop unrolling Instruction Iteration LD F0, 0(R1) 1 Issue Exec X 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 17 Automatic, dynamic loop unrolling Instruction Iteration LD F0, 0(R1) 1 MULTD F4, F0, F2 1 Issue Exec X X X 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 18 Automatic, dynamic loop unrolling Instruction Iteration LD F0, 0(R1) 1 MULTD F4, F0, F2 1 SD 0(R1), F4 1 Issue Exec X X X X 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 19 Automatic, dynamic loop unrolling Instruction Iteration LD F0, 0(R1) 1 MULTD F4, F0, F2 1 SD 0(R1), F4 1 LD F0, 0(R1) 2 Issue Exec X X X X X 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 20 Automatic, dynamic loop unrolling Instruction Iteration LD F0, 0(R1) 1 MULTD F4, F0, F2 1 SD 0(R1), F4 1 LD F0, 0(R1) 2 Issue Exec X X X X X X 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 21 Automatic, dynamic loop unrolling Instruction Iteration LD F0, 0(R1) 1 MULTD F4, F0, F2 1 SD 0(R1), F4 1 LD F0, 0(R1) 2 MULTD F4, F0, F2 2 SD 0(R1), F4 2 Issue Exec X X X X X X X X 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 22 Cell Overview Chip multiprocessor for multithreaded applications Adapted power PC: 8 SPE + 1 PPE Shared address space SIMD (VMX) Software managed resources on the SPE Optimize the common case 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 23 Block diagram http://domino.watson.ibm.com/comm/research.nsf/pages/r.arch.innovation.html 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 24 CELL block diagram showing detail FPUs (4) ALUs (4) FPUs (4) ALUs (4) FPUs (4) ALUs (4) FPUs (4) ALUs (4) 256KB local memory 256KB local memory DMA, I/O Controllers EIB 256KB local memory 256KB local memory regfile 128x128 regfile 128x128 regfile 128x128 regfile 128x128 Courtesy Sam Sandbote regfile 128x128 regfile 128x128 regfile 128x128 regfile 128x128 FPUs (4) ALUs (4) FPUs (4) ALUs (4) FPUs (4) ALUs (4) FPUs (4) ALUs (4) 512K L2 256KB local memory 256KB local memory I$ D$ 256KB local memory 256KB local memory 64-bit SMT Power core, 2x in-order superscalar PPE 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 25 SPE - Synergistic processing element 256 KB local store 128-bit x 128 entry register file DMAs Local Store via EIB 4 x 128 bit wide 200 GB/sec Rambus XDR DRAM interface 3.2 Gbit/sec/SPE 25.6 Gb/s aggregate Image courtesy Keith O'Conor, O' http://isg.cs.tcd.ie/oconork/presentations/CellBroadbandEngine.ppt http://isg.cs. tcd. ie/oconork/presentations/CellBroadbandEngine. 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 26 Inside the SPE SIMD (VMX) Stacked operations Unified vector and scalar operations Software managed resources Optimize the common case In order issue Parallelism in VMX ops and across SPEs rather than ILP Aligned accesses Branch hints No cache on SPWs: local memory Large register files Scheduling of SPEs 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 27 SPE Local Store Scheduling is an issue Mostly in software Want high utilization Multiple simultaneous requests DMA 16 x 16kb Priority: DMA, L/S, Fetch Image courtesy Keith O'Conor, O' http://isg.cs.tcd.ie/oconork/presentations/CellBroadbandEngine.ppt http://isg.cs. tcd. ie/oconork/presentations/CellBroadbandEngine. 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 28 Usage scenarios Pipeline Multithreaded Function offload Image courtesy Keith O'Conor, O' http://isg.cs.tcd.ie/oconork/presentations/CellBroadbandEngine.ppt http://isg.cs. tcd. ie/oconork/presentations/CellBroadbandEngine. 11/28/06 Scott B. Baden / CSE 160 / Fall 2006 29 ...
View Full Document

This note was uploaded on 02/14/2008 for the course CSE 160 taught by Professor Baden during the Fall '06 term at UCSD.

Ask a homework question - tutors are online