Unformatted text preview: or register spilling) •  Configurable L1/ scratchpad 16K+48K •  Larger L2 cache •  No ROP units (?) hSp:// NVidia’s Fermi vs Tesla/GT200 hSp:// AMD “Cypress” GPU Hardware Architecture AMD 5870 – Cypress 20 SIMD engines 16 SIMD units per core 5 mul(ply ­adds per func(onal unit (VLIW processing) •  2.72 Teraflops Single Precision •  544 Gigaflops Double Precision •  •  •  •  Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011 Source: Introductory OpenCL SAAHPC2010, Benedict R. Gaster 27 SIMD Engine One SIMD Engine •  A SIMD engine consists of a set of “Stream Cores” •  Stream cores arranged as a five way Very Long Instruc(on Word (VLIW) processor –  Up to five scalar opera(ons can be issued in a VLIW instruc(on –  Scalar opera(ons executed on each processing element •  Stream cores within compute unit execute same VLIW instruc(on One Stream Core Instruc(on and Control Flow T ­Processing Element –  The block of work ­items that are executed together is called a wavefront. –  64 work items for 5870 Branch Execu(on Unit General Purpose Registers Process...
