Readswrites supported at 32bit dword rates private

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: or register spilling) •  Configurable L1/ scratchpad 16K+48K •  Larger L2 cache •  No ROP units (?) hSp:// NVidia’s Fermi vs Tesla/GT200 hSp:// AMD “Cypress” GPU Hardware Architecture AMD 5870 – Cypress 20 SIMD engines 16 SIMD units per core 5 mul(ply ­adds per func(onal unit (VLIW processing) •  2.72 Teraflops Single Precision •  544 Gigaflops Double Precision •  •  •  •  Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011 Source: Introductory OpenCL SAAHPC2010, Benedict R. Gaster 27 SIMD Engine One SIMD Engine •  A SIMD engine consists of a set of “Stream Cores” •  Stream cores arranged as a five way Very Long Instruc(on Word (VLIW) processor –  Up to five scalar opera(ons can be issued in a VLIW instruc(on –  Scalar opera(ons executed on each processing element •  Stream cores within compute unit execute same VLIW instruc(on One Stream Core Instruc(on and Control Flow T ­Processing Element –  The block of work ­items that are executed together is called a wavefront. –  64 work items for 5870 Branch Execu(on Unit General Purpose Registers Process...
View Full Document

This document was uploaded on 03/18/2014 for the course CO 332 at Imperial College.

Ask a homework question - tutors are online