Ch07-AdvCompArch-ManycoresAndGPUs-PaulKelly-V03

What if they branch in dierent direcons what if they

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: can use SMT to hide cache access latency, and maybe even main memory latency –  Control is at a premium: •  How to launch >10,000 threads? •  What if they branch in different direc(ons? •  What if they access random memory blocks/banks? •  This is the “manycore” world •  Driven by the gaming market – but with many other applica(ons NVidia G80 Sketchy informa(on on graphics primi(ve processing No L2 cache coherency problem, since data can be in only one cache VIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE; Erik Lindholm ohn Nickolls, Stuart Oberman, John Montrym (IEEE Micro, March ­April 2008) 16 cores Each with 8 “SP” units 16x8=128 threads execute in parallel Each core issues instruc(ons in “warps” of 32 Each core up to 24 ­way SMT All caches are small Texture cache does interpola(on ROP does Z ­buffering & alpha blending NVidia’s TESLA microarchitecture •  Designed to do rendering •  Designed to do general ­purpose compu(ng –  But to manage thousands of threads, a new programming model is needed, called CUDA –  CUDA is proprietary, but essen(ally the same model lies behind OpenCL, an open standard with implementa(ons for mul(ple vendors’ GPUs •  GPU evolved from hardware designed specific...
View Full Document

Ask a homework question - tutors are online