Ch07-AdvCompArch-ManycoresAndGPUs-PaulKelly-V03

Gaster 27 simd engine one simd engine a simd engine

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ed memory •  Find out which itera(on this thread is doing •  This thread loads its share of the block into shared memory •  Each thread executes a loop •  Two phases: –  Load next plane –  Compute •  “syncthreads()” is needed to ensure all threads in this block have done phase, before moving on    Common pattern:    Threads cooperate to load data into shared memory in parallel    Syncthreads()    Use it, and repeat GT200 •  •  •  •  •  •  •  •  10 Thread Processing Clusters (TPCs) 3 Streaming Mul(processors (SMs) per TPC 8 32 ­bit FPU per SM 1 64 ­bit FPU per SM 16K 32 ­bit registers per SM Up to 1024 threads / 32 warps per SM 16KB Shared memory per SM 8 64 ­bit memory controllers (512 ­bit wide memory interface) G80 •  •  •  •  •  •  •  8 Thread Processing Clusters (TPCs) 2 Streaming Mul(processors (SMs) per TPC 8 32 ­bit FPU per SM 8K 32 ­bit registers per SM Up to 768 threads / 24 warps per SM 16KB Shared memory per SM 6 64 ­bit memory controllers (384 ­bit wide memory interface) hSp://www.realworldtech.com/page.cfm?Ar(cleID=RWT090808195242 NVidia’s Fermi •  2010 ­genera(on GPU specifically targeSed at general ­ purpose compute market •  Read ­write L1 caches (f...
View Full Document

This document was uploaded on 03/18/2014 for the course CO 332 at Imperial College.

Ask a homework question - tutors are online