Ch07-AdvCompArch-ManycoresAndGPUs-PaulKelly-V03

Ch07-AdvCompArch-ManycoresAndGPUs-PaulKelly-V03

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: rp – different warps execute independently •  Minimising divergence is important for performance, but not correctness (cf. cache lines) TPC memory pipeline •  Load and store instruc(ons are generated in the SMs –  Address calcula(on (register + offset) –  Virtual to physical address transla(on •  Issued a warp at a (me – executed in half ­warp groups (i.e. 16 accesses at a (me) •  Memory coalescing and alignment –  Threads with adjacent indices should access adjacent memory loca(ons (i.e. thread K should access Kth data word) –  Accesses should be aligned for half ­words hSp://www.realworldtech.com/page.cfm?Ar(cleID=RWT090808195242 CUDA example •  Tobias Brandvik and Graham Pullan’s automa(c program generator for jet ­ engine fluid dynamics •  hSp:// www.industrialmath.net/ CUDA09_talks/pullan.pdf •  Compute on structured 3D mesh •  Update each element using data from its neighbours •  Use CUDA shared memory to buffer neighbour data •  Allocate (small!) 3D buffer in shar...
View Full Document

This document was uploaded on 03/18/2014 for the course CO 332 at Imperial College.

Ask a homework question - tutors are online