Unformatted text preview: rp – different warps execute independently •  Minimising divergence is important for performance, but not correctness (cf. cache lines) TPC memory pipeline •  Load and store instruc(ons are generated in the SMs –  Address calcula(on (register + offset) –  Virtual to physical address transla(on •  Issued a warp at a (me – executed in half ­warp groups (i.e. 16 accesses at a (me) •  Memory coalescing and alignment –  Threads with adjacent indices should access adjacent memory loca(ons (i.e. thread K should access Kth data word) –  Accesses should be aligned for half ­words hSp:// CUDA example •  Tobias Brandvik and Graham Pullan’s automa(c program generator for jet ­ engine fluid dynamics •  hSp:// CUDA09_talks/pullan.pdf •  Compute on structured 3D mesh •  Update each element using data from its neighbours •  Use CUDA shared memory to buffer neighbour data •  Allocate (small!) 3D buffer in shar...
