cs8803SC_lecture14

cs8803SC_lecture14 - CS8803SC Software and Hardware...

Info iconThis preview shows pages 1–7. Sign up to view the full content.

View Full Document Right Arrow Icon
1 CS8803SC Software and Hardware Cooperative Computing CUDA optimization Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Review: • Branch instruction handling in G80 • Divergent branch • Predication • Error bounds • IEEE floating point format
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
2 Today’ Goal • CUDA optimization techniques – Slides are from presentation at SC2007 by Mark Harris ( NVIDIA Developer Technology ) • An example from Image Convolution CUDA Optimization Strategies • Optimize Algorithms for the GPU • Optimize Memory Access Coherence • Take Advantage of On-Chip Shared Memory • Use Parallelism Efficiently
Background image of page 2
3 Optimize Algorithms for the GPU • Maximize independent parallelism • Maximize arithmetic intensity (math/bandwidth) • Sometimes it’s better to recompute than to cache – GPU spends its transistors on ALUs, not memory • Do more computation on the GPU to avoid costly data transfers – Even low parallelism computations can sometimes be faster than transferring back and forth to host Optimize Memory Coherence • Coalesced vs. Non-coalesced = order of magnitude – Global/Local device memory • Optimize for spatial locality in cached texture memory • In shared memory, avoid high-degree bank conflicts
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
4 Take Advantage of Shared Memory • Hundreds of times faster than global memory • Threads can cooperate via shared memory • Use one / a few threads to load / compute data shared by all threads • Use it to avoid non-coalesced access – Stage loads and stores in shared memory to re-order noncoalesceable addressing Use Parallelism Efficiently • Partition your computation to keep the GPU multiprocessors equally busy – Many threads, many thread blocks • Keep resource usage low enough to support multiple active thread blocks per multiprocessor – Registers, shared memory
Background image of page 4
5 Global Memory Reads/Writes • Highest latency instructions: 400-600 clock cycles • Likely to be performance bottleneck • Optimizations can greatly increase performance – Coalescing: up to 10x speedup Coalescing A coordinated read by a warp A contiguous region of global memory: 128 bytes - each thread reads a word: int, float, … 256 bytes - each thread reads a double-word: int2, float2, … 512 bytes – each thread reads a quad-word: int4, float4, … Additional restrictions: Starting address for a region must be a multiple of region size The k th thread in a warp must access the k th element in a block being read Exception: not all threads must be participating Predicated access, divergence within a warp
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
6 Coalesced Access: Reading floats t0 t1 t2 t3 . . . 128
Background image of page 6
Image of page 7
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 10/06/2010 for the course CS 8803 taught by Professor Staff during the Spring '08 term at Georgia Tech.

Page1 / 18

cs8803SC_lecture14 - CS8803SC Software and Hardware...

This preview shows document pages 1 - 7. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online