07 Optimizing CUDA (NVIDIA)

07 Optimizing CUDA (NVIDIA) - Optimizing CUDA 2 © NVIDIA...

Info iconThis preview shows pages 1–12. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Optimizing CUDA 2 © NVIDIA Corporation 2008 Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary 3 © NVIDIA Corporation 2008 Optimize Algorithms for the GPU Maximize independent parallelism Maximize arithmetic intensity (math/bandwidth) Sometimes it’s better to recompute than to cache GPU spends its transistors on ALUs, not memory Do more computation on the GPU to avoid costly data transfers Even low parallelism computations can sometimes be faster than transferring back and forth to host 4 © NVIDIA Corporation 2008 Optimize Memory Access Coalesced vs. Non-coalesced = order of magnitude Global/Local device memory Optimize for spatial locality in cached texture memory In shared memory, avoid high-degree bank conflicts Partition camping When global memory access not evenly distributed amongst partitions Problem-size dependent 5 © NVIDIA Corporation 2008 Take Advantage of Shared Memory Hundreds of times faster than global memory Threads can cooperate via shared memory Use one / a few threads to load / compute data shared by all threads Use it to avoid non-coalesced access Stage loads and stores in shared memory to re-order non- coalesceable addressing 6 © NVIDIA Corporation 2008 Use Parallelism Efficiently Partition your computation to keep the GPU multiprocessors equally busy Many threads, many thread blocks Keep resource usage low enough to support multiple active thread blocks per multiprocessor Registers, shared memory 7 © NVIDIA Corporation 2008 Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary 8 © NVIDIA Corporation 2008 10-Series Architecture 240 thread processors execute kernel threads 30 multiprocessors , each contains 8 thread processors One double-precision unit Shared memory enables thread cooperation Thread Processors Multiprocessor Shared Memory Double 9 © NVIDIA Corporation 2008 Execution Model Software Hardware Threads are executed by thread processors Thread Thread Processor Thread Block Multiprocessor Thread blocks are executed on multiprocessors Thread blocks do not migrate Several concurrent thread blocks can reside on one multiprocessor - limited by multiprocessor resources (shared memory and register file) ... Grid Device A kernel is launched as a grid of thread blocks Only one kernel can execute on a device at one time 10 © NVIDIA Corporation 2008 Warps and Half Warps Thread Block Multiprocessor 32 Threads 32 Threads 32 Threads ... Warps 16 Half Warps 16 DRAM Global Local A thread block consists of 32- thread warps A warp is executed physically in parallel (SIMD) on a multiprocessor Device Memory = A half-warp of 16 threads can coordinate global memory accesses into a single transaction 11 © NVIDIA Corporation 2008 Memory Architecture Host CPU Chipset DRAM Device DRAM Global Constant Texture Local GPU Multiprocessor Registers Shared Memory Multiprocessor Registers Shared Memory Multiprocessor Registers Shared Memory Constant and Texture Caches 12...
View Full Document

Page1 / 76

07 Optimizing CUDA (NVIDIA) - Optimizing CUDA 2 © NVIDIA...

This preview shows document pages 1 - 12. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online