cs8803SC_lecture12

cs8803SC_lecture12 - CS8803SC Software and Hardware...

Info iconThis preview shows pages 1–4. Sign up to view the full content.

View Full Document Right Arrow Icon
1 CS8803SC Software and Hardware Cooperative Computing G80 Architecture (control-flow) Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Quick terminology review Thread : concurrent code and associated state executed on the CUDA device (in parallel with other threads) The unit of parallelism in CUDA Warp : a group of threads executed physically in parallel in G80 Block : a group of threads that are executed together and form the unit of resource assignment Grid : a group of thread blocks that must all complete before the next phase of the program can begin © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
2 How thread blocks are partitioned Thread blocks are partitioned into warps Thread IDs within a warp are consecutive and increasing Warp 0 starts with Thread ID 0 Partitioning is always the same Thus you can use this knowledge in control flow However, the exact size of warps may change from generation to generation However, DO NOT rely on any ordering between warps If there are any dependencies between threads, you must __syncthreads() to get correct results © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC Control Flow Instructions Main performance concern with branching is divergence Threads within a single warp take different paths Different execution paths are serialized in G80 The control paths taken by the threads in a warp are traversed one at a time until there is no more. A common case: avoid divergence when branch condition is a function of thread ID Example with divergence: • If (threadIdx.x > 2) { } This creates two different control paths for threads in a block Branch granularity < warp size; threads 0 and 1 follow different path than the rest of the threads in the first warp Example without divergence: • If (threadIdx.x / WARP_SIZE > 2) { } Also creates two different control paths for threads in a block Branch granularity is a whole multiple of warp size; all threads in any given warp follow the same path © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC
Background image of page 2
3 Parallel Reduction • Given an array of values, “reduce” them to a single value in parallel • Examples – sum reduction: sum of all values in the array – Max reduction: maximum of all values in the array • Typically parallel implementation: – Recursively halve # threads, add two values per
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 4
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 12

cs8803SC_lecture12 - CS8803SC Software and Hardware...

This preview shows document pages 1 - 4. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online