Carnegie Mellon Parralel Computing Notes on Lecture 6

Carnegie Mellon Parralel Computing Notes on Lecture 6 -...

Info iconThis preview shows pages 1–8. Sign up to view the full content.

View Full Document Right Arrow Icon
Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2014 Lecture 6: Programming for Performance, Part 1: Work Distribution
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
CMU 15-418, Spring 2014 The Heavy Colleen (Great Vengeance and Furious Fire) Tunes “Colleen? Ha, that wasn’t about a girl. We wrote that one about the dangers of premature program optimization. It burns everyone, and it’s certainly burned me.” - Kelvin Swaby
Background image of page 2
CMU 15-418, Spring 2014 Today Review mapping of CUDA programming to GPUs Solver example in the message passing model Begin discussing techniques for optimizing parallel programs
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
CMU 15-418, Spring 2014 Finishing up a few CUDA concepts. ..
Background image of page 4
CMU 15-418, Spring 2014 Review: executing warps on GTX 680 = SIMD functional unit, control shared across 32 units (1 MUL-ADD per clock) = “special” SIMD functional unit, control shared across 32 units (operations like sin/cos) = SIMD load/store unit (handles warp loads/stores, gathers/scatters) SMX core operation each clock: - Select up to four runnable warps from up to 64 resident on core (thread-level parallelism) - Select up to two runnable instructions per warp (instruction-level parallelism) - Execute instructions on available groups of SIMD ALUs, special-function ALUs, or LD/ST units
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
CMU 15-418, Spring 2014 Review: scheduling threads in a thread block Imagine a thread block with 256 CUDA threads Only 4 warps worth of parallel execution in HW Why not just have a pool of four “worker” warps? (e.g., run 4x32=128 threads in block to completion, then run next 128 threads in block) #define’THREADS_PER_BLK’256 __global__’void’convolve(int’N, ’float*’input,’float*’output) { ’__shared__’float’support[THREADS_PER_BLK+2]; ’int’index’=’blockIdx.x’*’blockDim.x’+’ threadIdx.x; ’support[ threadIdx.x ]’=’input[index]; ’if’(threadIdx.x’<’2)’{ ’support[THREADS_PER_BLK+threadIdx.x] ’=’input[index+THREADS_PER_BLK];’ ’} __syncthreads(); ’float’result’=’0.0f;’//’thread-local’variable ’for’(int’i=0;’i<3;’i++)’ ’result’+=’support[ threadIdx.x ’+’i]; ’output[index]’=’result; } CUDA kernels may create dependencies between threads in a block Simplest example is __syncthreads() Threads in a block cannot be executed by the system in any order when dependencies exist. CUDA semantics: threads in a block ARE running concurrently. If a thread in a block is runnable it will eventually be run! (no deadlock)
Background image of page 6
CMU 15-418, Spring 2014 CUDA execution semantics Thread blocks can be scheduled in any order by the system - System assumes no dependencies - A lot like ISPC tasks, right? Threads in a block DO run concurrently - When block begin execution, all threads are running concurrently (these semantics impose a scheduling constraint on the system) - A CUDA thread block is itself an SPMD program (like an ISPC gang of program instances) - Threads in thread-block are concurrent, cooperating “workers” CUDA implementation: - A Kepler GPU warp has performance characteristics akin to an ISPC gang of instances (but
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 8
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

Page1 / 46

Carnegie Mellon Parralel Computing Notes on Lecture 6 -...

This preview shows document pages 1 - 8. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online