{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

Carnegie Mellon Parralel Computing Notes on Lecture 6

Carnegie Mellon Parralel Computing Notes on Lecture 6 -...

Info iconThis preview shows pages 1–8. Sign up to view the full content.

View Full Document Right Arrow Icon
Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2014 Lecture 6: Programming for Performance, Part 1: Work Distribution
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
CMU 15-418, Spring 2014 The Heavy Colleen (Great Vengeance and Furious Fire) Tunes “Colleen? Ha, that wasn’t about a girl. We wrote that one about the dangers of premature program optimization. It burns everyone, and it’s certainly burned me.” - Kelvin Swaby
Background image of page 2
CMU 15-418, Spring 2014 Today Review mapping of CUDA programming to GPUs Solver example in the message passing model Begin discussing techniques for optimizing parallel programs
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
CMU 15-418, Spring 2014 Finishing up a few CUDA concepts...
Background image of page 4
CMU 15-418, Spring 2014 Review: executing warps on GTX 680 = SIMD functional unit, control shared across 32 units (1 MUL-ADD per clock) = “special” SIMD functional unit, control shared across 32 units (operations like sin/cos) = SIMD load/store unit (handles warp loads/stores, gathers/scatters) SMX core operation each clock: - Select up to four runnable warps from up to 64 resident on core (thread-level parallelism) - Select up to two runnable instructions per warp (instruction-level parallelism) - Execute instructions on available groups of SIMD ALUs, special-function ALUs, or LD/ST units
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
CMU 15-418, Spring 2014 Review: scheduling threads in a thread block Imagine a thread block with 256 CUDA threads Only 4 warps worth of parallel execution in HW Why not just have a pool of four “worker” warps? (e.g., run 4x32=128 threads in block to completion, then run next 128 threads in block) #define THREADS_PER_BLK 256 __global__ void convolve(int N, float* input, float* output) { __shared__ float support[THREADS_PER_BLK+2]; int index = blockIdx.x * blockDim.x + threadIdx.x; support[ threadIdx.x ] = input[index]; if (threadIdx.x < 2) { support[THREADS_PER_BLK+threadIdx.x] = input[index+THREADS_PER_BLK]; } __syncthreads(); float result = 0.0f; // thread-­‐local variable for (int i=0; i<3; i++) result += support[ threadIdx.x + i]; output[index] = result; } CUDA kernels may create dependencies between threads in a block Simplest example is __syncthreads() Threads in a block cannot be executed by the system in any order when dependencies exist. CUDA semantics: threads in a block ARE running concurrently. If a thread in a block is runnable it will eventually be run! (no deadlock)
Background image of page 6
CMU 15-418, Spring 2014 CUDA execution semantics Thread blocks can be scheduled in any order by the system - System assumes no dependencies - A lot like ISPC tasks, right? Threads in a block DO run concurrently - When block begin execution, all threads are running concurrently (these semantics impose a scheduling constraint on the system) - A CUDA thread block is itself an SPMD program (like an ISPC gang of program instances) - Threads in thread-block are concurrent, cooperating “workers” CUDA implementation: - A Kepler GPU warp has performance characteristics akin to an ISPC gang of instances (but unlike an ISPC gang, the warp concept is not CUDA programming model concept *) - All warps in a thread block are scheduled onto the same core, allowing for high-BW/low latency communication through shared memory variables -
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 8
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}