Carnegie Mellon Parralel Computing Notes on Lecture 6

X blockdimx threadidxx supportthreadidxx inputindex

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: float support[THREADS_PER_BLK+2]; int index = blockIdx.x * blockDim.x + threadIdx.x; support[threadIdx.x] = input[index]; if (threadIdx.x < 2) { support[THREADS_PER_BLK+threadIdx.x] = input[index+THREADS_PER_BLK]; } __syncthreads(); float result = 0.0f; // thread- local variable for (int i=0; i<3; i++) result += support[threadIdx.x + i]; output[index] = result; } Imagine a thread block with 256 CUDA threads CUDA kernels may create dependencies between threads in a block Only 4 warps worth of parallel execution in HW Simplest example is __syncthreads() Why not just have a pool of four “worker” warps? (e.g., run 4x32=128 threads in block to completion, then run next 128 threads in block) Threads in a block cannot be executed by the system in any order when dependencies exist. CUDA semantics: threads in a block ARE running concurrently. If a thread in a block is runnable it will eventually be run! (no deadlo...
View Full Document

This document was uploaded on 03/19/2014 for the course CMP 15-418 at Carnegie Mellon.

Ask a homework question - tutors are online