CUDA Execu(on Model •  CUDA is a C extension –  Serial CPU code –  Parallel GPU code (kernels) •  GPU kernel is a C func(on –  Each thread executes kernel code –  A group of threads forms a thread block (1D, 2D or 3D) –  Thread blocks are organised into a grid (1D or 2D) –  Threads within the same thread block can synchronise execu(on, and share access to local scratchpad memory Key idea: hierarchy of parallelism, to handle thousands of threads __global__ void matAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < N && j < N) C[i][j] = A[i][j] + B[i][j]; } Example: Matrix addi(on int main() { // Kernel setup dim3 blockDim(16, 16); dim3 gridDim((N + blockDim.x – 1) / blockDim.x, (N + blockDim.y – 1) / blockDim.y); // Kernel invocation matAdd<<<gridDim, blockDim>>>(A, B, C); }
