lec02-gpu_history_and_cuda_programming_basics

Lec02-gpu_history_an - GPU Programming Lecture 2 GPU History CUDA Programming Basics Outline of CUDA Basics Basic Kernels and Execution on GPU

Info iconThis preview shows pages 1–16. Sign up to view the full content.

View Full Document Right Arrow Icon
GPU Programming ecture 2: GPU History & CUDA Lecture 2: GPU History & CUDA Programming Basics
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Outline of CUDA Basics asic Kernels and Execution on GPU Basic Kernels and Execution on GPU Basic Memory Management oordinating CPU and GPU Execution Coordinating CPU and GPU Execution ee the Programming Guide for the full API See the Programming Guide for the full API
Background image of page 2
ASIC KERNELS AND BASIC KERNELS AND EXECUTION ON GPU
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
CUDA Programming Model arallel code (kernel) is launched and Parallel code (kernel) is launched and executed on a device by many threads Launches are hierarchical Threads are grouped into blocks Blocks are grouped into grids Familiar serial code is written for a thread Each thread is free to execute a unique code path Built-in thread and block ID variables
Background image of page 4
High Level View SMEM Global Memory CPU Chipset PCIe
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Blocks of threads run on an SM M Streaming Processor Streaming Multiprocessor hreadblock SME Thread Memory Threadblock Per-block Shared Memory Registers y Memory
Background image of page 6
Whole grid runs on GPU Many blocks of threads . . . M SME Global Memory
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Thread Hierarchy hreads launched for a parallel section are Threads launched for a parallel section are partitioned into thread blocks Grid = all blocks for a given launch Thread block is a group of threads that can: Synchronize their execution Communicate via shared memory
Background image of page 8
Memory Model Kernel 0 equential . . . Per-device Global emory Kernel 1 Sequential Kernels Memory . . .
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Memory Model Device 0 memory Device 1 Host memory cudaMemcpy() memory
Background image of page 10
Example: Vector Addition Kernel // Compute vector sum C = A+B Device Code // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x ; C[i] = A[i] + B[i]; } int main() { // Run grid of N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>> (d A, d B, d C); ___ }
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x ; C[i] = A[i] + B[i]; } int main() Host Code { // Run grid of N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>> (d A, d B, d C); ___ }
Background image of page 12
Example: Host code for vecAdd // allocate and initialize host (CPU) memory l t*hA *hB *hC ( t) float *h_A = …, *h_B = …; *h_C = …(empty) // allocate device (GPU) memory float *d A, *d B, *d C; cudaMalloc ( (void**) &d_A, N * sizeof(float)); cudaMalloc ( (void**) &d_B, N * sizeof(float)); cudaMalloc ( (void**) &d_C, N * sizeof(float)); // copy host memory to device cudaMemcpy ( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice ) ); cudaMemcpy ( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice ) ); // execute grid of N/256 blocks of 256 threads each vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
Background image of page 13

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Example: Host code for vecAdd (2) // execute grid of N/256 blocks of 256 threads each Add<<<N/256 256>>>(d A d B d C) vecAdd<<<N/256, 256>>>(d_A, d_B, d_C); // copy result back to host memory cudaMemcpy ( hC , dC , N * sizeof(float), cudaMemcpyDeviceToHost ) ); // do something with the result… // free device (GPU) memory cudaFree (d_A); cudaFree (d_B); cudaFree (d_C);
Background image of page 14
Kernel Variations and Output __global__ void kernel( int *a ) { t idx = lockIdx.x * lockDim.x + readIdx.x int idx blockIdx.x blockDim.x threadIdx.x ; a[idx] = 7 ; } Output:
Background image of page 15

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 16
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 10/31/2011 for the course EE 101 taught by Professor Gibbons during the Spring '09 term at Michigan State University.

Page1 / 44

Lec02-gpu_history_an - GPU Programming Lecture 2 GPU History CUDA Programming Basics Outline of CUDA Basics Basic Kernels and Execution on GPU

This preview shows document pages 1 - 16. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online