{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

06b CUDA Basics (NVIDIA) - CUDA Basics CUDA A Parallel...

Info iconThis preview shows pages 1–14. Sign up to view the full content.

View Full Document Right Arrow Icon
CUDA Basics
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
© NVIDIA Corporation 2009 CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs C OpenCL Fortran (PGI) DX Compute Supported on common operating systems: Windows Mac OS Linux DX Compute
Background image of page 2
© NVIDIA Corporation 2009 3 Arrays of Parallel Threads A CUDA kernel is executed by an array of threads All threads run the same code Each thread has an ID that it uses to compute memory addresses and make control decisions 0 1 2 3 4 5 6 7 float x = input[threadID]; float y = func(x); output[threadID] = y; threadID
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
© NVIDIA Corporation 2009 5 Example: Increment Array Elements CPU program CUDA program void increment_cpu( float *a, float b, int N) { for ( int idx = 0; idx<N; idx++) a[idx] = a[idx] + b; } void main() { ..... increment_cpu(a, b, N); } __global__ void increment_gpu( float *a, float b, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x ; if (idx < N) a[idx] = a[idx] + b; } void main() { ….. dim3 dimBlock ( blocksize ); dim3 dimGrid( ceil( N / ( float ) blocksize ) ); increment_gpu<<<dimGrid, dimBlock>>>(ad,bd, N); }
Background image of page 4
© NVIDIA Corporation 2009 Outline of CUDA Basics Basics Memory Management Basic Kernels and Execution on GPU Coordinating CPU and GPU Execution Development Resources See the Programming Guide for the full API
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Basic Memory Management
Background image of page 6
© NVIDIA Corporation 2009 Memory Spaces CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU Very similar to corresponding C functions Pointers are just addresses Can’t tell from the pointer value whether the address is on CPU or GPU Must exercise care when dereferencing: Dereferencing CPU pointer on GPU will likely crash Same for vice versa
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
© NVIDIA Corporation 2009 GPU Memory Allocation / Release Host (CPU) manages device (GPU) memory: cudaMalloc (void ** pointer, size_t nbytes) cudaMemset (void * pointer, int value, size_t count) cudaFree (void* pointer) int n = 1024; int nbytes = 1024*sizeof(int); int * d_a = 0; cudaMalloc( (void**)&d_a, nbytes ); cudaMemset( d_a, 0, nbytes); cudaFree(d_a);
Background image of page 8
© NVIDIA Corporation 2009 Data Copies cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction); returns after the copy is complete blocks CPU thread until all bytes have been copied doesn’t start copying until previous CUDA calls complete enum cu d aMemcpyKind cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice Non-blocking memcopies are provided
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
© NVIDIA Corporation 2009 Code Walkthrough 1 Allocate CPU memory for n integers Allocate GPU memory for n integers Initialize GPU memory to 0s Copy from GPU to CPU Print the values
Background image of page 10
© NVIDIA Corporation 2009 Code Walkthrough 1 #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
© NVIDIA Corporation 2009 Code Walkthrough 1 #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }
Background image of page 12
© NVIDIA Corporation 2009 Code Walkthrough 1 #include <stdio.h> int main() { int dimx = 16;
Background image of page 13

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 14
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}