cs8803SC_lecture6 - CS8803SC Software and Hardware...

Info iconThis preview shows pages 1–5. Sign up to view the full content.

View Full Document Right Arrow Icon
1 CS8803SC Software and Hardware Cooperative Computing Introduction to CUDA Programming Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Today’s Goal • CUDA! • Many slides are from UIUC Prof. Hwu & Dr. Kirk’s class’s slides
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
2 CUDA • “Compute Unified Device Architecture” • Available for GeForce 8 Series, Quadro FX5600/4600, and Tesla solutions • Targeted software stack – Compute oriented drivers, language, and tools • Driver for loading computation programs into GPU – Standalone Driver - Optimized for computation – Interface designed for compute - graphics free API • Cuda provides general DRAM memory addressing (just like CPU) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC An Example of Physical Reality Behind CUDA CPU (host) GPU w/ local DRAM (device) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC
Background image of page 2
3 Parallel Computing on a GPU NVIDIA GPU Computing Architecture Via a separate HW interface In laptops, desktops, workstations, servers 8-series GPUs deliver 50 to 200 GFLOPS on compiled parallel C applications GPU parallelism is doubling every year Programming model scales transparently Programmable in C with CUDA tools Multithreaded SPMD model uses application data parallelism and thread parallelism GeForce 8800 Tesla S870 Tesla D870 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC Extended C Declspecs global, device, shared, local, constant Keywords threadIdx, blockIdx Intrinsics __syncthreads Runtime API Memory, symbol, execution management Function launch __device__ float filter[N]; __global__ void convolve (float *image) { __shared__ float region[M]; ... region[threadIdx] = image[i]; __syncthreads() ... image[j] = result; } // Allocate GPU memory void *myimage = cudaMalloc(bytes) // 100 blocks, 10 threads per block convolve<<<100, 10>>> (myimage); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
4 CUDA Programming Model: A Highly Multithreaded Coprocessor The GPU is viewed as a compute device that: Is a coprocessor to the CPU or host Has its own DRAM ( device memory ) Runs many threads in parallel Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads Differences between GPU and CPU threads GPU threads are extremely lightweight Very little creation overhead GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC G80 Characteristics • 367 GFLOPS peak performance (25-50 times of current high-end microprocessors) • Massively parallel, 128 cores, 90W • Massively threaded, sustains 1000s of threads per app • 30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC
Background image of page 4
Image of page 5
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 10/06/2010 for the course CS 8803 taught by Professor Staff during the Spring '08 term at Georgia Institute of Technology.

Page1 / 18

cs8803SC_lecture6 - CS8803SC Software and Hardware...

This preview shows document pages 1 - 5. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online