lec06-cuda-memory-part1

lec06-cuda-memory-part1 - GPU Programming g g Programming...

Info iconThis preview shows pages 1–8. Sign up to view the full content.

View Full Document Right Arrow Icon
GPU Programming Programming Massively Parallel Processors ecture 6: UDA Memories Lecture 6: CUDA Memories © nVidia 2009 1
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
G80 Implementation of CUDA Memories G80 p e e tat o o CU e o es • Each thread can: Grid – Read/write per-thread registers ead/write per- read Block (0, 0) hared Memory Block (1, 0) hared Memory Read/write per thread local memory – Read/write per-block hared memory Shared Memory Registers Registers Shared Memory Registers Registers shared memory – Read/write per-grid global memory Global Memory Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Read/only per-grid constant memory Constant Memory © nVidia 2009 2
Background image of page 2
CUDA Variable Type Qualifiers Variable declaration Memory Scope Lifetime __device__ __local__ int LocalVar; local thread thread __device__ __shared__ int SharedVar; shared block block __device__ int GlobalVar; global grid application device constant nt ConstantVar; onstant rid pplication device optional when used with __device__ __constant__ int ConstantVar; constant grid application __device__ is optional when used with __local__ , __shared__ , or __constant__ • Automatic variables without any qualifier reside in a register © nVidia 2009 3 Except arrays that reside in local memory
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
here to Declare Variables? Where to Declare Variables? Can host access it? yes no global register (automatic) shared tside of constant local Outside of any Function In the kernel © nVidia 2009 4
Background image of page 4
ariable Type Restrictions Variable Type Restrictions • Pointers can only point to memory allocated or declared in global memory: – Allocated in the host and passed to the kernel: __global__ void KernelFunc(float* ptr) – Obtained as the address of a global variable: float* ptr = &GlobalVar; © nVidia 2009 5
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
A Common Programming Strategy • Global memory resides in device memory (DRAM) - uch slower access than shared memory much slower access than shared memory • So, a profitable way of performing computation on the device is to tile data to take advantage of fast shared memory: – Partition data into subsets that fit into shared memory – Handle each data subset with one thread block by: Loading the subset from global memory to shared memory, sing multiple threads to exploit memory- vel parallelism using multiple threads to exploit memory level parallelism Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data lement © nVidia 2009 6 ee e t Copying results from shared memory to global memory
Background image of page 6
A Common Programming Strategy ont ) (Cont.) • Constant memory also resides in device memory RAM) uch slower access than shared (DRAM) - much slower access than shared memory ut cached! But… cached!
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 8
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 25

lec06-cuda-memory-part1 - GPU Programming g g Programming...

This preview shows document pages 1 - 8. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online