lec11-parallel_patterns_1

lec11-parallel_patterns_1 - Lecture 11: Parallel Patterns I...

Info iconThis preview shows pages 1–14. Sign up to view the full content.

View Full Document Right Arrow Icon
ecture 11: Parallel Patterns I Lecture 11: Parallel Patterns I
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping of threads to work Launch grid configuration __shared__ memory management Resource allocation Lots of moving parts Hard to see the forest for the trees
Background image of page 2
CUDA Madlibs __global__ void foo(. ..) { extern __shared__ smem[]; int i = ??? // now what??? } ... int B = ?? t ??? int N = ??? int S = ??? foo<<<B,N,S>>>();
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Parallel Patterns Think at a higher level than individual CUDA kernels g Specify what to compute, not how to compute it Let programmer worry about algorithm Defer pattern implementation to someone else
Background image of page 4
Common Parallel Computing Scenarios Many parallel threads need to generate a single result yp g g Æ Reduce Many parallel threads need to partition data Æ Split Many parallel threads produce variable output / thread ompact / Expand Æ Compact / Expand
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Primordial CUDA Pattern: Blocking Partition data to operate in well-sized blocks p Small enough to be staged in shared memory Assign each data partition to a thread block o different from cache blocking! No different from cache blocking! Provides several performance benefits Have enough blocks to keep processors busy Working in shared memory cuts memory latency dramatically Likely to have coherent access patterns on load/store to shared memory
Background image of page 6
Primordial CUDA Pattern: Blocking Partition data into subsets that fit into shared emory memory
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Primordial CUDA Pattern: Blocking Handle each data subset with one thread block
Background image of page 8
Primordial CUDA Pattern: Blocking Load the subset from global memory to shared emory, sing multiple threads to exploit memory- memory, using multiple threads to exploit memory level parallelism
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Primordial CUDA Pattern: Blocking Perform the computation on the subset from shared emory memory
Background image of page 10
Primordial CUDA Pattern: Blocking Copy the result from shared memory back to global emory memory
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Primordial CUDA Pattern: Blocking All CUDA kernels are built this way Blocking may not matter for a particular problem, but you’re still forced to think about it ot all kernels require shared emory Not all kernels require __shared__ memory All kernels do require registers All of the parallel patterns we’ll discuss have CUDA implementations that exploit blocking in some fashion
Background image of page 12
Reduction Reduce vector to a single value g Via an associative operator (+, *, min/max, AND/OR, …) CPU: sequential implementation or
Background image of page 13

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 14
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 41

lec11-parallel_patterns_1 - Lecture 11: Parallel Patterns I...

This preview shows document pages 1 - 14. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online