lec07-cuda-memory-part2

lec07-cuda-memory-part2 - GPU Programming Programming...

Info iconThis preview shows pages 1–7. Sign up to view the full content.

View Full Document Right Arrow Icon
GPU Programming Programming Massively Parallel Processors ecture 7: UDA Memories Lecture 7: CUDA Memories Part 2 © nVidia 2009 1
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
ow about performance on G80? Grid How about performance on G80? • All threads access global memory r their input matrix elements Block (0, 0) hared Memory Block (1, 0) hared Memory for their input matrix elements Two memory accesses (8 bytes) per floating point multiply-add 4B/s of memory Shared Memory Registers Registers Shared Memory Registers Registers bandwidth/FLOPS 4*346.5 = 1386 GB/s required to achieve peak FLOP rating 6 4 GB/s limits the code at Global Memory Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host 86.4 GB/s limits the code at 21.6 GFLOPS • The actual code runs at about 15 GFLOPS Constant Memory • Need to drastically cut down memory accesses to get closer to the peak 346.5 GFLOPS © nVidia 2009 2
Background image of page 2
bx 01 2 Tiled Multiply Nd tx TILE_WIDTH-1 2 IDTH • Break up the execution of the ernel into phases so that the TILE_W I _WIDTH WIDTH kernel into phases so that the data accesses in each phase is focused on one subset (tile) of TILE _ Md and Nd Md Pd 0 Pd sub by ty 2 1 0 TILE_WIDTH-1 1 TILE_WIDTHE © nVidia 2009 3 TILE_WIDTH WIDTH WIDTH TILE_WIDTH TILE_WIDTH 2
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Small Example A Small Example Nd 1,0 Nd 0,0 Nd 1,2 Nd 1,1 Nd 0,1 Nd 0,2 Nd 0,3 Nd 1,3 Pd 1 ,0 Md 2,0 Md 1,1 Md 1,0 Md 0,0 Md 0,1 Md 3,0 Md 2,1 Pd 0, 0 Md 3,1 Pd 0,1 Pd 2,0 Pd 3,0 Pd 1,1 d d d d Pd 3,1 Pd 2,1 Pd 0,2 Pd 2,2 Pd 3,2 Pd 1,2 Pd 0,3 Pd 2,3 Pd 3,3 Pd 1,3 © nVidia 2009 4
Background image of page 4
Every Md and Nd Element is used exactly twice in generating a 2X2 tile of P P 0,0 thread 0,0 P 1,0 thread 1,0 P 0,1 thread 0,1 P 1,1 thread 1,1 M 0 * N 0 M 0 * N 0 M 1 * N 0 M 1 * N 0 0,0 0,0 0,0 1,0 0,1 0,0 0,1 1,0 M ,0 * N ,1 M ,0 * N ,1 M ,1 * N ,1 M ,1 * N ,1 Access 1,0 0,1 1,0 1,1 1,1 0,1 1,1 1,1 M 2,0 * N 0,2 M 2,0 * N 1,2 M 2,1 * N 0,2 M 2,1 * N 1,2 order M 3,0 * N 0,3 M 3,0 * N 1,3 M 3,1 * N 0,3 M 3,1 * N 1,3 © nVidia 2009 5
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
reaking Md and Nd into Tiles d d Breaking Md and Nd into Tiles • Break up the inner d Nd 1,1 Nd 1,0 Nd 0,0 Nd 0,1 d product loop of each thread into phases At the beginning of each Nd 0,3 Nd 1,3 Nd 1,2 Nd 0,2 phase, load the Md and Nd elements that everyone needs during Pd 1 ,0 Md 2,0 Md 1,1 Md 1,0 Md 0,0 Md 0,1 Md 3,0 Md 2,1
Background image of page 6
Image of page 7
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 10/31/2011 for the course EE 101 taught by Professor Gibbons during the Spring '09 term at Michigan State University.

Page1 / 16

lec07-cuda-memory-part2 - GPU Programming Programming...

This preview shows document pages 1 - 7. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online