14.1-gpus-4 - Announcements GPU systems Please send an...

Info iconThis preview shows pages 1–10. Sign up to view the full content.

View Full Document Right Arrow Icon
Announcements GPU systems Please send an email to me by Wed w/ subject ‘need gpu for CS6230 project’ Working our way through the proposals Schedule 12 Oct 10 [email protected] 1
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
CS6230–HPC Tools and Applications Heterogeneous Computing with GPUs Jeffrey S. Vetter Computational Science and Engineering College of Computing Georgia Institute of Technology http://ft.ornl.gov/~vetter [email protected]
Background image of page 2
OPTIMIZING MATRIX MULTIPLY 2
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
3 Md Nd Pd Pd sub TILE_WIDTH WIDTH WIDTH TILE_WIDTH TILE_WIDTH bx tx 01 TILE_WIDTH-1 2 0 1 2 by ty 2 1 0 TILE_WIDTH-1 2 1 0 TILE_WIDTH TILE_WIDTHE WIDTH Tiled Multiply Break up the execution of the kernel into phases so that the data accesses in each phase is focused on one subset (tile) of Md and Nd
Background image of page 4
4 Pd 1 ,0 A Small Example Md 2,0 Md 1,1 Md 1,0 Md 0,0 Md 0,1 Md 3,0 Md 2,1 Pd 0, 0 Md 3,1 Pd 0,1 Pd 2,0 Pd 3,0 Nd 0,3 Nd 1,3 Nd 1,2 Nd 1,1 Nd 1,0 Nd 0,0 Nd 0,1 Nd 0,2 Pd 1,1 Pd 0,2 Pd 2,2 Pd 3,2 Pd 1,2 Pd 3,1 Pd 2,1 Pd 0,3 Pd 2,3 Pd 3,3 Pd 1,3
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
5 Every Md and Nd Element is used exactly twice in generating a 2X2 tile of P P 0,0 thread 0,0 P 1,0 thread 1,0 P 0,1 thread 0,1 P 1,1 thread 1,1 M 0,0 * N 0,0 M 0,0 * N 1,0 M 0,1 * N 0,0 M 0,1 * N 1,0 M 1,0 * N 0,1 M 1,0 * N 1,1 M 1,1 * N 0,1 M 1,1 * N 1,1 M 2,0 * N 0,2 M 2,0 * N 1,2 M 2,1 * N 0,2 M 2,1 * N 1,2 M 3,0 * N 0,3 M 3,0 * N 1,3 M 3,1 * N 0,3 M 3,1 * N 1,3 Access order
Background image of page 6
6 Pd 1 ,0 Md 2,0 Md 1,1 Md 1,0 Md 0,0 Md 0,1 Md 3,0 Md 2,1 Pd 0, 0 Md 3,1 Pd 0,1 Pd 2,0 Pd 3,0 Nd 0,3 Nd 1,3 Nd 1,2 Nd 1,1 Nd 1,0 Nd 0,0 Nd 0,1 Nd 0,2 Pd 1,1 Pd 0,2 Pd 2,2 Pd 3,2 Pd 1,2 Pd 3,1 Pd 2,1 Pd 0,3 Pd 2,3 Pd 3,3 Pd 1,3 Breaking Md and Nd into Tiles Break up the inner product loop of each thread into phases At the beginning of each phase, load the Md and Nd elements that everyone needs during the phase into shared memory Everyone access the Md and Nd elements from the shared memory during the phase
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
7 Each phase of a Thread Block uses one tile from Md and one from Nd Step 4 Step 5 Step 6 T 0,0 Md 0,0 Mds 0,0 Nd 0,0 Nds 0,0 PValue 0,0 += Mds 0,0 *Nds 0,0 + Mds 1,0 *Nds 0,1 Md 2,0 Mds 0,0 Nd 0,2 Nds 0,0 PValue 0,0 += Mds 0,0 *Nds 0,0 + Mds 1,0 *Nds 0,1 T 1,0 Md 1,0 Mds 1,0 Nd 1,0 Nds 1,0 PValue 1,0 += Mds 0,0 *Nds 1,0 + Mds 1,0 *Nds 1,1 Md 3,0 Mds 1,0 Nd 1,2 Nds 1,0 PValue 1,0 += Mds 0,0 *Nds 1,0 + Mds 1,0 *Nds 1,1 T 0,1 Md 0,1 Mds 0,1 Nd 0,1 Nds 0,1 PdValue 0,1 += Mds 0,1 *Nds 0,0 + Mds 1,1 *Nds 0,1 Md 2,1 Mds 0 , 1 Nd 0,3 Nds 0,1 PdValue 0,1 += Mds 0,1 *Nds 0,0 + Mds 1,1 *Nds 0,1 T 1,1 Md 1,1 Mds 1,1 Nd 1,1 Nds 1,1 PdValue 1,1 += Mds 0,1 *Nds 1,0 + Mds 1,1 *Nds 1,1 Md 3,1 Mds 1,1 Nd 1,3 Nds 1,1 PdValue 1,1 += Mds 0,1 *Nds 1,0 + Mds 1,1 *Nds 1,1 Phase 1 Phase 2 time
Background image of page 8
8 Tiled Matrix Multiplication Kernel __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { 1. __shared__ float Mds[TILE_WIDTH][TILE_WIDTH]; 2. __shared__ float Nds[TILE_WIDTH][TILE_WIDTH]; 3. int bx = blockIdx.x;
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 10
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 11/04/2010 for the course CSE 6530 taught by Professor Jeffreyvetter during the Fall '10 term at Georgia Tech.

Page1 / 50

14.1-gpus-4 - Announcements GPU systems Please send an...

This preview shows document pages 1 - 10. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online