lec04-cuda-threads-part1

lec04-cuda-threads-part1 - GPU Programming Lecture 4 A...

Info iconThis preview shows pages 1–10. Sign up to view the full content.

View Full Document Right Arrow Icon
PU Programming GPU Programming Lecture 4: A Simple Example, ools and CUDA Threads Tools, and CUDA Threads ©nVidia 1
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Step 1: Matrix Multiplication Simple Host Version in C A Simple Host Version in C N // Matrix multiplication on the (CPU) host void MatrixMulOnHost(float* M, float* N, float* P, int Width) IDTH { for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) { float sum = 0; k j WI for (int k = 0; k < Width; ++k) { float a = M[i * width + k]; float b = N[k * width + j]; sum += a * b; M P } P[i * Width + j] = sum; } } i WIDTH 2 WIDTH WIDTH k ©nVidia
Background image of page 2
Step 2: Input Matrix Data Transfer ost- ide Code) void MatrixMulOnDevice(float* M, float* N, float* P, int Width) { (Host side Code) int size = Width * Width * sizeof(float); float* Md, Nd, Pd; 1. // Allocate and Load M, N to device memory cudaMalloc(&Md, size); udaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); udaMemcpy(Nd N size cudaMemcpyHostToDevice); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); // Allocate P on the device 3 cudaMalloc(&Pd, size); ©nVidia
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Step 3: Output Matrix Data Transfer (Host-side Code) 2. // Kernel invocation code – to be shown later 3. // Read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); // Free device matrices udaFree(Md); cudaFree(Nd); cudaFree (Pd); cudaFree(Md); cudaFree(Nd); cudaFree (Pd); } 4 ©nVidia
Background image of page 4
Step 4: Kernel Function p // Matrix multiplication kernel – per thread code __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Pvalue is used to store the element of the matrix that is computed by the thread // that is computed by the thread float Pvalue = 0; 5 ©nVidia
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Step 4: Kernel Function (cont.) Nd p () for (int k = 0; k < Width; ++k) { DTH float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x]; value += Melement * Nelement; k WI D Pvalue Melement Nelement; } d[threadIdx y*Width+threadIdx x] = Pvalue; tx Md Pd Pd[threadIdx.y*Width+threadIdx.x] = Pvalue; } WIDTH ty ty 6 WIDTH WIDTH tx k ©nVidia
Background image of page 6
Step 5: Kernel Invocation (Host-side Code) Setup the execution configuration // Setup the execution configuration dim3 dimGrid(1, 1); dim3 dimBlock(Width, Width); Launch the device computation threads! // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width ); 7 ©nVidia
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
nly One Thread Block Used Only One Thread Block Used One Block of threads compute atrix Pd Grid 1 Block 1 2 Nd matrix Pd – Each thread computes one element of Pd ach thread 4 2 Thread (2, 2) Each thread – Loads a row of matrix Md – Loads a column of matrix Nd – Perform one multiply and 6 addition for each pair of Md and Nd elements – Compute to off-chip memory ccess ratio close to ot 3 2 5 4 48 access ratio close to 1:1 (not very high) Size of matrix limited by the number of threads allowed in a WIDTH 8 thread block Md Pd ©nVidia
Background image of page 8
Step 7: Handling Arbitrary Sized Square Matrices • Have each 2D thread block to Nd compute a (TILE_WIDTH) 2 sub- matrix (tile) of the result matrix DTH – Each has (TILE_WIDTH) 2 threads • Generate a 2D Grid of WI D (WIDTH/TILE_WIDTH) 2 blocks Md Pd by You still need to put a loop round the kernel call for WIDTH ty x
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 10
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 10/31/2011 for the course EE 101 taught by Professor Gibbons during the Spring '09 term at Michigan State University.

Page1 / 36

lec04-cuda-threads-part1 - GPU Programming Lecture 4 A...

This preview shows document pages 1 - 10. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online