g80-application-performance

g80-application-performance - David Kirk/NVIDIA and Wen-mei...

Info iconThis preview shows pages 1–5. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires hit the road David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 2 Objective Putting the CUDA performance knowledge to work Plausible strategies may or may not lead to performance enhancement Different constraints dominate in different application situations Case studies help to establish intuition, idioms and ideas Algorithm patterns that can result in both better efficiency as well as better HW utilization This lecture covers simple case studies on useful strategies for tuning CUDA application performance on G80. David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 3 Some Performance Lessons from Matrix Multiplication David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 4 Multiply Using Several Blocks One block computes one square sub-matrix P sub of size BLOCK_WIDTH One thread computes one element of P sub Assume that the dimensions of M and N are multiples of BLOCK_WIDTH and square shape M N P P sub BLOCK_WIDTH WIDTH WIDTH BLOCK_WIDTH BLOCK_WIDTH bx tx 01 bsize-1 2 1 2 by ty 2 1 bsize-1 2 1 B L O C K _ W I D T H B L O C K _ W I D T H B L O C K _ S I Z E W I D T H W I D T H David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 5 First-order Size Considerations Each thread block should have a minimal of 96 (768/8) threads TILE_WIDTH of 16 gives 16*16 = 256 threads A minimal of 64 thread blocks A 1024*1024 P Matrix at TILE_WIDTH 16 gives 64*64 = 4096 Thread Blocks Each thread block perform 2*256 = 512 float loads from device memory for 256 * (2*16) = 8,192 mul/add operations. David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 6 Shared Memory Usage Each SMP has 16KB shared memory Each Thread Block uses 2*256*4B = 2KB of shared memory. Can potentially have up to 8 Thread Blocks actively executing For BLOCK_WIDTH = 16, this allows up to 8*512 = 4,096 pending loads In practice, there will probably be up to half of this due to scheduling to make use of SPs. The next BLOCK_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per Thread Block, allowing only up to two Thread Blocks active at the same time David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 7 Instruction Mix Considerations for (int k = 0; k < BLOCK_SIZE; ++k) Pvalue += Ms[ty][k] * Ns[k][tx]; There are very few mul/add between branches and address calculation....
View Full Document

Page1 / 18

g80-application-performance - David Kirk/NVIDIA and Wen-mei...

This preview shows document pages 1 - 5. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online