583L21b - Optimization Principles and Application...

Info iconThis preview shows pages 1–8. Sign up to view the full content.

View Full Document Right Arrow Icon
Optimization Principles and Optimization Principles and Application Performance Evaluation Application Performance Evaluation of a Multithreaded GPU Using CUDA of a Multithreaded GPU Using CUDA Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, and Wen-mei H. Hwu Center for Reliable and High-Performance Computing University of Illinois at Urbana-Champaign and NVIDIA Corporation
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
PPoPP 2008 – February 21, 2008 MRI MRI - - F F H D Performance D Performance 8.0 16.8 7.0 11.1 13.1 85.8 127.5 152.5 495.2 47.5 53.9 28.8 1.2 1.0 0.3 22.5 4.4 34.1 0 100 200 300 400 500 600 CP U. DP SP GPU.Bas e GPU.RegAlloc GP Co a le sc G P U.Co n st Me m U.Fa s tTrig U.Tune U.Mul t i Loop Unrolling Factor GFLOPS 0 10 20 30 40 50 60 Time (min) GFLOPS Time How do we get to here?
Background image of page 2
PPoPP 2008 – February 21, 2008 Principles Principles • Leverage zero-overhead thread scheduling • Inter-thread communication possible locally, not globally • Optimize use of on-chip memory • Group threads to avoid SIMD penalties and memory port/bank conflicts • Further optimization involves tradeoffs between resources
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
PPoPP 2008 – February 21, 2008 Managing Memory Latency Managing Memory Latency Global memory latency is 200+ cycles 8 instructions/cycle Need 1600 instructions to avoid stalling Decompose work into a fine granularity for TLP ILP and MLP within each thread have a multiplicative effect Ctemp = 0; for (i = 0; i < widthA; i++) { Ctemp += A[indexA] * B[indexB]; indexA++; indexB += widthB; } C[indexC] = Ctemp; Matrix multiplication Each thread – 1 result element 1024x1024 matrix: 1M threads
Background image of page 4
PPoPP 2008 – February 21, 2008 Global Bandwidth Saturation Global Bandwidth Saturation Ctemp = 0; for (i = 0; i < widthA; i++) { Ctemp += A[indexA] * B[indexB]; indexA++; indexB += widthB; } C[indexC] = Ctemp; 2 global loads for every 6 instructions Requires more than 2X the available bandwidth Inter-thread data reuse: local scratchpad memory can reduce global bandwidth usage
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
PPoPP 2008 – February 21, 2008 Memory Access Pattern Memory Access Pattern WIDTH
Background image of page 6
PPoPP 2008 – February 21, 2008 Reducing Memory Bandwidth Usage Reducing Memory Bandwidth Usage Ctemp = 0; for (i = 0; i < widthA; i++) { Ctemp += A[indexA] * B[indexB]; indexA++; indexB += widthB; } C[indexC] = Ctemp; Ctemp = 0; for (. ..) { __shared__
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 8
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 12/26/2011 for the course EECS 583 taught by Professor Flinn during the Fall '08 term at University of Michigan.

Page1 / 27

583L21b - Optimization Principles and Application...

This preview shows document pages 1 - 8. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online