Parallelism for memory is most important most codes

Info icon This preview shows pages 75–80. Sign up to view the full content.

Parallelism for memory is most important Most codes don’t achieve peak fp throughput because: - Stalls waiting on memory (latency not completely hidden) - Execution of non-fp instructions (indexing, control-flow, etc.) - NOT because of lack of independent fp math GK104: Compared to Fermi, needs ~2x concurrent accesses per SM to saturate memory bandwidth - Memory bandwidth comparable to Fermi - 8 SMs while Fermi had 16 SMs Doesn’t necessarily need twice the occupancy of your Fermi code - If Fermi code exposed more than sufficient parallelism, increase is less than 2x
Image of page 75

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

© NVIDIA Corporation 2012 Kepler SM Improvements for Occupancy 2x registers Both GK104 and GK110 64K registers (Fermi had 32K) Code where occupancy is limited by registers will readily achieve higher occupancy (run more concurrent warps) 2x threadblocks Both GK104 and GK110 Up to 16 threadblocks (Fermi had 8) 1.33x more threads Both GK104 and GK110 Up to 2048 threads (Fermi had 1536)
Image of page 76
© NVIDIA Corporation 2012 Increased Shared Memory Bandwidth Both GK104 and GK110 To benefit, code must access 8-byte words No changes for double-precision codes Single-precision or integer codes should group accesses into float2 , int2 strutures to get the benefit Refer to Case Study 6 for a usecase sample
Image of page 77

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

© NVIDIA Corporation 2012 SM Improvements Specific to GK110 More registers per thread A thread can use up to 255 registers (Fermi had 63 ) Improves performance for some codes that spilled a lot of registers on Fermi (or GK104) - Note that more registers per thread still has to be weighed against lower occupancy Ability to use read-only cache for accessing global memory Improves performance for some codes with scattered access patterns, lowers the overhead due to replays Warp-shuffle instruction (tool for ninjas) Enables threads in the same warp to exchange values without going through shared memory
Image of page 78
© NVIDIA Corporation 2012 Considerations for Dynamic Parallelism GPU threads are able to launch work for GPU GK110-specific feature Same considerations as for launches from CPU Same exact considerations for exposing sufficient parallelism as for “traditional” launches (CPU launches work for GPU) A single launch doesn’t have to saturate the GPU: - GPU can execute up to 32 different kernel launches concurrently
Image of page 79

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

© NVIDIA Corporation 2012 Conclusion When programming and optimising think about: Exposing sufficient parallelism Coalescing memory accesses Having coherent control flow within warps Use profiling tools when analyzing performance Determine performance limiters first Diagnose memory access patterns
Image of page 80
This is the end of the preview. Sign up to access the rest of the document.
  • Spring '17
  • Junaid Haroon
  • CUDA, GPGPU, Nvidia Corporation,  PCIe bus

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern