Nvidia corporation 2012 nvprof profile data

Info icon This preview shows pages 58–67. Sign up to view the full content.

© NVIDIA Corporation 2012 nvprof Profile Data Export/Import Produce profile into a file using o $ nvprof o profile.out <app> <app args> Import into Visual Profiler File menu -> Import nvprof Profile… Import into nvprof to generate textual outputs $ nvprof i profile.out $ nvprof i profile.out --print-gpu-trace $ nvprof i profile.out --print-api-trace
Image of page 58

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

© NVIDIA Corporation 2012 nvprof MPI Each rank must output to separate file Launch nvprof wrapper with mpirun Set output file name based on rank Limit which ranks are profiled Example script in nvvp help for OpenMPI and MVAPICH2 Remember to disable profiling at start if using cudaProfilerStart()/cudaProfilerStop()
Image of page 59
© NVIDIA Corporation 2012 EXPOSING SUFFICIENT PARALLELISM
Image of page 60

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

© NVIDIA Corporation 2012 Kepler: Level of Parallelism Needed To saturate instruction bandwidth: Fp32 math: ~1.7K independent instructions per SM Lower for other, lower-throughput instructions Keep in mind that Kepler SM can track up to 2048 threads To saturate memory bandwidth: 100+ independent lines per SM
Image of page 61
© NVIDIA Corporation 2012 Memory Parallelism Achieved Kepler memory throughput As a function of the number of independent requests per SM Request: 128-byte line
Image of page 62

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

© NVIDIA Corporation 2012 Exposing Sufficient Parallelism What hardware ultimately needs: Arithmetic pipes: Sufficient number of independent instructions (accommodate multi-issue and latency hiding) Memory system: Sufficient requests in flight to saturate bandwidth (Little’s Law) Two ways to increase parallelism More independent work within a thread (warp) - ILP for math, independent accesses for memory More concurrent threads (warps)
Image of page 63