cs8803SC_lecture8

cs8803SC_lecture8 - CS8803SC Software and Hardware...

Info iconThis preview shows pages 1–5. Sign up to view the full content.

View Full Document Right Arrow Icon
1 CS8803SC Software and Hardware Cooperative Computing CUDA Programming Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Thinking Parallel: 15-puzzle problem • The 15-puzzle consists of 15 tiles numbered 1 through 15 and one blank tile placed in a 4x4 grid. … The objective is to determine any sequence or a shortest sequence of moves that transforms the initial and final configuration. Think first
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
2 Questions Why not recursive? Function inlining # DRAM memory banks? It’s not clear DRAM page miss virtual page miss next lecture Writing to the same location? “ When each thread in the warp receives unique data values, there are no collisions at all, and no additional actions need to be done. However, when two or more threads collide trying to write to the same location, the hardware performs shared memory write combining, that results in the acceptance of the tagged counter from one of the threads, and the rejection from all the other pending threads.” (http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/histogram256/doc/ histogram.pdf) Shared memory: ( no miss!) Constant/texture cache size? Manual: Appendix A. The cache working set for constant memory/texture memory is 8KB per multiprocessor, The first interleaved MT machine’s name: Heterogeneous element processor (HEP) (Review) Programming Model: Square Matrix Multiplication Example • P = M * N of size WIDTH x WIDTH • Without tiling: – One thread handles one element of P – M and N are loaded WIDTH times from global memory M N P WIDTH WIDTH WIDTH © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC
Background image of page 2
3 How about performance? • All threads access global memory for their input matrix elements Two memory accesses (8 bytes) per floating point multiply-add 4B/s of memory bandwidth/FLOPS 86.4 GB/s limits the code at 21.6 GFLOPS • The actual code should run at about 15 GFLOPS • Need to drastically cut down memory accesses to get closer to the peak 346.5 GFLOPS Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 Device memory Shared Memory Instruction Unit Processor 1 Registers Processor 2 Registers Processor M Registers Constant Cache Texture Cache Global, constant, texture memories © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC Idea: Use Shared Memory to reuse global memory data • Each input element is read by WIDTH threads. • If we load each element into Shared Memory and have several threads use the local version, we can drastically reduce the memory bandwidth – Load all the matrix ? – Tiled algorithms • Pattern – Copy data from global to shared memory – Synchronization – Computation (iteration) – Synchronization – Copy data from shared to global memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
4 Blocked (Tiled) Matrix Multiply Consider A,B,C to be N by N matrices of b by b subblocks where b=n / N is called
Background image of page 4
Image of page 5
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 18

cs8803SC_lecture8 - CS8803SC Software and Hardware...

This preview shows document pages 1 - 5. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online