cs8803SC_lecture10

cs8803SC_lecture10 - CS8803SC Software and Hardware...

Info iconThis preview shows pages 1–5. Sign up to view the full content.

View Full Document Right Arrow Icon
1 CS8803SC Software and Hardware Cooperative Computing G80 Architecture (micro-architecture) Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Grids and Blocks: CUDA Review A kernel is executed as a grid of thread blocks All threads share data memory space A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution For hazard-free shared memory accesses Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Courtesy: NVIDIA © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
2 GeForce-8 Series HW Overview TPC TPC TPC TPC TPC TPC TEX SM SP SP SP SP SFU SP SP SP SP SFU Instruction Fetch/Dispatch Instruction L1 Data L1 Texture Processor Cluster Streaming Multiprocessor SM Shared Memory Streaming Processor Array © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC SPA Streaming Processor Array (variable across GeForce 8- series, 8 in GeForce8800) TPC Texture Processor Cluster (2 SM + TEX) SM Streaming Multiprocessor (8 SP) Multi-threaded processor core Fundamental processing unit for CUDA thread block SP Streaming Processor Scalar ALU for a single CUDA thread CUDA Processor Terminology © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC
Background image of page 2
3 16 highly threaded SM’s, >128 FPU’s, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU Load/store Global Memory Thread Execution Manager Input Assembler Host Texture Texture Texture Texture Texture Texture Texture Texture Texture Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Load/store Load/store Load/store Load/store Load/store GeForce 8800 GTX © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC SM Executes Blocks Threads are assigned to SMs in Block granularity Up to 8 Blocks to each SM as resource allows SM in G80 can take up to 768 threads Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc. Threads run concurrently SM assigns/maintains thread id #s SM manages/schedules thread execution t0 t1 t2 … tm Blocks Texture L1 SP Shared Memory MT IU SP Shared Memory MT IU TF L2 Memory t0 t1 t2 … tm Blocks SM 1 SM 0 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
4 Thread Scheduling/Execution Each Thread Blocks is divided in 32- thread Warps This is an implementation decision, not part of the CUDA programming model Warps are scheduling units in SM
Background image of page 4
Image of page 5
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 17

cs8803SC_lecture10 - CS8803SC Software and Hardware...

This preview shows document pages 1 - 5. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online