memory_arch - ECE 498AL Lectures 8: Memory Hardware in G80...

Info iconThis preview shows pages 1–5. Sign up to view the full content.

View Full Document Right Arrow Icon
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign CUDA Device Memory Space: Review • Each thread can: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host • The host can R/W global , constant , and texture memories
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 3 Parallel Memory Sharing Local Memory: per-thread (slow , in DRAM) Private per thread Auto variables, register spill Shared Memory: per-Block (fast) Shared by threads of the same block Inter-thread communication Global Memory: per-application Shared by all threads Inter-Grid communication Thread Local Memory Grid 0 . . . Global Memory . . . Grid 1 Sequential Grids in Time Block Shared Memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 4 HW Overview TPC TPC TPC TPC TPC TPC TPC TPC TEX SM SP SP SP SP SFU SP SP SP SP SFU Instruction Fetch/Dispatch Instruction L1 Data L1 Texture Processor Cluster Streaming Multiprocessor SM Shared Memory Streaming Processor Array
Background image of page 2
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 5 SM Memory Architecture Threads in a Block share data & results In Memory and Shared Memory Synchronize at barrier instruction Per-Block Shared Memory Allocation Keeps data close to processor Minimize trips to global Memory SM Shared Memory dynamically allocated to Blocks, one of the limiting resources t0 t1 t2 … tm Blocks Texture L1 SP Shared Memory MT IU SP Shared Memory MT IU TF L2 Memory t0 t1 t2 … tm Blocks SM 1 SM 0 Courtesy: John Nicols, NVIDIA © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 6 SM Register File Register File (RF) 32 KB Provides 4 operands/clock TEX pipe can also read/write RF 2 SMs share 1 TEX Load/Store pipe can also read/write RF I$ L1 Multithreaded Instruction Buffer R F C$ L1 Shared Mem Operand Select MAD SFU
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 7 Programmer View of Register File • There are 8192 registers in each SM in G80 – This is an implementation decision, not part of CUDA – Registers are dynamically partitioned across all Blocks assigned to the SM – Once assigned to a Block, the register is NOT accessible by threads in other Blocks – Each thread in the same Block only access registers assigned to itself 4 blocks 3 blocks © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
Background image of page 4
Image of page 5
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 10/03/2011 for the course CDA 6938 taught by Professor Zou,c during the Spring '08 term at University of Central Florida.

Page1 / 15

memory_arch - ECE 498AL Lectures 8: Memory Hardware in G80...

This preview shows document pages 1 - 5. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online