cs8803SC_lecture11

cs8803SC_lecture11 - CS8803SC Software and Hardware...

Info iconThis preview shows pages 1–6. Sign up to view the full content.

View Full Document Right Arrow Icon
1 CS8803SC Software and Hardware Cooperative Computing G80 Architecture (memory system) Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Think Parallel • Parallel reduction • How can we do with CUDA?
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
2 Constants Immediate address constants Indexed address constants Constants stored in DRAM, and cached on chip L1 per SM A constant value can be broadcast to all threads in a Warp Extremely efficient way of accessing a value that is common for all threads in a Block! I$ L1 Multithreaded Instruction Buffer R F C$ L1 Shared Mem Operand Select MAD SFU © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC Textures • Textures are 2D arrays of values stored in global DRAM • Textures are cached in L1 and L2 • Read-only access • Caches are optimized for 2D access: – Threads in a warp that follow 2D locality will achieve better memory performance https://users.ece.utexas.edu/~merez/new/pmwiki.php/EE382VFa07/Schedule?action=download&upname=EE382V_Fa07_Lect13_G80Mem.pdf
Background image of page 2
3 Exploiting the Texture Samplers • Designed to map textures onto 3D polygons • Specialty hardware pipelines for: – Fast data sampling from 1D, 2D, 3D arrays – Swizzling of 2D, 3D data for optimal access – Bilinear filtering in zero cycles • Arrays indexed by u,v,w coordinates – easy to program • Extremely well suited for multigrid & finite difference methods – example later
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
4 Shared Memory Each SM has 16 KB of Shared Memory 16 banks of 32bit words CUDA uses Shared Memory as shared storage visible to all threads in a thread block read and write access Not used explicitly for pixel shader programs we dislike pixels talking to each other I$ L1 Multithreaded Instruction Buffer R F C$ L1 Shared Mem Operand Select MAD SFU © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC Sample CUDA / PTX Programs © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC
Background image of page 4
PTX Virtual Machine and ISA Parallel Thread eXecution (PTX) Virtual Machine and ISA Programming model Execution resources and state An intermediate language ISA – Instruction Set Architecture Variable declarations Instructions and operands Translator is an optimizing compiler Translates PTX to Target code Program install time Driver implements VM runtime Coupled with Translator C/C++ Compiler C/C++ Application PTX to Target Translator C G80 GPU ASM-level Library Programmer Target code PTX Code PTX Code © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC In PTX World • CTA = (block in CUDA programming domain) : Cooperative Thread Array • Special registers – ctaid: each CTA has a unique CTA id – ntcaid: 1D, 2D, 3D? – gridid: each grid has a unique temporal grid id
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 6
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 10/06/2010 for the course CS 8803 taught by Professor Staff during the Spring '08 term at Georgia Tech.

Page1 / 16

cs8803SC_lecture11 - CS8803SC Software and Hardware...

This preview shows document pages 1 - 6. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online