Ch07-AdvCompArch-ManycoresAndGPUs-PaulKelly-V03

Memory cuda memory model this diagram is misleading

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview:          Kernel invocation (“<<<…>>>”) corresponds to enclosing loop nest, managed by hardware Explicitly split into 2-level hierarchy: blocks (which share “shared” memory), and grid Kernel commonly consists of just one iteration but could be a loop Multiple tuning parameters trade off register pressure, shared-memory capacity and parallelism –  Local memory – private to each thread (slow if off ­chip, fast if register allocated) –  Shared memory – shared between threads in a thread block (fast on ­chip) –  Global memory – shared between thread blocks in a grid (off ­chip DRAM) –  Constant memory (small, read ­ only) –  Texture memory (read ­only; cached, stored in Global memory) CUDA Memory Model          This diagram is misleading: logical association but not hardware locality “Local memory” is non-cached (in Tesla), stored in global DRAM Critical thing is that “shared” memory is shared among all threads in a block, since they all run on the same SM Mapping from CUDA to TESLA •  Array of streaming mul(processors (SMs) –  (we might call them “cores”, when comparing to conven(onal mul(core; each SM is an instruc(on ­fetch ­execu(on engine) •  CUDA thread blocks get mapped to SMs •  SMs have thread processors, private registers, shared memory, etc. •  Ea...
View Full Document

This document was uploaded on 03/18/2014 for the course CO 332 at Imperial College.

Ask a homework question - tutors are online