Ch07-AdvCompArch-ManycoresAndGPUs-PaulKelly-V03

0120345 161 0120345 optional use

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: truc(on each issue –  ¼ Wavelength (16 threads) on each clock of 4 clocks (T0 ­15, T16 ­31, T32 ­47, T48 ­T63) With thanks to AMD, with permission Local Data Share (LDS) •  This is AMD’s implementa(on of the “shared” memory of the CUDA model •  32 banks to allow concurrent access by all threads With thanks to AMD, with permission AMD GPU Memory Architecture •  Memory per compute unit –  Local data store (on ­chip) –  Registers SIMD Engine LDS, Registers L1 Cache Compute Unit to Memory X ­bar L2 Cache Write Cache Atomic Path –  L1 cache (8KB for 5870) per compute unit •  L2 Cache shared between compute units (512KB for 5870) •  Fast path for only 32 bit opera(ons •  Complete path for atomics and < 32bit opera(ons LDS Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 32 Nvidia Memory Hierarchy •  L1 cache per SM configurable to support shared memory and caching of global memory Registers Thread Block –  48 KB Shared / 16 KB of L1 cache –  16 KB Shared / 48 KB of L1 cache •  Data shared between work items of a group using shared memory •  Each SM has a 32K register bank •  L2 cache (768KB) that services all opera(ons (l...
View Full Document

Ask a homework question - tutors are online