For loads and stores shared memory l1 cache l2 cache

Unformatted text preview: ing Elements Source: AMD Accelerated Parallel Processing OpenCL Programming Guide Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 28 Instruc(ons dispatched in “clauses”, split on main ­ memory accesses With thanks to AMD, with permission SIMD Engine “Warps” •  SIMD Engine can process Wavefronts from mul(ple kernels concurrently •  Thread divergence within a Wavefront is enabled with Lane Masking and Branching – Enabling each Thread in a Wavefront to traverse a unique program execu(on path •  Full hardware barrier support for up to 8 Work Groups per SIMD Engine (for thread data sharing) •  Each Stream Core receives up to the following per VLIW instruc(on issue –  5 unique ALU Ops  ­ or  ­ 4 unique ALU Ops with a LDS Op (Up to 3 operands per thread) •  •  •  •  LDS and Global Memory access for byte, ubyte, short, ushort reads/writes supported at 32bit dword rates Private Loads and read only texture reads via Read Cache Unordered shared consistent loads/stores/atomics via R/W Cache Wavefront length of 64 threads where each thread executes a 5 way VLIW Ins...
