Ch07-AdvCompArch-ManycoresAndGPUs-PaulKelly-V03

For loads and stores shared memory l1 cache l2 cache

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ing Elements Source: AMD Accelerated Parallel Processing OpenCL Programming Guide Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 28 Instruc(ons dispatched in “clauses”, split on main ­ memory accesses With thanks to AMD, with permission SIMD Engine “Warps” •  SIMD Engine can process Wavefronts from mul(ple kernels concurrently •  Thread divergence within a Wavefront is enabled with Lane Masking and Branching – Enabling each Thread in a Wavefront to traverse a unique program execu(on path •  Full hardware barrier support for up to 8 Work Groups per SIMD Engine (for thread data sharing) •  Each Stream Core receives up to the following per VLIW instruc(on issue –  5 unique ALU Ops  ­ or  ­ 4 unique ALU Ops with a LDS Op (Up to 3 operands per thread) •  •  •  •  LDS and Global Memory access for byte, ubyte, short, ushort reads/writes supported at 32bit dword rates Private Loads and read only texture reads via Read Cache Unordered shared consistent loads/stores/atomics via R/W Cache Wavefront length of 64 threads where each thread executes a 5 way VLIW Ins...
View Full Document

This document was uploaded on 03/18/2014 for the course CO 332 at Imperial College.

Ask a homework question - tutors are online