Ch07-AdvCompArch-ManycoresAndGPUs-PaulKelly-V03

Cache lines tpc memory pipeline load and store

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: sed on their type and “fairness” •  Scheduler issues a warp instruc(on with highest priority A hypothe(cal GPU: •  Four warp contexts T0–T3 •  20 cycle compute •  50 cycle memory access K. Fatahalian, M. Houston. "A Closer Look at GPUs," Commun. ACM, 51:10, pp. 50 ­57. SM instruc(on issue •  To op(mise for power and performance, GPU units are in mul(ple clock domains –  Core clock (e.g. 600 MHz) – “slow” cycles –  FUs clock (e.g. 1300 MHz) – “fast” cycles –  Memory clock (e.g. 1100 MHz) •  Issue logic can issue every slow cycle (every 2 fast cycles) •  FUs can typically be issued instruc(ons every 4 fast cycles (1 cycle per 8 threads from a warp) “Dual” issue to independent FUs on alternate slow cycles SM branch management •  Threads in a warp start execu(ng from the same address •  A warp that executes a branch instruc(on waits un(l the target address for every thread in the warp is computed •  Threads in a warp can take the same branch path  •  Threads in a warp can diverge to different paths –  the warp serially executes each path, disabling threads that are not on that path; –  when all paths complete, the threads reconverge •  Divergence only occurs within a wa...
View Full Document

This document was uploaded on 03/18/2014 for the course CO 332 at Imperial College.

Ask a homework question - tutors are online