Unformatted text preview: its own queue. If it is empty, it randomly chooses another thread and tries to steal one from that threads’ queue. As each task tries to take a new task first from its own task and there is no single hot spot, it eliminates the performance degradation. CUDA supports atomic operation on shared memory as well as on global memory. As shared memory is local to each core in GPU, we can implement a local distributed queues using atomic operation on shared memory. Figure 6 illustrates this. Instead of passing shared queue head in global memory (q.head in Figure 2) to atomicAdd, pass local queue head defined in shared memory (localQ_head in Figure 6). __shared__ int localQ[MAX_SIZE] ;
__shared__ int localQ_head;
__shared__ int globalQ_index;
// put vertexID into the local queue
int index = atomicAdd(&localQ_head, 1);
localQ [index] = vertexID;
// Master thread in a thread block obtains the index of global queue
// where the local queue contents need to be copied to
if (threadIdx.x==0) globalQ_index = atomicAdd(q.head, localQ_head);
// Copy the local qeuue content to the global queue...
View Full Document
This document was uploaded on 03/17/2014 for the course CS 4800 at Northeastern.
- Fall '12