This preview shows page 1. Sign up to view the full content.
Unformatted text preview: if (threadIdx.x < localQ_head)
q.Queue[ globalQ_index + threadIdx.x ] = localQ[threadIdx.x];
} Figure 6. Local distributed queue in CUDA using atomicAdd on shared memory CUDA framework as mentioned before implicitly handles load‐balancing with its runtime. This is achieved at the expense of following CUDA programming model where any results obtained in the local shared memory need to be copied to the global memory to be re‐used in the subsequent kernels. To follow this, we copy the local queue content to the global memory in Figure 6. The second atomicAdd() obtains the index of global of queue. However, this mutual exclusion to global memory is only performed by a master thread in a thread block, which results in very little contention. Example Fibonacci number computing We will use an example that has shown in . The example computes in parallel Fibonacci number of a certain number n. As shown in Figure 7 and Figure 8, if this number n is smaller than some threshold, we compute Fibonac...
View Full Document
This document was uploaded on 03/17/2014 for the course CS 4800 at Northeastern.
- Fall '12