Lecture 13 Notes

The forkmethodisdefined in task class shown in figure

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: if (threadIdx.x < localQ_head) q.Queue[ globalQ_index + threadIdx.x ] = localQ[threadIdx.x]; } Figure 6. Local distributed queue in CUDA using atomicAdd on shared memory CUDA framework as mentioned before implicitly handles load‐balancing with its runtime. This is achieved at the expense of following CUDA programming model where any results obtained in the local shared memory need to be copied to the global memory to be re‐used in the subsequent kernels. To follow this, we copy the local queue content to the global memory in Figure 6. The second atomicAdd() obtains the index of global of queue. However, this mutual exclusion to global memory is only performed by a master thread in a thread block, which results in very little contention. Example Fibonacci number computing We will use an example that has shown in [1]. The example computes in parallel Fibonacci number of a certain number n. As shown in Figure 7 and Figure 8, if this number n is smaller than some threshold, we compute Fibonac...
View Full Document

This document was uploaded on 03/17/2014 for the course CS 4800 at Northeastern.

Ask a homework question - tutors are online