Last time I had determined that I could see how many threads had been launched on my GTS 450. However, each one of those threads was making an atomicAdd() call to a pointer in host memory. Two things that are quite bad.
If I want the host to be able to read what the threads are doing, then I need to keep the memory location in host. However, I don't want to have every single thread updating that memory location. Thankfully, I don't have to. Blocks are loaded onto individual SMPs at once and retired all at once. Not only that, but I can get the block size quite easily from inside the kernel with the blockDim variable. Implementing this change, my new kernel looks like this:
__global__ void wasteTime(int nIter, int *threads){ int blkThreads = 0; if( threadIdx.x == 0 && threadIdx.y == 0 && threadIdx.z == 0){ blkThreads = blockDim.x * blockDim.y * blockDim.z; atomicAdd(threads, blkThreads); } int j; float data = 1.0; for (j = 0; j < nIter; j++){ data = cos(data) + data; } if( threadIdx.x == 0 && threadIdx.y == 0 && threadIdx.z == 0){ atomicSub(threads, blkThreads); } }
I also made a modification to the main routine after the kernel call to be able to make more updates as to how many threads have been launched on the device:
wasteTime<<<grid, block, 0, stream>>>(nIter, threads); struct timespec delay; delay.tv_sec = 0; delay.tv_nsec = 200000000L; printf("%d threads running\n", *threads); nanosleep(&delay, NULL); while (*threads > 0){ printf("%d threads running\n", *threads); nanosleep(&delay, NULL); } cudaFree(threads);
This also requires a #include <time.h> up at the top of the source. One last change is that I swapped the grid to be 4x4x4, to launch more blocks. The full code is here. Running the code generated the following results:
0 threads running
2048 threads running
2048 threads running
2048 threads running
1024 threads running
As you can see, the first time I called the printf statement, no threads hadn't been spawned. The total number of threads launched was actually (4 x 4 x 4) x (4 x 4 x 4) = 4096 , but I apparently got a maximum number of threads (for this problem and GPU) that could be resident on the GPU as being 2048. Another good thing, however, was that I can see the number of threads decreasing at the right right before it finishes.
There is a small problem in the kernel, especially with small block sizes though. Any time there is a divergence in threads within a warp (set of 32 threads), all of the threads in that warp perform all of the commands for all of the divergent paths with the non-applicable threads having their results masked out. What this means for my code is that threads 1-31 will all wait and twiddle their thumbs while thread 0 goes off and does its DMA atomicAdd and atomicSub. With the size of my block equaling 64, this means that a full half of my threads are affected by the time delay associated with 1/64 of my threads.
I can attack this in a few ways. The first is to do the atomic commands only on a single thread from the entire grid. Another is to set the value of *threads prior to the launch and remove the atomicAdd, but leave in the atomicSub. I think I want to leave the atomicSub at the end of the kernel anyway (and calculate blkThreads within that if statement) because I like seeing individual blocks retired. Changing the kernel to have only block 0,0,0 thread 0,0,0 atomicAdd the number of threads for the grid and then have thread 0,0,0 for each block atomicSub it back down (code here) has the following results:
0 threads running
4096 threads running
3072 threads running
2048 threads running
1024 threads running
The count starts off at 4096 and goes down as the blocks retire. Having the host increment the number of threads prior to launch, though, may be even better. This would allow me to see how many threads have been scheduled for launch, which is what I really want to know for load balancing purposes.
To avoid race conditions I'll have to still use atomicAdd, just from the host, especially if I branch out into threading, where we may have multiple threads trying to do this all at once. Simple right? Well, you can try it, but I'll spoil it, atomicAdd is a "device" function, it can only be called from the device. Obviously the idea of atomics aren't new to CUDA though, so we can look else where.
I could try to use the C++11 standard library atomics, but there are a few but that requires me to create the variables as a std::atomic<int> object. This is problematic because I also need to allocate the memory using cudaMallocHost to get my DMA access. Hmm, well, I can also use intrinsics. GCC has an intrinsic __sync_fetch_and_add. This works in C (which means I can drop it into my runningThreads.cu file), but it is, as I said, intrinsic to GCC, making it less portable. This could bite me later if I want to use the Intel or PGI compilers.
I implemented the increment with __sync_fetch_and_add and changed the loop so that it would launch a new grid (with a total of 3 grids being launched) when there was less than 30% of the threadsPerGrid still on the device. This could just as easily be set based on some percentage of cudaDeviceProp::maxThreadsPerMultiProcessor * cudaDeviceProp::multiProcessorCount. In any case, the code can be found here, and the output was:
16384 threads running
16384 threads running
12288 threads running
12288 threads running
8192 threads running
8192 threads running
20480 threads running
20480 threads running
16384 threads running
16384 threads running
12288 threads running
12288 threads running
8192 threads running
8192 threads running
20480 threads running
20480 threads running
16384 threads running
16384 threads running
12288 threads running
12288 threads running
8192 threads running
8192 threads running
4096 threads running
4096 threads running
Looking pretty good. Next time I'll throw some OpenMP threads into the mix and maybe an idea I just had about a way I could use the C++11 atomics to capture the number of currently launched threads, but I haven't even tried to implement that yet.