Thoughts on $var: CUDA - Figuring out how many threads have launched Part 1

I've talked about multiple threads sharing a single GPU once before. That got me thinking though, what if I don't want all of them to be using the GPU simultaneously. What if the GPU actually becomes the bottleneck in my computational code? In all likelyhood, anything I implement on a GPU will probably also have a CPU implementation somewhere, because I would like my code to run on any system. If it can run on an accelerated system, all the better.

Now that I've laid out my motivation, it's time to figure out what I can do about it. It is pretty simple to figure out how much of the devices memory has been utilized with the function cudaMemGetInfo(). But what about the number of blocks or threads? Now, it may be that my Google-Fu is weak, but I couldn't find a built-in way to let me determine the current number of threads running, so I decided to see if I could build one myself.

I started off using the same code I used last time, though I modified my wasteTime routine so that it would cut back on the registers used a bit. The first iteration of the new wasteTime function looks like this:

     __global__ void wasteTime(int nIter, int *threads){

        atomicAdd(threads, 1);

        int j;
        float data = 1.0;
        for (j = 0; j < nIter; j++){
            data = cos(data) + data;
        }
    }

In the main function, I got rid of the OpenMP (for now) to focus on getting this call right.

     int main(int argc, char *argv[]){
        int nIter;
        nIter = 100000;
        int *threads;
        cudaMallocHost(&threads, sizeof(int));

        *threads = 0;

        dim3 grid(2, 1, 1);
        dim3 block(4, 4, 4);

        cudaStream_t stream;

        cudaStreamCreate(&stream);

        wasteTime<<<grid, block, 0, stream>>>(nIter, threads);

        printf("%d threads started\n", *threads);
        cudaStreamDestroy(stream);
        printf("%d threads started\n", *threads);

        cudaFree(threads);
    }

A few of notes before I run this:

The threads pointer in main is the same threads pointer in the CUDA code, this is done via pinned memory on the host and direct memory access (DMA) from the GPU to the host.
DMA from the device to the host is incredibly expensive.
DMA is not available for all compute cards, though it is for all Fermi-based cards and above, and honestly, Fermi is old enough now that you should be able to assume that if people are going to accelerate their code, they'll have it.
Atomic operations are expensive because they get serialized.
I have changed the grid and block dimensions for launch so that they can be a bit more general. I get the feeling I'll be using this general project as a framework for a bunch of different tests. If I don't, no biggie.
The source code up to this point is here.

I've left in the streams from last time because I want to parallelize this later.

Building and running (still on my 4 core AMD with a GTS 450) I get an answer of:

0 threads started
71 threads started

Hmm, interesting. Doing some basic math real quick, we know that we have 2 blocks, and each block has 4*4*4 = 64 threads. The first time that the printf statement is called, we haven't actually launched any threads yet, or at least they haven't modified the global memory yet. We then destroy the stream, which I had thought would synchronize the device and the host, and when we call printf again, we only have 71 threads, more than 1 block, but not fully 2.

Now we'll go back and synchronize the device to the host. This will be a simple change where we just add a call to cudaDeviceSynchronize() right after the second printf call, and we'll put another printf right after that to see how things have changed. Also, I've modified the wasteTime function and given it an atomicAdd(threads, -1); at the end, so that instead of seeing how many threads have started, I can see how many threads are currently running. I've also modified the printf statements to reflect this. The code for the new version is here. The results from 3 separate runs are:

ross@kronix:~/dev/cuda/runningThreads$ Debug/runningThreads 
0 threads running
113 threads running
0 threads running
ross@kronix:~/dev/cuda/runningThreads$ Debug/runningThreads 
0 threads running
77 threads running
0 threads running
ross@kronix:~/dev/cuda/runningThreads$ Debug/runningThreads 
0 threads running
68 threads running
0 threads running

As you can see there is a fair amount of entropy in the number of threads "caught" running by the middle printf, which is between the thread destroy and the device synchronize.

I definitely have some more to add to this, such as not having every thread run the atomic functions, and spawning into multiple OpenMP threads as well. I'll be following this post up with some future posts detailing my thought process and approach to solving this problem I may or may not have invented out of thin air.

In fact, you may be wondering why I even care about this at all. The reason is I may be working on a project soon which will literally have thousands of concurrent tasks each one of which may or may not be able to saturate a GPU for a short time. If I am even running on an accelerated system, it will still have a high CPU to GPU ratio on each node. While spawning some of these tasks onto the GPU to free up CPU time will be beneficial, if I end up being stuck waiting for the GPU to crunch through all of the tasks I've given it, because I haven't paid attention to load balancing, well, that would be the opposite of accelerated functionality, now wouldn't it. That last sentence is long and has a lot of commas, but I'm going to leave it like that anyway.

On a more pragmatic tangent, like "why wouldn't you just develop for CPUs and then accelerate with a Xeon Phi?", The fact that I can do these kinds of tests and experiments at home is one of the nice things about how NVIDIA has set up their system. I can tinker with my low cost system, but port that code and expertise over to the production systems at work with minimum effort. While I'd love to get my hands on an Intel Xeon Phi and do some work with it (a minimum $2k proposition at the low end, including the fact that my current desktop couldn't run it), I would pretty much have to purchase the Intel Compilers to go with it to be able to build toward it. Contrast that with my $65 video card and free compilers, and the choice isn't that hard at the moment. That and I find it fun.

Still, let me reiterate that: I would love to get my hands on an Intel Xeon Phi.

Finally, on a non-GPU related note, I want to thank Tod for giving me some space on his server. He set up my account a while ago for D&D stuff, but I don't think he'll mind if I put some code up for others to try out. If you want me to take it down (or actually link to you, other than tod.net) let me know.

Thoughts on $var

Tuesday, April 30, 2013

CUDA - Figuring out how many threads have launched Part 1

No comments:

Post a Comment

Popular Posts