Wednesday, April 24, 2013

Sharing a (nVidia) GPU between OpenMP Threads

In case you couldn't tell from the title, this will be somewhat code heavy.

I was thinking about one of my problems for work and thought that at some point it may be worth while for me to know whether or not I could share a GPU between different threads using OpenMP. I thought it was possible, since they would all share the same context (still not sure what that means, but I   know that MPI processes do not share the same context, and that they can't share a GPGPU for pre-Kepler hardware).

A couple of disclaimers, I only write in CUDA when I'm doing GPGPU work, which I know ties me to nVidia hardware, but that's the way it is.  Also, this code has no error checking, because it was a quick and dirty test this evening. I'm pretty sure I spent more time on writing this blog post than I did on  writing the code.

At work I have access to most compilers as well as a GTX 680 (in my workstation, it's sweet), some K10s, a K20, a system where I can get up to 4 nodes with each node having one M2050 (Fermi), and a system with multiple M2070s per node (Fermi). At home though, well, I'll let some of the device properties speak for themselves.

 
==== Device 0 ===
                  name: GeForce GTS 450
    concurrent kernels: T
                   SMP: 4
   Max Threads per SMP: 1536
  Max Theads per Block: 1024
            Global Mem: 1.00 G
               Compute: 2.1
   Asyncronous Engines: 1

So yeah, not the most powerful GPU out there, but it has computer capacity 2.1, which was why I bought it. It only cost ~$65 at the time. But I think my next one might be in the GTX Titan family. But that is a different post.

As an aside, if anybody wants the code to generate the above queries the code is here.

The first thing that I decided I had to do was create a kernel which could waste some serious time so that I wouldn't have the possibility that one kernel from a thread would be done before the overhead from the next even started. I didn't want to deal with allocating and freeing memory, though. I came up with a very simple and useless kernel:
 
    __global__ void wasteTime(int nIter){
        int j;
        float data = 1.0;
        for (j = 0; j < nIter; j++){
            data = cos(data) + sin(data);
        }
    }

Next, I set up a simple program to call the kernel from a bunch of different threads.
 
    int main(int argc, char *argv[]){
        int nIter;
        nIter = 100000;

    #pragma omp parallel
        {
            wasteTime<<<1,128>>>(nIter);
        }
    }

I created this project using the nVidia Nsight Eclipse Edition (I'm running on the Linux box), which has nvvp (a handy-dandy GPGPU profiler) built in, which will allow me to easily see if I was able to overlap my kernel calls. Here is the a profile of that first attempt
Figure 1.
So the good news is that my kernel takes more than half a second, which makes it a supreme time waster for this situation.  Also, at first glance, it appears as though I have 3 kernels that are overlapping. First glance is wrong though, and my initial inkling that it is wrong was that I have 4 cores on my CPU, so I should see 4 kernels if everything was working correctly. Looking a bit closer, I'm pretty sure I actually have only 1 kernel running there. To check I put in a printf statement inside the parallel section. If I do have more than 1 thread running, I should see more than one output line.

    int main(int argc, char *argv[]){
        int nIter;
        nIter = 100000;

    #pragma omp parallel
        {
            wasteTime<<<1,128>>>(nIter);
            printf("done.\n");
        }
    }

Figure b.

Hmm, only one 'done.' That's a problem. And then I remember, I have to pass in some flags to make sure that it actually compiles the #pragma omp parallel statement. I didn't see any warnings about not being able to handle the #pragma statements, which I usually get. That may have something to do with nvcc, or it may have something to do with me only checking to see if successfully built, and not looking through the console output.  It took me a bit to figure out where they go when using Nsight Eclipse edition (I often role my own Makefiles), but I'll give you a couple of screen shots so that you can easily find them if you do the same thing.  You can get the dialog below by selecting 'Project' ->'Properties'.
Figure iii.

Note, I only opened the dialog once, I did not have 2 dialogs one on top of the other, just thought I would make sure that was clear. I then had to go into my Debug folder and call 'make clean' (though there was probably another way to do this), rebuild and reprofile, yielding the following results:

Figure we're getting closer.

Well, we got the 'done' to show up 4 times, and I can also see the 4 individual threads in the profiler as well.  The bad news is that there is absolutely no overlap of the time wasting kernels.  There should be plenty of room on the GPU, after all, there are 4 SMP, and each one can handle 1.5k threads. I've also established that it can handle concurrent kernels.

And that's when I remember it, streams.  At least, I think that is the answer. Basically all of the threads are calling into the GPU's default stream (which appears to be stream 2 according to the profiler). Not only that, the default stream doesn't do things quite as asynchronously as other streams. To be able to get concurrent kernels I have to have each of the threads call into a different stream. That should be easy enough.

    int main(int argc, char *argv[]){
        int nIter;
        nIter = 100000;

        cudaStream_t stream;

    #pragma omp parallel
        {
            cudaStreamCreate(&stream);
            wasteTime<<<1,128, 0, stream>>>(nIter);
            cudaStreamDestroy(stream);
            printf("done.\n");
        }
    }

Figure Whoo popcorn superposition (kernel overlap)!

I've collapsed the CPU threads as they aren't all that important for what I want to test. However, as we can see stream 7 and stream 10 are both executing at the same time. The red line in there is apparently overhead directly associated with the profiler, and I may look into that some more.  I also ran the (from the debug directory) /urs/bin/time -p ./ompTest which gave me an elapsed time of 1.07 seconds. That is obviously more than 1x the time to run wasteTime(), but less than 2x the time to run wasteTime() so the profiler might be messing up some other stuff in there too.

Finally, final version of my (somewhat small) source code file is here. If you've made it this far, hopefully this post has been of some use to you.

No comments:

Post a Comment