Tuesday, April 30, 2013

CUDA - Figuring out how many threads have launched Part 1

I've talked about multiple threads sharing a single GPU once before. That got me thinking though, what if I don't want all of them to be using the GPU simultaneously.  What if the GPU actually becomes the bottleneck in my computational code? In all likelyhood, anything I implement on a GPU will probably also have a CPU implementation somewhere, because I would like my code to run on any system. If it can run on an accelerated system, all the better.

Now that I've laid out my motivation, it's time to figure out what I can do about it.  It is pretty simple to figure out how much of the devices memory has been utilized with the function cudaMemGetInfo(). But what about the number of blocks or threads? Now, it may be that my Google-Fu is weak, but I couldn't find a built-in way to let me determine the current number of threads running, so I decided to see if I could build one myself.

I started off using the same code I used last time, though I modified my  wasteTime routine so that it would cut back on the registers used a bit.  The first iteration of the new wasteTime function looks like this:

     __global__ void wasteTime(int nIter, int *threads){

        atomicAdd(threads, 1);

        int j;
        float data = 1.0;
        for (j = 0; j < nIter; j++){
            data = cos(data) + data;
        }
    }

In the main function, I got rid of the OpenMP (for now) to focus on getting this call right.

     int main(int argc, char *argv[]){
        int nIter;
        nIter = 100000;
        int *threads;
        cudaMallocHost(&threads, sizeof(int));

        *threads = 0;

        dim3 grid(2, 1, 1);
        dim3 block(4, 4, 4);

        cudaStream_t stream;

        cudaStreamCreate(&stream);

        wasteTime<<<grid, block, 0, stream>>>(nIter, threads);

        printf("%d threads started\n", *threads);
        cudaStreamDestroy(stream);
        printf("%d threads started\n", *threads);

        cudaFree(threads);
    }

A few of notes before I run this:
  1. The threads pointer in main is the same threads pointer in the CUDA code, this is done via pinned memory on the host and direct memory access (DMA) from the GPU to the host.
  2. DMA from the device to the host is incredibly expensive.
  3. DMA is not available for all compute cards, though it is for all Fermi-based cards and above, and honestly, Fermi is old enough now that you should be able to assume that if people are going to accelerate their code, they'll have it.
  4. Atomic operations are expensive because they get serialized.
  5. I have changed the grid and block dimensions for launch so that they can be a bit more general. I get the feeling I'll be using this general project as a framework for a bunch of different tests. If I don't, no biggie.
  6. The source code up to this point is here.
I've left in the streams from last time because I want to parallelize this later.

Building and running (still on my 4 core AMD with a GTS 450) I get an answer of:

0 threads started
71 threads started

Hmm, interesting. Doing some basic math real quick, we know that we have 2 blocks, and each block has 4*4*4 = 64 threads. The first time that the printf statement is called, we haven't actually launched any threads yet, or at least they haven't modified the global memory yet. We then destroy the stream, which I had thought would synchronize the device and the host, and when we call printf again, we only have 71 threads, more than 1 block, but not fully 2.

Now we'll go back and synchronize the device to the host. This will be a simple change where we just add a call to cudaDeviceSynchronize() right after the second printf call, and we'll put another printf right after that to see how things have changed. Also, I've modified the wasteTime function and given it an atomicAdd(threads, -1); at the end, so that instead of seeing how many threads have started, I can see how many threads are currently running.  I've also modified the printf statements to reflect this. The code for the new version is here. The results from 3 separate runs are:

ross@kronix:~/dev/cuda/runningThreads$ Debug/runningThreads 
0 threads running
113 threads running
0 threads running
ross@kronix:~/dev/cuda/runningThreads$ Debug/runningThreads 
0 threads running
77 threads running
0 threads running
ross@kronix:~/dev/cuda/runningThreads$ Debug/runningThreads 
0 threads running
68 threads running
0 threads running

As you can see there is a fair amount of entropy in the number of threads "caught" running by the middle printf, which is between the thread destroy and the device synchronize.

I definitely have some more to add to this, such as not having every thread run the atomic functions, and spawning into multiple OpenMP threads as well. I'll be following this post up with some future posts detailing my thought process and approach to solving this problem I may or may not have invented out of thin air.

In fact, you may be wondering why I even care about this at all. The reason is I may be working on a project soon which will literally have thousands of concurrent tasks each one of which may or may not be able to saturate a GPU for a short time. If I am even running on an accelerated system, it will still have a high CPU to GPU ratio on each node. While spawning some of these tasks onto the GPU to free up CPU time will be beneficial, if I end up being stuck waiting for the GPU to crunch through all of the tasks I've given it, because I haven't paid attention to load balancing, well, that would be the opposite of accelerated functionality, now wouldn't it. That last sentence is long and has a lot of commas, but I'm going to leave it like that anyway.

On a more pragmatic tangent, like "why wouldn't you just develop for CPUs and then accelerate with a Xeon Phi?", The fact that I can do these kinds of tests and experiments at home is one of the nice things about how NVIDIA has set up their system. I can tinker with my low cost system, but port that code and expertise over to the production systems at work with minimum effort. While I'd love to get my hands on an Intel Xeon Phi and do some work with it (a minimum $2k proposition at the low end, including the fact that my current desktop couldn't run it), I would pretty much have to purchase the Intel Compilers to go with it to be able to build toward it. Contrast that with my $65 video card and free compilers, and the choice isn't that hard at the moment. That and I find it fun.

Still, let me reiterate that: I would love to get my hands on an Intel Xeon Phi.

Finally, on a non-GPU related note, I want to thank Tod for giving me some space on his server. He set up my account a while ago for D&D stuff, but I don't think he'll mind if I put some code up for others to try out. If you want me to take it down (or actually link to you, other than tod.net) let me know.

Saturday, April 27, 2013

Thoughts on Spring Cleaning

I feel like my wife and I were pretty productive today. First I finally got around to mowing the lawn for the first time this season. There was one point where the grass was so thick and I was moving too fast (at a walk) that I actually stalled the mower. Going over that patch a bit slower let me take it out though.

After that, and lunch, we started in on what we refer to as the office. This is where our computers are and it was in pretty bad shape. Truth be told it still isn't actually clean, but I can actually see the surfaces of both of the desks that make up my corner. I also raised my two monitors (having experienced 2 monitors, I far prefer the increased real estate to a single monitor) up onto a mostly unused shelf that was sitting behind them.

This made an incredible difference for my entire desk. Not only did it make the space they were sitting in usable, it made the mostly unused space behind them (and under the shelf) usable as well. Now I'll just have to get used to the raised position of the monitors. While I am slouching pretty bad right now, for the most part I have found that I have had better posture with them standing higher.

I'm sure this post has been pretty boring so far, so here are some amusing tidbits from our adventures in cleaning.

  1. We found a small LEGO creator kit, I built the race car and it is now sitting on the shelf with my monitors, I may or may not have been playing with the race car all evening to the annoyance of my wife.
  2. We filled a medium sized trash bin twice with recyclables.
  3. We found a birthday card that was supposed to go to somebody a few weeks ago (technically this was found in the kitchen, and they did get the present at the appropriate time).
  4. We found a hospital bill that was over a year old.
  5. Pants which had a loose button, and therefore were on the sewing table for over year were fixed.
  6. ???
  7. Profit
Now to get ready for the D&D adventure I'm running tonight.

Wednesday, April 24, 2013

Sharing a (nVidia) GPU between OpenMP Threads

In case you couldn't tell from the title, this will be somewhat code heavy.

I was thinking about one of my problems for work and thought that at some point it may be worth while for me to know whether or not I could share a GPU between different threads using OpenMP. I thought it was possible, since they would all share the same context (still not sure what that means, but I   know that MPI processes do not share the same context, and that they can't share a GPGPU for pre-Kepler hardware).

A couple of disclaimers, I only write in CUDA when I'm doing GPGPU work, which I know ties me to nVidia hardware, but that's the way it is.  Also, this code has no error checking, because it was a quick and dirty test this evening. I'm pretty sure I spent more time on writing this blog post than I did on  writing the code.

At work I have access to most compilers as well as a GTX 680 (in my workstation, it's sweet), some K10s, a K20, a system where I can get up to 4 nodes with each node having one M2050 (Fermi), and a system with multiple M2070s per node (Fermi). At home though, well, I'll let some of the device properties speak for themselves.

 
==== Device 0 ===
                  name: GeForce GTS 450
    concurrent kernels: T
                   SMP: 4
   Max Threads per SMP: 1536
  Max Theads per Block: 1024
            Global Mem: 1.00 G
               Compute: 2.1
   Asyncronous Engines: 1

So yeah, not the most powerful GPU out there, but it has computer capacity 2.1, which was why I bought it. It only cost ~$65 at the time. But I think my next one might be in the GTX Titan family. But that is a different post.

As an aside, if anybody wants the code to generate the above queries the code is here.

The first thing that I decided I had to do was create a kernel which could waste some serious time so that I wouldn't have the possibility that one kernel from a thread would be done before the overhead from the next even started. I didn't want to deal with allocating and freeing memory, though. I came up with a very simple and useless kernel:
 
    __global__ void wasteTime(int nIter){
        int j;
        float data = 1.0;
        for (j = 0; j < nIter; j++){
            data = cos(data) + sin(data);
        }
    }

Next, I set up a simple program to call the kernel from a bunch of different threads.
 
    int main(int argc, char *argv[]){
        int nIter;
        nIter = 100000;

    #pragma omp parallel
        {
            wasteTime<<<1,128>>>(nIter);
        }
    }

I created this project using the nVidia Nsight Eclipse Edition (I'm running on the Linux box), which has nvvp (a handy-dandy GPGPU profiler) built in, which will allow me to easily see if I was able to overlap my kernel calls. Here is the a profile of that first attempt
Figure 1.
So the good news is that my kernel takes more than half a second, which makes it a supreme time waster for this situation.  Also, at first glance, it appears as though I have 3 kernels that are overlapping. First glance is wrong though, and my initial inkling that it is wrong was that I have 4 cores on my CPU, so I should see 4 kernels if everything was working correctly. Looking a bit closer, I'm pretty sure I actually have only 1 kernel running there. To check I put in a printf statement inside the parallel section. If I do have more than 1 thread running, I should see more than one output line.

    int main(int argc, char *argv[]){
        int nIter;
        nIter = 100000;

    #pragma omp parallel
        {
            wasteTime<<<1,128>>>(nIter);
            printf("done.\n");
        }
    }

Figure b.

Hmm, only one 'done.' That's a problem. And then I remember, I have to pass in some flags to make sure that it actually compiles the #pragma omp parallel statement. I didn't see any warnings about not being able to handle the #pragma statements, which I usually get. That may have something to do with nvcc, or it may have something to do with me only checking to see if successfully built, and not looking through the console output.  It took me a bit to figure out where they go when using Nsight Eclipse edition (I often role my own Makefiles), but I'll give you a couple of screen shots so that you can easily find them if you do the same thing.  You can get the dialog below by selecting 'Project' ->'Properties'.
Figure iii.

Note, I only opened the dialog once, I did not have 2 dialogs one on top of the other, just thought I would make sure that was clear. I then had to go into my Debug folder and call 'make clean' (though there was probably another way to do this), rebuild and reprofile, yielding the following results:

Figure we're getting closer.

Well, we got the 'done' to show up 4 times, and I can also see the 4 individual threads in the profiler as well.  The bad news is that there is absolutely no overlap of the time wasting kernels.  There should be plenty of room on the GPU, after all, there are 4 SMP, and each one can handle 1.5k threads. I've also established that it can handle concurrent kernels.

And that's when I remember it, streams.  At least, I think that is the answer. Basically all of the threads are calling into the GPU's default stream (which appears to be stream 2 according to the profiler). Not only that, the default stream doesn't do things quite as asynchronously as other streams. To be able to get concurrent kernels I have to have each of the threads call into a different stream. That should be easy enough.

    int main(int argc, char *argv[]){
        int nIter;
        nIter = 100000;

        cudaStream_t stream;

    #pragma omp parallel
        {
            cudaStreamCreate(&stream);
            wasteTime<<<1,128, 0, stream>>>(nIter);
            cudaStreamDestroy(stream);
            printf("done.\n");
        }
    }

Figure Whoo popcorn superposition (kernel overlap)!

I've collapsed the CPU threads as they aren't all that important for what I want to test. However, as we can see stream 7 and stream 10 are both executing at the same time. The red line in there is apparently overhead directly associated with the profiler, and I may look into that some more.  I also ran the (from the debug directory) /urs/bin/time -p ./ompTest which gave me an elapsed time of 1.07 seconds. That is obviously more than 1x the time to run wasteTime(), but less than 2x the time to run wasteTime() so the profiler might be messing up some other stuff in there too.

Finally, final version of my (somewhat small) source code file is here. If you've made it this far, hopefully this post has been of some use to you.

Saturday, April 20, 2013

Thoughts on exercise

I don't get to exercise as much as I would like to. Or rather, I don't make as much time for exercise as I aught to. I enjoy exercise in general. In the past I have enjoyed soccer, ultimate, and even a bit of basketball (though I cannot claim to ever have had much skill in that last one). For solo exercise, I actually somewhat enjoy using an elliptical, and I think that I would like to run at least a half-marathon at some point. Maybe I should make myself an official 'bucket list', but that is not what this post is about.

The primary problems for me are motivation and time management. In terms of motivation, I almost always enjoy the exercise while I am doing it, and I typically feel pretty good afterward as well. For some reason, though, it can be hard to drag myself out of bed or away from the computer to actually start the activity.

The issue of time management has become more pronounced lately. I used to hit the gym on Monday, Wednesday, and Friday mornings. However, about a month ago we got a dog. What this means, though, is that I typically take him on a walk every morning and evening, which takes about 20 minutes. So while I do get some exercise every day, it is not very intense. I have considered running a bit on these walks, and I may start doing that.

Another option is to start going to the gym in the evenings, but that takes time away from my family. So this post comes to an end without any real resolution, but rather just some thoughts which have been bouncing around my head lately.

Thursday, April 11, 2013

Thoughts on loading ... loading ... loading ... done

Almost all of the coding that I do is command line based. I work in an environment where pretty much everything has to be accessed via the command line, and even when we do work with a GUI, it typically is to set up a command line string to be passed to a system call. That is just the nature of the work when you operate in a batch environment.

Anyway, back to my point, sometimes there are times when I have to watch a screen doing some boring stuff like tell my how many of my 12k files have been copied over to the work space before the job can start. For things like this I would kind of like a more interesting screen than:

100 of 12224 files copied
200 of 12224 files copied
300 of you get the picture.

I think I may start to work on a library (written in C, probably using ncurses) to do some text based animations.  There are a few which were interesting in a thread over at stack overflow which could be kind of neat, but few of them actually give information about the progress of the process.

I had some ideas, like a bulge going through a pipe and filling up a bucket, along with some information about the progress.  An example might look something like:

 
---------/ \----\
---------\_/---\ \
             |     |
             |     |
 x of y      |-----|
 items       |     |
             +-----+

Another option would be something along the lines of files flying toward something, maybe a giant mouth that wants to eat it:

 
                           /-----\
               ____       / o     \
        --    /   /       +--\     \
       --    /___/            |    |
________________________  +--/     /
     #.# % done           \       /
     x of y items          \-----/


Another one which I won't draw out here at all, but I thought was funny was a cow which eats some grass on one side of the screen, then moves over to the other side of the screen and, um, deposits the grass in a slightly altered form.

I figure the API would look something along the line of:
 
    int pipe(int progress, // the number of items / tasks which have finished
             int max,      // the to total number of items to be done
             int substeps, // the number of steps to generate for an animation
             const char* itemName); // replaces 'items' above

Of course, I am by no means tied to this API, and it may even make sense to make the library in C++ so that a single animation could be considered an object, making it easier to have state.

Finally, while this was inspired by something at work, it is by no means appropriate for me to develop while at work. Though if I do get it developed on my own time, and like the results, I have no problem with incorporating it into my projects at work. I'm welcome to other ideas for animations, API, or any other form of help/support.

Sunday, April 7, 2013

Thoughts on running a D&D adventure

Again, not a coding post, but I didn't feel like working on one of those tonight. I'd rather talk a bit about the D&D adventure I ran last night.

So I'm the DM for a (generally) biweekly gaming group.  We are spread out geographically, but all in the Eastern time zone. We use a couple of tools for communication, mostly Skype for voice and roll20 for map-support. Anyway, one of the members of my group is a paladin for Garl Glittergold. The group is at level 4 now, and I decided that Glittergold (also known as Watchful Protector, The Joker, Priceless Gem, and Sparkling Wit) has all of his paladins undergo a quest at around this point in their development.

As it turns out I also knew that the group was going to head through a portal near the very beginning of this particular adventure. They decided to all jump through simultaneously, which made it really easy. "As you all are jumping into the portal, you notice it change from a glowing blue to a glowing yellow."

They found themselves standing in a 30' x 30' room, with a 10' wide ethereal glowing scroll on the south side of the room that looked something like this:
They then saw that the northern "wall" was more like a shimmering/ glowing ethereal curtain. When the walked through that, they found themselves in a 50' x 50' room where the north, east, and west "walls" were all ethereal curtains.  The north wall and the west wall were glowing, the east wall wall was dark. Essentially they had walked into a giant 5 x 5 grid, where each tile was 50' x 50'. If they followed a path laid out by the glowing light, it would eventually take them to a room where all of the walls were either stone, or dark, but in that room would be a large sized lantern archon (I know they aren't large size, but Glittergold enlarged this one. That's my story and I'm sticking to it) in the north east corner of the room. Not that any of my PCs have knowledge: the planes, but an NPC cleric with them did.

Anyway, when any body part of the party got into the same square as part of the archon, it said "Eeee" and left, leaving 2 new walls glowing.  The party had found the puzzle.

following the archon the players find that it repeats a pattern of positions. Actually the solve the function of this part puzzle before they got through the pattern the first time.  The pattern was:

Black lines indicate rooms, faint blue lines indicate 10' squares, and yes, that is a screen capture of Excel. While the paladin was investigating the 2nd position, they kicked all of the non-gnomes out of that "room". Around this time they also discovered they could ask the archon questions, and it would answer, though not always in a useful manner. For example: Q: "When is the puzzle over?" A: "When you have solved it or have given up." 

The non-gnomes headed south, where one of them noticed some very faint lines on the ground, lines which formed an "F".  They then moved to the room (marked above with a 9 and a 6) where they found the lines associated with the letter L. It was at this point that one of my players very quickly pieced together that the rooms correlated directly with the alphabet, starting in the upper left and reading like a book. He then told the other party members to have the archon move. When they had gotten to the 4th position, they were trying to figure out was "HATO" meant. They finally asked the archon if it's position within each room mattered, to which it responded with an actually helpful "yes".

From that they quickly recalled where it had shown up the previous 3 times, and generated the string "HeAdToOv", where the capital letter is the room, and the lowercase letter is the location within the room. They eventually got the string "HeAdToOvErLoOkHiLl" which is pretty obviously "Head to Overlook Hill". 

There were some other small things which they thought might have been in important (Only the rooms with the letters C, D, F, I, L, O, Q, T, U, X, and Y had glowing letters on the floor, and they seemed convinced that the poem itself had some sort of clue to solving he riddle. Of course, the full puzzle isn't over yet. They don't even know where Overlook Hill is. And, as the last thing the did, they all stepped back into a portal, not knowing what lies on the other side >=).

If I were to run it again, I think I would have tracked how many questions the group asked the archon, and once they had asked 20, the only way the archon would respond is "I have already answered 20 questions".  Another option might be to have the archon only answer yes/no questions. One of my favorite question and answers was "Are your movements spelling out a word?" to which the archon answered "no". This was truthful because the archon's movements were spelling out multiple words.

So there you have it. Feel free to drop this puzzle into a game if you are a GM, it can fit in quite a few places whenever you want them to spend some time working out a message. Heck, if you have a paladin of Garl Glittregold in your group and want to use the scroll and the cheesy poem, feel free.

Thursday, April 4, 2013

Thoughts on Connections


So this is non-alcoholic beer, right?
This is another blog post about something not related to computers or code, though I've got a couple of those posts kicking around in my head. I'll try to get one up this weekend, but no promises.

So a couple of days ago my wife shared a photo by Elise Andrew on her Facebook page (IFLS, I'm going to try not to swear here). Through that post I'll link what she linked which is here. Yes, that is an awesome looking jet pack on a toddler. You could probably say that the link is relevant to my interests.


Anyway, we decided that we wanted to do it too. I mean, it's awesome, why wouldn't we want to make one? But to make one we would need some 2-liter bottles. We don't just buy 2-liter bottles though, we buy soda in 2-liter bottles.  I decided to buy some of the soda seen at right.

Now, this soda is fairly tasty on its own. But it can be improved, and quite simply. "How?" you may ask, but I believe that you, dear reader, already know the answer.

If you have not figured it out by the time you read this line, then the images are probably broken. But yes, we added vanilla ice cream to our already tasty root beer so that we might consume our root beer floats.


Tastiness ensued. And that is how a toddler rocket-pack craft project turned into a root beer float.

Monday, April 1, 2013

Thoughts on April 1st

I'm not a big April Fool's Day prankster. My wife has pulled of a few good ones in the past; and I have sometimes provided some help in wording and/or initial idea, but for some reason I don't ever think of ones that involve myself.  Also, at this point I don't think I could really pull them on my children.  One of them wouldn't even know what is going on, and the other takes things so literally that he doesn't get the idea of a practical joke yet.

But this does raise a question (note: it does not beg a question) for me. Why do we pull pranks on people in the first place? I feel like the purpose of pranks is to make people look foolish, silly, and/or stupid. I have realized for a long time that I am not a fan of schadenfreude. Laughing at other people's misfortunes is not for me. People who know me in real life can also attest that I'm not really a fan of sitcoms because much of the humor is derived from people being in awkward situations. I wouldn't want to be in that situation, why would I laugh at somebody who is in that situation?

It may very well be that I don't know how to take a joke. When I look at what I do find funny, I gravitate toward puns, some physical comedy, and general silliness. Perhaps interestingly, I'm typically not good at crafting the type of humor I enjoy.

I often tell people that I don't enjoy movies or TV shows with plot. However, I can watch documentaries, sports, stand-up comedy, and game shows very easily. I seem to have a dislike for tension build-up. Unfortunately, tension build-up seems to be a requirement for drama. This is in no way a criticism of drama, as obviously people have enjoyed it for thousands of years, but more of an introspective on myself.

This dislike for tension, of course, generates tension of its own. A common social activity is to watch a movie, either in the theater or on a couch at home. If I am always against watching movies, that doesn't bode well for interacting well socially, and this has created problems in the past, and will likely cause problems in the future.

Maybe I'll come to appreciate tension build-up, hoaxes, pranks, and the like. Maybe I'm just a fool.