Tuesday, April 29, 2014

SCIENCE! continued

This is a continuation/update from my last post.

First off, as I said in the update to the previous post, the time lapse that we took overnight to try to make a video of the dye progressions, well, it didn't work so well. But I did add a new image to that post which shows off some of the dye showing up on the outside of the unopened blossoms.

The science experiment continues, and has definitely had some interesting aspects to it. My son's enthusiasm has waned a bit, but I think that is to be expected of a 5 year old. Of course, today he proudly proclaimed "I have a hypothesis! I think the leaves are going to turn purple!" (The show Dinosaur Train (warning: sound on that web page) has a character that makes lots of hypotheses).

Figure yes our counter is messy, for Science! (hush)

One of the things that my wife and I found really neat was that the dye seems to be separating out and we can see some blue highlights. The edges look a bit more purple to me today than it had previously, which is pretty neat. Plus, there now seems to be some of the dye showing up on the interior of some of the petals, where previously it had seemed to be confined to the edges. Also, one of the blossoms that had been previously closed has started to open, which was something I wasn't expecting. 

Figure this is my favorite picture from this project so far

Finally, because this flower has just started blooming, I've now set it up with the webcam and ChronoLapse to see if I can get a time lapse video of it opening. I think I may have to add some more water to the vase tomorrow though. It isn't in danger of drying out, but I want to make sure that it won't. I also think we added enough dye the first time that we won't have to add any more even if we double the current volume of water.








Sunday, April 27, 2014

SCIENCE!

I started a science experiment with my older son today.

Let me go back a few days though. One day last week, I was able to get out of work a bit earlier than normal. I decided since I actually had a bit of time before getting home, I would pick up some flowers for my wife, because she's awesome and sometimes I like to get her flowers to let her know she's awesome. They had lilies, which my wife is fond of, so that was the flower of choice.

Today, I remembered, or thought I remembered at least, that if you add food coloring to the water in a vase, then the coloring would eventually end up showing in the flower petals. After asking my wife's permission (they were her lilies after all) I got my older son to help me add some maroon food coloring and a bit of extra water to the vase. I also re-cut the stems, about 1 cm of the end of each, in the possibly mistaken belief that this would help speed up the uptake of the new water.

We also started a table to keep track of our observations, seen below. You can tell that I am a terrible speller without spell checker.

Figure teaching my son bad science habits like using pencil instead of pen

As you can see from our second observation, we actually started to see some change around the edges of the flower petals in the form of red "tick" marks. Since I now carry a camera around with me everywhere, just like almost everybody else in this country, I decided to take a picture:

Figure you know you're looking at sex organs, right?

It's hard to see it from the whole picture, so here is a zoom in near the tip of the flower petal on the bottom:

Figure my phone takes better pictures than I thought it could, this was simply a crop

One thing I was really surprised by was the first presence of the dye on the tips of the petals. I was expecting to see "rays" of red emanating from the center of the flower. But then, that is why you do the experiment! :)

Now I am running the program ChronoLapse and have set the lily up in front of my webcam to see if I can get a time lapse video out of it that shows something cool, hopefully the pigment either getting darker or spreading out more. If the video turns out well I'll probably post it. however, my web cam is hard to focus well, and has a resolution of 640 x 480, so cropping won't give me the zoom in effect like it does on my phone. However, if dramatic changes do occur, it should be a pretty cool video.

There is also always the possibility that Big Foot will break into our house to smell the color changing Lilies. Or the two-year old could become involved.

The best part of all of this? My son is super excited to make observations and see how it changes. Yay science!

UPDATE: The time lapse video didn't show much in the way of color spread. It did, however, show what happens during the sunrise which was pretty cool. The "ticks" on the petal edges are now darker, and we saw some red splotches on the outside of the closed flowers as well.

Figure I'm even less witty in the morning

Thursday, January 23, 2014

An Intersection of Politics and Optimization

I had a thought a few weeks ago, that something today jogged my memory about, and that is that one of the problems that most people agree on with regards to politics is gerrymandering. I've seen a few ways to try to combat it, from getting an independent (if you can find one) panel to divvy up a state, to just creating square blocks within a state and calling that good.

I have my own idea, but first I need to set it up so that it makes sense. As I see it, there are a number of competing, and possibly mutually exclusive, objectives when creating districts, and that is before even thinking about the political persuasion of the constituents. These objectives include:
  • Equal population in each district
  • District lines fall on county boundaries
  • Non-gerrymandered
  • Human Input into final decision (nobody likes to be told what to do by a computer)
There may be others, but I think these are a good start. It is almost guaranteed that optimizing for one will necessarily not be optimized for the others. As it is, we are currently optimized only on the last point, which we really can't assign an optimization function to anyway.

The first three, however, I think we can assign optimization values to. The first one is pretty easy, for a state of population P with N districts, where the ith district has a population pi then minimize the value of Sum(abs(pi - P/N)). You can use a squared value instead of abs if you want, that's just using the L2 norm instead of the L1 norm. Anyway, minimize that sum, and populations in each district will be equalized. For population data, use census data by precinct, assume that the population is uniform across the precinct (inaccurate, I know, but a good first guess).

The second one is a bit harder, but I think one way to express it mathematically is to minimize the the area which can be drawn between a district boundary and a county line. For a county that is split in such a way, count only the smaller area (otherwise the county could be cut any way you like). I'm sure others could come up with a way to optimize this as well that might make a bit more sense.

The third one could be mathematically imposed by minimizing the sum of the differences between the convex hull for a district and the district itself. If all of the districts are convex, then this sum will be 0. You will also have a non-gerrymandered state. For states such as Maryland, with its jagged southern border, it would probably be impossible to get this sum to 0, but that's still OK, we just want to minimize it for any given state.

Now, as I said before, these three cannot be complete optimized at the same time, and different people will put different weights on each of the objectives. Some people might believe that the population equality is most important, while others might believe that districts following county lines is paramount. This is almost where the human element comes in. See, even though people might disagree on what is most important, what can be done before assigning weights is to try to optimize over all of them. What will result is something called a Pareto frontier. Each point on the Pareto frontier represents a different weighting scheme. Some reasonable limits might also want to be enforced, maybe that every pi has to fall within 0.75 * P/N to 1.25 * P/N.

What would then happen is that a multiple objective optimization program can find a bunch of points on the Pareto frontier, in this case, each point is actually a specific map of the state in question. A set of these can then be handed off to a state legislature, or some other deciding body, and then they can select the one they like the most, bringing the human element back into it. They can then select the most optimal map for them if they like.

So there it is, a possible way to reduce gerrymandering while still leaving the final decision in the hands of people, possibly even state legislatures. I even think that this is a non-partisan idea. Also, I'd be happy to hear other (non-partisan) objectives that such a program might want to optimize for. It would be interesting if multiple states did this to see what the relative weights each state chose for it's given district mapping.

Thursday, January 2, 2014

Goals

So yeah, this thing that I never update.

Interesting thing about last year's posts, besides the fact that I didn't make very many of them. The most popular one was Building gcc 4.8. Mostly because it actually got linked in a Stack Overflow answer about the same thing. That was a point of pride for me :).

Looking forward to the future though, I have a few goals. Posting more on here is not one of them. I may post more, I may not. Last year I kind of stopped posting because I moved in July, and then I moved again in November. Between packing, unpacking, and unpacking again (movers did the packing for the second move), this kind of got left by the wayside. Anyway, on to some of my actual goals for 2014:

Fitness:

  • Do 2000 pushups, this actually seems pretty low. But if I hit it, I can always up the count next year
  • Spend 2000 minutes on exercise machines probably elipticals, though I think I'll count an actual bicycle toward this as well
  • Run a 5K, no walking. My eventual goal is a half-marathon (and maybe a full marathon someday, but we'll start with the half). But I want achievable goals for 2014, and I'm not sure the half-marathon is achievable for me

Household:

  • Build a 6 foot tall bookshelf. I built a solid pine 4' x 4' bookshelf a couple years ago. It is the strongest, and best looking bookshelf we own. It costs a lot more than the cheapo particle board ones, but the end result, and the experience, are worth it.
  • Play 36 board games during the year with my son. It is a way for me to spend quality time with my child as well as building his love for a hobby I love.
Technology/Professional:
  • Get a MySQL certification, I have to do more and more with databases, getting a certification is probably a good idea, at least from a resume perspective.
  • Submit at least one conference or journal paper where I am the lead author. I've got one in mind, but I'm not sure it is publication worthy
  • Complete at least 1 online course and get the certificate, I'm signed up for Convex Optimization through Stanford online, you can join me there if you'd like.
Well, there they are. They should be achievable. I'll start on them tomorrow, or this weekend at the latest :)

Friday, May 3, 2013

CUDA - Figuring out how many threads have launched Part 2

Part one is here.

Last time I had determined that I could see how many threads had been launched on my GTS 450.  However, each one of those threads was making an atomicAdd() call to a pointer in host memory. Two things that  are quite bad.

If I want the host to be able to read what the threads are doing, then I need to keep the memory location in host.  However, I don't want to have every single thread updating that memory location. Thankfully, I don't have to.  Blocks are loaded onto individual SMPs at once and retired all at once. Not only that, but I can get the block size quite easily from inside the kernel with the blockDim variable.  Implementing this change, my new kernel looks like this:

    __global__ void wasteTime(int nIter, int *threads){

        int blkThreads = 0;
        if(     threadIdx.x == 0 &&
                threadIdx.y == 0 &&
                threadIdx.z == 0){
            blkThreads = blockDim.x * blockDim.y * blockDim.z;
            atomicAdd(threads, blkThreads);
        }

        int j;
        float data = 1.0;
        for (j = 0; j < nIter; j++){
            data = cos(data) + data;
        }

        if(     threadIdx.x == 0 &&
                threadIdx.y == 0 &&
                threadIdx.z == 0){
            atomicSub(threads, blkThreads);
        }
    }

I also made a modification to the main routine after the kernel call to be able to make more updates as to how many threads have been launched on the device:

        wasteTime<<<grid, block, 0, stream>>>(nIter, threads);

        struct timespec delay;
        delay.tv_sec = 0;
        delay.tv_nsec = 200000000L;

        printf("%d threads running\n", *threads);
        nanosleep(&delay, NULL);

        while (*threads > 0){
            printf("%d threads running\n", *threads);
            nanosleep(&delay, NULL);
        }

        cudaFree(threads);

This also requires a #include <time.h> up at the top of the source. One last change is that I swapped the grid to be 4x4x4, to launch more blocks.  The full code is here. Running the code generated the following results:


0 threads running
2048 threads running
2048 threads running
2048 threads running
1024 threads running

As you can see, the first time I called the printf statement, no threads hadn't been spawned.  The total number of threads launched was actually (4 x 4 x 4) x (4 x 4 x 4) = 4096 , but  I apparently got a maximum number of threads (for this problem and GPU) that could be resident on the GPU as being 2048.  Another good thing, however, was that I can see the number of threads decreasing at the right right before it finishes.

There is a small problem in the kernel, especially with small block sizes though.  Any time there is a divergence in threads within a warp (set of 32 threads), all of the threads in that warp perform all of the commands for all of the divergent paths with the non-applicable threads having their results masked out.  What this means for my code is that threads 1-31 will all wait and twiddle their thumbs while thread 0 goes off and does its DMA atomicAdd and atomicSub. With the size of my block equaling 64, this means that a full half of my threads are affected by the time delay associated with 1/64 of my threads.

I can attack this in a few ways. The first is to do the atomic commands only on a single thread from the entire grid. Another is to set the value of *threads prior to the launch and remove the atomicAdd, but leave in the atomicSub. I think I want to leave the atomicSub at the end of the kernel anyway (and calculate blkThreads within that if statement) because I like seeing individual blocks retired. Changing the kernel to have only block 0,0,0 thread 0,0,0 atomicAdd the number of threads for the grid and then have thread 0,0,0 for each block atomicSub it back down (code here) has the following results:

0 threads running
4096 threads running
3072 threads running
2048 threads running
1024 threads running

The count starts off at 4096 and goes down as the blocks retire. Having the host increment the number of threads prior to launch, though, may be even better.  This would allow me to see how many threads have been scheduled for launch, which is what I really want to know for load balancing purposes.

To avoid race conditions I'll have to still use atomicAdd, just from the host, especially if I branch out into threading, where we may have multiple threads trying to do this all at once. Simple right? Well, you can try it, but I'll spoil it, atomicAdd is a "device" function, it can only be called from the device. Obviously the idea of atomics aren't new to CUDA though, so we can look else where.

I could try to use the C++11 standard library atomics, but there are a few  but that requires me to create the variables as a std::atomic<int> object. This is problematic because I also need to allocate the memory using cudaMallocHost to get my DMA access.  Hmm, well, I can also use intrinsics. GCC has an intrinsic __sync_fetch_and_add.  This works in C (which means I can drop it into my runningThreads.cu file), but it is, as I said, intrinsic to GCC, making it less portable. This could bite me later if I want to use the Intel or PGI compilers.

I implemented the increment with __sync_fetch_and_add and changed the loop so that it would launch a new grid (with a total of 3 grids being launched) when there was less than 30% of the threadsPerGrid still on the device. This could just as easily be set based on some percentage of cudaDeviceProp::maxThreadsPerMultiProcessor * cudaDeviceProp::multiProcessorCount. In any case, the code can be found here, and the output was:

16384 threads running
16384 threads running
12288 threads running
12288 threads running
8192 threads running
8192 threads running
20480 threads running
20480 threads running
16384 threads running
16384 threads running
12288 threads running
12288 threads running
8192 threads running
8192 threads running
20480 threads running
20480 threads running
16384 threads running
16384 threads running
12288 threads running
12288 threads running
8192 threads running
8192 threads running
4096 threads running
4096 threads running

Looking pretty good. Next time I'll throw some OpenMP threads into the mix and maybe an idea I just had about a way I could use the C++11 atomics to capture the number of currently launched threads, but I haven't even tried to implement that yet.

Tuesday, April 30, 2013

CUDA - Figuring out how many threads have launched Part 1

I've talked about multiple threads sharing a single GPU once before. That got me thinking though, what if I don't want all of them to be using the GPU simultaneously.  What if the GPU actually becomes the bottleneck in my computational code? In all likelyhood, anything I implement on a GPU will probably also have a CPU implementation somewhere, because I would like my code to run on any system. If it can run on an accelerated system, all the better.

Now that I've laid out my motivation, it's time to figure out what I can do about it.  It is pretty simple to figure out how much of the devices memory has been utilized with the function cudaMemGetInfo(). But what about the number of blocks or threads? Now, it may be that my Google-Fu is weak, but I couldn't find a built-in way to let me determine the current number of threads running, so I decided to see if I could build one myself.

I started off using the same code I used last time, though I modified my  wasteTime routine so that it would cut back on the registers used a bit.  The first iteration of the new wasteTime function looks like this:

     __global__ void wasteTime(int nIter, int *threads){

        atomicAdd(threads, 1);

        int j;
        float data = 1.0;
        for (j = 0; j < nIter; j++){
            data = cos(data) + data;
        }
    }

In the main function, I got rid of the OpenMP (for now) to focus on getting this call right.

     int main(int argc, char *argv[]){
        int nIter;
        nIter = 100000;
        int *threads;
        cudaMallocHost(&threads, sizeof(int));

        *threads = 0;

        dim3 grid(2, 1, 1);
        dim3 block(4, 4, 4);

        cudaStream_t stream;

        cudaStreamCreate(&stream);

        wasteTime<<<grid, block, 0, stream>>>(nIter, threads);

        printf("%d threads started\n", *threads);
        cudaStreamDestroy(stream);
        printf("%d threads started\n", *threads);

        cudaFree(threads);
    }

A few of notes before I run this:
  1. The threads pointer in main is the same threads pointer in the CUDA code, this is done via pinned memory on the host and direct memory access (DMA) from the GPU to the host.
  2. DMA from the device to the host is incredibly expensive.
  3. DMA is not available for all compute cards, though it is for all Fermi-based cards and above, and honestly, Fermi is old enough now that you should be able to assume that if people are going to accelerate their code, they'll have it.
  4. Atomic operations are expensive because they get serialized.
  5. I have changed the grid and block dimensions for launch so that they can be a bit more general. I get the feeling I'll be using this general project as a framework for a bunch of different tests. If I don't, no biggie.
  6. The source code up to this point is here.
I've left in the streams from last time because I want to parallelize this later.

Building and running (still on my 4 core AMD with a GTS 450) I get an answer of:

0 threads started
71 threads started

Hmm, interesting. Doing some basic math real quick, we know that we have 2 blocks, and each block has 4*4*4 = 64 threads. The first time that the printf statement is called, we haven't actually launched any threads yet, or at least they haven't modified the global memory yet. We then destroy the stream, which I had thought would synchronize the device and the host, and when we call printf again, we only have 71 threads, more than 1 block, but not fully 2.

Now we'll go back and synchronize the device to the host. This will be a simple change where we just add a call to cudaDeviceSynchronize() right after the second printf call, and we'll put another printf right after that to see how things have changed. Also, I've modified the wasteTime function and given it an atomicAdd(threads, -1); at the end, so that instead of seeing how many threads have started, I can see how many threads are currently running.  I've also modified the printf statements to reflect this. The code for the new version is here. The results from 3 separate runs are:

ross@kronix:~/dev/cuda/runningThreads$ Debug/runningThreads 
0 threads running
113 threads running
0 threads running
ross@kronix:~/dev/cuda/runningThreads$ Debug/runningThreads 
0 threads running
77 threads running
0 threads running
ross@kronix:~/dev/cuda/runningThreads$ Debug/runningThreads 
0 threads running
68 threads running
0 threads running

As you can see there is a fair amount of entropy in the number of threads "caught" running by the middle printf, which is between the thread destroy and the device synchronize.

I definitely have some more to add to this, such as not having every thread run the atomic functions, and spawning into multiple OpenMP threads as well. I'll be following this post up with some future posts detailing my thought process and approach to solving this problem I may or may not have invented out of thin air.

In fact, you may be wondering why I even care about this at all. The reason is I may be working on a project soon which will literally have thousands of concurrent tasks each one of which may or may not be able to saturate a GPU for a short time. If I am even running on an accelerated system, it will still have a high CPU to GPU ratio on each node. While spawning some of these tasks onto the GPU to free up CPU time will be beneficial, if I end up being stuck waiting for the GPU to crunch through all of the tasks I've given it, because I haven't paid attention to load balancing, well, that would be the opposite of accelerated functionality, now wouldn't it. That last sentence is long and has a lot of commas, but I'm going to leave it like that anyway.

On a more pragmatic tangent, like "why wouldn't you just develop for CPUs and then accelerate with a Xeon Phi?", The fact that I can do these kinds of tests and experiments at home is one of the nice things about how NVIDIA has set up their system. I can tinker with my low cost system, but port that code and expertise over to the production systems at work with minimum effort. While I'd love to get my hands on an Intel Xeon Phi and do some work with it (a minimum $2k proposition at the low end, including the fact that my current desktop couldn't run it), I would pretty much have to purchase the Intel Compilers to go with it to be able to build toward it. Contrast that with my $65 video card and free compilers, and the choice isn't that hard at the moment. That and I find it fun.

Still, let me reiterate that: I would love to get my hands on an Intel Xeon Phi.

Finally, on a non-GPU related note, I want to thank Tod for giving me some space on his server. He set up my account a while ago for D&D stuff, but I don't think he'll mind if I put some code up for others to try out. If you want me to take it down (or actually link to you, other than tod.net) let me know.

Saturday, April 27, 2013

Thoughts on Spring Cleaning

I feel like my wife and I were pretty productive today. First I finally got around to mowing the lawn for the first time this season. There was one point where the grass was so thick and I was moving too fast (at a walk) that I actually stalled the mower. Going over that patch a bit slower let me take it out though.

After that, and lunch, we started in on what we refer to as the office. This is where our computers are and it was in pretty bad shape. Truth be told it still isn't actually clean, but I can actually see the surfaces of both of the desks that make up my corner. I also raised my two monitors (having experienced 2 monitors, I far prefer the increased real estate to a single monitor) up onto a mostly unused shelf that was sitting behind them.

This made an incredible difference for my entire desk. Not only did it make the space they were sitting in usable, it made the mostly unused space behind them (and under the shelf) usable as well. Now I'll just have to get used to the raised position of the monitors. While I am slouching pretty bad right now, for the most part I have found that I have had better posture with them standing higher.

I'm sure this post has been pretty boring so far, so here are some amusing tidbits from our adventures in cleaning.

  1. We found a small LEGO creator kit, I built the race car and it is now sitting on the shelf with my monitors, I may or may not have been playing with the race car all evening to the annoyance of my wife.
  2. We filled a medium sized trash bin twice with recyclables.
  3. We found a birthday card that was supposed to go to somebody a few weeks ago (technically this was found in the kitchen, and they did get the present at the appropriate time).
  4. We found a hospital bill that was over a year old.
  5. Pants which had a loose button, and therefore were on the sewing table for over year were fixed.
  6. ???
  7. Profit
Now to get ready for the D&D adventure I'm running tonight.