CUDA 101: Get Ahead of the CUDA Curve with Practice!

After a recent talk I gave called “CUDA 101: Intro to GPU Computing”, a student asked “What’s the best way for me to get experience in parallel programming and CUDA?”. This is a question I struggled a lot with when I was in college and one I still ask myself about various topics today. The first step is to realize that it’s hard to get useful experience without having some skill in an area. This is where practice comes in. Practice is not something people look for on your résumé, but it is a critical step in gaining experience. Let’s talk about some ways to get practice with parallel programming on the CUDA platform, without having to tackle a large research or programming project.

First, if you’ve done any school work that involves programming, you should have a plethora of old programming assignments, books with sample problems, and old tests. This is a great bounty of problems to try to solve with parallelism. The first step is to solve the problem using only the CPU (or load an existing CPU-only solution program). Time this solution program with a wall clock (better yet, use the Unix time command or a precise software timer) and see how fast you can get it to run (while still remaining accurate). Now try to port the computationally intensive parts of the code to the GPU using OpenACC, CUDA C/C++ or CUDA Fortran. Can you make the program run faster? If you can, you’ve just made a step in teaching your brain to think in parallel. If you’ve tried everything you can think of, and cannot make your parallelized program run faster, don’t despair! Learning which algorithms cannot be made faster with parallelism is just as valuable a lesson.

The second method is making use of the large number of questions on forums (like StackOverflow or The NVIDIA CUDA forums). The trick is to read the original question, without looking at the answers. Once you’ve read the question, take these steps.

If you don’t understand the question, go do some research and figure it out!
Once you understand what the person is asking, come up with a solution. I don’t mean just reason it out in your head. Type it out. If required, write the code, compile and test it. The problem with just doing it in your head is it’s too easy to hand wave and skip over key details. Writing it out forces you to collect and organize your thoughts, which is very important.
Now the fun part. There’s probably already an answer posted for the question, so check your work against it! Did you come up with a similar answer? If not, figure out why. Feel free to contact the person who posted the answer for clarification.
If there is no answer, post yours! Pretty soon you’ll have a number of answers posted which become a great experience item which we’ll talk about in a later post.

Answering others’ questions is a great way to learn, and I often make this point during my presentations: one of the best ways to learn is by teaching. There are many ways you can practice teaching parallel programming with CUDA. If you are a member of a computer club (i.e. IEEE or ACM student chapter), you can hold tutoring sessions and take turns presenting on an aspect of CUDA, or some CUDA code. I’ve also found that helping to debug other people’s code (ethically, of course) is a great way to get exposed to common problems and workarounds. If you already have some experience, you might even be a good teaching assistant for a parallel programming class that uses CUDA. Teaching is not just a great way to practice, it’s a way to build valuable experience for your résumé; I’ll talk more about this in my next post.

Next up is one of my favorite practice methods from school. As you’re solving problems from mathematics, physics, biology, and other classes; see if you can write a parallel program to solve these as well. You can then check the accuracy of your program against the results you get from the non-programmatically solved answers. Try writing the entire program to execute as fast as possible (while remaining accurate) on the CPU. Now compare the two, can you make it execute faster by leveraging the GPU’s parallelism?

You know that Data Structures or Algorithms book on your book shelf? This is another great source for practice problems. Find an old algorithm that has been around for ages and is extremely optimized for sequential execution. By now I think you know where this is headed. Rewrite the performance critical parts to run on the GPU and see if you can get it to go faster. Or better yet, use a completely different algorithm to get the same, correct output.

My last suggestion is to compare your algorithms against code compiled with OpenACC. Use the compiler directives to try to parallelize some loops. Then see if you can match or beat the speed-up gained by the compiler by writing the code in CUDA. This is a good way to get practice both with OpenACC and CUDA.

We’ve covered several methods to practice and develop your CUDA programming skills. As beneficial as practice is, it’s just a stepping stone toward solid experiences to put on your résumé. In my next post I’ll cover ways to go about getting the experience you need!

Follow me on Twitter: @CudaHamster

Note: If you liked this post, you may also enjoy “What are your favorite parallel programming references?“