Commit 471460ca authored by Erik Strand's avatar Erik Strand

Explain basic GPU concepts

parent a81cb56e
# cuda
CUDA is the programming model used for general purpose programming of NVIDIA GPUs. It's an extension
of C++ that adds support for transferring data between the CPU (host) and GPU (device), and using
the GPU's massive parallelism for arbitrary computation.
## Why CUDA?
There are a couple layers to this. First, why are we considering GPUs next to microcontrollers like
the humble [SAMD11]( While GPUs started as a
tool for actual graphics processing, they quickly became popular in the high performance and
scientific computing worlds for their sheer [FLOPs]( Nowadays
many machine learning applications are moving their way out of the datacenter and into physical
devices, to accelerate tasks like object detection or speech interfaces. As a result there are an
increasing number of small GPU
[modules]( designed for
integration into robotics, autonomous systems, etc. They're still a little pricey now, but they get
cheaper each year.
Second, why CUDA? It's a proprietary system that only works for NVIDIA hardware.
[OpenCL]( is the open equivalent. It works on pretty much any GPU,
and unlike CUDA you can look at all the [code]( behind
it. This is a major win for OpenCL in my book, but at this point in time CUDA is still the de facto
standard for most scientific and machine learning GPU code. NVIDIA was the first to market in that
space, and they've kept their lead since then. Hopefully we start to see a more diverse ecosystem
## How to think about GPU programming
The main point of a GPU is to run many threads at once. Often this is thousands at a time -- far
more threads than you can run on any single CPU. But the way threads behave on a GPU is different
from how they behave on a CPU, so writing GPU code is a lot different than writing CPU code, even if
you have a big compute cluster and could launch an equivalent number of CPU threads.
The biggest difference is that GPU threads are less independent than CPU threads. On a CPU, two
threads might compete for resources -- like RAM, or a floating point unit -- but they can execute
completely separate applications. On (almost all) GPUs, threads run in groups of 32 called *warps*.
For the most part, all threads in a warp have to execute the same instruction at the same
time! So if you have some code like this:
if (condition) {
} else {
Each thread won't call `do_this()` or `do_that()` independently. Instead, first all the threads that
need to take the first branch will execute `do_this()`, and the other threads just wait. Then all
the threads that need to take the second branch will execute `do_that()` while the first group of
threads waits. This takes some getting used to.
## Warps, Blocks, and Latency Hiding
Conceptually, knowing that threads run in groups rather than individually is the most important
thing to understand. But if you want to write the fastest GPU code, you need to know some more about
how the hardware works. There's a whole hierarchy between individual threads and the GPU's global
resources (like VRAM), and structuring your code to fit in this hierarchy neatly is how you get the
best performance.
Warps are grouped together into *blocks*. Each block shares some memory and other resources. All the
threads in a block can use this memory to communicate with each other; communication between threads
in separate blocks is much more limited. When you launch a GPU kernel, you have to say how many
blocks you want to run, and how many threads you want in each block (and this latter number is
almost always a multiple of 32, otherwise you'll end up wasting threads in some warps).
Finally, all threads can access global memory. Global memory is much larger than the shared
resources in each block, but it's also much slower. For many GPU applications, basically all of the
runtime comes from transferring data from global memory and back again; the processing each thread
does takes negligible time. Certain memory access patterns are much more efficient than others, so
these sorts of details become very important for writing fast GPU code. (Spoiler alert: generally
you want all threads in a warp to access adjacent memory locations, so that the warp as a whole
loads one contiguous block of data.)
The last important concept to understand is latency hiding. At the end of the day, threads are
executed by CUDA *cores* that reside in *streaming multiprocessor*. These multiprocessors can
quickly switch between running different warps. So while most GPUs have a thousand or so physical
cores, you'll commonly launch blocks that contain tens of thousands if not millions of threads.
Execution of these threads is interleaved, so at any given moment the GPU is likely to be working on
several times as many threads as it has cores. The point of this is to not have to wait. One warp
might execute a costly (i.e. slow) read from global memory. Rather than wait 100 clock cycles
(roughly speaking) before executing that warp's next instruction, the streaming multiprocessor tries
to find a different warp that's ready to execute its next instruction right away. So (within reason)
it's a good thing to run a lot of threads at once, so that the streaming multiprocessor has the best
chance of always finding something useful to do.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment