Commit 1ed3a695 authored by Erik Strand's avatar Erik Strand

Update GPU explanation

parent 471460ca
......@@ -9,7 +9,7 @@ the GPU's massive parallelism for arbitrary computation.
There are a couple layers to this. First, why are we considering GPUs next to microcontrollers like
the humble [SAMD11](https://gitlab.cba.mit.edu/pub/hello-world/atsamd11)? While GPUs started as a
tool for actual graphics processing, they quickly became popular in the high performance and
scientific computing worlds for their sheer [FLOPs](https://en.wikipedia.org/wiki/FLOPS). Nowadays
scientific computing worlds for their sheer [FLOPS](https://en.wikipedia.org/wiki/FLOPS). Nowadays
many machine learning applications are moving their way out of the datacenter and into physical
devices, to accelerate tasks like object detection or speech interfaces. As a result there are an
increasing number of small GPU
......@@ -36,7 +36,7 @@ The biggest difference is that GPU threads are less independent than CPU threads
threads might compete for resources -- like RAM, or a floating point unit -- but they can execute
completely separate applications. On (almost all) GPUs, threads run in groups of 32 called *warps*.
For the most part, all threads in a warp have to execute the same instruction at the same
time! So if you have some code like this:
time! So say you have some code like this:
```
if (condition) {
......@@ -46,10 +46,15 @@ if (condition) {
}
```
Each thread won't call `do_this()` or `do_that()` independently. Instead, first all the threads that
need to take the first branch will execute `do_this()`, and the other threads just wait. Then all
the threads that need to take the second branch will execute `do_that()` while the first group of
threads waits. This takes some getting used to.
If some threads need to `do_this()`, and others need to `do_that()`, they won't call these functions
independently. Instead, first all the threads that need to take the first branch will execute
`do_this()`, and the other threads just wait. Then all the threads that need to take the second
branch will execute `do_that()` while the first group of threads waits. In total it takes as much
time as if all threads executed both branches. So you don't want to have large blocks of code that
only a few threads need to execute, or loops that a few threads will run way more times than others,
since these things effectively hold the remaining threads hostage. On the other hand, if all the
threads in the warp end up taking the same branch, then the warp can skip the other branch
completely.
## Warps, Blocks, and Latency Hiding
......@@ -59,28 +64,39 @@ how the hardware works. There's a whole hierarchy between individual threads and
resources (like VRAM), and structuring your code to fit in this hierarchy neatly is how you get the
best performance.
Warps are grouped together into *blocks*. Each block shares some memory and other resources. All the
threads in a block can use this memory to communicate with each other; communication between threads
in separate blocks is much more limited. When you launch a GPU kernel, you have to say how many
blocks you want to run, and how many threads you want in each block (and this latter number is
almost always a multiple of 32, otherwise you'll end up wasting threads in some warps).
Warps are grouped together into *blocks*. Each block shares some memory, cache, and other resources.
All the threads in a block can use this *shared memory* to communicate with each other;
communication between threads in separate blocks is much more limited. When you launch a GPU kernel,
you have to say how many blocks you want to run, and how many threads you want in each block (and
this latter number is almost always a multiple of 32, otherwise you'll end up wasting threads in
some warps).
Finally, all threads can access global memory. Global memory is much larger than the shared
resources in each block, but it's also much slower. For many GPU applications, basically all of the
Finally, all threads can access global memory. Global memory is much larger than the shared memory
in each block, but it's also much slower. For data intensive applications, basically all of the
runtime comes from transferring data from global memory and back again; the processing each thread
does takes negligible time. Certain memory access patterns are much more efficient than others, so
these sorts of details become very important for writing fast GPU code. (Spoiler alert: generally
you want all threads in a warp to access adjacent memory locations, so that the warp as a whole
loads one contiguous block of data.)
does takes negligible time. Certain memory access patterns are much more efficient than others --
generally speaking you want all threads in a warp to access adjacent memory locations at the same
time, so that the warp as a whole can load one contiguous block of data. This is often the single
most important thing to get right if you want to write fast GPU code.
The last important concept to understand is latency hiding. At the end of the day, threads are
executed by CUDA *cores* that reside in *streaming multiprocessor*. These multiprocessors can
executed by CUDA *cores* that reside in *streaming multiprocessors*. These multiprocessors can
quickly switch between running different warps. So while most GPUs have a thousand or so physical
cores, you'll commonly launch blocks that contain tens of thousands if not millions of threads.
Execution of these threads is interleaved, so at any given moment the GPU is likely to be working on
several times as many threads as it has cores. The point of this is to not have to wait. One warp
might execute a costly (i.e. slow) read from global memory. Rather than wait 100 clock cycles
several times as many threads as it has cores. The point of this is to not have to wait. Say one
warp executes a costly (i.e. slow) read from global memory. Rather than wait 100 clock cycles
(roughly speaking) before executing that warp's next instruction, the streaming multiprocessor tries
to find a different warp that's ready to execute its next instruction right away. So (within reason)
it's a good thing to run a lot of threads at once, so that the streaming multiprocessor has the best
chance of always finding something useful to do.
to find a different warp that's ready to execute its next instruction right away. When the first
warp's data finally arrives, the streaming multiprocessor will pick it back up again and execute its
next instruction.
So generally speaking you want to run a whole lot of threads at once, so that the streaming
multiprocessor has the best odds of always finding some warp that's ready to do something useful.
But the streaming multiprocessor only has so much memory etc. (that gets divvied up as the shared
memory for all the blocks it's processing). So if all your blocks need lots of resources, each
streaming multiprocessor will only be able to handle a few of them at a time, and you'll end up in
situations where no warp is ready to execute its next instruction. This reduces the streaming
multiprocessor's *occupancy*, which is basically the amount of time it spends doing useful things.
Balancing the number of threads with resource usage to increase occupancy is one of the most
important concerns for writing really fast GPU code.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment