Notes taken while reading [PMPP](https://www.amazon.com/Programming-Massively-Parallel-Processors-Hands/dp/0323912311), supplemented by the official [programming guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#). If you're learning CUDA from scratch you probably want to read those, the latter is particularly readable. These notes are for jogging my memory when I forget details 6 months from now. --- **CUDA C++** is an extension to C++, compiled by [**NVCC**](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html), for writing code which runs on Nvidia GPUs. Your machine, the CPU, is the **host**. Each of your GPUs are a **device**. A device is a collection of **streaming multiprocessors** (**SM**) all capable of running in parallel. A **kernel** is a function you call to spawn a **grid** on a device. A grid is a collection of **thread block**s, which are each a collection of **thread**s which will run together on a streaming multiprocessor. > [!tip] tiny illustrative image > > ![[cuda-sms-blocks.png.png]] > > ([source](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#scalable-programming-model-automatic-scalability)) When you call a kernel on a device you specify how many blocks to spawn and how many threads to spawn per block. These will be scheduled to run on the device. Thread blocks should not try to coordinate. Blocks might run in any order, in series or in parallel, depending on how many streaming processors your GPU has. The threads within a block are more able to coordinate; they have block-level [shared memory](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory) as well as [synchronization primitives](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cooperative-groups). All the threads of a block will be co-located on the same SM, sharing resources, so each block can have no more than 1024 threads (["maximum number of threads per block"](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications-technical-specifications-per-compute-capability)) --- CUDA code includes some function annotations: - `__host__` a function which runs on the host (optional, host function is the default) - `__global__` a kernel, a function which spawns a new grid - `__device__` a function which can be called from inside a CUDA thread - a function can be both `__host__` and `__device__`, NVCC will build both objects kernel and device functions have access to some global variables: - `dim3 blockIdx`, the ID of current block - `dim3 blockDim`, the number of threads in each thread block - `dim3 threadIdx`, the ID of the current thread, within the current thread block - `dim3` here is [a uint 3-tuple](https://github.com/shinpei0208/gdev/blob/master/cuda/runtime/ocelot/cuda/interface/cuda_runtime.h#L131), for your convenience, in case your data is multi-dimensional pulling these together: ```c++ __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < N && j < N) C[i][j] = A[i][j] + B[i][j]; } ``` the guard (`if (i < N && j < N)`) exists because you generally won't spawn exactly as many threads as there are units of work. The minimum block size is 32 threads and your work probably doesn't divide evenly into chunks of 32. **A kernel is called** with some unusual syntax: ```c++ Func<<< Dg, Db, Ns, S >>>(parameter); ``` where: - `dim3 Dg` is the number of blocks in the grid - `dim3 Db` is the number of threads in each block - `size_t Ns` is the amount of shared memory to allocate to each block (defaults to 0) - `cudaStream_t S` is which CUDA stream to send this kernel to (defaults to 0) --- **Small glossary**: - **arithmetic intensity**: Roughly, how many operations you perform for each byte of memory access. Each GPU has some theoretical maximum memory bandwidth and some theoretical maximum FLOPs. In order to make full use of your GPU and saturate _both_ you must maintain a specific ratio. This is often illustrated graphically via a roofline chart. - **bank conflict**: memory (also, registers!) is divided into banks. Each logical memory address resides on memory stored in some physical bank. Each bank has limited bandwidth, a warp which accesses multiple addresses from the same bank will have its requests serialized. - [official programming guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#shared-memory-and-memory-banks) - [more](https://github.com/NervanaSystems/maxas/wiki/SGEMM#calculating-c-register-banks-and-reuse) - [more](https://github.com/Kobzol/hardware-effects-gpu/blob/master/bank-conflicts/README.md) - [more](https://github.com/NVIDIA/cutlass/blob/main/media/docs/implicit_gemm_convolution.md#shared-memory-layouts) - [more by Lei Mao](https://leimao.github.io/blog/CUDA-Shared-Memory-Bank/) - **CUDA graph**: when dealing with smaller grids it's possible for the GPU to spend most of its time waiting for the CPU to send the next instruction. CUDA graphs are instruction streams which are sent to the GPU all at once. Without special effort CUDA graphs hard-code addresses, which means the inputs and outputs must always be at the same locations; if you want concurrency you have to build multiple graphs. - **SASS**: a low-level assembly [with a slightly different instruction set for each GPU generation](https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#instruction-set-reference). Halfway between PTX and machine code. - You can view the SASS for your kernels with [`nvdisasm`](https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#nvdisasm) - There are also [_assemblers_](https://github.com/daadaada/turingas?tab=readme-ov-file), if you feel like writing this stuff. - **stream**: [A work queue](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#streams) allowing devices to be driven asynchronously. You can schedule data transfers and kernels to streams knowing they will eventually complete. The device pulls work from the stream as it can. There are functions to [poll](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1g2021adeb17905c7ec2a3c1bf125c5435) or [block on](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1g82b5784f674c17c6df64affe618bf45e) results from the stream. - **occupancy**: \frac{the number of active warps per SM}{the maximum number of warps each SM supports}. Each SM provides limited resources, if each warp requires a large amount of e.x. shared memory then occupancy will suffer. This could hurt performance. For example: high occupancy allows the GPU to hide latency by running non-blocked warps while some are blocked on IO. - [siboehm](https://siboehm.com/articles/22/CUDA-MMM) has some good notes on occupancy, including mentioning the following: - pages 51 and 52 of [this document](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.pdf) describe a tradeoff between arithmetic intensity and occupancy. At very high or very low AI a high occupancy is not necessary for high performance. - **PTX**: the NVIDIA GPU equivalent of bytecode/assembly. [You can include inline assembly](https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html) in your kernels and when you do you write PTX statements. NVCC compiles your CUDA code down to PTX. The device driver compiles this into actual binary code at load time. - **thread block cluster**: As of H100, capability 9.0, threads have an additional level of hierarchy. - Thread < Thread Block < Thread Block Cluster < Grid - Threads part of the same cluster will be scheduled to run together, and can [synchronize](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cluster-group) as well as [share memory](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#distributed-shared-memory) with each other. - [Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-block-clusters) - **warp**: A group of 32 threads. This is the smallest unit of work which can be scheduled on a SM. While CUDA is same-program-multiple-data (SPMD) warps are same-instruction-multiple-thread (SIMT). This means conditionals might cause [**warp divergence**](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture) which are a performance issue. **Resources**: - Official [best practices](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/contents.html) - [A list of all the c++ extensions nvcc makes available](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#c-language-extensions) - As of at least 2018 there was room to understand the hardware better than nvcc's optimizing compiler can and write a 15% faster matmul [by patching the binary](https://arxiv.org/pdf/1804.06826). ([Turing follow up](https://arxiv.org/pdf/1903.07486)) - [More](https://github.com/SunsetQuest/CudaPAD) on the difference between PTX and SASS and tools for working with both. - Tim Dettmers, [which gpu to use for deep learning](https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/) (2023) - How to write a fast [SGEMM](https://siboehm.com/articles/22/CUDA-MMM) (2022) - excellent read, lots of great stuff in here! - How to write a fast [SGEMM](https://github.com/NervanaSystems/maxas/wiki/SGEMM) (2017) - How to write a fast [SGEMM](https://cnugteren.github.io/tutorial/pages/page1.html) (2014) - [ThunderKittens](https://github.com/HazyResearch/ThunderKittens/tree/main) - fast kernels