**amazon bedrock** - hosted foundation models (text, image, embeddings). Instead of doing any of your own serving you can configure popular models (prompting, knowledge bases for RAG, agent pipelines) in the AWS console and forget all the implementation details
- [Supported models](https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html)
**bfloat16 “brain floating point”** - a google brain invention. An fp32 but with 16 fewer mantissa bits. Alt: a fp16 but with 3 more exponent bits and 3 fewer mantissa bits. They claim: the missing precision is expensive to compute and doesn’t seem to matter much for deep learning while the extra dynamic range definitely seems to matter for deep learning
- [a nice diagram](https://cloud.google.com/tpu/docs/bfloat16)
- [they're heavily used in TPUs](https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus)
- [careful, it's not a panacea](https://arxiv.org/pdf/2112.11446) (pg 53)
**broadcasting** - conventions for making it easier to perform binary operations on arrays without needing to first massage them into the right shape
- [numpy](https://numpy.org/doc/stable/user/basics.broadcasting.html#general-broadcasting-rules)
- [pytorch](https://pytorch.org/docs/stable/notes/broadcasting.html)
- [onnx](https://github.com/onnx/onnx/blob/main/docs/Broadcasting.md)
**convolution** - signal processing calls this cross-correlation
- though, [it's kind of arbitrary which one you use](https://leimao.github.io/blog/Cross-Correlation-VS-Convolution/)
- [more on the difference](https://www.youtube.com/watch?v=xbO-iIzkBy0)
**cuBLAS** - CUDA-powered-[BLAS](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms), linear algebra routines which run on the GPU over GPU memory. cuBLAS calls can be added to CUDA streams and CUDA graphs.
- [official docs](https://docs.nvidia.com/cuda/cublas/index.html)
**cuDNN (CUDA Deep Neural Network)** - a collection of libraries (c [and c++](https://github.com/NVIDIA/cudnn-frontend) interface) implementing common primitives (convolution, attention, gelu, pooling, etc). Includes an API for declaring the computational graph, which is then [compiled into kernels at runtime](https://docs.nvidia.com/deeplearning/cudnn/latest/developer/graph-api.html#generic-runtime-fusion-engines)
- [he constellation of shared libraries](https://docs.nvidia.com/deeplearning/cudnn/latest/api/overview.html#backend-api-overview)
**fused multiply-add (FMA, fmadd)** - the operation `d = a + b*c`
- [ptx reference](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#floating-point-instructions-fma) (over floats)
- `fma` and `mad` refer to _nearly_ the same operation but treat the intermediate `b*c` result differently depending on which datatype you're using and compute capability you're on.
**general matrix multiplication (GEMM)** - the operation $Y = \alpha * A*B + \beta * C$, where $\alpha$ and $\beta$ are scalars, $A$, $B$, and $C$ are matrices, and $A$ and $B$ might be transposed
- [onnx docs](https://github.com/onnx/onnx/blob/main/docs/Operators.md#gemm)
- [other onnx docs](https://onnx.ai/onnx/operators/onnx__Gemm.html)
**loss-scaling** - when training a model you much prefer fp16 to fp32, your arithmetic becomes faster and memory (and memory bandwidth) is at a premium! However, activation gradients tend to be quite small, so small that most of them are rounded to 0 when stored at fp16 (see figure 3 in the first link). If the loss is scaled before starting the backward pass (and the final weight gradients are unscaled before applying them) they can be kept in the range fp16 is capable of accurately representing.
- [original paper](https://arxiv.org/pdf/1710.03740) (ICLR 2018)
- [adaptive loss scaling](https://arxiv.org/pdf/1910.12385) (2019) removes the need to manually tune the loss-scale, it uses gradient statistics to discover a reasonable scale factor for each layer.
- **bfloat16** values have the same dynamic range as your fp32 weights so they do not require loss-scaling
**multiply-accumulate (MAC)** - the operation `a = a + (b*c)`
**NCCL (“nickel”)** - a collection of kernels for communication between interlinked (e.g. NVLink) NVIDIA GPUs and a C library for invoking those kernels. There are functions for directly invoking these kernels, convenience functions for controlling multiple GPUs at once, and there’s the possibility of adding these kernels [to CUDA graphs](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/cudagraph.html).
- [official docs](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/overview.html)
- [supported primitives](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html)
- a “rank” is what the rest of CUDA calls a “device”; 0 is the first GPU within your clique of communicating GPUs, etc.
**olive** - takes a computational graph (such as an `.onnx`) and applies configurable optimizations and specializes it for your target hardware
- [main repo](https://github.com/microsoft/Olive)
- [example pipelines](https://microsoft.github.io/Olive/examples.html)
**ONNX** - a common interchange format for computational graphs. You can for example export an `.onnx` from pytorch and then run it in `TensorRT`
- [main repo](https://github.com/onnx/onnx)
**ONNX runtime** - an abstraction layer which takes your onnx model and runs it using the appropriate runtime for your current hardware. e.g. On Nvidia hardware it can use the **TensorRT** runtime
- [Supported Execution Providers](https://onnxruntime.ai/docs/execution-providers/#summary-of-supported-execution-providers)
**prompt engineering** - an iterative, vibes-based, process of finding the right input to coax the intended output from a model. Instruct models are trained to make certain behaviors easier to elicit, but many other behaviors do not come naturally. This can be difficult work. Projects like [dspy](https://github.com/stanfordnlp/dspy), [outlines](https://github.com/outlines-dev/outlines) and [[Sequential Monte Carlo Steering|SMC steering]] are all attempts to avoid it where possible.
- [lil'log](https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/)
**singular value decomposition (SVD)** - an important matrix factorization commonly used for dimensionality reduction. It provides an optimal low-rank approximation to any matrix, among other things. Tightly related to principal component analysis (PCA).
- [Steve Brunton](https://databookuw.com/page-2/page-4/)
- [Jonathon Shlens](https://arxiv.org/pdf/1404.1100)
**tensor core** - NVIDIA GPUs are a collection of **streaming multiprocessors (SM)** running in parallel. Each **SM** is endowed with a number of heterogeneous cores, at each cycle some subset of available cores will be issued an instruction. Tensor Cores are one type of available core
- They implement a matrix **fused multiply-add**, operations of the form `D = A*B+C`
- They were first introduced in Volta, capability 7.0, sm_70
- [Functions for using tensor cores in kernels](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-matrix-functions)
- [Only small matricies are supported](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma-type-sizes), up to 16x16x16 (though larger matrices are easily supported [via tiling](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html))
**TensorRT** - takes a model computation graph (e.g. inferred from pytorch tracing, parsed from onnx, manually specified with their API), compiles the graph, and executes it (NVIDIA only)
- [Docs](https://docs.nvidia.com/deeplearning/tensorrt/)
- [Example notebook](https://github.com/NVIDIA/TensorRT/blob/main/quickstart/SemanticSegmentation/tutorial-runtime.ipynb)
**trainium** - the AWS accelerator, their version of a GPU/TPU. Very few architectural details seem to be available
- [high-level architecture](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trainium.html)
**triton** - a DSL (most commonly used from inside python) which compiles down to GPU kernels
- [original paper](https://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf) (2019) - the pitch here is: it gets within 90% performance of hand-crafted **cuBLAS** routines. So, you still want to use **cuBLAS** when there’s an available routine but when the thing you’re doing is too new for that you can write it in high-level triton instead of low-level CUDA.
- [official docs](https://triton-lang.org/main/index.html)
**triton inference server** - server for taking requests, scheduling them (including batching) across some number of NVIDIA GPUs, using models written in any of a variety of frameworks, and returning results. Includes niceties like tracing and a health/metrics endpoint
- [Architecture](https://github.com/triton-inference-server/server/blob/44bc109df0e780d050856bb58fbbfba9476e9f26/docs/user_guide/architecture.md)