Ultra-Scale Playbook vol-1 - Single GPU

Introduction

Optimally scaling LLM training across multiple GPUs requires finding the right tradeoffs between memory, computational efficiency, and communication overhead.

It’s crucial to understand that each component, i.e., memory, compute-efficiency, and communication overhead, needs to be tuned carefully.

Communication

Communication among nodes requires GPUs to be idle. That’s detrimental to overall efficiency because we want our GPUs to spend the most amount of time computing. That requires optimising intra-node/inter-node bandwidth usage, data transfers, and waiting for/syncing GPUs.

Memory

Talking about memory, storing all activations is quadratically expensive with $seq$. Instead, store only expensive-to-recompute activations, discard the rest, and then recompute the discarded ones in the backward pass. We call this gradient checkpointing or recompilation, that trades off compute for memory

Attention scores and matrices are an example of recomputable activations. FlashAttention natively integrates recomputing attention values/matrices in the backward pass out-of-the-box.

Consideration wrt GPU architecture: GPUs have limited high-speed memory, and accessing memory is typically slower than performing computations.

Compute-efficiency

Activations still scale linearly with $bs$, which is why we use gradient accumulation: sum up (avg in practice) gradients across $k$ passes (with batch size $bs/k$) and only then take an optimiser step. Requires keeping buffers for gradients that persist in a training step.

However, gradient accumulation does not mean free lunch, as we get more computational overhead from lower per-pass batch size $bs/k$. Taking an optimiser step once every $k$ batches does not alleviate the cost of more forward & backward passes.

What’s neat is that we can parallelise the $k$ forward + backward passes. They aren’t correlated in any manner and just need to be summed to accumulate the gradient. However, that needs more VRAM.

Introduction

Communication

Memory

Compute-efficiency

Enjoy Reading This Article?