Training vs. Inference Compute
The very different hardware demands of building vs. running a model
What it is
Training requires storing not just the model weights but all intermediate activations needed for backpropagation, roughly 3-4x the memory of weights alone. Training also requires precise floating-point arithmetic, whereas inference can use quantized (lower precision) weights.
Inference is dramatically lighter: only a forward pass, no gradient storage. A quantized 70B model can run on 2-4 consumer GPUs. Inference can also be batched, serving 50 simultaneous requests takes similar compute to serving 1 on the same hardware, which is why API providers can offer low per-token costs at scale.
This creates two distinct market segments: training clusters (massive, specialized, centralized) and inference infrastructure (more distributed, easier to deploy).