Sign in
Hardware & Compute

Training vs. Inference Compute

The very different hardware demands of building vs. running a model

What it is

Training requires storing not just the model weights but all intermediate activations needed for backpropagation, roughly 3-4x the memory of weights alone. Training also requires precise floating-point arithmetic, whereas inference can use quantized (lower precision) weights.

Inference is dramatically lighter: only a forward pass, no gradient storage. A quantized 70B model can run on 2-4 consumer GPUs. Inference can also be batched, serving 50 simultaneous requests takes similar compute to serving 1 on the same hardware, which is why API providers can offer low per-token costs at scale.

This creates two distinct market segments: training clusters (massive, specialized, centralized) and inference infrastructure (more distributed, easier to deploy).

Why it matters

These differences inform every deployment decision. "Can we self-host this model?" depends on inference requirements, not training requirements. "Why is the API cheaper than renting the compute to run it myself?" relates to batching efficiency at scale. "Why is fine-tuning cheaper than pre-training?", the compute profile is completely different.

Resources

PreviousTraining Costs
NextEnd of section