Hardware & Compute

Training vs. Inference Compute

The very different hardware demands of building vs. running a model

What it is

Training requires storing not just the model weights but all intermediate activations needed for backpropagation, roughly 3-4x the memory of weights alone. Training also requires precise floating-point arithmetic, whereas inference can use quantized (lower precision) weights.

Inference is dramatically lighter: only a forward pass, no gradient storage. A quantized 70B model can run on 2-4 consumer GPUs. Inference can also be batched, serving 50 simultaneous requests takes similar compute to serving 1 on the same hardware, which is why API providers can offer low per-token costs at scale.

This creates two distinct market segments: training clusters (massive, specialized, centralized) and inference infrastructure (more distributed, easier to deploy).

Why it matters

These differences inform every deployment decision. "Can we self-host this model?" depends on inference requirements, not training requirements. "Why is the API cheaper than renting the compute to run it myself?" relates to batching efficiency at scale. "Why is fine-tuning cheaper than pre-training?", the compute profile is completely different.

Resources

Training vs. Inference

blogs.nvidia.com· Concise NVIDIA explainer on compute, memory, and hardware differences between the two phases

6 min

The Essential Guide to GPUs for AI, Training and Inference

lambda.ai· Comprehensive guide covering both phases from a cloud GPU provider, includes hardware comparison tables

20 min

Inference Economics: What It Is And Why It Matters Now

cloudzero.com· Why inference costs now dominate over training in aggregate, cites Stanford HAI 2025 AI Index, explains the paradox of cheaper per-token costs but higher total spend

10 min

Training vs. Inference Explained

youtube.com· Clear whiteboard-style explanation of the two phases of AI compute

7 min