Key Concepts

Parameters

The learned weights that define a model, and why size matters

What it is

Parameters (also called weights) are the numerical values inside a neural network that are learned during training. For a transformer, these include the attention weight matrices, feed-forward layer weights, and the embedding table. The total count of these values is what people mean by "model size."

A model with 7 billion parameters has 7 billion floating-point numbers that collectively encode everything it learned. Larger parameter counts generally mean more capacity to store knowledge and learn complex patterns, but also more compute required for both training and inference.

Model sizes range from tiny (1B parameters, runs on a phone) to massive (estimated ~1.8 trillion for GPT-4's mixture-of-experts architecture).

Why it matters

Parameter count is shorthand for model capability and cost. A 7B parameter model you can run on a laptop has very different cost and capability characteristics than a 70B model requiring a server. When making deployment decisions (self-hosted vs. API, which model tier to use) understanding what parameter count implies about hardware requirements and capability is essential.

Resources

But what is a neural network?

youtube.com· THE classic visual explainer of neural networks. Beautiful animations build intuition for what weights and biases (parameters) actually are. Essential foundation before any LLM discussion.

19 min

Deep Dive into LLMs like ChatGPT (section: neural network I/O and internals ~0:14:00

youtube.com· Explains what billions of parameters represent, "compressed internet" encoded as weights. Walks through GPT-2's 1.6B parameter architecture with concrete numbers.

17 min

What Are LLM Parameters?

ibm.com· Covers both model parameters (weights learned during training) and hyperparameters (temperature, top-p, context window size). Great for understanding both what "70B" means and how user-facing settings work.

10 min

LLM Settings

promptingguide.ai· Quick reference explaining temperature, top-p, frequency penalty, and other inference-time parameters. Practical companion to the IBM article.

5 min