Sign in
Core Architecture

Attention Mechanisms

How models decide what to focus on, the core of what makes transformers powerful

What it is

The attention mechanism is an algorithm that determines how much weight to give each token in the input when processing any other token. During training, the model learns which relationships between tokens matter, for example, a pronoun attending strongly to the noun it refers to, or "not" attending to the word that follows it.

Every token is compared to every other token, producing an attention score. These scores are normalized and used to create a weighted sum of token representations. The result is that each token's representation is "enriched" with context from the rest of the sequence.

This all-to-all comparison is what makes attention powerful, and also what makes it computationally expensive. Longer inputs require quadratically more computation, which is a core reason why context windows are hard to extend.

Why it matters

Attention is the reason LLMs understand context rather than just pattern-matching. It explains why context window size is such a hard engineering problem (quadratic compute cost), why models sometimes "lose track" of things in very long documents, and why the concept of "relevant context" matters when you're prompting a model. If you understand attention, you understand the biggest architectural constraint in modern AI.

Related concepts

Resources

Attention in Transformers, step-by-step
youtube.com· The definitive visual explanation of queries, keys, values, multi-head attention, and cross-attention. Also listed under Transformers but this is THE attention resource.
26 min
The Matrix Math Behind Transformer Neural Networks, One Step at a Time
youtube.com· Walks through the actual matrix multiplications behind attention, step by step. Great for recruits who want the math without it being overwhelming.
20 min
Transformer Neural Networks, ChatGPT's Foundation, Clearly Explained
youtube.com· Covers self-attention as part of the full encoder-decoder transformer. Builds intuition step-by-step before adding complexity.
36 min
The Illustrated Attention
jalammar.github.io· Precursor to the Illustrated Transformer. Explains attention from first principles using the translation analogy (encoder-decoder). A gentle on-ramp before the full transformer article, good if you want the "why" before the "how."
15 min
Attention Is All You Need: A Complete Guide to Transformers
medium.com· Thorough breakdown of the original "Attention Is All You Need" paper in accessible language. Uses the YouTube search analogy for Q/K/V which is very beginner-friendly. Good for recruits who want to understand the source material without reading the full paper.
20 min