Core Architecture

Attention Mechanisms

How models decide what to focus on, the core of what makes transformers powerful

What it is

The attention mechanism is an algorithm that determines how much weight to give each token in the input when processing any other token. During training, the model learns which relationships between tokens matter, for example, a pronoun attending strongly to the noun it refers to, or "not" attending to the word that follows it.

Every token is compared to every other token, producing an attention score. These scores are normalized and used to create a weighted sum of token representations. The result is that each token's representation is "enriched" with context from the rest of the sequence.

This all-to-all comparison is what makes attention powerful, and also what makes it computationally expensive. Longer inputs require quadratically more computation, which is a core reason why context windows are hard to extend.

Why it matters

Attention is the reason LLMs understand context rather than just pattern-matching. It explains why context window size is such a hard engineering problem (quadratic compute cost), why models sometimes "lose track" of things in very long documents, and why the concept of "relevant context" matters when you're prompting a model. If you understand attention, you understand the biggest architectural constraint in modern AI.