Sign in
Core Architecture

Transformers

The dominant architecture behind modern AI, parallelizable, scalable, attention-powered

What it is

Transformers are the neural network architecture that powers virtually every modern large language model. Introduced by Google in 2017 in the paper "Attention Is All You Need," transformers replaced older sequential architectures like RNNs by processing all input tokens simultaneously rather than one at a time.

The key innovation is the self-attention mechanism, which allows the model to weigh how relevant each token is to every other token in the input. This gives transformers rich contextual understanding, the word "not" before "good" fundamentally changes the meaning, and the attention mechanism learns to capture this.

Because transformers process input in parallel, they scale extremely well across thousands of GPUs. This parallelizability is what made it possible to train models on internet-scale data and reach the capabilities we see today.

Why it matters

Every major AI model you'll encounter (GPT-4, Claude, Gemini, Llama) is a transformer. When clients ask why AI improved so dramatically after 2017, the answer is transformers. When you're evaluating a new model architecture or reading a research paper, understanding transformers is your baseline. You can't have an intelligent conversation about modern AI without it.

Related concepts

Resources

But what is a GPT? Visual intro to Transformers
youtube.com· The gold standard visual introduction to transformer architecture. Beautiful animations walk through embeddings, softmax, and the predict-sample-repeat loop with 3B1B's signature style. Accessible to non-CS backgrounds.
27 min
Attention in Transformers, step-by-step
youtube.com· Deep dive into self-attention, multi-head attention, and cross-attention. Pairs perfectly with #1 as part 2. Same visual quality as the intro video.
26 min
Decoder-Only Transformers, ChatGPT's Specific Transformer, Clearly Explained
youtube.com· Step-by-step walkthrough of how decoder-only transformers (the kind behind ChatGPT) work. Beginner-friendly. Great complement to 3B1B since it focuses on the specific variant used in modern LLMs.
37 min
The Illustrated Transformer
jalammar.github.io· The gold standard written explainer for transformers. Featured in courses at Stanford, Harvard, MIT, Princeton, and CMU. Excellent diagrams that build intuition piece by piece. Recently expanded into a book (LLM-book.com).
20 min
Transformer Explainer
poloclub.github.io· Runs a live GPT-2 model in your browser. Type text and watch how each component of the transformer processes it in real time. Excellent for building intuition.
15 min
Transformer Explained
youtube.com· Accessible walkthrough of transformer architecture. Good complement to 3B1B's visual approach with a different teaching style.
PreviousBeginning of section
NextAttention Mechanisms