Core Architecture

Tokens

The atomic units of language that AI models actually process

What it is

Tokens are the vocabulary of a transformer model, the discrete units that the model reads and generates. Before training, a tokenizer is built that maps text to token IDs (integers) the model can process.

Most LLMs use subword tokenization (typically BPE (Byte Pair Encoding)), which finds the most frequent character sequences in the training data and uses those as tokens. Common words like "dog" become single tokens; rarer words like "artificial" might split into "art" and "ificial." This balances vocabulary size against coverage.

Tokens are not always whole words, and they don't map 1-to-1 with characters. "ChatGPT is great" might be 5 tokens. This has real implications: token counts determine cost, speed, and context window usage.

Why it matters

Token counting directly affects your API costs, latency, and what fits in a context window. When a client asks why their long document can't fit in one prompt, or why the API bill is higher than expected, tokens are the answer. The famous example of LLMs failing to count the r's in "strawberry" also comes down to tokenization, the model sees subword units, not individual characters.

Related concepts

Transformers API Basics

Resources

Let's Build the GPT Tokenizer

youtube.com· The comprehensive tokenization resource. First ~14 min covers what tokenization is and why it matters (with demos using the tiktokenizer web app); first 30 min covers why tokens cause LLM weirdness (bad spelling, arithmetic issues, non-English problems). Full video dives into BPE implementation. Recommend the intro segment for all recruits; full video for those who want depth.

133 min

Deep Dive into LLMs like ChatGPT (Tokenization section)

youtube.com· Accessible general-audience explanation of tokenization within the broader LLM training pipeline. Less technical than the dedicated tokenizer video above. Great quick intro before the deeper tokenizer video.

7 min

Tokenization in Large Language Models, Explained

seantrott.substack.com· Linguist/cognitive-scientist-friendly explanation of tokenization. Covers why tokens aren't words, different tokenization techniques, connections to morphology, and how tokenization impacts what LLMs learn. Great for non-CS recruits.

12 min

The Technical User's Introduction to LLM Tokenization

christophergs.com· Based on Karpathy's lecture but distilled into a readable article. Practical guide with live examples using the Tiktokenizer tool. Covers BPE, why tokenization causes weird LLM behavior, and shows how the same text tokenizes differently across models.

15 min