Tokens
The atomic units of language that AI models actually process
What it is
Tokens are the vocabulary of a transformer model, the discrete units that the model reads and generates. Before training, a tokenizer is built that maps text to token IDs (integers) the model can process.
Most LLMs use subword tokenization (typically BPE (Byte Pair Encoding)), which finds the most frequent character sequences in the training data and uses those as tokens. Common words like "dog" become single tokens; rarer words like "artificial" might split into "art" and "ificial." This balances vocabulary size against coverage.
Tokens are not always whole words, and they don't map 1-to-1 with characters. "ChatGPT is great" might be 5 tokens. This has real implications: token counts determine cost, speed, and context window usage.