Transformers
The dominant architecture behind modern AI, parallelizable, scalable, attention-powered
What it is
Transformers are the neural network architecture that powers virtually every modern large language model. Introduced by Google in 2017 in the paper "Attention Is All You Need," transformers replaced older sequential architectures like RNNs by processing all input tokens simultaneously rather than one at a time.
The key innovation is the self-attention mechanism, which allows the model to weigh how relevant each token is to every other token in the input. This gives transformers rich contextual understanding, the word "not" before "good" fundamentally changes the meaning, and the attention mechanism learns to capture this.
Because transformers process input in parallel, they scale extremely well across thousands of GPUs. This parallelizability is what made it possible to train models on internet-scale data and reach the capabilities we see today.