Interpretability
The research frontier of understanding what's happening inside AI models
What it is
Interpretability (also called mechanistic interpretability or explainability) is the research field focused on understanding what computations neural networks actually perform internally. Current LLMs are largely black boxes, we know inputs and outputs but have limited insight into the internal representations and algorithms that produce outputs.
Anthropic's interpretability research has identified "features" in model activations that correspond to human-interpretable concepts, individual neurons or circuits that activate for specific concepts. Circuit analysis traces how specific behaviors are implemented through attention heads and MLP layers.
Interpretability matters for: debugging model failures, providing audit trails for high-stakes decisions, detecting deceptive alignment in future powerful models, and understanding emergent capabilities.