Benchmarking LLMs
How we measure AI capability, and why benchmarks are tricky
What it is
Benchmarks are standardized test suites used to evaluate and compare LLM capabilities. Common benchmarks include MMLU (multiple-choice knowledge across 57 subjects), HumanEval (code generation), MATH (mathematical reasoning), and LMSYS Chatbot Arena (human preference-based Elo ratings).
Different benchmarks measure different things, and a model can score highly on one while underperforming on another. There's also significant concern about benchmark contamination, models trained on data that includes benchmark questions score artificially high.
The field is increasingly moving toward human evaluation as a more reliable signal, since human raters are harder to game than fixed question sets.