AI Benchmarks

Confidence 0.85 · 3 sources · last confirmed 2026-04-30

Standardized tests used to compare AI models. The benchmark landscape in 2024 is shaped by rapid saturation of older tests, an explosion of new harder tests, and a methodological shift toward agent / reasoning evaluations. This page is an umbrella — individual benchmarks are listed here in a roster and get promoted to their own pages once they’re discussed standalone in another source.

Working definition

An AI benchmark is a fixed dataset + scoring function used to compare model performance under controlled conditions. Benchmarks live on a saturation cycle: introduced → models improve → benchmark saturates → harder benchmark proposed.

The 2024 wave is unusual in two ways:

Speed of saturation. AI now masters new benchmarks within ~12 months of introduction (MMMU +18.8pp, GPQA +48.9pp, SWE-bench +67.3pp in one year).
Methodological shift. Researchers are moving from static single-shot benchmarks toward agent benchmarks that evaluate trajectories under time budgets (RE-Bench), and toward reasoning benchmarks that resist pattern-matching (PlanBench, FrontierMath).

Key claims

2024: rapid benchmark saturation

Major one-year benchmark gains in 2024 (AI Index 2025 §2):

MMMU: +18.8 percentage points
GPQA: +48.9 percentage points
SWE-bench: +67.3 percentage points (4.4% → 71.7%)

2025: continued saturation, agent leaps, science benchmarks (AI Index 2026)

SWE-bench Verified: 60% → near 100% in one year (the same benchmark that was at 71.7% in 2024 has effectively saturated within two years of being introduced).
OSWorld (real computer tasks across operating systems): AI agents leapt from 12% → ~66% task success in one year. Agents still fail roughly 1 in 3 attempts on structured benchmarks.
IMO (International Mathematical Olympiad): Gemini Deep Think earned a gold medal.
Analog clock reading: top model only 50.1% correct — emblematic of the jagged-frontier (capability and incompetence on tasks of similar perceived difficulty).
ChemBench: frontier AI models outperform human chemists on average; below 20% on astrophysics replication, 33% on Earth-observation questions.
RLBench (robotic manipulation in software-based simulations): 89.4% success — but robots succeed on only 12% of household tasks in the real world.
Frontier-lab disclosure has dropped: independent testing does not always confirm what developers report.

New harder benchmarks (proposed because the old ones saturated)

Humanity’s Last Exam — academic test; top score so far: 8.80%.
FrontierMath — complex math; AI ~2% solve rate.
BigCodeBench — coding; AI 35.5% vs. human standard 97%.
PlanBench — logical planning; AI fails consistently even when provably correct solutions exist.
RE-Bench — agent evaluation: in 2-hour budgets AI scores 4× human experts, but humans win 2:1 at 32 hours.

Task horizons (METR / Anthropic Economic Index, 4th report)

A complement to single-shot benchmarks: task horizons measure the duration of tasks at which an AI achieves a given success rate. METR introduced this measure; the Anthropic Economic Index applies the same lens to its own data.

For Claude Sonnet 4.5, the duration at which 50% success is achieved varies sharply by source:

Source	50% success threshold
METR (fixed-task benchmark)	~2 hours
Anthropic 1P API	~3.5 hours
Anthropic Claude.ai	~19 hours

Per the Anthropic report, the Claude.ai number is much higher because of selection bias (users bring tasks they expect Claude to succeed on) and task decomposition with feedback loops. The methodology gap is itself diagnostic — fixed-benchmark horizons and platform-observed effective horizons measure different things.

Benchmark roster (mentioned in this wiki, awaiting standalone pages)

Benchmark	Domain	Status as of AI Index 2025
MMLU	Multitask language	Saturated; U.S./China gap closed to 0.3pp
MMMU	Multimodal understanding	New (2023); AI gained 18.8pp in 2024
GPQA	Graduate-level reasoning	New (2023); AI gained 48.9pp in 2024
SWE-bench	Real-world coding	New (2023); AI 4.4% → 71.7% in one year
HumanEval	Code completion	Saturated; near parity U.S./China
MATH	Competition math	Saturated; near parity U.S./China
GSM8K	Grade-school math	Saturated
MedQA	Clinical knowledge	OpenAI o1 = 96.0% (state-of-art); approaching saturation
HELM Safety	RAI / safety	New 2024; for responsible-ai
AIR-Bench	RAI	New 2024
FACTS	Factuality	New 2024
SimpleQA	Factuality	New 2024
Hughes Hallucination Evaluation Model	Factuality	Updated 2024
HaluEval	Factuality	Failed to gain widespread adoption
TruthfulQA	Factuality	Failed to gain widespread adoption
Foundation Model Transparency Index	Disclosure / governance	37% → 58% Oct 2023 → May 2024
Chatbot Arena Leaderboard	Pairwise human preference Elo	Top-2 gap 0.7%, top-10 gap 5.4%
Humanity’s Last Exam	Academic generalist	New 2024; top score 8.80%
FrontierMath	Advanced math	New 2024; AI ~2%
BigCodeBench	Coding	New 2024; AI 35.5% vs. human 97%
PlanBench	Logical planning	AI consistently fails
RE-Bench	Agent / time-budget	New 2024; 4× humans @ 2hr, humans win 2:1 @ 32hr

Debates / contradictions

Are we measuring what matters or what’s measurable? Benchmarks gain complexity faster than tasks gain real-world specification. Open question whether the saturation race is sustainable, or whether the field is overdue for a different evaluation paradigm.
Agent benchmarks vs. static benchmarks. RE-Bench’s two-time-budget result suggests static benchmarks can mislead on real workflows. Future evaluations will likely emphasize trajectories, not single-shot scores. But agent evaluations are themselves harder to standardize — open question whether the field can converge.
Reasoning benchmarks. The IMO/PlanBench split suggests current models are good at patterns of math but bad at symbolic reasoning where verifiable solutions exist. Open question for the generative-ai roadmap, and load-bearing for safety in high-stakes deployments.
Methodology stability. Many of the highest-impact 2024 benchmarks (MMMU, GPQA, SWE-bench) are <2 years old. Year-over-year claims about “AI improvement” depend on stable benchmark methodology — worth flagging when sources cite trend lines without examining benchmark drift.

foundation-models — what’s being benchmarked
generative-ai — sub-domain dominant in current benchmarks
responsible-ai — RAI benchmarks intersect here

AI-Wiki

Explorer

ai-benchmarks

AI Benchmarks

Working definition

Key claims

2024: rapid benchmark saturation

2025: continued saturation, agent leaps, science benchmarks (AI Index 2026)

New harder benchmarks (proposed because the old ones saturated)

Task horizons (METR / Anthropic Economic Index, 4th report)

Benchmark roster (mentioned in this wiki, awaiting standalone pages)

Debates / contradictions

Relationships

Uses

Graph View

Table of Contents

Backlinks

AI-Wiki

Explorer

ai-benchmarks

AI Benchmarks

Working definition

Key claims

2024: rapid benchmark saturation

2025: continued saturation, agent leaps, science benchmarks (AI Index 2026)

New harder benchmarks (proposed because the old ones saturated)

Task horizons (METR / Anthropic Economic Index, 4th report)

Benchmark roster (mentioned in this wiki, awaiting standalone pages)

Debates / contradictions

Related concepts

Relationships

Uses

Graph View

Table of Contents

Backlinks