METR

Confidence 0.70 · 1 source · last confirmed 2026-04-28

A research organization that builds AI evaluations, focused on task horizons — the length of time over which AI models can successfully complete tasks.

Why it appears in this wiki

METR’s task-horizon benchmark is referenced in Anthropic’s fourth Economic Index report as a complementary measure of AI capability:

  • Benchmark: a fixed task set spanning varied human-time durations.
  • Metric: the duration at which an AI model achieves 50% success.
  • For Claude Sonnet 4.5: METR reports ~2 hours.

The Anthropic report’s own data, computed differently, finds Claude Sonnet 4.5 reaches 50% success at ~3.5 hours (1P API) and ~19 hours (Claude.ai). Methodology differences (selection bias on Claude.ai, task decomposition with feedback loops) account for the gap; see the source page for detail.

Open questions

  • METR’s underlying benchmark methodology is not yet directly ingested — it’s only known here through Anthropic’s reference. A primary METR source would clarify the comparison.