METR

Confidence 0.70 · 1 source · last confirmed 2026-04-28

A research organization that builds AI evaluations, focused on task horizons — the length of time over which AI models can successfully complete tasks.

Why it appears in this wiki

METR’s task-horizon benchmark is referenced in Anthropic’s fourth Economic Index report as a complementary measure of AI capability:

Benchmark: a fixed task set spanning varied human-time durations.
Metric: the duration at which an AI model achieves 50% success.
For Claude Sonnet 4.5: METR reports ~2 hours.

The Anthropic report’s own data, computed differently, finds Claude Sonnet 4.5 reaches 50% success at ~3.5 hours (1P API) and ~19 hours (Claude.ai). Methodology differences (selection bias on Claude.ai, task decomposition with feedback loops) account for the gap; see the source page for detail.

Open questions

METR’s underlying benchmark methodology is not yet directly ingested — it’s only known here through Anthropic’s reference. A primary METR source would clarify the comparison.

AI-Wiki

Explorer

METR

METR

Why it appears in this wiki

Open questions

Graph View

Table of Contents

Backlinks