Artificial Intelligence high · independent

Agents' Last Exam

UC Berkeley graded frontier AI agents on real, finished work from 55 professional occupations — and on the hardest tier, the multi-day expert jobs, the best models pass essentially none of it.

paper UC Berkeley RDI · 2 min read · originally announced 3 Jun 2026

The agent that tops Terminal-Bench at 82% — Codex running GPT-5.5 — passes 0% of the hardest tasks in UC Berkeley's new benchmark. That gap is the whole point. Agents' Last Exam doesn't ask exam questions or canned coding puzzles; it grades agents on roughly 1,500 tasks lifted from real projects that human professionals actually completed, across 55 computer-based occupations, and checks the output by running it — comparing files, numbers, and simulated behavior, not asking a human or another model to judge.

The agent that tops Terminal-Bench at 82% — Codex on GPT-5.5 — passes 0% of the hardest tier in UC Berkeley's benchmark.

On the everyday tier, the best agent clears about 38%. On the full-spectrum tier, around 24%. On the 'Last-Exam' tier — the multi-day work a specialist would bill for — the flagship configurations from OpenAI, Anthropic, and Google sit at 0.0%, and the frontier average is 2.6%. Best overall, across everything, is roughly 24% of real professional tasks.

The failure isn't carelessness. Berkeley's own breakdown pins about three-quarters of the misses on a single thing: the agents don't know how to do the job. They can't drive the specialist software a profession runs on, so they improvise plausible-looking workarounds that don't hold up when the answer is checked against reality.

The benchmark is built to keep being hard — private tasks rotate in as public ones leak, so a model can't quietly train on the test. That design makes it a candidate to become the eval cited when the next model claims it can do knowledge work. For now it draws a line under a year of agent hype: the systems that look like they've mastered the tools have not yet learned the jobs.

Want to try it?

Start with the live leaderboard and task tiers at agents-last-exam.org, or read the methods in arXiv 2606.05405 to see how a task is graded by execution.

Read the paper at arxiv.org →

The lenses

Novelty 4

Impact · breadth 3

Impact · depth 3

Actionable 4

Substance 5

Hype 3

The facts

What it grades~1,500 tasks from real completed expert projects across 55 occupations, scored by running the output

Best score~24% of all professional tasks; 0% on the hardest multi-day tier

Cost spread~$3.80 to ~$15.70 per completed task — an order of magnitude between agents

OpenPublic task set, dataset, and leaderboard released; private tasks rotate to resist contamination

Concepts

AI benchmarks Agentic AI Frontier models AI economics

Open arxiv.org →

How this connects

Tap a node to open it

Agents' Last Exam

The lenses

The facts

Concepts

More in Artificial Intelligence

Agent Skills

The bottleneck is a transformer

Safety's rounding error

How this connects