Mentatcurated
Artificial Intelligence high · independent

Agents' Last Exam

UC Berkeley graded frontier AI agents on real, finished work from 55 professional occupations — and on the hardest tier, the multi-day expert jobs, the best models pass essentially none of it.

The agent that tops Terminal-Bench at 82% — Codex running GPT-5.5 — passes 0% of the hardest tasks in UC Berkeley's new benchmark. That gap is the whole point. Agents' Last Exam doesn't ask exam questions or canned coding puzzles; it grades agents on roughly 1,500 tasks lifted from real projects that human professionals actually completed, across 55 computer-based occupations, and checks the output by running it — comparing files, numbers, and simulated behavior, not asking a human or another model to judge.

The agent that tops Terminal-Bench at 82% — Codex on GPT-5.5 — passes 0% of the hardest tier in UC Berkeley's benchmark.

On the everyday tier, the best agent clears about 38%. On the full-spectrum tier, around 24%. On the 'Last-Exam' tier — the multi-day work a specialist would bill for — the flagship configurations from OpenAI, Anthropic, and Google sit at 0.0%, and the frontier average is 2.6%. Best overall, across everything, is roughly 24% of real professional tasks.

The failure isn't carelessness. Berkeley's own breakdown pins about three-quarters of the misses on a single thing: the agents don't know how to do the job. They can't drive the specialist software a profession runs on, so they improvise plausible-looking workarounds that don't hold up when the answer is checked against reality.

The benchmark is built to keep being hard — private tasks rotate in as public ones leak, so a model can't quietly train on the test. That design makes it a candidate to become the eval cited when the next model claims it can do knowledge work. For now it draws a line under a year of agent hype: the systems that look like they've mastered the tools have not yet learned the jobs.

Want to try it?

Start with the live leaderboard and task tiers at agents-last-exam.org, or read the methods in arXiv 2606.05405 to see how a task is graded by execution.

Read the paper at arxiv.org →

The lenses

Novelty 4
Impact · breadth 3
Impact · depth 3
Actionable 4
Substance 5
Hype 3

The facts

What it grades~1,500 tasks from real completed expert projects across 55 occupations, scored by running the output
Best score~24% of all professional tasks; 0% on the hardest multi-day tier
Cost spread~$3.80 to ~$15.70 per completed task — an order of magnitude between agents
OpenPublic task set, dataset, and leaderboard released; private tasks rotate to resist contamination
Open arxiv.org →

How this connects

Tap a node to open it