Agents' Last Exam
UC Berkeley graded frontier AI agents on real, finished work from 55 professional occupations — and on the hardest tier, the multi-day expert jobs, the best models pass essentially none of it.
The agent that tops Terminal-Bench at 82% — Codex running GPT-5.5 — passes 0% of the hardest tasks in UC Berkeley's new benchmark. That gap is the whole point. Agents' Last Exam doesn't ask exam questions or canned coding puzzles; it grades agents on roughly 1,500 tasks lifted from real projects that human professionals actually completed, across 55 computer-based occupations, and checks the output by running it — comparing files, numbers, and simulated behavior, not asking a human or another model to judge.
The agent that tops Terminal-Bench at 82% — Codex on GPT-5.5 — passes 0% of the hardest tier in UC Berkeley's benchmark.
On the everyday tier, the best agent clears about 38%. On the full-spectrum tier, around 24%. On the 'Last-Exam' tier — the multi-day work a specialist would bill for — the flagship configurations from OpenAI, Anthropic, and Google sit at 0.0%, and the frontier average is 2.6%. Best overall, across everything, is roughly 24% of real professional tasks.
The failure isn't carelessness. Berkeley's own breakdown pins about three-quarters of the misses on a single thing: the agents don't know how to do the job. They can't drive the specialist software a profession runs on, so they improvise plausible-looking workarounds that don't hold up when the answer is checked against reality.
The benchmark is built to keep being hard — private tasks rotate in as public ones leak, so a model can't quietly train on the test. That design makes it a candidate to become the eval cited when the next model claims it can do knowledge work. For now it draws a line under a year of agent hype: the systems that look like they've mastered the tools have not yet learned the jobs.
Start with the live leaderboard and task tiers at agents-last-exam.org, or read the methods in arXiv 2606.05405 to see how a task is graded by execution.
Read the paper at arxiv.org →The lenses
The facts
How this connects
Tap a node to open it