OSWorld 2.0
The lab whose desktop-agent benchmark the field just cleared at 83% built a harder one — and the best agent finishes one task in five.
Two years ago XLANG Lab released OSWorld, a test where an AI agent has to drive a real computer — clicking, typing, switching apps — to finish a task. It became the field's standard, and agents now clear about 83% of it. So the same lab built a successor and watched its own field's best model, Claude Opus 4.8, complete just 20.6% of it (and that under the most generous setting they allowed: a 500-step budget and maximum deliberation).
Past about 163 minutes of human-equivalent work, completion drops to zero — for every model.
The difference is length. Where the old tasks took roughly 30 actions, the new 108 workflows average about 318 and would take a skilled human around 1.6 hours each — real jobs spanning research, finance, engineering and admin, not toy clicks. Long horizons expose failures short ones never touch: tracking state across many items, inferring things the screen doesn't say, reconciling sources that disagree.
The sharp finding is that agents don't fade gracefully as work gets longer — they fall off a cliff. Past roughly 163 minutes of human-equivalent effort, completion is zero for every model tested. None finishes a single one of the longest jobs. The tasks themselves are kept behind a permissioned dataset, on the assumption that any benchmark left in the open gets trained on and stops measuring anything.
It is a useful corrective to the claim that agents are about to do your knowledge work. On short, well-scoped tasks they are genuinely capable; on the multi-hour jobs that make up an actual workday, the current frontier finishes almost none of them.
The lenses
The facts
How this connects
Tap a node to open it