Mentatcurated
Artificial Intelligence high · first-party

OSWorld 2.0

The lab whose desktop-agent benchmark the field just cleared at 83% built a harder one — and the best agent finishes one task in five.

Two years ago XLANG Lab released OSWorld, a test where an AI agent has to drive a real computer — clicking, typing, switching apps — to finish a task. It became the field's standard, and agents now clear about 83% of it. So the same lab built a successor and watched its own field's best model, Claude Opus 4.8, complete just 20.6% of it (and that under the most generous setting they allowed: a 500-step budget and maximum deliberation).

Past about 163 minutes of human-equivalent work, completion drops to zero — for every model.

The difference is length. Where the old tasks took roughly 30 actions, the new 108 workflows average about 318 and would take a skilled human around 1.6 hours each — real jobs spanning research, finance, engineering and admin, not toy clicks. Long horizons expose failures short ones never touch: tracking state across many items, inferring things the screen doesn't say, reconciling sources that disagree.

The sharp finding is that agents don't fade gracefully as work gets longer — they fall off a cliff. Past roughly 163 minutes of human-equivalent effort, completion is zero for every model tested. None finishes a single one of the longest jobs. The tasks themselves are kept behind a permissioned dataset, on the assumption that any benchmark left in the open gets trained on and stops measuring anything.

It is a useful corrective to the claim that agents are about to do your knowledge work. On short, well-scoped tasks they are genuinely capable; on the multi-hour jobs that make up an actual workday, the current frontier finishes almost none of them.

The lenses

Novelty 3
Impact · breadth 3
Impact · depth 4
Actionable 3
Substance 4
Hype 2

The facts

Best agentClaude Opus 4.8 finishes 20.6% of tasks (best-case harness); GPT-5.5 plateaus near 13%
Task length108 workflows, ~1.6 hours of human work each; ~318 actions vs ~30 in the original
AccessOpen code (Apache 2.0), but the tasks are gated behind a permissioned dataset to resist training leakage
Open github.com →

How this connects

Tap a node to open it