Mentatcurated
Artificial Intelligence high · independent

o1 versus the physicians

A reasoning model out-diagnosed attending physicians across six tests — but the same study quietly shows that handing a doctor the model barely helped them at all.

A team from Harvard and Stanford ran OpenAI's o1 reasoning model through the diagnostic gauntlet that medicine has used to test machines since 1959 — the hardest published case conferences, emergency-room triage, management decisions — and scored it against hundreds of physicians. The model matched or beat the doctors on every one, with the widest margin at the moment of first contact, when almost nothing is known about the patient and the pressure to decide is highest.

Across the emergency cases, the physicians scoring the answers couldn't tell which came from a person and which from the model — one guessed right just 2.7 percent of the time.

The buried number is the interesting one. On the management cases, the model answering alone scored a median 86 percent. Doctors working with the very same model to consult scored 41 percent — statistically no better than doctors with nothing but their usual references, who scored 34. Adding the AI to a physician barely moved the physician; the model unassisted beat the pairing by a wide margin.

Which means the bottleneck this study exposes is not the machine. It is the handoff — what happens when a clinician reads a model's answer and decides what to do with it. The doctors couldn't extract the model's advantage even when it was sitting in front of them.

And here is the catch the authors are blunt about: they only ever tested the model alone. Nobody has yet shown that a doctor and this model together beat either one — and the cases were simulated and historical, not living patients. The result isn't a verdict on AI medicine. It's a demand for the trial that would actually settle it.

Why it's here

The headline is that AI beat doctors; the finding worth keeping is that giving doctors the AI didn't help them.

The lenses

Novelty 4
Impact · breadth 4
Impact · depth 4
Actionable 1
Substance 5
Hype 4

The facts

Peer-reviewed?Yes — published in Science, the AI's own diagnoses graded blind by physicians
Real patients?No — simulated and historical cases; no live-patient outcomes tested
Doctors + AI together?Untested — the study measured the model alone, not collaboration
Open https://www.science.org/doi/10.1126/science.adz4433 →