Mentatcurated
Artificial Intelligence medium · independent

The o3 GeoGuessr prompt

The viral prompt that supposedly made OpenAI's o3 superhuman at locating photos had gone the better part of a year without being tested against a plain one. When someone finally ran the control, the one-liner won.

Sean Goedecke ran the control everyone had skipped. The prompt in question went viral in mid-2025 as the thing that made OpenAI's o3 superhuman at guessing where a photo was taken — a long, hand-tuned block that reporters built iteratively, asking the model how it could have avoided its own mistakes and folding the advice back in. Goedecke pitted it against a single line, "think carefully about where this picture was taken," on the same 200 images and the same models.

"It'll still be pretty good, except this time it's good because of what you did."

The one-liner won on every metric. Its median error was 83.2 km against the elaborate prompt's 102.3; 109 of its guesses landed within 100 km, versus 99 for the long one. The celebrated prompt is roughly ten times longer and buys about one extra second of thinking. Nobody had measured it against a control — the viral write-ups asserted the improvement and moved on.

o3 was already good at this. The elaborate prompt didn't change the answers; it changed the story — it handed a skill the model already had a flattering origin, something you could point to and feel you'd unlocked. For the better part of a year, confident write-ups passed it along, and the control that would have caught them was a single one-line query.

Why it's here

A rare hard, numeric data point in a 2026 discourse that's otherwise opinion essays ("is prompt engineering dead?"). It clears the bar not on a capability but on a measurement nobody ran: the original viral story asserted the elaborate prompt "significantly improves performance" with no plain-prompt control. Goedecke built the control and the result reversed. And it points past one geolocation task — as base models get more capable, elaborate prompting can buy belief more than accuracy.

The lenses

Novelty 3
Impact · breadth 3
Impact · depth 3
Actionable 2
Substance 4
Hype 1

The facts

Median km error83.2 (plain) vs 102.3 (elaborate)
Mean km error440.7 (plain) vs 481.9 (elaborate)
Guesses within 100km109 (plain) vs 99 (elaborate)
Test set200 images from Wikimedia Commons, Geograph, iNaturalist
Models testedo3, gpt-5.4, gpt-5.5
Prompt length vs thinkingelaborate prompt ~10x longer, adds ~1s on average
Open https://www.seangoedecke.com/the-o3-geoguessr-prompt-did-not-work/ →