Artificial Intelligence medium · independent

The expert moat, blind-tested

Sixteen contracts professors graded answers to their own office-hours questions without knowing which were written by AI and which by a fellow professor; they picked the AI three times out of four.

paper Stanford Law (Salinas et al.) · 2 min read

Sixteen law professors from fourteen schools each wrote a short answer to forty real questions of the kind students bring to office hours — doctrine, hypotheticals, the policy reasoning that has no single right answer. Then they made nearly three thousand blind comparisons, picking the better answer without knowing its source. They chose the AI 75% of the time, and it beat all but one of them.

the preference for LLM answers is not reducible to length, clarity, or other stylistic markers — Salinas et al.

The assumed human moat in law was never recall of cases; it was judgment — the part everyone said machines couldn't do. This study put that belief in a controlled setting and the belief lost, on the experts' own questions, refereed by the experts themselves.

The most telling number is buried in the spread. The AI answers were flagged as pedagogically harmful 3.53% of the time; the professors' answers, 12.06%. But the human rate ranged from 1% for the best instructor to nearly 40% for the worst. The machine's advantage is not that it is brilliant — it is that it never has a bad day. The authors also checked whether the AI simply wrote prettier prose and found the opposite: the preference held after controlling for length and clarity.

Caveats are real and the authors press them: it is a preprint, one subject, one casebook, sixteen judges, and ninety-word written answers rather than a live tutor in a room. What it measures is the quality of a written answer, not whether you should hand a class to a chatbot. Even so, it relocates the burden of proof — the next person claiming AI can't handle expert judgment has to explain why a room full of contracts professors couldn't tell.

The lenses

Novelty 4

Impact · breadth 3

Impact · depth 3

Actionable 2

Substance 5

Hype 2

The facts

Who judged16 contracts professors, 14 US law schools, all teaching from the same casebook

The test~2,900 blind pairwise comparisons of professor vs. AI answers to real office-hours questions

ResultAI preferred 75% of the time; flagged harmful 3.53% vs. 12.06% for professors

StatusPreprint, not peer-reviewed; single subject and casebook

Concepts

AI benchmarks Human–AI collaboration

Open papers.ssrn.com →

The expert moat, blind-tested

The lenses

The facts

Concepts

More in Artificial Intelligence

The bottleneck is a transformer

Safety's rounding error

The Jevons bill comes due

How this connects