Mentatcurated
▸ Concept also: test set leakage, benchmark contamination, data leakage

Training data contamination

When examples from an evaluation set appear in a model's training data, making benchmark scores higher than genuine capability warrants.

In a nutshell

A benchmark works only if the model has never seen the answers. Contamination happens when crawled training corpora include pages that contain evaluation questions and their solutions — the model memorises rather than reasons, and the score inflates. The hard part is detection: training sets are vast and rarely audited against every benchmark released after the fact. A contaminated result looks identical to a real one on the leaderboard, which means the field's main progress signal can quietly stop measuring what it claims to measure.

Where it came from

Year2021
SourceBrown et al. — GPT-3 paper (OpenAI)
Why it matteredDedicated contamination analysis section flagged test-set overlap in pre-training data as a known confound — one of the first systematic treatments in a major model paper.

How this connects

Tap a node to open it