▸ Concept

Model distillation

Training a small model to mimic a large one by learning from its output distributions rather than raw labels.

Learn first

In a nutshell

A large model (the teacher) generates soft probability distributions over answers — not just the correct label but how confident it is across all possibilities. A smaller model (the student) trains on those distributions instead of hard labels, absorbing the teacher's generalizations. The student ends up far more capable than its size would predict from scratch. The hard part: the student must be large enough to hold what the teacher knows, and the teacher's outputs must be better-calibrated than noisy human labels — otherwise you distill the errors too.

Where it came from

Year2015

SourceHinton, Vinyals, Dean — "Distilling the Knowledge in a Neural Network" (NeurIPS workshop)

Why it matteredFormalised the soft-target approach; the term itself predates this paper.

Model distillation

Learn first

Where it came from

In megatrends

How this connects

Model distillation

Learn first

Where it came from

In megatrends

Finds citing this concept

The 29-million-answer letter

How this connects