Mentatcurated
▸ Concept

Model distillation

Training a small model to mimic a large one by learning from its output distributions rather than raw labels.

In a nutshell

A large model (the teacher) generates soft probability distributions over answers — not just the correct label but how confident it is across all possibilities. A smaller model (the student) trains on those distributions instead of hard labels, absorbing the teacher's generalizations. The student ends up far more capable than its size would predict from scratch. The hard part: the student must be large enough to hold what the teacher knows, and the teacher's outputs must be better-calibrated than noisy human labels — otherwise you distill the errors too.

Where it came from

Year2015
SourceHinton, Vinyals, Dean — "Distilling the Knowledge in a Neural Network" (NeurIPS workshop)
Why it matteredFormalised the soft-target approach; the term itself predates this paper.

In megatrends

How this connects

Tap a node to open it