▸ Concept
Model distillation
Training a small model to mimic a large one by learning from its output distributions rather than raw labels.
Learn first
In a nutshell
A large model (the teacher) generates soft probability distributions over answers — not just the correct label but how confident it is across all possibilities. A smaller model (the student) trains on those distributions instead of hard labels, absorbing the teacher's generalizations. The student ends up far more capable than its size would predict from scratch. The hard part: the student must be large enough to hold what the teacher knows, and the teacher's outputs must be better-calibrated than noisy human labels — otherwise you distill the errors too.
Where it came from
Year2015
SourceHinton, Vinyals, Dean — "Distilling the Knowledge in a Neural Network" (NeurIPS workshop)
Why it matteredFormalised the soft-target approach; the term itself predates this paper.
In megatrends
How this connects
Tap a node to open it
Model distillationThe 29-million-answer letterScaling lawsSupervised fine-tuningArtificial IntelligenceAndrej KarpathyOpenAIAnthropicAlibabao1-previewMiniMaxnanochatnanochatDeepMind argues against the singularityHunter AlphaThe Jevons bill comes dueThe terafab ceilingA flagship on a gaming cardAI policyExport controls for AIFrontier modelsAI chips
