Speculative decoding
A two-model trick that speeds up inference by having a small draft model guess several tokens ahead, then verifying them all at once with the larger target model — identical output, faster wall time.
Learn first
Language models generate one token at a time; each step waits for the last, making the GPU spend most of its time idle. Speculative decoding breaks this by running a cheap draft model first — it proposes a short sequence of tokens quickly. The large target model then checks the whole draft in a single parallel forward pass. Tokens the target agrees with are accepted; the first rejection triggers a corrective sample and the cycle restarts. The output distribution is mathematically identical to running the target alone, so quality is preserved. The hard part is choosing a draft model fast enough and accurate enough that most proposals land.
Where it came from
In megatrends
How this connects
Tap a node to open it
