Mentatcurated
▸ Concept also: spec decoding, speculative sampling

Speculative decoding

A two-model trick that speeds up inference by having a small draft model guess several tokens ahead, then verifying them all at once with the larger target model — identical output, faster wall time.

In a nutshell

Language models generate one token at a time; each step waits for the last, making the GPU spend most of its time idle. Speculative decoding breaks this by running a cheap draft model first — it proposes a short sequence of tokens quickly. The large target model then checks the whole draft in a single parallel forward pass. Tokens the target agrees with are accepted; the first rejection triggers a corrective sample and the cycle restarts. The output distribution is mathematically identical to running the target alone, so quality is preserved. The hard part is choosing a draft model fast enough and accurate enough that most proposals land.

Where it came from

Year2023
SourceLeviathan et al. — "Fast Inference from Transformers via Speculative Decoding" (Google, ICML 2023)
Why it matteredPublished concurrently with Chen et al. (DeepMind, 2023), who called the same technique speculative sampling.

How this connects

Tap a node to open it