Artificial Intelligence medium · first-party

DeepSpec

DeepSeek open-sourced DeepSpec, the machinery for training the draft models that make big LLMs answer faster — bundling DSpark, the drafter running in its own V4 service, which it clocks at 60-85% faster generation, with identical output.

repo DeepSeek · 2 min read · originally announced 26 Jun 2026

Every fast LLM today cheats a little: a small "draft" model guesses the next several tokens, and the big model verifies them in one pass instead of generating one at a time. Get the guesses right and you skip most of the slow work; the answer is byte-for-byte what the big model would have written anyway, because it still checks every token. The field had split into two camps of guesser — accurate ones that get slower the further ahead they guess, and cheap ones whose accuracy decays the further out you go. DeepSeek's DSpark splits the difference: a cheap parallel guesser for the whole block, plus a tiny sequential head that nudges each guess using the token just before it, claw[ing] back accuracy without paying the speed tax.

Target cache generation is storage-intensive (~38 TB for the default Qwen3-4B configuration). — DeepSeek

There's a second trick that matters more in production than the architecture. A scheduler watches the GPUs and verifies more guessed tokens when the hardware is idle, fewer when it's slammed — so the speedup bends to the actual load instead of a fixed setting. In DeepSeek's own V4 serving, the company reports per-user generation running 60-85% faster than its previous method.

The catch is that "open" has an asterisk. The released code (MIT-licensed, with the full data-prep, training and eval stack) trains drafters for Qwen and Gemma, not for V4 — and regenerating the target-model cache the training needs runs to roughly 38 terabytes for the default small-model setup. So you can read exactly how DSpark works and train a drafter of your own, but reproducing DeepSeek's headline V4 result is a well-resourced-lab exercise, and the 60-85% figure is the company's self-report on its own stack, not yet anyone else's measurement. What ships in the open is the recipe and two-thirds of the kitchen; the third camp's bragging is on the honour system.

Want to try it?

Clone the MIT-licensed DeepSpec repo and read the DSpark drafter and eval scripts — the load-aware verification scheduler is the part the paper alone doesn't show you.

Open the repo at github.com →

The lenses

Novelty 3

Impact · breadth 3

Impact · depth 3

Actionable 4

Substance 4

Hype 2

The facts

LicenseMIT — open code, data-prep, training and eval scripts

Reported speedup60-85% faster per-user generation vs DeepSeek's prior method, output unchanged

Reproducibility cost~38 TB of storage to regenerate the training cache for the default config

Independent checkSpeed numbers are DeepSeek's self-report; no third party has reproduced them

Concepts

AI infrastructure Frontier models Speculative decoding

Open github.com →

How this connects

Tap a node to open it

DeepSpec

The lenses

The facts

Concepts

More in Artificial Intelligence

The bottleneck is a transformer

Safety's rounding error

The Jevons bill comes due

How this connects