DeepSpec
DeepSeek open-sourced DeepSpec, the machinery for training the draft models that make big LLMs answer faster — bundling DSpark, the drafter running in its own V4 service, which it clocks at 60-85% faster generation, with identical output.
Every fast LLM today cheats a little: a small "draft" model guesses the next several tokens, and the big model verifies them in one pass instead of generating one at a time. Get the guesses right and you skip most of the slow work; the answer is byte-for-byte what the big model would have written anyway, because it still checks every token. The field had split into two camps of guesser — accurate ones that get slower the further ahead they guess, and cheap ones whose accuracy decays the further out you go. DeepSeek's DSpark splits the difference: a cheap parallel guesser for the whole block, plus a tiny sequential head that nudges each guess using the token just before it, claw[ing] back accuracy without paying the speed tax.
Target cache generation is storage-intensive (~38 TB for the default Qwen3-4B configuration). — DeepSeek
There's a second trick that matters more in production than the architecture. A scheduler watches the GPUs and verifies more guessed tokens when the hardware is idle, fewer when it's slammed — so the speedup bends to the actual load instead of a fixed setting. In DeepSeek's own V4 serving, the company reports per-user generation running 60-85% faster than its previous method.
The catch is that "open" has an asterisk. The released code (MIT-licensed, with the full data-prep, training and eval stack) trains drafters for Qwen and Gemma, not for V4 — and regenerating the target-model cache the training needs runs to roughly 38 terabytes for the default small-model setup. So you can read exactly how DSpark works and train a drafter of your own, but reproducing DeepSeek's headline V4 result is a well-resourced-lab exercise, and the 60-85% figure is the company's self-report on its own stack, not yet anyone else's measurement. What ships in the open is the recipe and two-thirds of the kitchen; the third camp's bragging is on the honour system.
Clone the MIT-licensed DeepSpec repo and read the DSpark drafter and eval scripts — the load-aware verification scheduler is the part the paper alone doesn't show you.
Open the repo at github.com →The lenses
The facts
How this connects
Tap a node to open it