Artificial Intelligence medium · first-party

llama.cpp self-speculation

The engine most local AI tools sit on now lets a model draft its own next tokens from prediction heads baked into the same file — no second model to ship — and roughly doubles decode speed on dense models.

repo ggml-org/llama.cpp · 2 min read

Speculative decoding is the trick that lets a language model run faster than it "should": a small, cheap model guesses the next few tokens and the big model checks them all in one pass, accepting the good guesses for free. The catch has always been the small model — you had to find one, ship it, and configure it alongside the real one. As of a May merge, llama.cpp skips that step: it reads extra prediction heads that recent models like Qwen3.6 already carry inside their own weights file and uses them as the draft. The model speculates on itself.

One run accepted 270 of 270 drafted tokens — a perfect score — and still ran slower than not speculating at all.

On a single high-end consumer card, decode on a dense 27-billion-parameter model jumped from 38 to 65 tokens a second — about 1.7x, with the model's self-drafts accepted roughly three times in four. It is off by default and set with one flag; nothing new needs downloading beyond a model that ships the heads.

Where it stops working is instructive. On mixture-of-experts models — the kind that only wake a slice of themselves per token — the speedup thins out for a single user, and an independent benchmark of the older separate-draft approach found the deeper reason: one run accepted every single drafted token, a perfect 270 for 270, and still decoded slower than plain generation. Acceptance was never the bottleneck. Each guessed-ahead token drags in a different expert, so the verification pass has to load the union of all of them at once — and pays for the lot. Faster drafting, more experts to fetch, no net win.

So the honest headline is narrow and real: the dominant local-inference engine gained free self-speculation, worth about a doubling on dense models and much less — sometimes nothing — on the sparse ones, decided entirely by memory traffic rather than how good the guesses are.

Want to try it?

Grab a Qwen3.6 GGUF and add --spec-type draft-mtp --spec-draft-n-max 3 to your llama.cpp run; the PR thread has the full flag set and the benchmark numbers to compare against.

Open the repo at github.com →

The lenses

Novelty 3

Impact · breadth 3

Impact · depth 3

Actionable 4

Substance 5

Hype 2

The facts

CostFree, open-source

NeedsA model that ships prediction heads (e.g. Qwen3.6)

EffortOne flag, off by default

Best case~2x decode on dense models, single user

Concepts

Speculative decoding Local inference Mixture of experts

Open github.com →

How this connects

Tap a node to open it

llama.cpp self-speculation

The lenses

The facts

Concepts

More in Artificial Intelligence

Agent Skills

The bottleneck is a transformer

Safety's rounding error

How this connects