It is all about moving the bottleneck. During prompt processing everything can be calculated in parallel, while during token generation you create a single token at a time. For example, using an RTX 4000 Ada, I'm getting 2700 t/s for prompt processing, and 48 t/s for token generation using an 8B class model.
Their approach is essentially a speculative decoding approach where multiple tokens are predicted at once and then verified. Therefore getting more tokens to be created at a speed that is closer to the prompt processing speed.
It seems to be special because their approach yields the exact same output distribution as the base model and it only takes a negligable amount of additional memory.
The main catch is that if your prompt processing speed is already bad, it will not help you all that much.
For example, the M-series Macs (up to M4) have a relative high generation speed compared to their prompt processing speed. That means they will not benefit as much (if at all). With the M5 the prompt processing speed has increased 4x, so those can expect to see a good uplift.
> Does this translate into a similar reduction in compute?
No, quite the opposite actually. Like with speculative decoding this model will compute more tokens and discard the invalid ones.
> What's the catch?
LLMs[1] are limited by memory latency and not by compute[2]: because they process tokens one at a time, you spend more time loading and unloading the weights on the GPU registers from VRAM than waiting for compute to happen. Techniques like these allow to process multiple tokens in parallel instead of one by one, and as such exploit better the compute of your graphic card. They do so by predicting which tokens are likely to occur and then verifying that the guess was correct.
For instance if the previous token is “hello”.
A regular autoregressive LLM will compute:
“hello” => “! ”,
then “hello! ” => “how ”,
“hello! how ” => “are ”,
“hello! how are ” => “you”.
and finally “hello! how are you” => “?<end>”
One at a time. Loading and unloading every weights 5 times from the GPU memory to its compute units.
With speculative decoding (I'd say this one isn't strictly speculative decoding, but it's a variant of the same principle), you have something that guesses that the whole sentence is going to be “how are you today?”, so the LLM can generate
“hello” => “! ”,
“hello! ” => “how ”,
“hello! how ” => “are ”,
“hello! how are ” => “you”.
“hello! how are you” => “?<end>”
“hello! how are you today” => “?<end>”
In parallel. So each weight would have been loaded only once from the VRAM instead of 5.
The last token will be discarded though, as the prefix “how are you today” doesn't match what has actually been generated. So in that particular example, you'd have gotten your 5 tokens 5 times faster than with pure autoregressive inference, but at the expense of a 6th token being generated and discarded immediately. So 5 times more token throughtput, but 20% compute cost increase per token.
[1]: autoregressive LLMs, that is. Which are the ones everybody uses because they are the most performant.
[2]: at least when run at low batch size, on your own computer for your personal use. On a datacenter, with many concurrent users, GPUs are actually compute-bound.
Idea: Inject a trainable diffusion attention module into each layer of a frozen AR Transformer. Both heads share one KV cache. Diffusion head projects K=32 tokens in parallel; AR head verifies in a second pass and accepts the longest matching prefix. Output distribution is provably identical to the base model.
Results:
- Up to 7.8x TPF, ~6x wall-clock on MATH-500.
- 16% of params trained, <1B tokens, 24h on 8xH200.
- vs. diffusion LMs (Dream, Fast-dLLM-v2, SDAR, Mercury, Gemini Diffusion): they modify base weights and lose accuracy (Fast-dLLM-v2: -11 pts on MATH-500). Orthrus freezes the backbone; accuracy matches Qwen3-8B exactly.
- vs. Speculative Decoding (EAGLE-3, DFlash): no external drafter, no separate cache, zero TTFT penalty (no drafter to init/sync). KV overhead is O(1) (~4.5 MiB flat). Acceptance length on MATH-500: 11.7 vs. 7.9 (DFlash) vs. 3.5 (EAGLE-3).
- Single-step denoising beats multi-step (6.35 vs. 3.53 TPF). KL distillation beats CE on acceptance rate.
Limitations: strictly bounded by the frozen base model (inherits its biases, hallucinations, knowledge gaps); Qwen3-only evaluation; greedy + rejection sampling only.
From a quick and shallow view of the paper, it looks very feasible (with a little tinkering ) to be adapted to qwen3.6 27B. The process looks somewhat similar to training a LoRA, or in a way distilling your own model so that a mini model learns how to imitate it, and you glue them. I might bite the bullet and rent a gpu to do it for 3.6 27b, as this will solve a lot of my problems.
Scratch that, I don't have that kind of money, and 3.5's architecture is a little more divergent from 3's, so it will be a bit less trivial. It does look possible, just not on a student's paycheck.
What's the catch?
reply