Hacker Newsnew | past | comments | ask | show | jobs | submit | littlestymaar's commentslogin

> Does this translate into a similar reduction in compute?

No, quite the opposite actually. Like with speculative decoding this model will compute more tokens and discard the invalid ones.

> What's the catch?

LLMs[1] are limited by memory latency and not by compute[2]: because they process tokens one at a time, you spend more time loading and unloading the weights on the GPU registers from VRAM than waiting for compute to happen. Techniques like these allow to process multiple tokens in parallel instead of one by one, and as such exploit better the compute of your graphic card. They do so by predicting which tokens are likely to occur and then verifying that the guess was correct.

For instance if the previous token is “hello”.

A regular autoregressive LLM will compute:

“hello” => “! ”,

then “hello! ” => “how ”,

“hello! how ” => “are ”,

“hello! how are ” => “you”.

and finally “hello! how are you” => “?<end>”

One at a time. Loading and unloading every weights 5 times from the GPU memory to its compute units.

With speculative decoding (I'd say this one isn't strictly speculative decoding, but it's a variant of the same principle), you have something that guesses that the whole sentence is going to be “how are you today?”, so the LLM can generate

“hello” => “! ”,

“hello! ” => “how ”,

“hello! how ” => “are ”,

“hello! how are ” => “you”.

“hello! how are you” => “?<end>”

“hello! how are you today” => “?<end>”

In parallel. So each weight would have been loaded only once from the VRAM instead of 5.

The last token will be discarded though, as the prefix “how are you today” doesn't match what has actually been generated. So in that particular example, you'd have gotten your 5 tokens 5 times faster than with pure autoregressive inference, but at the expense of a 6th token being generated and discarded immediately. So 5 times more token throughtput, but 20% compute cost increase per token.

[1]: autoregressive LLMs, that is. Which are the ones everybody uses because they are the most performant.

[2]: at least when run at low batch size, on your own computer for your personal use. On a datacenter, with many concurrent users, GPUs are actually compute-bound.


Minor nit re[2]: for agentic workloads that are actually worth money - i.e., claude code and similar, things are either prefill-bound - which this does not help - or more importantly tps/user bound (at 150k+ context windows) - you want your big magic model to emit 200 tps/user. This is why Nvidia bought Groq (now LPU) and what Cerebras is trying to do, etc, etc. So for the stuff that makes money in the field - GPUs are not really compute bound once context lengths are large - but still memory transfer bound (may be KV-cache transfer, may be HBM->SRAM-on-chip, etc..)

> i.e., claude code and similar, things are either prefill-bound

When accounting for prefix caching, this greatly accelerates each turn. Barring large file reads, prefill still isn't the bottleneck vs. decoding reasoning tokens. Script-writing too.

This is especially true during exploration phases when traversing through directory trees and grepping files, you're talking about a few hundred tokens/turn.


Fantastic results. Well done. ...So this is built into the way the model works.. if I'm understanding it correctly.

I was wondering what would be involved in getting it to work with GGUF files, rather than safetensor files...


Just to get it into a GGUF file would be fairly trivial. But using that GGUF file would need a bunch of additional things. One would need to create a new architecture derived from Qwen3, and then probably adapt the speculative decoding functionality.

At the moment not even MTP is merged into llama.cpp, so I wouldn't quite hold my breath for it.


I thought that might be the case. I naively wondered. I'll see if I can understand the paper :-)

Hope the paper gets lots of references and the technique gets a lot of use to save power and time.

There's been several potential big changes for LLM inference efficiency over the last few months. There's been Attention Sequencing (I think it's called..?) Turbo Quant and this one.

Interesting times.


MTP merged today, a couple of hours after your post by the looks of things.

By the looks of it, it will take a couple more follow up PRs to clean things up a bit and get the most performance from MTP. I hope that by that point it will be easier to add more spec decoding types.

In the meantime I've benchmarked Orthrus some more and got some quite promising results. So I'd be glad if my prediction that it may take some time until it lands in llama.cpp turns out to be wrong.


So, it's D-Flash but at each transformer layer and share the KV cache of the original model? Very smart!

Kindof yeah - predictivity is a question though for larger layers - when trying to scale this up. But yeah, this is a "95% predictor in latent space is a 7x improvement in speed if done right" approach.

US tech is currently being weaponized against the ICC and its member judges in Europe[1], and the US is threatening to annex Greenland, as a result all (former) US allies are scrambling to get rid of their strategic dependency.

[1]: https://www.lemonde.fr/en/international/article/2025/11/19/n...


> as a result all (former) US allies are scrambling to get rid of their strategic dependency.

In a very limited way. Some countries are moving some government systems off US suppliers, but its very limited.

Somethings are going the other way. For example (there was a post about this on HN a few days ago) the new EU ID/age verification app depends on Google or Apple attested phones.

There is no real effort to push the private sector off American systems. Your government might technically function but its going to be crippled if your private sector has ground to a standstill.


It's still limited, but the shift has been dramatic over the past few month alone. It's hard to tell what the situation will look like in a few years. Unless the midterms are a democrats landslide (which the GOP has tried to prevent as much as possible with gerrymandering) I don't see the trend stopping anytime soon.

Stylometry has existed for decades, and there's no way an LLM is stronger at that job than a specialized piece of software (it's not more realistic than expecting Opus to beat Stockfish at chess).

In practice, you've never been anonymous while posting on the internet and AI isn't changing anything on that front. Or rather: if anything, AI can help you become more anonymous than before, since it can be used to hide your identity from stylometry by rewriting your prose before publishing.


What would be an example of such software


> the efficiency changes that have created this perceived downtrend in claude quality”

Why the euphemism? What Anthropic did was an aggressive degradation of their model to save compute, and it's not just “perceived downtrend”, Anthropic themselves have acknowledged the quality of service degradation.


x oil shock (due to Ormuz).


Yeah, there are arguments to be made about the benefits (less teenagers on social media) vs the drawbacks (having to hand your id card to some untrustworthy provider), or the fact that it makes people used to circumventing the law, or about the law addressing the wrong issue (so called “social media” being actively harmful by design in ways that ought to be banned) but claiming that the law increases social media consumption is ridiculous.


> But you probably won't see these outside datacenters for a while.

That's especially true now that Data centers spendings are crazy high.


This would have worked a few years back, but now you can be detained at the US border for posting what you just did so it's a terrible example to pick.

By the way, even with the current administration, there's no question about which is the more authoritarian with their own citizens between China and the US. But if you aren't American, then the US government is much more of a threat than the Chinese.

China cannot make the life of an official in Europe miserable for investigating their atrocities towards the Uighurs, meanwhile CPI judges are now forcedly unbanked and cannot work with American software because they investigated in US's ally's atrocities in Gaza.


> China cannot make the life of an official in Europe miserable for investigating their atrocities towards the Uighurs

Sure. China and America are the same. Go try the social media experiment.


I literally wrote the opposite, but ok…


How can a medium-sized model like Deepseek-V4-Flash be cheaper than a much smaller models like Qwen3.5-35B-A3B.

It's five times bigger in both total and active parameters!


I don’t know for sure, but I believe those larger models must be run on nVidia hardware (CUDA), while Deepseek-V4-* can be run on Huawei chips. My assumption is that there is less demand pressure on non-nVidia chips.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: