Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I was hoping these were the stealth Horizon models on OpenRouter, impressive but not quite GPT-5 level.

My bet: GPT-5 leans into parallel reasoning via a model consortium, maybe mixing in OSS variants. Spin up multiple reasoning paths in parallel, then have an arbiter synthesize or adjudicate. The new Harmony prompt format feels like infrastructural prep: distinct channels for roles, diversity, and controlled aggregation.

I’ve been experimenting with this in llm-consortium: assign roles to each member (planner, critic, verifier, toolsmith, etc.) and run them in parallel. The hard part is eval cost :(

Combining models smooths out the jagged frontier. Different architectures and prompts fail in different ways; you get less correlated error than a single model can give you. It also makes structured iteration natural: respond → arbitrate → refine. A lot of problems are “NP-ish”: verification is cheaper than generation, so parallel sampling plus a strong judge is a good trade.



Fascinating, thanks for sharing. Are there any specific kind of problems you find this helps with?

I've found that LLMs can handle some tasks very well and some not at all. For the ones they can handle well, I optimize for the smallest, fastest, cheapest model that can handle it. (e.g. using Gemini Flash gave me a much better experience than Gemini Pro due to the iteration speed.)

This "pushing the frontier" stuff would seem to help mostly for the stuff that are "doable but hard/inconsistent" for LLMs, and I'm wondering what those tasks are.


It shines on hard problems that have a definite answer. Google's IMO gold model used parallel reasoning. I don't know what exactly theirs looks like, but their Mind Evolution paper had a similar to my llm-consortium. The main difference being that theirs carries on isolated reasoning, while mine in it's default mode shares the synthesized answer back to the models. I don't have pockets deep enough to run benchmarks on a consortium, but I did try the example problems from that paper and my method also solved them using gemini-1.5. those where path-finding problems, like finding the optimal schedule for a trip with multiple people's calendars, locations and transport options.

And it obviously works for code and math problems. My first test was to give the llm-consortium code to a consortium to look for bugs. It identified a serious bug which only one of the three models detected. So on that case it saved me time, as using them on their own would have missed the bug or required multiple attempts.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: