Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

But are LLMs the right models to even be able to learn such long horizon goals and how to not cheat at them?

I feel like we need a new base model where the next token prodiction itself is dynamical and RL based to be able to handle this issue properly



I was including RLHF in "training". And even the system prompt, really.

If it's true that models can be prevented from spiraling into dead ends with "proper prompting" as the comment above claimed, then it's also true that this can be addressed earlier in the process.

As it stands, this behavior isn't likely to be useful for any normal user, and it's certainly a blocker to "agentic" use.


The RLHF is happening too late i think. I think the reinforcement learning needs to be during the initial next token prodiction. On that note we need something to represent a complex world state than just language.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: