I was including RLHF in "training". And even the system prompt, really.
If it's true that models can be prevented from spiraling into dead ends with "proper prompting" as the comment above claimed, then it's also true that this can be addressed earlier in the process.
As it stands, this behavior isn't likely to be useful for any normal user, and it's certainly a blocker to "agentic" use.
The RLHF is happening too late i think. I think the reinforcement learning needs to be during the initial next token prodiction. On that note we need something to represent a complex world state than just language.
I feel like we need a new base model where the next token prodiction itself is dynamical and RL based to be able to handle this issue properly