Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Can language models serve as text-based world simulators? (arxiv.org)
95 points by mpweiher on June 15, 2024 | hide | past | favorite | 65 comments


I tried for a few months getting ChatGPT4 to work with a MUD. In my experience it's not very good at that particular task.

One problem I ran into was its ability to logically connect rooms. In a MUD you navigate by going north, south, east, west, up, or down. Not every room lets you go any direction. And usually, if you go east, in your new room, you can go west. Rarely a level creator will make this not be true. ChatGPT4 was pretty bad at it though.

Another problem was descriptions. It might mention a mountain in the distance once. But then never again. So this giant landmark was described in a single room.

It was also difficult to get it to create a fair quantity of secrets in logical places. Lots of times it would just chain together multiple secrets in a single place. If you have more than one, you want to spread it around

And finally, room layout. It tended to not be very good at this. Lots of linear layouts. It didn't have an eye towards when details should be complex rooms and when it can just be a line in a description.

So it could do it, but it created levels that weren't very fun or particularly creative, even when it came to room descriptions.


Somewhat related, Yann LeCun posted a few months back about how many concepts aren't understood through language and therefore can't be modeled through it (which is why LLM's are terrible with things like position and direction).

https://x.com/ylecun/status/1768353714794901530?s=46


    To people who claim that "thinking and reasoning require language", here is a problem:
    Imagine standing at the North Pole of the Earth.
    Walk in any direction, in a straight line, for 1 km.
    Now turn 90 degrees to the left.
    Walk for as long as it takes to pass your starting point.
    Have you walked:
    1. More than 2xPi km
    2. Exactly 2xPi km
    3. Less than 2xPi km
    4. I never came close to my starting point.

    Think about how you tried to answer this question and tell us whether it was based on language.
Just quoting this here in case anything happens to the tweet...

I agree with this, however I have one tiny nitpick, feel free to tell me if you think I'm wrong or being overly nitpicky, but the knowledge of the situation in which the phenomenon that he's describing occurs, I learned about entirely from language. So my ability to answer the question is based on that knowledge.

I'm aware that the reasoning problem itself doesn't utilise it, but a position and direction system in and of itself arguably also suffers from being insufficient.

I suppose I'm wondering if "setup" counts as needing language model?


I don't think the example is a good rebuttal of "thinking and reasoning require language".

It may be a decent challenge, probably still not an actual rebuttal, of "language is sufficient for all thinking and reasoning", but "X is required for Y" and "X is sufficient for everything encompassed by Y" are very different claims.


That's fair, it's not supposed to be a rebuttal of that :)...

Though now that you said it, I'm really thinking about the original statement.

Honest question, can you give me an example of thinking or reasoning that happens fully independently of reasoning via symbols or their manipulation?

I feel like I'm missing something obvious, but nothing is coming to mind right now :)...

Just thinking about the underlying statement and simplifying language down to symbolic expression. (I was originally going to say manipulation, but it doesn't feel like it quite fits...)


> Honest question, can you give me an example of thinking or reasoning that happens fully independently of reasoning via symbols or their manipulation?

I don't think we have anything but subjective, indirect understanding of how thinking happens, but I think that, at a minimum, what we describe as "reasoning" specifically is tightly conceptually related to, if not a subset of, manipulation of abstract symbols to which concrete experiences may be approximately mapped.

I'm not sure I'd say this is the same thing as language, but there's at a minimum a shared common symbol-manipulation underlying both. Do the capacities always come together? I'm not sure how we would know that, I think our ability to recognize reasoning is tied to it being mapped to language, and are ability to distinguish something as language rather than nonlinguistic signaling or mimicry of something else that is using language is tied to independent expression of reasoning through it.


If the brain actually used language to represent ideas in use, we should have succeeded in finding its https://en.wikipedia.org/wiki/Universal_grammar. Instead it seems more like language is a kind of lossy compression we keep reinventing for ideas, to export them from a brain in some tangible way and get them into another (even in the future).


Hmm, I'm not sure if that holds?

Just because you use a mechanism for expressing your ideas and reasoning, doesn't mean that underlying reality has to confirm in any way to it?

We invent new symbols and terms all the time as we experience new phenomena. A universal grammar may still be possible, barring incompleteness anyway, we may just be lacking a whole bunch of ideas still...


@erik_seaberg the lack of success in finding a universal grammar is a logical leap. The failure to find something does not necessarily mean it doesn't exist. Language is powerful for expressing abstract ideas without explicitly saying them, which suggests language more than "lossy" compression because it's more similar to Shannon's lossless compression with prefix codes. I see where you're coming from though.

@Felcon If language is a mechanism for expressing ideas and reasoning, it should reflect our cognitive processes that generate those ideas, so " [...] Just because you use a mechanism for expressing your ideas and reasoning, doesn't mean that underlying reality has to confirm in any way to it" is a bit contradictory. Are our cognitive processes not included in reality?

The existance of a universal grammar is a specific hypothesis that requires empirical evidence. It's tiring to hear Chomsky's ideas parroted despite no empirical framework to stand on. What ideas could we be lacking? This argument is similar to String Theory proponents who kept pulling ideas out of the ether to support an unsubstantiated theory.


Firstly I meant conform btw, not confirm... Unfortunately it's too late to edit!

> Are our cognitive processes not included in reality?

Ah, this may not be productive, but I'm really just trying to tease apart different things.

Of course our cognitive processes exist in reality, however I would say there's nothing that requires that what they produce must materially map to reality.

We do not for example treat dreams as evidence, even though they run on our cognitive processes.

> It's tiring to hear Chomsky's ideas parroted despite no empirical framework to stand on.

To be honest, I had no idea I was doing that...

> What ideas could we be lacking?

I'm not claiming a lack of any specific ideas, I'm merely pointing out that considering that we do know that we invent terms for phenomena that we experience and I doubt that we have been exposed to even the majority of all phenomena, it seems unlikely that we can casually refute the existence of a universal grammar.

Absolutely, proving it does require evidence, which is in short supply and if I was pressed, I would suspect that it's existence is unlikely, but not impossible.

Now just to be clear, I don't mean that this is kind of reasoning can be useful for much else, but with regards to attempts to find some complete unification such as a universal grammar. In those specific cases things become a little fuzzier and reasonable people can disagree.


I would agree with you, except that I would say the reasoning problem itself not only utilizes language, but in fact hinges entirely on the language.

There's a bit of spatial knowledge involved to understand that you're walking in a circle around the north pole. But the reasoning needed to get the answer to the question is based on the language.

Specifically, the language tells us that our starting point is at the north pole. Then the language of the 4th point states "to pass your starting point", which has two meanings - to cross over the starting point (as in passing the finish line in a race) or to pass by it off to the side (as in passing a store as you're traveling along a road). But since we're walking in a circle around it, we'll never pass it in either sense of the word.

Had it used different language like "How far did you have to walk to complete a circle around your starting point?" then the answer would be quite different, as would the reasoning.

But that wasn't the language used. So the language completely determines the answer and the reasoning involved, including whether you even need to think about distance.

One could also argue that all four of the answer options are wrong, partly since 1km is not a very far distance, and therefore you were always close to your starting point, depending on your ideal of 'close'. But more specifically simply because the language saying you "never came close" to it would be nonsense because you started right at your starting point, and of course can't get much closer than that. So again you don't even really need to account for distance. The language alone determines it.


> But since we're walking in a circle around it, we'll never pass it in either sense of the word.

Are we, though? Or did we start on a great circle around the Earth from the random point 1km from the north pole?

It depends on whether you assume someone has in mind a "straight line" following a map, or what they'd actually experience as a straight line given the scale of the Earth.


I think the problem the author is putting is that it does not have any reasoning behind. It is a sheer coincidence it works eventually for problems that requires logic. Mostly it could be because the dataset increases chances of it being right and not because it did process anything


Language is useful to transmit the concepts but is not sufficient to actually solve problems with those concepts.


My bet is that this is wrong, and at the same time, language isn't required - just sufficient. I see concepts as defined only through associations with other concepts (which can be modeled as proximity in high-dimensional space, and that's precisely what LLMs are doing) and, sometimes, through memorized sensory data - the latter isn't the typical case, but it's needed to make the recursive definition (concepts defined in terms of concepts) stay anchored to reality.

From that follows that written language is enough to build that structure of concepts (latent space in ML terms). So is spoken language. So is vision in general, or hearing in general[0]. The brain will build one concept space out of all inputs available; it is necessary and sufficient to have at least one, but none of them alone is itself necessary.

--

[0] - Languages are higher-level regularities used for communication, growing on top of those senses, but not strictly necessary for understanding the real world. I'd use people who have no perceptible inner voice and high visualization skills as a counter to the idea that concepts need to be thought of symbolically in something resembling a written or spoken language.


This interaction down the chain is interesting:

Q: yann do you have an internal monologue?

A: Not that I know of.


It’s all about the training. Llms can have better understanding of space then humans to the point where they can draw things better than us.

Don’t restrict LLMs to text. If you train one with images and text you’ll get one that understands position.


I think the issue might be that people like to throw the tool at the whole problem when it’s not the right tool for a lot of it.

Don’t use LLMs for logic. Use it for colour and flavour. Generate your own layout, populate the rooms in a manner befitting the context, difficulty, story arc, and then whenever a user makes an action, update the state then pass to the LLM the state and ask it to describe the room and make a few other smaller decisions.


I think the best thing to do in a scenario like that is standard procgen where you randomly, logically generate the rooms with a bunch of descriptive tags, and then LLM-ify the room descriptions, with some context for what the world/area is supposed to be like.


It can even dynamically extend by using prompts to generate new rooms as the user enters them, but keeping track of generated world outside of the LLM.

Which is the same way nearly all procedurally generated games work.


Did you try AI Dungeon 2, in 2020 when it was a thing?


It's still running now:

https://aidungeon.com/


Apparently still with a version of the NSFW filter that made everyone abandon it in 2020.


Really? I was under the impression that was an OpenAI limitation, which they eventually worked around by deploying their own models.


It's both.

The game started as a side project that went crazy viral as the first real way people could access an LLM (GPT-2 at the time). That of course immediately led to people using it for NSFW content, and they were really the first project to have to deal with the question of what to do about that content. Their initial reaction was to deploy a filter, which worked but was unpopular with many in their early audience.

I haven't followed the project for a while, but my understanding was that they rolled back the filter at least for paid accounts. But when they had the filter it wasn't just an OpenAI restriction because their filter actually predated GPT-3 and OpenAI's APIs.


"worked" as in "made OpenAI happy" but not as in "retained their customers". People were getting banned for using words like "melon" (racist implication, said the filter).


Looks more like a prompting and coding issue. ChatGPT4 is not strong AGI. It's an AGI, but you need to guide it.


Not by themselves... but with a good program built around it, managing the actual state, and very careful prompting, yes. I've been thinking about this for a while, always have the desire to make a game.


Exactly. I've been working on something like this for a while using finite state machines to control the prompts. The biggest struggle I've had is creating memory. It's one thing to save a list of events and feed it into the prompts, but there are always issues interpreting it and making sure the memories are detailed enough but not an entire page.

For example, you "conquered the dungeon in level 1". If this gets saved as "conquered the dungeon" the next time you get to a new dungeon, it may think you already beat it and won't generate NPC monsters, that kind of thing.


I am also working on something similar. I have written a grammar system for controlling the llm that is not context free.

One of the main challenges is to have a database of what is reality, id -> object, and then have the llm return ids for objects referenced by the user. I am doing a system where something that has no id has not been created yet.

I use story templates with labeled segments to create characters, items, location as well as events. Example: "Once upon a time there was a <role>cowardly knight</role> named <name>Billy Bonkers</name>. The named qualities become the data structure associated with the id of the new character. It can be hilarious to prompt the llm to do nerdy space humor!

I seems to be very possible to create a "narrator driven" game happening in a "real" world.

I plan on using a multimodal llm and all the stuff that is created will have an image. That combined with id-less objects existing but not having a detailed description will make qualities and objects visible in the created images also part of the world.


I made a D&D game for my kids using llama. It went surprisingly well. Lots of potential here.

https://blog.katarismo.com/2023-05-26-i-m-a-dad-i-replaced-m...


FYI I'm getting ERR_CERT_DATE_INVALID


Recently I'm getting some strange behavior from ChatGPT (with GPT-4) where it outputs a code snippet in which it declares a variable, and then on the very next line it forgets the name of the variable and refers to it as something else. Or it refers to the variable but it makes a typo in the name.

If that's the behavior by one of the best models on a few lines of code, I'm not hopeful for a world simulator any time soon unless we see another big leap in model quality.


I feel like it’s an open secret at this point that chatGPT is not useful for anything requiring consistent, logical thought which is a requirement for most human jobs.


Chat GPT4 has been in general both better and worse than 3.5. I sometimes start a conversation with GPT4 and then have 3.5 fix it in another window (by pasting the code and pretending it's mine). It feels like things have degraded for coding tasks specifically, but the more recent knowledge of GPT4 is helpful if I'm asking how to use a certain library that's more recent as it won't just pretend the library exists, GPT4 is also much slower.

I wish there was a ChatGPT 3.5 updated with more recent knowledge.


Think about what the model is doing:

It’s sampling from a statistical distribution of likely tokens, one of which is the correctly spelled variable name.


Yes. The presence of the correctly spelled name earlier in the context should've dominated the distribution so much that it should've been extremely unlikely to select an incorrectly spelled variable name in the place where the correct name should go.


What if the model got more training data? There’s no guarantee that the training data causes the model to converge on correctness. It’s not like this stuff is curated for specific problems.


My point is that in previous versions it was sampling more correctly more often than it has been recently.


People seem to have noticed that chatgpt has good days and bad days. we don't really know what openai is doing in the background, so it could be that some days you get assigned a really bad B in an A/B model deployment test, or maybe they're throttling the performance because they're at peak, or likely some other factor. But it's an observed phenomenon that it performs well and then doesn't.


From what I've heard, the reason it's doing that is to add an AI fingerprint. It has to pick less likely tokens to encode this information to the detriment of the output quality. Unfortunately it's only hearsay but it made sense to me so I thought I'd share.


Yes, I've noticed it this week as well. Simple errors like misspelling variable names.


Maybe they can, but they’re trained on conversation and stories (among other things) rather than simulations. These are more general than the log of a simulation run in how they use time. Stories can be chronological, but they can also fill in the details in any order using things like flashbacks. Or they can get really complicated with time travel stories.

So it seems like to understand a story well, an LLM would need a more sophisticated notion of time, where there is a history and events are slotted in as they’re learned from the narrative, and sometimes the whole history gets reinterpreted based on new information. (Plot twist!)

It would be fascinating if some mechanistic interpretability researcher figured out how they actually work with story time. Are there the rudiments of that kind of understanding yet?


Now I wonder what one would get from an LLM trained entirely on several bajillion text logs of people playing text games like Zork.


Yeah, might be interesting. I think it wouldn’t be too hard to generate training runs automatically, especially given a solution to start from.


I was able to get this to work with an LLM, but had to build some short term memory to keep awareness as you explored. The current site allows you to interact with an Oregon Trail or Zork like world, but you can specify any time in history and then it will create the stories and stay within that world. I also have it generate some images to go along with you for fun. https://grue.is/ (PS, I don't know how to code, so this is also proof that you can use an LLM to write all the software, here is my github if you are interested in learning more about that: https://github.com/lrspeiser/Grue.is)


The answer is yes.

https://aidungeon.com/

This actually came out before chatGPT and it floored me.


This is neat. In theory, you could hook llama.cpp into into a GOAL-based planner (https://www.gamedevs.org/uploads/three-states-plan-ai-of-fea...) and have much better default bots navigating your nav mesh. Even better, if you record player actions as GOAL actions within the nav mesh, you can use that to fine tune the model. Or even feed it back in realtime so they learn the modus operandi of the player.


Try it!

All you need to do is type this into llama 3 8B or 70B 16fp instruct:

“You are a text adventure game.”

Done.


Inform7 it's a beast and the resulting game can be run under a 486. No AI required.


I think there is a MUD effort called LlamaTale which allows creation of this as telepresence. I’m trying to put a MUD together to try this out.


Curious, can this objective be reworded as "How big a universe an LLM can serve to simulate?"


Can headlines written in the interrogative case ever be answered in the affirmative?


There exists at least one headline written in the interrogative case that can be answered in the affirmative.



genius!


Technically “text-based world simulator” pretty much is an explanation of LLMs already. That’s why we all suddenly care about dumb chatbots —- they accidentally cracked the frame problem.


TL;DR: unfortunately the conclusion is, no they can't

> the best recorded performance is 59.9% on accurately simulating state transitions [...] after 10 steps, average simulation accuracy would reduce to less than 1%. Our results indicate that LLMs are not yet able to reliably act as text world simulators


Would be interesting to see a version of this but incorporating tool usage features of the latest models.


I've already built things like this. Cool paper but I build things, put them to the side, never write a paper. Sometimes tweet about them. Then I see a paper later about some similar thing and the crowd goes wild. Not a complaint. Idk what else to say.


Nethack FTW :)


Seriously, ha, time is a flat circle.


I still love slashem.


Too much data are locked up in non-public sources that still affect the world. You will not find complete engineering analysis reports for the Ford F-150 engine on the internet or full email exchanges for Trump's presidential campaign planning in Ohio. Yet these all influence us.


The answer is no




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: