At an AI meetup in San Francisco someone said this:
“Imagine a newsroom where you have to produce a daily newspaper and you suddenly stop getting feeds from the outside world. Your earlier newspapers are the only source.”
This is what to me LLMs eventually would get to—same content being fed again and again.
> "Beware of first-hand ideas!" exclaimed one [...] "First-hand ideas do not really exist. They are but the physical impressions produced by love and fear, and on this gross foundation who could erect a philosophy? Let your ideas be second-hand, and if possible tenth-hand, for then they will be far removed from that disturbing element — direct observation."
– E.M. Forster's 1909 short story "The Machine Stops"
^ The Machine Stops is really shockingly good at its predictions. When reading, remember that moving pictures were brand new, and color photography had just become a thing you could do outside a lab / highly specialized setups. Radio communication had just started to be used by governments. While it's describing the life of a fully-online Influencer™.
The problem with this claim is it’s objectively not how OpenAI works. First, they pay contractors to do RLHF so that’s a limited new source of data. More importantly, they have a huge user base generating new content (conversations) and rating it too! I think one could be suspicious of including responses generated by the model, but the user generated text from ChatGPT is not going to be AI generated, so you grow your corpus that way.
If you just slurp all AI content sure, you get the collapse this paper talks about. But if you only ingest the upvoted conversations (which could still be a lot of data, and is also a moat by the way) what then?
The other reason I find this line of argument overly pessimistic is we haven’t seriously started to build products where this gen of LLMs converse with humans in speech; similar opportunities to curate large datasets there too.
Finally, there is no reason OpenAI cannot just hire domain experts to converse with the models, or otherwise build highly curated datasets that increase the average quality. They have billions of dollars to throw at GPT-5; they could hire hundreds of top tier engineers, mathematicians, economists, traders, or whatever, full time for years just debating and tutoring GPT-4 to build the next dataset. The idea that slurping the internet is the only option seems pretty unimaginative to me.
They wouldn't be able to finance the creation of new content that would constitute more than a rounding error compared to all the writing produced by humanity in all of history that they got for almost nothing. The opportunities for new training data are in non-public documents like internal corporate and government documents and communication and private text messages and chat transcripts. After that, you have non-text sources like video and audio. Imagine paying people a few bucks per week to use an app that records all audio all the time, anonymizes it, and incorporates it into a training corpus, or paying for access to home security cam footage and audio. McDonalds could create a new revenue stream by recording all human speach and activity in every one of its kitchens and dining rooms.
Do they have to start over from scratch or can they use all of the data they currently have and then either add more scraped data that has been curated by humans or just outright buy data that isn’t publicly available.
Considering that RLHF took GPT-3 from a text completion model to an instruction following chat bot, you could use expert feedback to fine tune the model in whatever domains you wanted or a mixture of domains to produce an even more generally capable model.
I've seen this in reinforcement learning often, where the output of the model becomes its own training data. Once you hit the edge of the replay buffer things sometimes take a stark turn for the worse, as the model's initial failures are forgotten.
“Imagine a newsroom where you have to produce a daily newspaper and you suddenly stop getting feeds from the outside world. Your earlier newspapers are the only source.”
This is what to me LLMs eventually would get to—same content being fed again and again.