Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

At an AI meetup in San Francisco someone said this:

“Imagine a newsroom where you have to produce a daily newspaper and you suddenly stop getting feeds from the outside world. Your earlier newspapers are the only source.”

This is what to me LLMs eventually would get to—same content being fed again and again.



> "Beware of first-hand ideas!" exclaimed one [...] "First-hand ideas do not really exist. They are but the physical impressions produced by love and fear, and on this gross foundation who could erect a philosophy? Let your ideas be second-hand, and if possible tenth-hand, for then they will be far removed from that disturbing element — direct observation." – E.M. Forster's 1909 short story "The Machine Stops"


^ The Machine Stops is really shockingly good at its predictions. When reading, remember that moving pictures were brand new, and color photography had just become a thing you could do outside a lab / highly specialized setups. Radio communication had just started to be used by governments. While it's describing the life of a fully-online Influencer™.


And I don't expect its predictions to suddenly falter, either.


Sounds like a Wikipedia editorial policy.


The problem with this claim is it’s objectively not how OpenAI works. First, they pay contractors to do RLHF so that’s a limited new source of data. More importantly, they have a huge user base generating new content (conversations) and rating it too! I think one could be suspicious of including responses generated by the model, but the user generated text from ChatGPT is not going to be AI generated, so you grow your corpus that way.

If you just slurp all AI content sure, you get the collapse this paper talks about. But if you only ingest the upvoted conversations (which could still be a lot of data, and is also a moat by the way) what then?

The other reason I find this line of argument overly pessimistic is we haven’t seriously started to build products where this gen of LLMs converse with humans in speech; similar opportunities to curate large datasets there too.

Finally, there is no reason OpenAI cannot just hire domain experts to converse with the models, or otherwise build highly curated datasets that increase the average quality. They have billions of dollars to throw at GPT-5; they could hire hundreds of top tier engineers, mathematicians, economists, traders, or whatever, full time for years just debating and tutoring GPT-4 to build the next dataset. The idea that slurping the internet is the only option seems pretty unimaginative to me.


They wouldn't be able to finance the creation of new content that would constitute more than a rounding error compared to all the writing produced by humanity in all of history that they got for almost nothing. The opportunities for new training data are in non-public documents like internal corporate and government documents and communication and private text messages and chat transcripts. After that, you have non-text sources like video and audio. Imagine paying people a few bucks per week to use an app that records all audio all the time, anonymizes it, and incorporates it into a training corpus, or paying for access to home security cam footage and audio. McDonalds could create a new revenue stream by recording all human speach and activity in every one of its kitchens and dining rooms.


Do they have to start over from scratch or can they use all of the data they currently have and then either add more scraped data that has been curated by humans or just outright buy data that isn’t publicly available.

Considering that RLHF took GPT-3 from a text completion model to an instruction following chat bot, you could use expert feedback to fine tune the model in whatever domains you wanted or a mixture of domains to produce an even more generally capable model.


> there is no reason OpenAI cannot just hire domain experts to converse with the models

If it didn't work for cyc...


> But if you only ingest the upvoted conversations

given how prolific bot farms/karma farms/etc are, you might still end up in the same spot with this criteria.


> Imagine a newsroom where you have to produce a daily newspaper and you suddenly stop getting feeds from the outside world.

https://mwichary.medium.com/one-hundred-and-thirty-seven-sec...


I've seen this in reinforcement learning often, where the output of the model becomes its own training data. Once you hit the edge of the replay buffer things sometimes take a stark turn for the worse, as the model's initial failures are forgotten.


Our source is still only other humans. I don't think this will be a long-term problem.


The trick will be figuring out what part of your training data was actually made by humans.


Run it through an LLM and see if it gets the disease?


But a single document doesn't break the LLM. The problem happens when lots of training documents were AI-generated.


> The trick will be figuring out what part of your training data was actually made by humans.

My point was that this will not be necessary for the same reason it isn't necessary to filter out human-made content.


Sounds like human society right now; remember the past, preserve it, recite it, protect it.

We’re just reviewing our prior stats and insuring they do not deviate too much such that the wrong people would be impacted.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: