At an AI meetup in San Francisco someone said this: “Imagine a newsroom where yo...

hn_throwawa_100 · on June 14, 2023

> "Beware of first-hand ideas!" exclaimed one [...] "First-hand ideas do not really exist. They are but the physical impressions produced by love and fear, and on this gross foundation who could erect a philosophy? Let your ideas be second-hand, and if possible tenth-hand, for then they will be far removed from that disturbing element — direct observation." – E.M. Forster's 1909 short story "The Machine Stops"

Groxx · on June 14, 2023

^ The Machine Stops is really shockingly good at its predictions. When reading, remember that moving pictures were brand new, and color photography had just become a thing you could do outside a lab / highly specialized setups. Radio communication had just started to be used by governments. While it's describing the life of a fully-online Influencer™.

CatWChainsaw · on June 20, 2023

And I don't expect its predictions to suddenly falter, either.

bitwize · on June 14, 2023

Sounds like a Wikipedia editorial policy.

theptip · on June 14, 2023

The problem with this claim is it’s objectively not how OpenAI works. First, they pay contractors to do RLHF so that’s a limited new source of data. More importantly, they have a huge user base generating new content (conversations) and rating it too! I think one could be suspicious of including responses generated by the model, but the user generated text from ChatGPT is not going to be AI generated, so you grow your corpus that way.

If you just slurp all AI content sure, you get the collapse this paper talks about. But if you only ingest the upvoted conversations (which could still be a lot of data, and is also a moat by the way) what then?

The other reason I find this line of argument overly pessimistic is we haven’t seriously started to build products where this gen of LLMs converse with humans in speech; similar opportunities to curate large datasets there too.

Finally, there is no reason OpenAI cannot just hire domain experts to converse with the models, or otherwise build highly curated datasets that increase the average quality. They have billions of dollars to throw at GPT-5; they could hire hundreds of top tier engineers, mathematicians, economists, traders, or whatever, full time for years just debating and tutoring GPT-4 to build the next dataset. The idea that slurping the internet is the only option seems pretty unimaginative to me.

istjohn · on June 14, 2023

They wouldn't be able to finance the creation of new content that would constitute more than a rounding error compared to all the writing produced by humanity in all of history that they got for almost nothing. The opportunities for new training data are in non-public documents like internal corporate and government documents and communication and private text messages and chat transcripts. After that, you have non-text sources like video and audio. Imagine paying people a few bucks per week to use an app that records all audio all the time, anonymizes it, and incorporates it into a training corpus, or paying for access to home security cam footage and audio. McDonalds could create a new revenue stream by recording all human speach and activity in every one of its kitchens and dining rooms.

throwuwu · on June 14, 2023

Do they have to start over from scratch or can they use all of the data they currently have and then either add more scraped data that has been curated by humans or just outright buy data that isn’t publicly available.

Considering that RLHF took GPT-3 from a text completion model to an instruction following chat bot, you could use expert feedback to fine tune the model in whatever domains you wanted or a mixture of domains to produce an even more generally capable model.

moonchild · on June 14, 2023

> there is no reason OpenAI cannot just hire domain experts to converse with the models

If it didn't work for cyc...

MSFT_Edging · on June 14, 2023

> But if you only ingest the upvoted conversations

given how prolific bot farms/karma farms/etc are, you might still end up in the same spot with this criteria.

Dylan16807 · on June 14, 2023

> Imagine a newsroom where you have to produce a daily newspaper and you suddenly stop getting feeds from the outside world.

https://mwichary.medium.com/one-hundred-and-thirty-seven-sec...

Buttons840 · on June 14, 2023

I've seen this in reinforcement learning often, where the output of the model becomes its own training data. Once you hit the edge of the replay buffer things sometimes take a stark turn for the worse, as the model's initial failures are forgotten.

flangola7 · on June 14, 2023

Our source is still only other humans. I don't think this will be a long-term problem.

DennisP · on June 14, 2023

The trick will be figuring out what part of your training data was actually made by humans.

castis · on June 14, 2023

Run it through an LLM and see if it gets the disease?

DennisP · on June 14, 2023

But a single document doesn't break the LLM. The problem happens when lots of training documents were AI-generated.

flangola7 · on June 14, 2023

> The trick will be figuring out what part of your training data was actually made by humans.

My point was that this will not be necessary for the same reason it isn't necessary to filter out human-made content.

issore · on June 14, 2023

Sounds like human society right now; remember the past, preserve it, recite it, protect it.

We’re just reviewing our prior stats and insuring they do not deviate too much such that the wrong people would be impacted.