Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There’s a fallacy that gets used a whole lot to justify things like this (not just with LLMs), and I see it in many of the comments here: If it’s OK (or at least negligible on a small scale), then it must be OK on a large scale.

It usually goes something like: If I can make money by learning something from a web page, why does a computer making money by learning everything from everyone upset people so? It’s the same thing!

It’s like if I go to Golden Gate Park and pick one flower, I shouldn’t do that, but no one cares. But if I build a machine to automatically cut every flower in the park because I want to sell them, that’s different.

“You say I can pick one flower, but you get upset when I take a bunch. That’s inconsistent. Check and mate.”

But quantitative changes in an activity produce qualitative changes. Everyone knows this, but sometimes they seem to find it inconvenient to admit it. Not that effects of the qualitative change are always bad, but they are often different, and worth considering rather than dismissing.

 help



We ran into a lot of stuff like this in the early days of the web. For example, there was a lot of information that was "public" in that anyone could go to the city courthouse and ask to see the documents. But it changed in nature when you could suddenly look up anyone in the country by typing their name in your browser.

I am not quite sure why my address history, known aliases, and sometimes phone number, are publicly available to anyone who Googles my name, and I'm not sure how to opt out of this.

> I'm not sure how to opt out of this.

If you’re a EU citizen, do a web search for “right to be forgotten”.

https://support.google.com/legal/answer/10769224


Alas, I'm in the US.

Then as a California resident you should presumably be googling instead about the CCPA and the right to delete :)

I've done that. They pop up like hydra heads. The point isn't right to delete. The point is right to not have my personal info plastered all over the internet without me having to contact each site and say "plz stop" and for them to say "OK we'll do it in 7-10 business days"

You'd be considered lucky if you can even find an email address. I had some of my personal info crop up on Google searches on Fedora mailing lists, I've emailed various people at Fedora to get them delete my old messages or redact them, but never received a response. :(

we used to ship mass lists of addresses and phone numbers to people in each town and it was fine/appreciated.

You could also easily opt out with the single entity that shipped that information.

Yes, but getting an unlisted number was considered weird and against the norm even if possible. Even in the early 2000s when I dropped my landline, my parents were aghast - "if you do that, you won't be in the phone book! How will anyone get in contact with you?"

And you often had to pay for the privilege... A dollar a month for them to not put your name and number in the phonebook.

Yeah and back then it wasn't used as a sort of UUID to track every single thing you do in your life... Different times

You ever had a bump in the night my guy?

Or a stalker?


For a practical example of that, a lot of documents used to have things like social security numbers, and they started stripping that information off once it was visible online.

> It’s like if I go to Golden Gate Park and pick one flower, I shouldn’t do that, but no one cares. But if I build a machine to automatically cut every flower in the park because I want to sell them, that’s different.

The problem here is, in your example the small scale example, and the large scale example are both unacceptable behavior.

Learning from others at a small scale is not only socially acceptable, but is the foundation of how advancement works.

So this concept of the issue of the scale being the issue isn't at its core the problem, its that something that that is desired behavior in a human, is not socially acceptable because of a machine is doing it.


What a total non-sequitur. You think you found flaw in one of the examples, instead of seeing if you can come up with better ones, you say it so it can't be this, therefore it's a completely different thing that makes zero sense. Machines aren't "doing" anything, they're being wielded by humans. And they're doing it at scale, to other humans, via the force multiplication of machines.

Why would I attempt to come up with a "better" example of a premise I reject?

Your vague response doesn't seem to have anything to do with the the base subject this whole thing revolves around. Plagiarism be it small scale or large isn't acceptable, and the idea that humans doing things that are wrong is ok, but AI doing the same thing at large scale is not ok?


> Your vague response doesn't seem to have anything to do with the the base subject this whole thing revolves around.

No, I instead refuted your reply.

> but AI doing the same thing at large scale is not ok?

No, humans doing things can be okay or not so okay depending on the scale they do them at. "AI" isn't "doing" anything by itself, at all, so that doesn't enter into it at all. You cannot separate "scale" and "thing". Rubbing your hands to make them warmer is fine, igniting a nuke is not, both aren't "basically the same thing, raising temperature, just at different scales". You didn't reject the premise, you didn't understand it in the first place, and knocked down your own straw man instead. Which I pointed out, that's all.


Ha, I'm sorry do you think you've made a logical point by comparing rubbing your hands together, and "igniting a nuke."

Again, this isn't a "this at small scale is ok, but at large scale it isn't" argument. Small scale plagiarism isn't acceptable, neither is large scale.

You are refuting my reply seemingly without the context of the article, and larger issue at hand.

Don't be condescending when you aren't even accurately a following the original premise or purpose.


The context of this subthread is explaining that

> If it’s OK (or at least negligible on a small scale), then it must be OK on a large scale.

is a fallacy. Which it is. You confirm this by apparently seeing a difference between generating a little bit of heat and a whole lot, to name one of infinite examples anyone can easily come up with.

> Again, this isn't a "this at small scale is ok, but at large scale it isn't" argument.

You just keep doing the thing I pointed out in my first reply, you claim "it's not this" on a technicality, and then say "so therefore it's this instead", and the other thing is a criticism nobody brings up, ever.

And it gains you nothing, because if plagiarism isn't even okay at small scale, surely you can see how it's even less okay at big scale.


There are no such examples (recommended for humans, but abhorrent for machines).

> (recommended for humans, but abhorrent for machines).

That's not the criticism, that's the straw man used to dodge the criticism. Of course the straw man makes no sense, that's why it gets put up.

Machines aren't doing anything, humans are doing things, with or without machines.

"It's fine to raise the temperature of your surroundings by 0.0001 degrees by exhaling. It's less fine to set a house on fire, and even less fine to ignite a nuke. But aren't the all the same thing? How hypocritical that raising temperature is okay for some but not others???"

That things can change quality with quantity/frequency is trivially obvious, and you can think of many examples. Bad ones, good ones, doesn't matter. The point of OP stands, all that was added was how absolutely brazen the nonsense is getting.


Sorry, but the point stands for you because of you feel about this topic. It does not stand logically.

Ultimately we have to reckon with the fact that there's nothing which is recommended to do X of, but is abhorrent to do 10X of.


> there's nothing which is recommended to do X of, but is abhorrent to do 10X of.

No we don't, because that's nonsense. You can ask a stranger in the street for the time of day once, and they will react very, very differently if you ask them 10 times in a row. You can drive N miles per hour in a school zone, you cannot drive at 10x the speed, and so on.


Ok, I see your point. We live with tiny inconveniences, that we would not at 10x.

But I don't see how that relates to copyright or llm at all. 'Learning', at scale, is not an inconvenience, atleast in any forward looking society.


> There are no such examples (recommended for humans, but abhorrent for machines)

claiming to be human


> Learning from others at a small scale is not only socially acceptable, but is the foundation of how advancement works.

Exactly, if anything, the logic (a bit bad -> really bad) shows that one person learning from one thing is far inferior to one person learning from every thing (a bit good -> really good).


> The problem here is, in your example the small scale example, and the large scale example are both unacceptable behavior.

no, not really, or at the very least they're not at all in the same category of "unacceptable behavior"


The argument isn't small crime vs large crime. It is no crime regardless of scale.

If it is acceptable for a person to learn, then it should be acceptable for a machine. And any derived works produced from that information isn't theft or copyright violation.

Though I do think there is a valid gripe with the LLMs being trained on pirated materials. I've also personally learned from a lot of PDF of textbooks I didn't own.


> If it is acceptable for a person to learn, then it should be acceptable for a machine.

Is there a name for the fallacy when people act like models and algorithms should be granted the same rights as human beings?


AI psychosis would be my term. When you attribute to software the characteristics of a human you are in a psychosis.

Many things share characteristics with human, we have for decades created methods for systems to emulate and synthesize those characteristics. It is sort of delusional to think that the abilities of humans can't be produced by other systems, it is a severe delusion to think that proposing a machine can do it, is psychosis.

That’s not the topic. Read the post I replied to. No matter what a piece of software will never be a human.

I don't think anyone in this thread has suggested software are people.

But in the same regard it is very likely at some point the ability to simulate a human mind and persona is a real possibility.


Is it a fallacy? Can you provide a legal or logical basis for this being treated differently.

Hammers aren't granted rights.

Tools aren't granted rights. Why do we need to make an exemption for AI?


Well because no one is attempting to claim that the structures or products produced by using Hammers are plagiarism?

In a sane world, things produced by tools are owned and credited as creations by the users of tools, there are many who seem to argue that isn't the case with AI.

And that some how, that anything produced based on the knowledge it was trained on is some sort of plagiarism or copyright violation of the original source material even when none of that material is present in the end result?

So if we can't just leave it at its a tool, then we have to look at existing frameworks of laws and ethics to make the case of how this should be treated.


I'll just take my tools (video camera) into a cinema to learn off the latest Hollywood flicks. It's not an accurate 1:1 representation to the original source material, so the output that I've produced from it belongs to me.

Are you trying to make a shoestring argument?

Sure you can do that, but because there are several laws against that specific action already, you will be likely face prosecution, and the content (something poorly duplicated, not created) would be seized.

But lets assume, that your camera has an LLM in it, and it trained in this fashion, and you performed this action on countless other films, and then the camera could produce wholly unique and original work that did not have any duplication of the original works it sampled. The work produced would not be a violation of copyright, nor would it be plagiarism.

Just as someone whose education was to watch a large number of movies, and then created their own based on that education.

But as previously mentioned you may face the ramifications of violating the agreement you had for accessing the original source material in an illegal way.


> Well because no one is attempting to claim that the structures or products produced by using Hammers are plagiarism?

Of course they are! Is a video recorder not a tool? No one is claiming rights for video recorders.

Once again, the status quo is that tools do not get rights, the burden is on you to prove why an exemption should be made, not on those who are asking "why should tools get rights?"


Actually the burden is to prove that what AI is producing is plagiarism or copyright violation, this isn't about some special right, but that there are many making the case that things produced by AI are duplication of their work.

I'm also not sure where the concept of "the tool" be given a right to anything, That certainly isn't my argument, the right of the work should be to the user/owner used to create things with the tool. There are several pieces in the SFMOMA that use automation to create art, that art is credited to the creator of the machine, not the machine, I see AI in a similar lens.

You are intentionally selecting a device that makes duplicates of things as your comparator, so I can't tell if that is biased or some sort of flaw in your argument.

But an LLM being trained on works, and generating something based off of that training is not a duplication of any specific copyrighted material, and is wholly unique is not duplication.


> But an LLM being trained on works, and generating something based off of that training is not a duplication of any specific copyrighted material, and is wholly unique is not duplication.

Right[1], and humans can do that, no problem - ingesting existing material and recombining them to produce something new (not necessarily unique) is a right that humans are afforded. The question being asked is, since we don't allow that right to any other tools, why does this tool need an exemption?

-------------

[1] Not really (i.e. I don't necessarily agree with this point), but lets assume it for the sake of this discussion.


>Learning from others at a small scale is not only socially acceptable, but is the foundation of how advancement works.

This is true, shows how human thought differs from AI. AI needs massive datasets to be coherent.


How large do you think your some total of all things learned would be as a dataset, we aren't that different in that regard, just in how we amass that dataset and how curated it is.

Wasn’t his point about plagiarism? That is also not ok on a small scale.

I was trying to stick to the example, but I agree, that getting away with something doesn't determine if it is right or wrong. And the whole concept of that makes for shaky ground for any form of legal or ethical argument.

I think the difference here is that you guys are talking ethics. And in fact what were talking about is enforcement. While its unethical to pick one flower (in it's purest form, robbing the commons of the beauty of a flower), it won't be enforced.

Fair. AI might also not be the problem, but how it is utilized.

Suddenly everyone and their grandma are specialist at everything and the actual value of understanding is not appreciated anymore.


Ok, what is so special about understanding anyway? we understand way less things than we do no understand.

IMO, we're just giving special weight to understanding just because it gives people wages. Someone's specific brain structure should not privilege them over others. UBI or something equitable on those lines is the answer.


  quantitative changes in an activity produce qualitative changes

Well said!

It reminds me of a Stalin* quote: "Quantity has a quality all its own."

* Note that it may be misattributed to him


> It’s like if I go to Golden Gate Park and pick one flower, I shouldn’t do that, but no one cares. But if I build a machine to automatically cut every flower in the park because I want to sell them, that’s different.

It's not like that, because flowers are a physical object and moving them to one place deprives their original location of the flowers. When an LLM learns something from a webpage, the webpage is still there. Whatever 'theft' you perceive is entirely in your head; you were deprived of nothing by someone else making a copy of your thing.


This is not true. Because the copy is a devaluation of the original, so even though the web page is still there it’s value has decreased.

"It's not like that"

That's not the point. The point is that scale matters, and that was the only point.


> Whatever 'theft' you perceive is entirely in your head

Rather, it appears to be in your head, since the person you’re replying to has not mentioned or even hinted at theft. The problem with taking all flowers from a public park for your own profit is multifaceted. Amongst others, you’re depriving everyone else from enjoying them, but also degrading the image of the park and harming all the insects which depend on those flowers and the birds who depend on those insects, which in turn degrades the park further, which stops people from enjoying it and going there and caring for it. It’s not about a single physical object, it’s about the ripple effect the selfish action produces.


Can you apply your philosophy to the U.S. dollar ? I am sure producing copies is a "theft" that is entirely in your head. You were deprived of nothing by someone else making a copy of your dollar.

It's not like that, because flowers are a physical object and moving them to one place deprives their original location of the flowers. When an LLM learns something from a webpage, the webpage is still there. Whatever 'theft' I perceive is entirely in my head; I was deprived of nothing by someone else making a copy of my thing.

I get that the intention here is to plagiarize and thus cause the parent to feel the harm of it and realize the error in their ways, but I don't think it works. Plagiarism's harm to the plagiaree (?) is that it robs them of credit and payment, but nobody is viewing your reply in isolation of the parent's attribution and parent wasn't expecting to make money off of an HN comment. The harm to the rest of society where you gain false esteem for another's work is also not carried out in this instance. The harm to the plagiarizer where they fail to learn because they copied instead is likewise absent. If someone were to feel harm just from a copy of their words existing, they wouldn't need you to do it- google has hastily indexed this along with every other HN comment and we all know that this whole thread will make its way into LLM training sets eventually.

> google has hastily indexed this

Google doesn't claim authorship over that which they index.

Plagiarism doesn't need to be harmful for it to be bad, and my intent wasn't to harm anyone anyway. My intent was that I could use the authors exact words to pretend to make a unique take that I claimed to have authored.


I don't understand. In what way is plagiarism bad if it doesn't harm? If it were harmless to pretend you authored a unique take, how is the parent expected to react to you not harming them such that they realize it's bad?

Harmless doesn't imply ethical. Plagiarism that doesn't harm is still lying.

Fair enough, shame on me for assuming utilitarianism.

But you're still depriving the world of future flowers. Why spend years studying, sacrificing time with others, living frugally if others can take or monetize the result for free? Most people need compensation to justify their effort. Or the option to not have their years of work/sacrifice co-opted into an ai generated ad for toilet bowl cleaner.

No cost copying doesn't remove the need for compensation to sustain ongoing creation. Society has long treated knowledge, art, and thought as high-value outputs, and accepted the copyright tradeoff to support them. That is long settled and no 'get rid of copyright' proponents argue satisfactorily why the 300 year corpus of thought on that is invalid. Long copyright terms may justify reform but not rejection of the establishment that creative work needs economic value to sustain ongoing creation, and that ongoing creation is a net positive/desirable for society.

You are free to release copyright free today. In software that has unlocked immense value. In other areas those choosing copyright have unlocked more value. But software is different, I can get hired to build on the free. No one is hiring an author to expand their book to include fanfiction. And were that the model, it would arguably result in worse results as we are now back to the much worse patronage system where Bob hordes what he's paid for and only shares it with friends for status. For 300 years we've understood because of dynamics paywalled copyright with a throttled side of libraries unlocks the greatest access to knowledge. Eliminating duplication cost has not changed that.

'but I want every flower there is today and I don't care if there are any future flowers' doesn't change that, it's simply a new value judgement that my want/use case today outweighs the cost to society of lost future knowledge creation/return to a patronage based reward system. Again 300 years of thought say that results in a worse outcome for society. How does the typical OSS project that depends on patronage fare? Do we really want to return all knowledge output to that model?


When the LLM presents what it learned as its own thoughts without any attribution, that's the theft.

And you understand that. You're not stupid. This is the thing: AI is convenient for corporations, so you'll make dishonest arguments to justify your unethical behavior. Maybe you even believe what you say, but that's because people will hold on to any flimsy thing that lets them feel like they're good people, not because the reasoning actually makes any sense.

This is why people talking about AI get booed at speeches. There's no conversation to be had: you're not interested in the truth, or what's right, or what's good for anyone but yourself.


If one person is murdered, that's bad. If a million people are murdered, that's war.

If one word is stolen by AI, that's bad. If a million words are stolen by AI, that's business.


more like

If one word is stolen by Joe, that's bad. If a million words are stolen by Meta, that's business.

AI isn't the problem, is corporations using AI that are the problem


this made me oof. well said.

>If one word is stolen by AI, that's bad. If a million words are stolen by AI, that's business.

Where are all the instances of "one word" being "stolen by AI", and people getting mad over it?


Yes absolutely, when automation increases the rate of something many orders of magnitude that often is a qualitative difference.

It's weird to me how often on HN of all places I see arguments that can be refuted with "scale matters". I commonly see arguments on all sorts of topics that make the same mistake you're calling out.


I would say the difference there is: yes…you built a machine that “could” pick all the flowers. It did not, however, actually pick any flowers as you suggest. If you take the machine back and use it to pick the flowers, that should be a problem.

I think the problem with these things is that if the same metric and methodology were reversed, it doesn’t look favorably on artists either with such inflammatory framing: “The way the artist learned was to effectively plagiarize every piece of art they viewed, extracting important details in the way light, color, shading, anatomy or otherwise look in order to steal from the other artists, then replicated and combined those things as part of every future work they created, stealing over and over again.”

Handwaving away the small scale seems like it would ignore who has responsibility in the small scale. Metaphorically speaking, who in the small scale is responsible for plagiarism: the person making the paints or the person with the brush who sells them to an unsuspecting public? Point is, in this case, the user is the one holding the brush and trying to pass things off.

To be clear, I don’t really disagree with the fact their copyrights were likely violatedc and they should likely be liable for damages, which is for a court to decide, not me. They should have sourced their data sets properly, certainly, and other companies have. I just think the arguments really need improvement without simply falling back on the tropes, and hopefully it helps make sense why some people will take issue with arguments that others want to simply dismiss as invalid.


> But quantitative changes in an activity produce qualitative changes.

Interesting take. I think a corollary is that the qualitative changes are in the economics of things. And more than the scale, it is the value of those economic effects that determines how "accepted" that activity becomes.

Take Uber as an example; it basically enabled mass avoidance of taxi regulations, and naturally existing taxi drivers and lawmakers cried foul. But enough people found value in the service and kept using it that gradually and inexorably society and laws adjusted to it.

On the other hand, copyright infringement is an interesting case. While pretty much everyone and their dog pirates content to some extent, the % of people who think it's acceptable to do so is surprisingly small (22% apparently, up from only 14% in 2019). Furthermore the media industry, especially including ads, is a significant % of US GDP. I think those reasons, more than any RIAA/MPAA lobbying, are why copyright laws have remained as stringent as they have.

As such at a social level, I don't think these effects were dismissed, rather they were considered and formally internalized.

I suspect the same thing is happening with AI companies. They get away with devouring and training on the sum of human knowledge largely because existing laws are insufficient to stop them. So stopping this would require new laws but... well, given the early economic impact LLM technology is having my hunch is new laws will be brought in to protect it rather than restrain it.


> gradually and inexorably society and laws adjusted to it.

But in many places, the ways that society and laws adjusted to it were to make extra clear in their local ordinances that Uber was required to operate as an actual taxi service, or get out.

It's very disingenuous to imply that the public broadly decided Uber was Right, Actually, when both in its case and in that of many of the other gig economy companies, what really happened is that gradually and inexorably, they had to adjust to society and laws.


I didn't mean to imply Uber was "right" or it unilaterally got laws adjusted in its favor. Both sides had to adapt, but very clearly regulations had to be significantly adjusted as well: https://www.blackcarnews.com/article/the-uber-taxi-partnersh...

I followed this evolution peripherally as it happened, because while I appreciated the convenience of Uber, I disliked that it was unfair towards existing taxi drivers who had very onerous requirements like taxi medallions, which, note, never became a requirement for rideshare drivers.

I remember at one point Uber drivers at the airport would ask me to pretend I'm a friend being picked up to avoid trouble with the cops, and then a couple of years later there was a dedicated, official "Uber pickup lane."

My underlying point was that the whole system -- including Uber, incumbents, society and laws -- adapted to a new economic reality.


The era before internet, the gaps among information and knowledges could make money and power.

The era after internet and before LLM, the information and knowledge gaps have been largely leveled theoretically, but the recognition wall stops most of us to understand and make use of them.

The era after LLM, the wall is being destroyed and people should think about how to use these information and knowledge differently to make money and power.


This is a great point. I think for coding, the wording of the MIT open source license makes it clear that copying and distributing the software is authorised on a small scale and it's very clear that the act of copying must involve a person.

It provides distribution and modification rights to "any person obtaining a copy of the software" and explicitly requires attribution for any significant parts.

Mass-ingesting the code with a script without any human even reading the licence is a very different kind of copying mechanism and there is no person involved... The contract was bypassed completely. A contract requires consent from both parties to be binding. When ingesting code into the AI training set, nobody even read the license. There was no agreement; neither explicit nor implicit... Because the consumer, a script, never read the contact for that specific project.

There was nobody present when the copying occurred; on neither side! It cannot possibly constitute an agreement between two parties.


This would be an extremely novel mechanism of copyright litigation and I doubt it would fly in an American court with its' emphasis on highly individualized legal rights and obligations. And, if it did get accepted by the courts, that's halfway to an even crazier argument: that the MIT license only allows individual distribution to known parties; i.e. no hosting the code on a website or seeding it on BitTorrent, because that's not "small scale" and doesn't "involve a person".

You can only seed it on BitTorrent if it comes with the license which identifies the original author and acknowledges their copyrights over the code. Also there is definitely an assumption that a human will read the license or at least implicitly consent to the terms before using or modifying the software. When ingested by AI, the author gets zero credit and no consent has taken place between any sentient being on either side of the contract... Or at least none that are legally acknowledged as sentient or having legal rights.

And the thing is, you point out the easy out on this for similarly licensed code... a giant list of authors and contributors that may have code included in the generated output. It's a win/win for everyone. The original authors get their acknlowdgement, and the AI company gets to bill the users of AI for all the tokens for that multi-gigabyte copyright disclosure file.

> I think for coding, the wording of the MIT open source license makes it clear that copying and distributing the software is authorised on a small scale and it's very clear that the act of copying must involve a person.

I agree with “must involve a person. https://opensource.org/license/mit starts with (emphasis added) “Permission is hereby granted, free of charge, to any PERSON obtaining a copy of this software and associated documentation files (the “Software”)”.

That means it doesn’t give an LLM any rights. The way I see it, LLMs run (directly or indirectly) by a person can do stuff on their behalf, though, just as your CI pipeline can download and compile MIT-licensed software.

I definitely disagree with the “on a small scale” as the license continues (again, emphasis added) “to deal in the Software WITHOUT RESTRICTION, including WITHOUT LIMITATION the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software”.


The CI pipeline is different because for a module to end up as a dependency in the CI pipeline, it had to be explicitly selected by a person first to be included in the package file or manifest. There was intentionality and awareness that the software was included.

A person already pre-consented to the licenses of all the software which the pipeline downloaded. Big companies go through those dependency lists carefully already and remove those which do not meet their policies. This is a very intentional process.


> for a module to end up as a dependency in the CI pipeline, it had to be explicitly selected by a person first

I disagree. I think it’s entirely within the license to have your pipeline automatically pull in the latest version of a library, even if the new one happens to pull in a new MIT-licensed library (whether that’s a good idea and whether CI pipelines should, somehow, verify that code pulled in has an acceptable license are different discussions)

I also think it’s complete within the MIT license to tell a LLM that it can search for MIT-licensed libraries and use them without asking you.


That's like saying you're not allowed to load the source code into an editor, because it's not a person. Or that you're not allowed to run a global search-replace on the entire code base, because it's a script and not a person.

But in this case, a human has awareness of what software they are copying or modifying and that's how the original software author receives credit. The contract requires some degree of human awareness to be valid. This is the critical difference.

Sorry that's nonsense. There's human awareness when ingesting MIT code into an LLM too. In both cases it's a human that says $ excute-global-replace or $ ingest-into-llm

Both operations require some degree of human awareness. What you appear to be saying is, a human can only use a limited algorithm to access this source code, not a sophisticated one. And where do you draw that line? Who should get to say what is too sophisticated?

Error: your algorithm is too sophisticated to proceed, please provide more human awareness, it's a critical difference.


If your LLM were to hack into Microsoft and steal the source code from an important project and inject it into your project without you being aware of it; wouldn't that make you liable if you then published it?

Unfortunately there is no way to agree to a license of a software you're using if you didn't read the license or if you're not even aware that you're using the licence. This is what's happening at the training stage.

If you say that awareness doesn't matter then it means you cannot stop AI from stealing any IP open source or not.

I think the main issue with LLMs is that there is no mechanism to stop them from stealing. Thus they are guaranteed to infringe on copyright to some extent.

Also, beyond copying and copyright, there is another problem that LLMs are also infecting the logic and expertise built into the project. This is a completely novel mechanism and needs to be treated as separate under the law. Else it would be the end of all IP.


> I think the main issue with LLMs is that there is no mechanism to stop them from stealing.

Well, sure there is—for the people running them.

If you're building training data for an LLM, you only use data that a) is firmly in the public domain, or b) you have a clear and documented legal right to use.


Page Manager Theme Editor Media Library SEO Settings Analytics Domain Manager Export Code Publish Button Realtime Preview Undo/Redo

Honestly that's what's wrong with capitalism and property rights. We can understand what it means to own a thing like a piece of furniture, or a house, and "a person's home is their castle" rings true. But scale that up to individuals controlling resources that affect a neighborhood, a city, a country, or the world -- at each step their army of voters supports their right to own 800 billion dollars or whatever, same as they own their own houses -- it's only fair! And if they want to build a starbase and launch some rockets near your house and sensitive ecology they're just exercising the same rights you or I have, and attack on their ability to inflict damage on the community is an attack on all.

[edit] and the same goes for corporations owning "means of production". It's not the same as owning an iPhone.


> There’s a fallacy that gets used a whole lot to justify things like this ...

FWIW, this is the Fallacy of Composition

https://en.wikipedia.org/wiki/Fallacy_of_composition


Of course it's robbery. I don't think anyone is truly arguing it's not. The issue is that, if we don't do it, China will. Game over.

I'm surprised I hvan't seen more economist scholars exploring this topic; it's a fastincating phenomenon. I've seen folks try and re-visit history and compare what's happening with AI to some historic event--but, we've never seen anything quite like it. As much as history repeats itself; at the forefront of innvotaion it doesn't.

I suspect that there will one day be an AI tax as society tries to reclaim the value of the theft; maybe even UBI of some form. Until then, buy the stocks and ride the theft wave. The economsits are certainly exploring the K shaped economy, and this is why.


This argument of "if we don't do it, someone else will" to justify theft is so tiring. The companies doing the stealing are collectively the same ones that have power to prevent it, if they were incentivised to do so.

> The issue is that, if we don't do it, China will.

These AI companies aren’t state enterprises. How is geopolitics a justification?

If it were just the military training them, probably no one would care about the copyright infringement angle, it makes sense that the government could ignore those rules for national security.

But Mark Zuckerberg isn’t training his models to protect us from China. He’s doing it to make himself even more ridiculously wealthy.


I think that the problem arises because people equate things that are not the same.

One person learning something is good. At scale, that becomes everyone learning something. That's even better.

Machine learning is not scaling up people learning. It's completely different even if it's called "learning".

As the article argues, it's plagiarism at scale. In that sense, one person plagiarizing content is bad. Everyone plagiarizing at scale by using LLMs is even worse.


My complaint with your argument is that the word learn means one thing when we are talking about a person learning something from a webpage or book and something completely different when a webpage or book is used to adjust some weights in a matrix. Calling that learning is a distraction from the real copyright violations going on.

>when we are talking about a person learning something from a webpage or book and something completely different when a webpage or book is used to adjust some weights in a matrix

What material differences exist between the two besides "humans good, computers bad"?

>Calling that learning is a distraction from the real copyright violations going on.

Most courts so far have ruled that it counts as fair use.


I thought fair use was dead after Napster

No, only the type of "fair use" that people slap on their youtube uploads, thinking them it gives them a "get out of jail free" card for copyright infringement. Fair use was repeatedly affirmed in the 2010s, eg. https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

There is already a term for this, and ironically enough, it's often thrown around in discussions of machine learning; it's called emergence. As scales change, new properties appear, which is why we can describe a Chimpanzee as "swinging between the trees" even though at the level of quantum field theory there is no such thing as trees, Chimpanzees, or swinging.

Likewise, people shouldn't be surprised that as AI compute scales up, new forms of harm can be created, thereby introducing new moral quandaries. It's like comparing GPT-1 against today's frontier models. One is a fun albeit useless toy. The other is effecting categorical changes in the way knowledge work is done. In both cases the underlying technology is the same, but their impacts are totally different.


But also a person is a person, not a commercial product, and usually we learn from sources within their licensing agreements.

data brokers lean into this too... you can go to the city hall and get someone's public information pretty easily, that does not mean you should make all of that information available to everyone else all the time from anywhere

But it’s not the same though. If I look at a webpage, it’s still there for other people to enjoy. That’s not the same as a flower being picked.

Reasoning by analogy doesn’t work if your analogy isn’t well matched.


Let me double down on my downvote.

The analogy proposed here is correct rewritten as:

If one person uses an AI trained across copyrighted data, then that’s ok.

But if everyone uses that AI, it’s not ok.

Which is a bit of an irrelevant point.


In other words, size matters.

ugh. yeah. the tragedy of the commons

It's funny, the way that term gets used now is actually a wild distortion of the true history.

"The commons" was an incredibly successful system, and medieval (and prior) villages used it to great success, for the entire village's benefit! "Commons" are a great thing for everyone to have!

The real history is that as advances in technology (like the Industrial Revolution) changed things, certain rich villagers were suddenly able to manage more animals than they could before. Those (specific/rich) people over-used the commons, creating the "tragedy" we all know of.

The real lesson of history is not that commons fail: to the contrary, they worked great and helped everyone for centuries! The real lesson is "watch the fuck out for the new rich (especially when they just became rich because of recent technology advancements): those bastard will steal from everyone for their own benefit!"


if it's a fallacy that if it’s OK (or at least negligible on a small scale), then it must be OK on a large scale, then the alternative "it's ok at a small scale but not at a large scale" starts to slide us down a slippery slope... fallacy.

Of course quantity makes emerge it’s own quality. If you kill a single person, you are a murderer, if you genocide "others" and distribute the spoliation wealth to those unscathed you are a national hero. If you steal small material you are a theft and go to prison, if you hog some billions you can enact laws to grab even more.

>If you kill a single person, you are a murderer, if you genocide "others" and distribute the spoliation wealth to those unscathed you are a national hero.

This is a fundamental misunderstanding of how laws work. It's not the scale that makes it okay, it's that it's done through some official process. Trump's raid to grab Maduro killed less than 100 people. Pretty modest by "genocide" standards, and is easily eclipsed by gang/cartel violence. Yet nobody is going after Trump because he didn't meet some kill quota to get special protection, nor are people condoning cartel violence because they killed far more than Trump.


That's exactly how laws work then.

International Right for those who don't have all the nukes and lobotomized cannon meat bag ready to invade on a whim, and on the other side doing all the crimes and atrocities, straight transgress all legal processes ever invented, and expecting no possible punishment in return.

Number of directly killed people is not something that can be eclipsed by bigger number of killed people. Not in a mind that keeps empathy high in its value.


Who is sir Francis drake

In general tech has sat in the opposite paradigm: identify when doing something at a small scale is bad, but at a large scale is not

unauthorized plagiarism on the individual level is bad, at the medium scale is ick, but at the ultragigantic scale is meh.

laundering through an llm takes away the real moral ick from the plagiarism - the lying and building of ego by the person reboxing somebody else's ideas and work.


>> the lying and building of ego by the person reboxing somebody else's ideas and work.

Instead the bot lies to people who use its output to boost their ego. Not sure it's really changing the moral calculus here.


> why does a computer making money by learning everything from everyone upset people so? It’s the same thing!

The majority of the population, sitting outside the VC bubble, views AI unfavorably. That's not my hot take, that's a fact from the NYT survey published today.

It's going to be hilarious when VCs, having expropriated the IP of the entire internet, build The Layoff Machine That Does Everything Without Workers, and then the voters decide to just...enthusiastically expropriate that, and we end up with Fully Automated Luxury Communism.


>The majority of the population, sitting outside the VC bubble, views AI unfavorably.

Sure, where AI means threatens my job or my skills, people view it unfavourably.

But then they use it. They're all using it. People's rhetoric seldom matches their actions.

>enthusiastically expropriate that, and we end up with Fully Automated Luxury Communism

Maybe in other countries, initially, but the US is very firmly a plutocracy, and has a populace that will very happily vote against their own interests because the plutocrat-owned media told them to. And yeah, it is very rapidly approaching the point where there is going to be zero chance of a revolution even if people opened their eyes.

Which is precisely why the US is now threatening other countries as well, because plutocracy is threatened by rational, educated, better managed countries. Canada, for instance, is an example that country doesn't have to revert to being an idiocracy, so it's first in the crosshairs.


> They're all using it.

[Citation needed]

I know many more people who do not use AI than who use it, and many more who refuse to use AI than people who are enthusiastic about it.

Given your username, you are almost certainly in a bubble—an echo chamber—that makes it seem to you as though "everyone is using it." I recommend getting outside that bubble and talking to non-technical people outside your usual circles, especially people in the arts and humanities.


My blue collar buddy in water treatment uses ai to summarize reports and fix up emails. My retired neighbor who "doesn't do technology" was having an ai conversation on a product he was thinking of buying. I ordered through a voice kiosk ai at the drive-through last week. I am surprised how fast it is propagating.

But, see, this is part of the problem:

Most of the people I hear from who use AI say everyone they know uses AI.

Most of the people I hear from who don't use AI say no one they know uses AI.

It seems to me that we've got competing bubbles here. But the statistics certainly show that, leaving aside whether they use it, most people don't like it or want it.

...I think it's also worth noting that AI usage is likely to be "louder" than AI avoidance in many cases—that is, whichever side of this one falls on, it's easier to detect someone pasting from ChatGPT directly into emails, or complaining that Gemini told them you would sell them XYZ, than it is to detect someone who's just keeping on the way they've always been.


> But then they use it. They're all using it. People's rhetoric seldom matches their actions.

I don't see any contradiction. I criticize the hell out of guns and want them strictly controlled, and yet I own one. `¯\_(ツ)_/¯`

People can use AI and still demand that all of society receive the benefits, instead of a small group of oppressors.


The problem is, there's an intermediate step required there: the voters will need to get rid of the Republican Party lock, stock, and barrel if they ever want to make a genuinely-socialist move like that. And that's going to be made much, much more difficult by all the measures Trump and his cronies are putting in place to disenfranchise everyone who refuses to bow to him.

Can you map this more directly to claims made about AI? It's impossible to agree or disagree with you. You've just given us an analogy - but to what?

Not sure what you're missing here. The other comment replies don't seem to be missing it. See the article, etc.

What is the claim? A little plagiarism is OK therefore a lot is also OK?

No. It's more like,

"You say I can take a photo of one flower in your flowerbed you put next to the public street, but you get upset when I take a bunch of photos of many public flowerbeds. That's both an over-reach and inconsistent."




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: