I hate this take. "Open source" is not "public domain" or "free reign to do whatever the hell you want with no acknowledgement to the original creator." Even the most permissive MIT license has terms that every single AI company shamelessly violate. All code derived from open source code need to at the very least reference the original author, so unless the AI can reliably and accurately cite where the code it generates came from, all AI generated code that gets incorporated into any publicly distributed software violates the license of every single open source project it has ever scraped.

That's saying nothing about projects with copyleft licenses that place conditions on how the code can then be distributed. Can AI reliably avoid using information from those codebases when generating proprietary code? No? And that's not a problem because?

I absolutely hate the hypocrisy that permeates the discourse around AI and copyright. Knocking off Studio Ghibli's art style is apparently the worst atrocity you can commit but god forbid open source developers, most of whom are working for free, have similar complaints about how their work is used.

Just because you "can't" obey the license terms due to some technical limitation doesn't mean you deserve a free pass from them. It means the technology is either too immature to be used or shouldn't be used at all. Also, why aren't they using LLMs when scraping to read the licenses and exclude anything other than pure public domain? Or better yet, use literally last century's technology to read the robots.txt and actually respect it. It's not even a technical limitation, it's a case of doing the right thing is too restrictive and won't allow us to accomplish what we want to do so we demand the right thing be expanded to what we're trying to do.

Open source only has anywhere between one and two core demands: Credit me for my work and potentially distribute derivatives in a way I can still take advantage of. And even that's not good enough for these AI chuds, they think we're the unreasonable ones for having these demands and not letting them use our code with no strings attached.

This is where many creators find themselves today, particularly in response to AI training. But the solutions they're reaching for — more restrictive licenses, paywalls, or not publishing at all — risk destroying the very commons they originally set out to build.

Yeah blame the people getting exploited and not the people doing the exploiting why don't you.

Particularly with AI, there’s also no indication that tightening the license even works. We already know that major AI companies have been training their models on all rights reserved works in their ongoing efforts to ingest as much data as possible. Such training may prove to have been permissible in US courts under fair use, and it’s probably best that it does.

No. Fuck that. There's nothing fair about scraping an independent creator's website (costing them real money) and then making massive profits from it. The creator literally fucking paid to have their work stolen.

If a kid learns that carbon dioxide traps heat in Earth's atmosphere or how to calculate compound interest thanks to an editor’s work on a Wikipedia article, does it really matter if they learned it via ChatGPT or by asking Siri or from opening a browser and visiting Wikipedia.org?

Yes. And the fact that it's stolen isn't even the biggest problem by a long shot. In fact, even Wikipedia is a pretty shitty source, do what your high school teacher said you should do and search Wikipedia for citations, not the articles themselves.

Don't let AI teach you anything you can't instantly verify with an authoritative source. It doesn't know anything and therfore can't teach anything by definition.

Instead of worrying about “wait, not like that”, I think we need to reframe the conversation to [...] “wait, not in ways that threaten open access itself”.

Okay, let's do that then. All AI training threaten open access itself. If not by ensuring the creator can never make money to sustain their work, then by LITERALLY COSTING THE CREATORS MONEY WHEN THEIR CONTENT IS SCRAPED! So the conclusion hasn't changed.

The true threat from AI models training on open access material is not that more people may access knowledge thanks to new modalities. It’s that those models may stifle Wikipedia and other free knowledge repositories, benefiting from the labor, money, and care that goes into supporting them while also bleeding them dry. It’s that trillion dollar companies become the sole arbiters of access to knowledge after subsuming the painstaking work of those who made knowledge free to all, killing those projects in the process.

And how does shaming the victims of that knowledge theft for having the audacity to try and do something about it help exactly?

Anyone at an AI company who stops to think for half a second should be able to recognize they have a vampiric relationship with the commons.

[...]

And yet many AI companies seem to give very little thought to this,

"Anyone at a Southern slave plantation who stops to think for half a second should be able to recognize they have a vampiric relationship with their black slaves." Yeah, they know. That's the point.

[–] kibiz0r@midwest.social 5 points 3 months ago (1 children)

I don’t think there’s any disagreement (among you, me, and Molly White) about who the bad guys are.

The question is: What is an effective legal framework that focuses on the precise harms, doesn’t allow AI vendors to easily evade accountability, and doesn’t inflict widespread collateral damage?

Cory Doctorow has a pretty good stab at that: https://pluralistic.net/2023/09/17/how-to-think-about-scraping/

[–] HiddenLayer555@lemmy.ml 2 points 3 months ago* (last edited 3 months ago)

The question is: What is an effective legal framework that focuses on the precise harms, doesn’t allow AI vendors to easily evade accountability, and doesn’t inflict widespread collateral damage?

This is entirely my opinion and I'm likely wrong about many things, but at minimum:

The model has to be open source and freely downloadable, runnable, and copyleft, satisfying the distribution license requirements of copyleft source material (I'm willing to give a free pass to making it copyleft in general, as different copyleft licenses can have different and contradictory distribution license requirements, but IMO the leap from permissive to copyleft is the more important part). I suspect this alone will kill the AI bubble, because as soon as they can't exclusively profit off it they won't see AI as "the future" anymore.
All training data needs to be freely downloadable and independently hosted by the AI creator. Goes without saying that only material you can legally copy and host on your own server can be used as training data. This solves the IP theft issue, as IMO if your work is licensed such that it can be redistributed in its entirety, it should logically also be okay to use it as training data. And if you can't even legally host it on your own server, using it to train AI is off the table. And the independently hosted dataset (complete with metadata about where it came from) also serves as attribution, as you can then search the training data for creators.
Pay server owners for use of their resources. If you're scraping for AI you at the very least need to have a way for server owners to send you bills. And no content can be scraped from the original source more than once, see point 2.
Either have a mechanism of tracking acknowledgement and accurately generating references along with the code, or if that's too challenging, I'm personally also okay with a blanket policy where anything AI generated is public domain. The idea that you can use AI generated code derived from open source in your proprietary app, and can then sue anyone who has the audacity to copy your AI generated code, is ridiculous and unacceptable.