You heard it folks, invalidate all US patents and copyright because written treaties are irrelevant!
The question is: What is an effective legal framework that focuses on the precise harms, doesn’t allow AI vendors to easily evade accountability, and doesn’t inflict widespread collateral damage?
This is entirely my opinion and I'm likely wrong about many things, but at minimum:
-
The model has to be open source and freely downloadable, runnable, and copyleft, satisfying the distribution license requirements of copyleft source material (I'm willing to give a free pass to making it copyleft in general, as different copyleft licenses can have different and contradictory distribution license requirements, but IMO the leap from permissive to copyleft is the more important part). I suspect this alone will kill the AI bubble, because as soon as they can't exclusively profit off it they won't see AI as "the future" anymore.
-
All training data needs to be freely downloadable and independently hosted by the AI creator. Goes without saying that only material you can legally copy and host on your own server can be used as training data. This solves the IP theft issue, as IMO if your work is licensed such that it can be redistributed in its entirety, it should logically also be okay to use it as training data. And if you can't even legally host it on your own server, using it to train AI is off the table. And the independently hosted dataset (complete with metadata about where it came from) also serves as attribution, as you can then search the training data for creators.
-
Pay server owners for use of their resources. If you're scraping for AI you at the very least need to have a way for server owners to send you bills. And no content can be scraped from the original source more than once, see point 2.
-
Either have a mechanism of tracking acknowledgement and accurately generating references along with the code, or if that's too challenging, I'm personally also okay with a blanket policy where anything AI generated is public domain. The idea that you can use AI generated code derived from open source in your proprietary app, and can then sue anyone who has the audacity to copy your AI generated code, is ridiculous and unacceptable.
“Wait, not like that”: Free and open access in the age of generative AI
I hate this take. "Open source" is not "public domain" or "free reign to do whatever the hell you want with no acknowledgement to the original creator." Even the most permissive MIT license has terms that every single AI company shamelessly violate. All code derived from open source code need to at the very least reference the original author, so unless the AI can reliably and accurately cite where the code it generates came from, all AI generated code that gets incorporated into any publicly distributed software violates the license of every single open source project it has ever scraped.
That's saying nothing about projects with copyleft licenses that place conditions on how the code can then be distributed. Can AI reliably avoid using information from those codebases when generating proprietary code? No? And that's not a problem because?
I absolutely hate the hypocrisy that permeates the discourse around AI and copyright. Knocking off Studio Ghibli's art style is apparently the worst atrocity you can commit but god forbid open source developers, most of whom are working for free, have similar complaints about how their work is used.
Just because you "can't" obey the license terms due to some technical limitation doesn't mean you deserve a free pass from them. It means the technology is either too immature to be used or shouldn't be used at all. Also, why aren't they using LLMs when scraping to read the licenses and exclude anything other than pure public domain? Or better yet, use literally last century's technology to read the robots.txt and actually respect it. It's not even a technical limitation, it's a case of doing the right thing is too restrictive and won't allow us to accomplish what we want to do so we demand the right thing be expanded to what we're trying to do.
Open source only has anywhere between one and two core demands: Credit me for my work and potentially distribute derivatives in a way I can still take advantage of. And even that's not good enough for these AI chuds, they think we're the unreasonable ones for having these demands and not letting them use our code with no strings attached.
This is where many creators find themselves today, particularly in response to AI training. But the solutions they're reaching for — more restrictive licenses, paywalls, or not publishing at all — risk destroying the very commons they originally set out to build.
Yeah blame the people getting exploited and not the people doing the exploiting why don't you.
Particularly with AI, there’s also no indication that tightening the license even works. We already know that major AI companies have been training their models on all rights reserved works in their ongoing efforts to ingest as much data as possible. Such training may prove to have been permissible in US courts under fair use, and it’s probably best that it does.
No. Fuck that. There's nothing fair about scraping an independent creator's website (costing them real money) and then making massive profits from it. The creator literally fucking paid to have their work stolen.
If a kid learns that carbon dioxide traps heat in Earth's atmosphere or how to calculate compound interest thanks to an editor’s work on a Wikipedia article, does it really matter if they learned it via ChatGPT or by asking Siri or from opening a browser and visiting Wikipedia.org?
Yes. And the fact that it's stolen isn't even the biggest problem by a long shot. In fact, even Wikipedia is a pretty shitty source, do what your high school teacher said you should do and search Wikipedia for citations, not the articles themselves.
Don't let AI teach you anything you can't instantly verify with an authoritative source. It doesn't know anything and therfore can't teach anything by definition.
Instead of worrying about “wait, not like that”, I think we need to reframe the conversation to [...] “wait, not in ways that threaten open access itself”.
Okay, let's do that then. All AI training threaten open access itself. If not by ensuring the creator can never make money to sustain their work, then by LITERALLY COSTING THE CREATORS MONEY WHEN THEIR CONTENT IS SCRAPED! So the conclusion hasn't changed.
The true threat from AI models training on open access material is not that more people may access knowledge thanks to new modalities. It’s that those models may stifle Wikipedia and other free knowledge repositories, benefiting from the labor, money, and care that goes into supporting them while also bleeding them dry. It’s that trillion dollar companies become the sole arbiters of access to knowledge after subsuming the painstaking work of those who made knowledge free to all, killing those projects in the process.
And how does shaming the victims of that knowledge theft for having the audacity to try and do something about it help exactly?
Anyone at an AI company who stops to think for half a second should be able to recognize they have a vampiric relationship with the commons.
[...]
And yet many AI companies seem to give very little thought to this,
"Anyone at a Southern slave plantation who stops to think for half a second should be able to recognize they have a vampiric relationship with their black slaves." Yeah, they know. That's the point.
Speak for your own instance.
Thousands of sick people killed each year by health insurance companies trying every trick and scheme to pay out as little as possible: I sleep.
Someone kills the guy masterminding it: REAL SHIT?!
Small models have gotten remarkably good. 1 to 8 billion parameters, tuned for specific tasks — and they run on hardware that organizations already own
Hard disagree as someone who does host their own AI. Go on Ollama and run some models, you'll immediately realize that the smaller ones are basically useless. IMO 70B models are barely at the level of being usable for the simplest tasks, and with the current RAM landscape those are no longer accessible to most people unless you already bought the RAM before the Altman deal.
I suspect this is why he made that deal despite not having an immediate need for that much RAM. To artificially limit the public's ability to self host their own AI and therefore mitigate the threat open source models present to his business.
"Precrime"
"predictive policing"
AKA what they accuse aUtHoRiTaRiAn countries like China of.
Prolatariat say relentless exploitation by CEOs is hurting society and has "done a lot of damage."
America: Destroys all people who try to not be cancerous
"People are the cancer, actually."
If we're "exploring" the galaxy like the guys in that ship "explored" the Earth, we should stay home and avoid defiling any more celestial bodies than we already have.
I think a big part is emphasis. English tends to put the emphasis on the last word of the sentence so not saying it sounds weird.
Kind of off topic, but this reminded me about something I really don't like about the current paradigm of "intelligence" and "knowledge" being parts of a single monolithic model.
Why aren't we training models on how to search any generic dataset for information, find patterns, draw conclusions, etc, rather than baking the knowledge itself into the model? 8 or so GB of pure abstract reasoning strategies would probably be way more intelligent and efficient than even a much larger model we have now. Imagine if you can just give it an arbitrarily sized database whose content you control, which you can then fill with the highest quality, ethically obtained, human expert moderated data complete with attributions to original creators, and have it base all its decisions from that. It would even be able to cite what it used with identifiers in the database, which can then be manually verified. You get a concrete foundation of where it's getting its information from, and you only need to load what it currently needs into memory, whereas right now you have to load all the AI's "knowledge," relevant or not, into your precious and limited RAM. You would also be able to update the individual data separately from the model itself, and have it produce updated results from the new data. That would actually be what I consider an artificial "intelligence" and not a fancy statistical prediction mechanism.