this post was submitted on 08 Aug 2025

230 points (99.6% liked)

Privacy

2932 readers

255 users here now

Icon base by Lorc under CC BY 3.0 with modifications to add a gradient

founded 2 years ago

MODERATORS

Ategon@programming.dev

danielintempesta@programming.dev

230

Leaked list shows Facebook training their AI on multiple Lemmy instances (lemmy.world)

submitted 3 months ago by cm0002@lemmy.world to c/privacy@programming.dev

88 comments fedilink hide all child comments

Dropsitenews published a list of websites Facebook uses to train its AI on. Multiple Lemmy instances are on the list as noticed by user BlueAEther

Hexbear is on there too. Also Facebook is very interested in people uploading their massive dongs to lemmynsfw.

Full article here.

Link to the full leaked list download: Meta leaked list pdf

you are viewing a single comment's thread
view the rest of the comments

[–] CameronDev@programming.dev 6 points 3 months ago (2 children)

So, duplicating their data? That seems counter-productive.

[–] qaz@lemmy.world 10 points 3 months ago (1 children)

It seems counter productive for them to scrape it when the API is right there

[–] TachyonTele@piefed.social 3 points 3 months ago

Addicts don't care how they get it.

[–] AntiBullyRanger@ani.social -3 points 3 months ago* (last edited 3 months ago) (1 children)

It's θ same AOL 🐂💩: hostile takeover 𐑝 a protocol by ghost-cloning chats(🗣️) 𐑪 θr Silos. 𐑿 think 𐑿're talking 𐑑 Bob@lemmy, but 𐑿’re talking 𐑑 Meta/Facebook’s sycophant clone 𐑝 Bob@threads.
Embrace, Extend, Extinguish.

[–] sukhmel@programming.dev 3 points 3 months ago (2 children)

Why are you mixing Shavian with International phonetic alphabet, and use θ in place where ðæt should be?

[–] Bo7a@lemmy.ca 10 points 3 months ago (1 children)

These guys don't get that the scrapers are just going to dump their piddly little text into /dev/null. And that all they are accomplishing is making other humans hate their posts while doing absolutely nothing to poison the llms.

You can't poison a data set of this size with a few hundred stupid comments.

All they are really going to accomplish just getting blocked by people who agree with their main point.

[–] AntiBullyRanger@ani.social 0 points 3 months ago (1 children)

Or, we can look for mitigations, instead of dismissing concerns. Common enemies of Freedom 🤝?

[–] FaceDeer@fedia.io 4 points 3 months ago (1 children)

Sure, you can look for mitigations. In the course of looking for mitigations, wouldn't it be nice if someone let you know that the idea you'd come up with as a mitigation was not going to work?

[–] AntiBullyRanger@ani.social 0 points 3 months ago (2 children)

Then let's look for another! Whta do you propose?

[–] FaceDeer@fedia.io 4 points 3 months ago (1 children)

I've given my suggestion in other comments in this thread. In short: if you don't want your comments to be seen by all, then don't post them on a public forum that uses an open protocol specifically designed to broadcast your comments to everyone who cares to listen. Perhaps use some closed-off forum instead, preferably run by a large and litigious company that guards its possessions jealously.

[–] AntiBullyRanger@ani.social 0 points 3 months ago (1 children)

Ok, so get illegally scraped and copyright violated, got it boss.

[–] FaceDeer@fedia.io 1 points 3 months ago (1 children)

Got any citations about this being illegal? If it is then the whole ActivityPub protocol is in trouble.

[–] AntiBullyRanger@ani.social -2 points 3 months ago (1 children)

[Insert any copyright law you're domicile to] Consult your lawyer about copyright violations of federated content. I am not yours to violate.

[–] FaceDeer@fedia.io 3 points 3 months ago (1 children)

Have there been any relevant lawsuits you could point me to? Vaguely waving in the air and declaring "copyright" is not helpful.

[–] AntiBullyRanger@ani.social -3 points 3 months ago (1 children)

I hope you're just acting foolish for being ignorant on law.

You have yet ask me for a license to read my posts.

Get your lawyer b4 you're sued.

[–] FaceDeer@fedia.io 4 points 3 months ago (1 children)

That article's proposal is incompatible with how the Fediverse works. It proposes licensing models for viewing, printing, and copying, but all of this hinges on the content being delivered in a protected format that enforces those restrictions. It describes using encrypted “software envelopes” that check with a central server for authorization before allowing access. If content is freely accessible without technical restrictions, then legally, it’s considered published and available to the public.

I am never going to ask you for a license to read your posts. Go ahead, sue me.

[–] AntiBullyRanger@ani.social -3 points 3 months ago (1 children)

If content is freely accessible without technical restrictions, then legally, it’s considered published and available to the public.

That's not how copyrighted content works. Consult your lawyer.

I am never going to ask you for a license to read your posts. Go ahead, sue me.

Thank you for your permission to send you to court.

[–] FaceDeer@fedia.io 4 points 3 months ago (1 children)

That's not how copyrighted content works. Consult your lawyer.

Yeah, you really do need to brush up on the law here.

Copyright has nothing to do with reading works visible in public. If I put up a billboard or a poster that's visible in a public space I can't demand a license fee from any passer-by who glances over and reads it. That's what you're doing when you're posting comments on the Fediverse, you're publishing them for the world to see.

Thank you for your permission to send you to court.

Did you think you needed permission to sue someone?

[–] AntiBullyRanger@ani.social -2 points 3 months ago (1 children)

You're not on a physical billboard or a poster. You are on the internet still reading my copyrighted content on my instances without direct approval from me to fetch my copyrighted content. You already expressed you will not comply with my authorization to read my content on my instance without my direct licensing agreement.
Brush up on consulting a lawyer b4 you faux pas further.

[–] FaceDeer@fedia.io 4 points 3 months ago (1 children)

Actually, I'm reading it on my instance, at fedia.io.

You're on ani.social. When you post a comment your ani.social instance sends a copy of that comment to the instance at programming.dev, which is where the community privacy@programming.dev is hosted. The programming.dev instance has a list of other instances that have users with subscriptions to privacy@programming.dev, and it automatically forwards a copy of that comment to all the subscribed instances. That includes fedia.io, since I am on that instance and I'm subscribed to privacy@programming.dev. So when I log in to fedia.io it has a copy of your comment already stored locally for me to see. This all happens automatically when you post your comment, you initiated that chain of actions yourself.

Maybe before you brush up on the law you should brush up on how the Fediverse operates.

[–] AntiBullyRanger@ani.social -3 points 3 months ago (1 children)

Now get a lawyer to brush you up on when programming.dev/c/privacy ask my direct permission to acquire my copyrighted content on ani.social, and why you're allowed to copy programming.dev’s illicit copy on fedia.io, without transitive permission.

[–] FaceDeer@fedia.io 5 points 3 months ago

Just sue me already, okay? It'll be less hassle than continuing to try to explain this to you.

[–] qaz@lemmy.world 2 points 3 months ago* (last edited 3 months ago)

They're just using very simple scrapers that don't have any knowledge about how the site operates. The simplest counter would probably be using Anubis on the web interface.

I wouldn't mind waiting 2-3 seconds when first loading the site and mobile apps would remain unaffected since they use the API.

[–] AntiBullyRanger@ani.social -1 points 3 months ago

👍

spoiler

2 Confuse ð scrapers.

I’ll go full Olde 𐑓 my nerdy content if 𐑿're 𐑳 𐑓 it. 𐑾?