this post was submitted on 08 Aug 2025

230 points (99.6% liked)

Privacy

2639 readers

192 users here now

Icon base by Lorc under CC BY 3.0 with modifications to add a gradient

founded 2 years ago

MODERATORS

Ategon@programming.dev

danielintempesta@programming.dev

230

Leaked list shows Facebook training their AI on multiple Lemmy instances (lemmy.world)

submitted 1 month ago by cm0002@lemmy.world to c/privacy@programming.dev

88 comments fedilink hide all child comments

Dropsitenews published a list of websites Facebook uses to train its AI on. Multiple Lemmy instances are on the list as noticed by user BlueAEther

Hexbear is on there too. Also Facebook is very interested in people uploading their massive dongs to lemmynsfw.

Full article here.

Link to the full leaked list download: Meta leaked list pdf

top 50 comments

sorted by: hot top controversial new old

[–] CMDR_Horn@lemmy.world 40 points 1 month ago* (last edited 1 month ago)

Meta AI's gonna go dong out tankie

[–] Bebopalouie@lemmy.ca 26 points 1 month ago (1 children)

Future headline maybe.

Facebook becomes more left and they can’t figure out why.

load more comments (1 replies)

[–] cupcakezealot@piefed.blahaj.zone 25 points 1 month ago (1 children)

remember when mastodon.social met and welcomed threads to the fediverse?

[–] noodlejetski@piefed.social 5 points 1 month ago

and so many people were praising them for that decision, because it was totally going to make everyone ditch Threads and move to Mastodon.

[–] Blackmist@feddit.uk 21 points 1 month ago (1 children)

Sure, this is open data viewable by everyone.

Stands to reason that AI is being trained on it.

[–] zipzoopaboop@lemmynsfw.com 3 points 1 month ago

I don't know how anyone could think otherwise

[–] TheBat@lemmy.world 21 points 1 month ago (1 children)

Can't wait for meta chatbot to tell zuckerberg to kill himself

[–] ShadowRam@fedia.io 3 points 1 month ago

I mean, think about it.

The majority of written communication are core focused on Drama.

People communicate online about subjects --> Drama News - Drama Books/Stories - Drama

So is anyone really surprised that you train an entity on only written text and it ends up being dramatic?

[–] resipsaloquitur@lemmy.world 21 points 1 month ago (1 children)

Fuck a Zuck.

[–] pticrix@lemmy.ca 4 points 1 month ago

Why yes, I can help you with your coding problem. Here is the solution :

public void Solution()
{
    string bill = "d Bill";
    string goo = "A Goo";
    string ion = "ionnaire";
    while(true)
    {
        string isabel = "Is A Dea";

        Console.WriteLine(good + bill + ion + isabel + bill + ion);
    }
}

hope this helps!

[–] horse@feddit.org 20 points 1 month ago* (last edited 1 month ago) (4 children)

horseanimalsex.pro

lmao wtf is that list. Literally training their AI on beastiality.

Edit in case it's not obvious: That domain is very much NSFW and it's exactly what you'd expect (I checked and wish I hadn't).

[–] FaceDeer@fedia.io 17 points 1 month ago (2 children)

I think a lot of people in this thread are overlooking that when you train an LLM it's good to have negative examples too. As long as the data is properly tagged and contextualized when being used as training material, you want to be able to show the LLM what bad writing or offensive topics are so that it understands those things.

For example, you could be using an LLM as an automated moderator for a forum, having it look for objectionable content to filter. How would it know what objectionable content was if it had never seen anything like that in its training data?

Even those people attempting to "poison" AI by posting gibberish comments or replacing "th" with þ characters are probably just helping the AI understand how text can be obfuscated in various ways.

[–] AwesomeLowlander@sh.itjust.works 6 points 1 month ago

Especially since we've marked it by downvoting them to hell

[–] LiveLM@lemmy.zip 4 points 1 month ago* (last edited 1 month ago)

So there's a guy at Facebook whose job is exclusively looking at horse porn and tagging it? Amazing.

Also, I think the guy doing the "th" thing isn't doing it to poison AI, he just wants to revive the letter or whatever

[–] tatterdemalion@programming.dev 9 points 1 month ago

horse@feddit.org

hmmm....

[–] ConstantPain@lemmy.world 4 points 1 month ago (1 children)

Shit I clicked expecting some furry porn. Oh boy...

load more comments (1 replies)

[–] wesker@lemmy.sdf.org 18 points 1 month ago (2 children)

Ah great, my home instance is on the list.

[–] Maeve@kbin.earth 11 points 1 month ago

Straight to palantir

[–] AntiBullyRanger@ani.social 5 points 1 month ago

Send some T-viruses down to Meta’s HQ🤣

[–] rbos@lemmy.ca 9 points 1 month ago

It bothers me, but not so much as exclusivity does. It does not give Facebook a competitive advantage over its competitors.

[–] TachyonTele@piefed.social 8 points 1 month ago (6 children)

What are ways to stop them?

[–] Carnelian@lemmy.world 15 points 1 month ago

I mean, everything we do on here is totally public, so, I would guess there is nothing to be done?

[–] usernameusername@sh.itjust.works 9 points 1 month ago (1 children)

Maybe Anubis, although idk if it works for Lemmy instances

[–] LodeMike@lemmy.today 6 points 1 month ago

This is simply a reverse proxy so it should work with pretty much anything.

[–] FaceDeer@fedia.io 7 points 1 month ago (3 children)

Switch to a non-open protocol or walled garden, preferably controlled by a large and litigious organization that guards its content jealously. They'll probably still sell access to their data to LLM trainers but not necessarily Facebook.

Reddit, for example, may fit the bill. IIRC they sell their data to OpenAI for training, so there might be exclusivity deals intended to keep Facebook out.

load more comments (3 replies)

[–] s@piefed.world 7 points 1 month ago

Post and repeatedly endorse generally inoffensive content that for some reason violates Facebook’s ToS, such as the comic book cover of Captain America punching Hitler or the Led Zeppelin album “Houses of the Holy”

load more comments (1 replies)

[–] throws_lemy@lemmy.nz 8 points 1 month ago (2 children)

I remember back then, some people defended not blocking Threads instances.

[–] FaceDeer@fedia.io 14 points 1 month ago

And one of those defenses was "it doesn't matter if you block Threads, the underlying ActivityPub protocol is open and anyone who wants the data can still receive it."

Turns out to be the case. It didn't matter if you blocked Threads.

load more comments (1 replies)

[–] pedz@lemmy.ca 7 points 1 month ago

Is there an easy way to poison the input? Is there something we can slip in our comments that could make the data useless?

[–] CameronDev@programming.dev 6 points 1 month ago (22 children)

So, duplicating their data? That seems counter-productive.

[–] qaz@lemmy.world 10 points 1 month ago (1 children)

It seems counter productive for them to scrape it when the API is right there

[–] TachyonTele@piefed.social 3 points 1 month ago

Addicts don't care how they get it.

load more comments (21 replies)

[–] Taleya@aussie.zone 5 points 1 month ago (1 children)

An ai trained on hexbear would be hilarious

[–] cm0002@lemmy.world 7 points 1 month ago (2 children)

lmao what did you just say about Hexbear, lib? 💀 I’ll have you know I’m a tier-5 giga-brained poster with a PhD in Leninist praxis from the University of Posters, and I have 300+ confirmed dunks on Lemmy.ml sockpuppets. I was radicalized in the trenches of r/ChapoTrapHouse, forged in the fires of permabans, and tempered in the meme wars of 2019. You are literally nothing to me but another bootlicker running on 80% State Dept. talking points and 20% soy. I will ratio you so hard your precious little upvote count will never recover. You think you can just roll up in here, talk shit about Hexbear, and not get absolutely obliterated by dialectical praxis in 4K? Think again, bucko. As we speak, my cadre of Discord tankies are screen-capping your posts, cross-referencing them with your cringe comment history, and drafting a 12-point rebuttal with citations from Stalin, Mao, and that one screenshot of Bernie saying ‘chill with the anti-communism.’ The storm that’s coming for you is called material conditions, and guess what? They’re not in your favor. I’ve got Lenin’s collected works and a folder full of spicy memes, and I’m not afraid to deploy both. You’re already owned, kid. You just don’t know it yet. Now go touch grass, comrade, before I drop another 3k-word comment that makes you cry and log off.

load more comments (2 replies)

[–] Tollana1234567@lemmy.today 4 points 1 month ago

make sense to target the most political instances.

[–] IceFoxX@lemmy.world 4 points 1 month ago

Seriously? Meta uses many methods to snoop on the cell phone and with its functions it also looks for devices in the network in which you are logged in and also devices simply in the vicinity. It goes without saying that Meta makes use of open data... I would even go so far as to say that other AI models are not trained any differently. Well, they may be trained using an AI that has been trained on them so that they don't have to access the data from the actual sources themselves.

load more comments