this post was submitted on 08 Aug 2025
419 points (99.5% liked)

Fediverse

21122 readers
33 users here now

A community dedicated to fediverse news and discussion.

Fediverse is a portmanteau of "federation" and "universe".

Getting started on Fediverse;

founded 5 years ago
MODERATORS
 

Dropsitenews published a list of websites Facebook uses to train its AI on. Multiple Lemmy instances are on the list as noticed by user BlueAEther

Hexbear is on there too. Also Facebook is very interested in people uploading their massive dongs to lemmynsfw.

Full article here.

Link to the full leaked list download: Meta leaked list pdf

(page 2) 50 comments
sorted by: hot top controversial new old
[–] artifex@piefed.social 52 points 2 days ago (2 children)

So every AI’s gonna identify as an Arch user with striped socks now?

[–] oxysis@lemmy.blahaj.zone 31 points 2 days ago

Forcibly feminizing the ai, one pair of thigh highs at a time

load more comments (1 replies)
[–] hyacin@lemmy.ml 31 points 2 days ago

Ahahahahaha, so it's going to be a self-hating Meta AI bot?

[–] Carl@hexbear.net 38 points 2 days ago* (last edited 2 days ago)

lemmygrad

imagining Zuck launching his "everybody gets ten virtual friends" initiative and accidentally re-radicalizing your parents and grandparents in the other direction.

[–] CrispyFern@hexbear.net 45 points 2 days ago

The bot trained on hexbear and lemmygrad vs the bot trained on .world: approaching-1approaching-2

[–] Alaskaball@hexbear.net 43 points 2 days ago (1 children)

Damn zuckbot's gonna end up being a commie-bot that posts absurdist memes about beans if it's harvesting hexbear posts for content

[–] CloutAtlas@hexbear.net 24 points 2 days ago

The AI wasting hours of processing power having an internal struggle session re: outdoor cats before simply replying with ":pigpoopballs" on a platform that doesn't have that emoji

[–] Maeve@kbin.earth 42 points 2 days ago (1 children)

Going straight to palantir

[–] SaneMartigan@aussie.zone 28 points 2 days ago (1 children)

now I feel I should upload my asshole pic.

[–] wuphysics87@lemmy.ml 16 points 2 days ago (1 children)

Your proctologist already has

load more comments (1 replies)
[–] SexUnderSocialism@hexbear.net 30 points 2 days ago (2 children)

I'll be upping my use of Maoist Standard English and PIGPOOPBALLS in response this revelation.

load more comments (2 replies)
[–] mesamunefire@piefed.social 28 points 2 days ago* (last edited 2 days ago) (1 children)

Peertube as well. 46 instances.

Oh and https://mastodon.sdf.org/ as well.

load more comments (1 replies)
[–] QuentinCallaghan@sopuli.xyz 6 points 1 day ago

Sopuli's there also! This sucks, but hopefully Anubis protects against Meta.

[–] Erika3sis@hexbear.net 25 points 2 days ago (2 children)

Honestly, I already figured my posts probably were being used to train a LLM without my consent.

[–] nickwitha_k@lemmy.sdf.org 17 points 2 days ago

I'm more concerned about the non-consensual scraping causing excess load on the servers. The taking of content without license to train their energy-wasting autocomplete that is being used to for little commercially but to try to cheapen labor and pocket the money is a problem too. But I hate having servers impacted by their bullshit.

load more comments (1 replies)
[–] rimu@piefed.social 23 points 2 days ago (5 children)

Check out the robots.txt on any Lemmy instance....

[–] usernamesAreTricky@lemmy.ml 40 points 2 days ago (1 children)

Linked article in the body suggests that likely wouldn't have made a difference anyway

The scrapers ignored common web protocols that site owners use to block automated scraping, including “robots.txt” which is a text file placed on websites aimed at preventing the indexing of context

[–] mesamunefire@piefed.social 32 points 2 days ago* (last edited 2 days ago) (1 children)

Yeah ive seen the argument in blog posts that since they are not search engines they dont need to respect robots.txt. Its really stupid.

[–] AmbitiousProcess@piefed.social 24 points 2 days ago

"No no guys you don't understand, robots.txt actually means just search engines, it totally doesn't imply all automated systems!!!"

[–] Pamasich@kbin.earth 5 points 1 day ago

If they have a brain, and they do have the experience from Threads, they don't need to scrape Lemmy. They can just set up a shell instance, subscribe to Lemmy communities, and then use federation to get their data for free. That doesn't use robots.txt at all regardless.

load more comments (3 replies)
[–] captainlezbian@lemmy.world 10 points 1 day ago

Oh that's certainly a decision they made

[–] BlueEther@no.lastname.nz 21 points 2 days ago* (last edited 2 days ago)

aussie.zone and beehaw.org are on the list as well

[–] crazycraw@crazypeople.online 14 points 2 days ago

I thought we all knew and were training it wrong on purpose..

...as a joke.

[–] v4ld1z@lemmy.zip 16 points 2 days ago

Aw hell nah

[–] ada@lemmy.blahaj.zone 13 points 2 days ago

Our cdn is there... Joy...

load more comments
view more: ‹ prev next ›