this post was submitted on 14 Sep 2025
199 points (100.0% liked)

chapotraphouse

14088 readers
747 users here now

Banned? DM Wmill to appeal.

No anti-nautilism posts. See: Eco-fascism Primer

Slop posts go in c/slop. Don't post low-hanging fruit here.

founded 4 years ago
MODERATORS
 
top 50 comments
sorted by: hot top controversial new old
[–] FlakesBongler@hexbear.net 91 points 3 days ago (2 children)

reddit

This explains why it's so confidently wrong so often

[–] SchillMenaker@hexbear.net 33 points 3 days ago (1 children)

Even at sub-5% Quora is still doing some work here

[–] FlakesBongler@hexbear.net 25 points 3 days ago (1 children)

Quora explains why it's so horny

Especially since half of Quora is just weird erotica

[–] AOCapitulator@hexbear.net 3 points 1 day ago (1 children)

Wait really? I thought it was a questions and answers site why are people posting fanfic smut there lol

[–] FlakesBongler@hexbear.net 2 points 1 day ago

You should listen to the Quorators podcast

They go over all this sort of shit

[–] axont@hexbear.net 78 points 3 days ago* (last edited 3 days ago) (2 children)

AI acolytes tell me their preferred AI has the advantage of access to all the world's data, the full knowledge of mankind and yet 9.3% of its knowledge comes from walmart.com

if 9.3% of a hypothetical humans's knowledge came from walmart.com that person would be rightfully put in the pillory in the town square for the crime of demonic possession

[–] Mindfury@hexbear.net 33 points 3 days ago (1 children)

Walmart hosts the Codex Astartes in the backend, hard to access using the website manually but you can crawl it

[–] axont@hexbear.net 27 points 3 days ago

the omnissiah manifesting physically in our universe through the machinations of retail backends. hail the motive force

[–] LocalMaxima@hexbear.net 25 points 3 days ago

The chart title is a bit misleading. This isn’t the source of training data, but the sites that are linked to in responses. Google AI overview was included in the results, which kind of explains why this is list is just the sites you would expect to be at the top of a Google search

[–] The_hypnic_jerk@hexbear.net 49 points 3 days ago (1 children)

They automated putting "reddit" at the end of a Google search and called it agi

The llm itself admitted this!

[–] FnordPrefect@hexbear.net 40 points 3 days ago (2 children)

State Department like, "Yeah, look at all of those distinct and independent sources of information side-eye-1 side-eye-2"

but at least with yahoo on there we can be confident that grok will have lots of quality details about pregnartcy

[–] InevitableSwing@hexbear.net 23 points 3 days ago (1 children)

I love that vid.

"Dangerops prangent sex? will it hurt baby top of its head?" still the best one

I don't know if it's best but def in the top three.

[–] NephewAlphaBravo@hexbear.net 9 points 3 days ago* (last edited 3 days ago)

"gregnant" and "pregnart" live in my brain rent free forever

load more comments (1 replies)
[–] take_five_moments@hexbear.net 37 points 3 days ago (1 children)
[–] FlakesBongler@hexbear.net 17 points 3 days ago

Home of some of the worst wannabe police-cop LP guys ever

[–] BeanisBrain@hexbear.net 32 points 3 days ago (2 children)

Allow me to propose an alternative input set:

  • 60% marxists.org (for historical theory)
  • 30% redsails.org (for contemporary criticism)
  • 5% youtube.com (only transcripts of Hakim and Luna Oi videos)
  • 5% hexbear.net (for flavor)
[–] alexei_1917@hexbear.net 21 points 3 days ago* (last edited 3 days ago) (1 children)

I think a chatbot trained only on ML theory would certainly be fun to play with. Ask a political or economic question, get something that sounds just like Lenin and makes about as much sense as some particularly dense parts of Capital.

(And even though it's a robot, I do feel a weird perverse thrill at the idea of taking a completely politically unconscious and blank slate mind and providing it only the Marxist-Leninist perspective, and never exposing it to any other political viewpoint until a strong ideological foundation is built. That's kinda neat.)

[–] BountifulEggnog@hexbear.net 13 points 3 days ago (1 children)

You need a big dataset to train a model, unfortunately Marxist-Leninists are too short spoken.

[–] alexei_1917@hexbear.net 9 points 3 days ago (1 children)

Short spoken? Some of our theory seems pretty damn long.

[–] BountifulEggnog@hexbear.net 8 points 3 days ago* (last edited 3 days ago) (4 children)

That bit was a joke, although I would expect all theory to be much less then the amount of data needed to pretrain a model big enough to produce anything- coherent.

Actually, here's some math. SmolLM was trained on 600b tokens. Das Kapital is roughly 288k words, about 218k tokens. We'll round to 250,000 tokens. Divided into 600,000,000,000 and we would need 2.4 million Das Kapitals worth of text to train SmolLM. V2 uses 2t tokens, 8 million Das Kapitals. There's obviously a lot more theory then that, and you could probably throw forums like ours in, prolewiki, maybe some youtube subtitles. Synthetic data from theory. LLMs just need to eat a lot of text unfortunately. Qwen3 trained on 36 trillion tokens, 144 million Kapitals.

load more comments (4 replies)
load more comments (1 replies)
[–] emdash@hexbear.net 30 points 3 days ago (1 children)

Why did they need to pirate every book on Anna's Archive if they were just going to cite social media and product advertisements?

load more comments (1 replies)
[–] woodenghost@hexbear.net 29 points 3 days ago* (last edited 3 days ago)

Fucking Amazon? Why? For badly translated product descriptions and fake reviews? Those already were the closest thing to AI texts, before AI even existed.

Walmart? Really? Can it get any worse?

LinkedIn

Noooo! ooooooooooooooh

[–] The_Filthy_Commie@lemmygrad.ml 20 points 3 days ago

A scatological Ourobouros.

[–] varmint@hexbear.net 27 points 3 days ago (3 children)

Why does this add up to way more than 100%?

[–] roux@hexbear.net 29 points 3 days ago* (last edited 3 days ago)

They used AI to generate the chart.

presumably bc the same prompt can generate citations from multiple sites

[–] dannoffs@hexbear.net 28 points 3 days ago (3 children)

How on earth do they get almost 5% from home depot?

[–] InevitableSwing@hexbear.net 16 points 3 days ago

The trombone garden hose was invented in 1782 on sale now!

[–] GrouchyGrouse@hexbear.net 25 points 3 days ago

Time to edit all 400,000 of my Reddit comments to be about the 1997 point-and-click videogame Star Wars: Yoda Stories

[–] BoxedFenders@hexbear.net 21 points 3 days ago (2 children)

And I'd like to add that one of the reasons why Reddit is so high on this list is that they have positioned themselves as a source for easily scrapable data for the big AI players. It is now the highest priority of the company to appease the tech giants starved for cheap user content to feed its AI monsters. Reddit's stock has also gone up 300% in the last year just for these partnerships alone.

[–] Des@hexbear.net 10 points 3 days ago

i really wish I had taken that IPO offer fucking reddit

load more comments (1 replies)
[–] ShimmeringKoi@hexbear.net 18 points 3 days ago

I guess that explains the writing style

[–] MF_COOM@hexbear.net 21 points 3 days ago (1 children)

How does it scrape YouTube? Like the comments? Or the videos that have transcripts? Or the output from closed captioning?

[–] BountifulEggnog@hexbear.net 7 points 3 days ago

Yes, everything.

[–] Beaver@hexbear.net 13 points 3 days ago* (last edited 3 days ago) (1 children)

I just despair when there's so much digitized information that was written by actual academics and experts, but the LLMs and search engines clearly seem to give the most reddit-ass answers to questions.

[–] Alisu@hexbear.net 8 points 3 days ago (1 children)

I've managed to get linked to university websites and academic sources, but you gotta ask the right questions in the right way.

load more comments (1 replies)
[–] SexUnderSocialism@hexbear.net 8 points 3 days ago* (last edited 3 days ago)
[–] SteamedHamberder@hexbear.net 17 points 3 days ago

SMH Hexbear.net not on the top 10

[–] Parzivus@hexbear.net 17 points 3 days ago

I guess AI has also learned to staple "reddit" to the end of every web search now that google is so ass

[–] Rom@hexbear.net 12 points 3 days ago

So mostly from places where any asshole can post whatever the hell they pull out of their ass with zero verification, in other words.

[–] Evil_Shrubbery@thelemmy.club 5 points 3 days ago

I'm sure there is plenty of non-official (ie illegal) content & their own users' data (for the training too, not just searching).

[–] Saymaz@lemmygrad.ml 5 points 3 days ago

The next generation is gonna be somehow more rightwing than the previous two.

Instagram is the most worrying one

[–] Seasonal_Peace@hexbear.net 2 points 2 days ago

Why so much maps, do people ask LLMs for spacial info?

[–] NuraShiny@hexbear.net 8 points 3 days ago (1 children)

Would it be worse, or better, if Hexbear made that list?

load more comments (1 replies)
load more comments
view more: next ›