this post was submitted on 14 Sep 2025
199 points (100.0% liked)

chapotraphouse

14093 readers
1142 users here now

Banned? DM Wmill to appeal.

No anti-nautilism posts. See: Eco-fascism Primer

Slop posts go in c/slop. Don't post low-hanging fruit here.

founded 4 years ago
MODERATORS
 
you are viewing a single comment's thread
view the rest of the comments
[–] alexei_1917@hexbear.net 9 points 4 days ago (1 children)

Short spoken? Some of our theory seems pretty damn long.

[–] BountifulEggnog@hexbear.net 8 points 4 days ago* (last edited 4 days ago) (2 children)

That bit was a joke, although I would expect all theory to be much less then the amount of data needed to pretrain a model big enough to produce anything- coherent.

Actually, here's some math. SmolLM was trained on 600b tokens. Das Kapital is roughly 288k words, about 218k tokens. We'll round to 250,000 tokens. Divided into 600,000,000,000 and we would need 2.4 million Das Kapitals worth of text to train SmolLM. V2 uses 2t tokens, 8 million Das Kapitals. There's obviously a lot more theory then that, and you could probably throw forums like ours in, prolewiki, maybe some youtube subtitles. Synthetic data from theory. LLMs just need to eat a lot of text unfortunately. Qwen3 trained on 36 trillion tokens, 144 million Kapitals.

[–] hotcouchguy@hexbear.net 6 points 4 days ago (1 children)

I believe there are methods to train on a large, general dataset, and then re-train on a small, focused dataset, but I'm not sure of any specifics

[–] BountifulEggnog@hexbear.net 6 points 4 days ago (1 children)

Yes, lots of ways, and definitely the approach for something like this. You would still have to be picky about data though, pre training still effects its biases a lot. Especially if the hope is a blank slate that's only seen ML thinking.

[–] alexei_1917@hexbear.net 3 points 4 days ago

Yeah, absolutely. Creating a thing capable of at least appearing to think, that is literally unable to understand Western liberal nonsense because it's been fed only ML aligned material to read and process, might not be possible. I just thought the concept was kinda neat.

[–] alexei_1917@hexbear.net 2 points 4 days ago

Yeah, when you put it that way, one can see the issue. I was kind of joking myself, we have a lot of theory, and while it might be a drop in the bucket for a machine that needs to basically eat boatloads of text, when it comes to humans reading it, even just what a lot of orgs agree on as the core texts, is a lot of reading to do. And the theory itself is often... not short spoken or concise in any sense. Some of it can really feel like it's long and complicated on purpose.