this post was submitted on 14 Sep 2025
200 points (100.0% liked)

Chapotraphouse

14114 readers
845 users here now

Banned? DM Wmill to appeal.

No anti-nautilism posts. See: Eco-fascism Primer

Slop posts go in c/slop. Don't post low-hanging fruit here.

founded 4 years ago
MODERATORS
 
you are viewing a single comment's thread
view the rest of the comments
[–] alexei_1917@hexbear.net 21 points 2 weeks ago* (last edited 2 weeks ago) (1 children)

I think a chatbot trained only on ML theory would certainly be fun to play with. Ask a political or economic question, get something that sounds just like Lenin and makes about as much sense as some particularly dense parts of Capital.

(And even though it's a robot, I do feel a weird perverse thrill at the idea of taking a completely politically unconscious and blank slate mind and providing it only the Marxist-Leninist perspective, and never exposing it to any other political viewpoint until a strong ideological foundation is built. That's kinda neat.)

[–] BountifulEggnog@hexbear.net 13 points 2 weeks ago (1 children)

You need a big dataset to train a model, unfortunately Marxist-Leninists are too short spoken.

[–] alexei_1917@hexbear.net 9 points 2 weeks ago (1 children)

Short spoken? Some of our theory seems pretty damn long.

[–] BountifulEggnog@hexbear.net 8 points 2 weeks ago* (last edited 2 weeks ago) (2 children)

That bit was a joke, although I would expect all theory to be much less then the amount of data needed to pretrain a model big enough to produce anything- coherent.

Actually, here's some math. SmolLM was trained on 600b tokens. Das Kapital is roughly 288k words, about 218k tokens. We'll round to 250,000 tokens. Divided into 600,000,000,000 and we would need 2.4 million Das Kapitals worth of text to train SmolLM. V2 uses 2t tokens, 8 million Das Kapitals. There's obviously a lot more theory then that, and you could probably throw forums like ours in, prolewiki, maybe some youtube subtitles. Synthetic data from theory. LLMs just need to eat a lot of text unfortunately. Qwen3 trained on 36 trillion tokens, 144 million Kapitals.

[–] hotcouchguy@hexbear.net 6 points 2 weeks ago (1 children)

I believe there are methods to train on a large, general dataset, and then re-train on a small, focused dataset, but I'm not sure of any specifics

[–] BountifulEggnog@hexbear.net 6 points 2 weeks ago (1 children)

Yes, lots of ways, and definitely the approach for something like this. You would still have to be picky about data though, pre training still effects its biases a lot. Especially if the hope is a blank slate that's only seen ML thinking.

[–] alexei_1917@hexbear.net 3 points 2 weeks ago

Yeah, absolutely. Creating a thing capable of at least appearing to think, that is literally unable to understand Western liberal nonsense because it's been fed only ML aligned material to read and process, might not be possible. I just thought the concept was kinda neat.

[–] alexei_1917@hexbear.net 2 points 2 weeks ago

Yeah, when you put it that way, one can see the issue. I was kind of joking myself, we have a lot of theory, and while it might be a drop in the bucket for a machine that needs to basically eat boatloads of text, when it comes to humans reading it, even just what a lot of orgs agree on as the core texts, is a lot of reading to do. And the theory itself is often... not short spoken or concise in any sense. Some of it can really feel like it's long and complicated on purpose.