Unpopular Opinion
Welcome to the Unpopular Opinion community!
How voting works:
Vote the opposite of the norm.
If you agree that the opinion is unpopular give it an arrow up. If it's something that's widely accepted, give it an arrow down.
Guidelines:
Tag your post, if possible (not required)
- If your post is a "General" unpopular opinion, start the subject with [GENERAL].
- If it is a Lemmy-specific unpopular opinion, start it with [LEMMY].
Rules:
1. NO POLITICS
Politics is everywhere. Let's make this about [general] and [lemmy] - specific topics, and keep politics out of it.
2. Be civil.
Disagreements happen, but that doesn’t provide the right to personally attack others. No racism/sexism/bigotry. Please also refrain from gatekeeping others' opinions.
3. No bots, spam or self-promotion.
Only approved bots, which follow the guidelines for bots set by the instance, are allowed.
4. Shitposts and memes are allowed but...
Only until they prove to be a problem. They can and will be removed at moderator discretion.
5. No trolling.
This shouldn't need an explanation. If your post or comment is made just to get a rise with no real value, it will be removed. You do this too often, you will get a vacation to touch grass, away from this community for 1 or more days. Repeat offenses will result in a perma-ban.
6. Defend your opinion
This is a bit of a mix of rules 4 and 5 to help foster higher quality posts. You are expected to defend your unpopular opinion in the post body. We don't expect a whole manifesto (please, no manifestos), but you should at least provide some details as to why you hold the position you do.
Instance-wide rules always apply. https://legal.lemmy.world/tos/
view the rest of the comments
I think it's a bit more complicated than that, not sure if I'd call it no big deal... You're certainly right, it's impressive what they can do with synthetic data these days. But as far as I'm aware that's mostly used to train substantially smaller models from output of bigger models. I think it's called distillation? I did not read any paper revising the older findings with synthetic data. And to be honest, I think we want the big models to improve. And not just by a few percent each year, like what OpenAI is able to do these days... We'd need to make it like 10x more intelligent and less likely to confabulate answers, so it starts becoming reliable and usable for tasks like proper coding. And with the exponential need for more training data, we'd probably need many times the internet and all human-written books to go in, to make it two times or five times better than it is today. So it needs to work with mostly synthetic data. And then I'm not sure if that even works. Can we even make more intelligent newer models learn from the output of their stupider predecessors? With humans we mostly learn from people who are more intelligent than us, it's rarely the other way round. And I don't see how language is like chess, where AI can just play a billion games and just learn from that, that's not really how LLMs work.
You deserve a proper answer I'll post some papers later
Short answer distillation is s separate thing really
Data quality is at least as important as quantity. We don't need 10 internets or whatever. This is mostly what Deepseek showed. They did intelligent internet scraping for the pretrain initial bootstrap model ; big corporate models use synthetic data because more control and less cost concern.
" Textbooks Are All You Need " - quality synthetic data improves the model
" Escaping Model Collapse via Synthetic Data Verification " - using a verifier to get that quality data
Lastly Yannic Kilcher's breakdown of DeepSeek's math reasoning paper is fantastic https://www.youtube.com/watch?v=bAWV_yrqx4w