Unpopular Opinion
Welcome to the Unpopular Opinion community!
How voting works:
Vote the opposite of the norm.
If you agree that the opinion is unpopular give it an arrow up. If it's something that's widely accepted, give it an arrow down.
Guidelines:
Tag your post, if possible (not required)
- If your post is a "General" unpopular opinion, start the subject with [GENERAL].
- If it is a Lemmy-specific unpopular opinion, start it with [LEMMY].
Rules:
1. NO POLITICS
Politics is everywhere. Let's make this about [general] and [lemmy] - specific topics, and keep politics out of it.
2. Be civil.
Disagreements happen, but that doesn’t provide the right to personally attack others. No racism/sexism/bigotry. Please also refrain from gatekeeping others' opinions.
3. No bots, spam or self-promotion.
Only approved bots, which follow the guidelines for bots set by the instance, are allowed.
4. Shitposts and memes are allowed but...
Only until they prove to be a problem. They can and will be removed at moderator discretion.
5. No trolling.
This shouldn't need an explanation. If your post or comment is made just to get a rise with no real value, it will be removed. You do this too often, you will get a vacation to touch grass, away from this community for 1 or more days. Repeat offenses will result in a perma-ban.
6. Defend your opinion
This is a bit of a mix of rules 4 and 5 to help foster higher quality posts. You are expected to defend your unpopular opinion in the post body. We don't expect a whole manifesto (please, no manifestos), but you should at least provide some details as to why you hold the position you do.
Instance-wide rules always apply. https://legal.lemmy.world/tos/
view the rest of the comments
You deserve a proper answer I'll post some papers later
Short answer distillation is s separate thing really
Data quality is at least as important as quantity. We don't need 10 internets or whatever. This is mostly what Deepseek showed. They did intelligent internet scraping for the pretrain initial bootstrap model ; big corporate models use synthetic data because more control and less cost concern.
" Textbooks Are All You Need " - quality synthetic data improves the model
" Escaping Model Collapse via Synthetic Data Verification " - using a verifier to get that quality data
Lastly Yannic Kilcher's breakdown of DeepSeek's math reasoning paper is fantastic https://www.youtube.com/watch?v=bAWV_yrqx4w