this post was submitted on 23 Nov 2025
-3 points (20.0% liked)
AI Generated
53 readers
2 users here now
This is a community for any content generated using AI (images, music, code, etc.). Must follow all server rules.
founded 1 month ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
This is problematic because MMLU is very easy, and I think SWE bench is too. Even 32B models can ‘saturate’ MMLU, hence Qwen 30B would obliterate other models in these charts if you added it.
They're also gamed/contaminated like crazy.
I'd lean more towards “private” benchmarks that aren’t contaminated, or gamed like LM Arena. For instance:
https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard
https://eqbench.com/
Long context benchmarks like RULER are good too, as that’s extremely difficult for many models now.
Okay here's the new bar chart with the more spread out weighting, and honestly it looks a lot more reasonable.
Thank you, I'll factor that into the index. If you have any other recommendations for how to make the index more robust let me know. The goal is to make this dependent on real world API costs. I don't care if the newest smartest model is released if it costs $100 every time to use.