This is problematic because MMLU is very easy, and I think SWE bench is too. Even 32B models can ‘saturate’ MMLU, hence Qwen 30B would obliterate other models in these charts if you added it.
They're also gamed/contaminated like crazy.
I'd lean more towards “private” benchmarks that aren’t contaminated, or gamed like LM Arena. For instance:
https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard
Long context benchmarks like RULER are good too, as that’s extremely difficult for many models now.
