AI Generated

53 readers

2 users here now

This is a community for any content generated using AI (images, music, code, etc.). Must follow all server rules.

founded 1 month ago

MODERATORS

jaykrown@lemmy.world

-3

I am working on an "AI Efficiency Index" (lemmy.world)

submitted 1 month ago by jaykrown@lemmy.world to c/AIGenerated@lemmy.world

3 comments fedilink hide all child comments

DeepSeek V3.2-Exp (September 2025) — 80.00
Kimi K2 Thinking (November 2025) — 72.59
Gemini 2.5 Flash (May 2025) — 68.79
GPT-5 (August 2025) — 23.44
o3 (April 2025) — 22.57
Gemini 2.5 Pro (March 2025) — 21.37
Gemini 3 Pro (November 2025) — 16.93
Claude 3.5 Sonnet (August 2025) — 11.67
GPT-5 Pro (August 2025) — 1.96

The goal is to find out which models are the best practical performance per dollar spent on real world available API pricing. Right now it seems like DeepSeek and Kimi are winning, and Google has lost its way since Gemini 2.5 Flash.

This interests me because LLMs are global, especially if they're open sourced like Kimi K2 Thinking. Anyone can use the cheaper one just as easily, and no tariffs or policy can stop it.

you are viewing a single comment's thread
view the rest of the comments

[–] brucethemoose@lemmy.world 2 points 1 month ago* (last edited 1 month ago) (2 children)

This is problematic because MMLU is very easy, and I think SWE bench is too. Even 32B models can ‘saturate’ MMLU, hence Qwen 30B would obliterate other models in these charts if you added it.

They're also gamed/contaminated like crazy.

I'd lean more towards “private” benchmarks that aren’t contaminated, or gamed like LM Arena. For instance:

https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard

https://eqbench.com/

Long context benchmarks like RULER are good too, as that’s extremely difficult for many models now.

[–] jaykrown@lemmy.world 1 points 1 month ago* (last edited 1 month ago)

Okay here's the new bar chart with the more spread out weighting, and honestly it looks a lot more reasonable.

DeepSeek V3.2-Exp (Sep 2025) — 69.26
Kimi K2 Thinking (Nov 2025) — 66.19
Gemini 2.5 Flash (May 2025) — 58.73
Qwen 3 Max (Jul 2025) — 55.56
GPT-5 (Aug 2025) — 21.25
o3 (Apr 2025) — 20.39
Gemini 2.5 Pro (Mar 2025) — 19.98
Gemini 3 Pro (Nov 2025) — 19.82
Claude 3.5 Sonnet (Aug 2025) — 10.17
GPT-5 Pro (Aug 2025) — 1.96

[–] jaykrown@lemmy.world 1 points 1 month ago

Thank you, I'll factor that into the index. If you have any other recommendations for how to make the index more robust let me know. The goal is to make this dependent on real world API costs. I don't care if the newest smartest model is released if it costs $100 every time to use.