this post was submitted on 23 Feb 2026
-5 points (43.2% liked)

Technology

81869 readers
4956 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
 

I tested 9 flagships (Claude 4.6, GPT-5.2, Gemini 3.1 Pro, Kimi K2.5, etc.) in my own mini-benchmark with novel tasks, web search disabled and zero training contamination and no cheating possible.

TL;DR: Claude 4.6 is currently the best reasoning model, GPT-5.2 is overrated, and open-source is catching up fast, in particular Moonshot.ai's Kimi K2.5 seems very capable.

you are viewing a single comment's thread
view the rest of the comments
[–] T3CHT@sh.itjust.works 1 points 2 days ago

Thanks for sharing, interesting read and questions. Surely you'll be down voted here for anything with AI... But c'est la vie.

Ive been doing coding projects in VS code which uses GPT, Claude and Gemini. Woe are the days when my credits are used and only GPT 4.1 is available. Claudes ability to research and architect multi step software solutions is very, very good and it rarely makes messes or spins tires compared to older models from just a few months ago. This is precisely what converted me to 'whoa - ai' which is adjacent to 'pro ai'.

Lately I've been experimenting with customizing Gemini via instructions which include a link to a drive folder of md files with specific instructions for different agent tasks, such as performing specific market analysis, doing a news roundup with a specific list of topics and omitting prior reviewed items, etc. The files allow for both complex instructions or lists, as well as some chance to construct memory via logging. Results are a mixed bag, lots of additional function created, lots of mixed results.

Have you considered any tests of more complexity? Something like 'write a program that...' I think what will differentiate these models going forward is some have architect capabilities, strategy, insight, decision making, where others are agents - they do specific tasks well but have limits. With that model, the ai architect and it's ai agents need to work as a team to complete a multi step task.