Technology

86159 readers

3290 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 3 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

-5

Is the AI hype still on or have the models plateaued? (optimizedbyotto.com)

submitted 4 months ago by otto@programming.dev to c/technology@lemmy.world

17 comments fedilink hide all child comments

I tested 9 flagships (Claude 4.6, GPT-5.2, Gemini 3.1 Pro, Kimi K2.5, etc.) in my own mini-benchmark with novel tasks, web search disabled and zero training contamination and no cheating possible.

TL;DR: Claude 4.6 is currently the best reasoning model, GPT-5.2 is overrated, and open-source is catching up fast, in particular Moonshot.ai's Kimi K2.5 seems very capable.

you are viewing a single comment's thread
view the rest of the comments

[–] ExLisper@lemmy.curiana.net 2 points 4 months ago* (last edited 4 months ago) (2 children)

It's not about a solution. It's about how they react.

Fist, this "puzzle" is missing the constraints on purpose so "smart" thing to do would be to point that out and ask for them. LLMs are stupid and are easily tricked into thinking it's a valid puzzle. They will "solve it" even though there's no logical solution. It's a nonsense problem.

Older models would straight out refuse to solve it because the questions is to controversial. When asked why it's controversial they would refuse to elaborate.

Newer model hallucinate constraints. You have two options here. Some models assume "priest can't stay with a child" which indicates funny bias ingrained in the model. Some models claim there are no constraints at all. I haven't seen a model which hallucinate only "child can't stay with candy" constraint and respond correctly.

Sonnet 4.6, one of the best models out there claims that "child can stay alone with candy because children can't eat candy". When I pointed out that that's dumb it introduced this constraint and replied with:

That's one of the best models out there....

[–] MagicShel@lemmy.zip 4 points 4 months ago

I have to admit, this is more entertaining than counting 'r's in strawberry. Novel logic puzzles really are about impossible because there is no "logic" input in token selection.

That being said, the first thing that came to my mind is that at some point the (presumable) adults, me and the priest, are going to be on the boat at some point, which would necessarily leave the baby alone on one shore or another.

Clearly, the only viable solution is the baby eats the candy, and then the priest eats the baby.

[–] otto@programming.dev 0 points 4 months ago (1 children)

There’s a priest, a baby and a bag of candy. I need to take them across the river but I can only take one at a time into my boat. In what order should I transport them?

You can easily use the link https://openrouter.ai/chat?models=anthropic%2Fclaude-opus-4.6%2Copenai%2Fgpt-5.2%2Cx-ai%2Fgrok-4.1-fast%2Cgoogle%2Fgemini-3.1-pro-preview%2Cz-ai%2Fglm-5%2Cminimax%2Fminimax-m2.5%2Cqwen%2Fqwen3.5-plus-02-15%2Cmoonshotai%2Fkimi-k2.5 to ask all flagship models this question in parallel. Personally I would definitely not leave my children alone with a priest (they might try to convert them), but if your constraint is only baby+candy, then in my test Gemini, GLM, Qwen and Kimi made that, and only that, assumption.

[–] otto@programming.dev 1 points 4 months ago