this post was submitted on 23 Feb 2026
584 points (97.6% liked)

Technology

84254 readers
3037 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
 

Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the 'reasoning' models.

(page 2) 50 comments
sorted by: hot top controversial new old
[–] FireWire400@lemmy.world 8 points 2 months ago* (last edited 2 months ago) (6 children)

Gemini 3 (Fast) got it right for me; it said that unless I wanna carry my car there it's better to drive, and it suggested that I could use the car to carry cleaning supplies, too.

Edit: A locally run instance of Gemma 2 9B fails spectacularly; it completely disregards the first sentece and recommends that I walk.

load more comments (6 replies)
[–] ryannathans@aussie.zone 8 points 2 months ago (17 children)

Opus 4.6 has been excellent at problem solving in software development, no surprises it nails it

It's no surprise public opinion is these tools are trash when the free models are unable to answer simple questions

[–] Fizz@lemmy.nz 5 points 2 months ago (5 children)

The free models feel years behind so people constantly underestimate what its capable of. I still hear people say ai can't generate fingers.

load more comments (5 replies)
load more comments (16 replies)
[–] melfie@lemy.lol 8 points 2 months ago* (last edited 2 months ago) (2 children)

My kid got it wrong at first, saying walking is better for exercise, then got it right after being asked again.

Claude Sonnet 4.6 got it right the first time.

My self-hosted Qwen 3 8B got it wrong consistently until I asked it how it thinks a car wash works, what is the purpose of the trip, and can that purpose be fulfilled from a distance. I was considering using it for self-hosted AI coding, but now I’m having second thoughts. I’m imagining it’ll go about like that if I ask it to fix a bug. Ha, my RTX 4060 is a potato for AI.

load more comments (2 replies)
[–] myfunnyaccountname@lemmy.zip 8 points 2 months ago (9 children)

There are a lot of humans that would fail this as well. Just sayin.

[–] Hazzard@lemmy.zip 9 points 2 months ago (8 children)

They also polled 10,000 people to compare against a human baseline:

Turns out GPT-5 (7/10) answered about as reliably as the average human (71.5%) in this test. Humans still outperform most AI models with this question, but to be fair I expected a far higher "drive" rate.

That 71.5% is still a higher success rate than 48 out of 53 models tested. Only the five 10/10 models and the two 8/10 models outperform the average human. Everything below GPT-5 performs worse than 10,000 people given two buttons and no time to think.

load more comments (8 replies)
load more comments (8 replies)
[–] vala@lemmy.dbzer0.com 7 points 2 months ago (3 children)

Hey LLM, if I have a 16 ounce cup with 10oz of water in it and I add 10 more ounces, how much water is in the cup?

load more comments (3 replies)
[–] tover153@lemmy.world 7 points 2 months ago (3 children)

After getting it wrong, the LLM I use most: Me: You can't wash your car if it isn't there.

Ah. Yes. That is an excellent and devastatingly practical correction.

In that case: drive. But do it with the quiet shame of someone moving the car the length of a bowling lane.

This is one of those situations where logic beats virtue. The car must be present to be washed. Walking was philosophically elegant but mechanically flawed.

So:

Start engine.

Travel 50 meters.

Avoid eye contact with pedestrians.

Commit fully.

You are not lazy. You are complying with system requirements.

load more comments (3 replies)
[–] criticon@lemmy.ca 6 points 2 months ago (4 children)

Even when they give the correct answer they talk too much. AI responses contain a lot of garbage. When AI gives you an answer it will try to justify itself. Since they won't give you brief responses the responses will be long.

[–] chunes@lemmy.world 5 points 2 months ago* (last edited 2 months ago) (2 children)

I agree with you but found that DeepSeek was succinct.

You need to bring your car to the car wash, so you should drive it there. Walking would leave your car at home, which doesn't help.

load more comments (2 replies)
load more comments (3 replies)
[–] Professorozone@lemmy.world 6 points 2 months ago

Didn't like 30% of the population elect Trump? Coincidence? I don't think so.

[–] lemmydividebyzero@reddthat.com 6 points 2 months ago

They will scrape that article, too.

And I'm a few months, they have "learned" how that task works.

[–] melfie@lemy.lol 6 points 2 months ago* (last edited 2 months ago) (1 children)

Context engineering is one way to shift that balance. When you provide a model with structured examples, domain patterns, and relevant context at inference time, you give it information that can help override generic heuristics with task-specific reasoning.

So the chat bots getting it right consistently probably have it in their system prompt temporarily until they can be retrained with it incorporated into the training data. 😆

Edit:

Oh, I see the linked article is part of a marketing campaign to promote this company’s paid cloud service that has source available SDKs as a solution to the problem being outlined here:

Opper automatically finds the most relevant examples from your dataset for each new task. The right context, every time, without manual selection.

I can see where this approach might be helpful, but why is it necessary to pay them per API call as opposed to using an open source solution that runs locally (aside from the fact that it’s better for their monetization this way)? Good chance they’re running it through yet another LLM and charging API fees to cover their inference costs with a profit. What happens when that LLM returns the wrong example?

load more comments (1 replies)
[–] humanspiral@lemmy.ca 5 points 2 months ago (3 children)

Some takeaways,

Sonar (Perplexity models) say you are stealing energy from AI whenever you exercise (you should drive because eating pollutes more). ie gets right answer for wrong reason.

US humans, and 55-65 age group, score high on international scale probably for same reasoning. "I like lazy".

load more comments (3 replies)
[–] melsaskca@lemmy.ca 4 points 2 months ago

I don't use AI but read a lot about it. I now want to google how it attacks the trolley problem.

load more comments
view more: ‹ prev next ›