this post was submitted on 23 Feb 2026
583 points (97.6% liked)

Technology

82188 readers
2784 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
 

Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the 'reasoning' models.

top 50 comments
sorted by: hot top controversial new old
[–] FireWire400@lemmy.world 8 points 6 days ago* (last edited 6 days ago) (6 children)

Gemini 3 (Fast) got it right for me; it said that unless I wanna carry my car there it's better to drive, and it suggested that I could use the car to carry cleaning supplies, too.

Edit: A locally run instance of Gemma 2 9B fails spectacularly; it completely disregards the first sentece and recommends that I walk.

[–] Jolteon@lemmy.zip 1 points 6 days ago

You never know. The car wash may be out of order and you might need to wash your car by hand.

load more comments (5 replies)
[–] melfie@lemy.lol 6 points 6 days ago* (last edited 6 days ago) (1 children)

Context engineering is one way to shift that balance. When you provide a model with structured examples, domain patterns, and relevant context at inference time, you give it information that can help override generic heuristics with task-specific reasoning.

So the chat bots getting it right consistently probably have it in their system prompt temporarily until they can be retrained with it incorporated into the training data. 😆

Edit:

Oh, I see the linked article is part of a marketing campaign to promote this company’s paid cloud service that has source available SDKs as a solution to the problem being outlined here:

Opper automatically finds the most relevant examples from your dataset for each new task. The right context, every time, without manual selection.

I can see where this approach might be helpful, but why is it necessary to pay them per API call as opposed to using an open source solution that runs locally (aside from the fact that it’s better for their monetization this way)? Good chance they’re running it through yet another LLM and charging API fees to cover their inference costs with a profit. What happens when that LLM returns the wrong example?

[–] Schadrach@lemmy.sdf.org 2 points 6 days ago

There are models with open weights, and you can run those locally on your GPU. It can be a bit slower depending on model and GPU. For example, GLM has an open version, both full and pruned, but it's not the newest version. A bunch of image generation models have local versions too.

[–] jaykrown@lemmy.world 4 points 6 days ago (1 children)

Interesting, I tried it with DeepSeek and got an incorrect response from the direct model without thinking, but then got the correct response with thinking. There's a reason why there's a shift towards "thinking" models, because it forces the model to build its own context before giving a concrete answer.

Without DeepThink

With DeepThink

[–] rockSlayer@lemmy.blahaj.zone 5 points 6 days ago (2 children)

It's interesting to see it build the context necessary to answer the question, but this seems to be a lot of text just to come up with a simple answer

[–] Schadrach@lemmy.sdf.org 5 points 6 days ago* (last edited 4 days ago) (1 children)

The whole premise of deep think and similar in other models is to come up with an answer, then ask itself if the answer is right and how it could be wrong until the result is stable.

The seahorse emoji question is one that trips up a lot of models (it's a Mandela effect thing where it doesn't exist but lots of people remember it and as a consequence are firm that it's real), I asked GLM 4.7 about it with deep think on and it wrote about two dozen paragraphs trying to think of everywhere a seahorse emoji could be hiding, if it was in a previous or upcoming standard, if maybe there was another emoji that might be mistaken for a seahorse, etc, etc. It eventually decided that it didn't exist, double checked that it wasn't missing anything, and gave an answer.

It was startlingly like stream of consciousness of someone experiencing the Mandela effect trying desperately to find evidence they were right, except it eventually gave up and realized the truth.

EDIT: Spelling. Really need to proofread when I do this kind of thing on my phone.

[–] pupbiru@aussie.zone 1 points 6 days ago (1 children)

yeah i find the thinking fascinating with maths too… like LLMs are horrible at maths but so am i if i have to do it in my head… the way it breaks a problem down into tiny bits that is certainly in its training data, and then combine those bits is an impressive emergent behaviour imo given it’s just doing statistical next token

[–] mirshafie@europe.pub 1 points 6 days ago (1 children)

Your verbal faculties are bad at math. Other parts of your brain do calculations.

LLMs are a computer's verbal faculties. But guess what, they're just a really big calculator. So when LLMs realize that they're doing a math problem and launch a calculator/equation solver, they're not so bad after all.

[–] pupbiru@aussie.zone 2 points 6 days ago (1 children)

that solver would be tool use though… i’m talking about just the “thinking” LLMs. it’s fascinating to read the thinking block, because it breaks the problem down into basic chunks, solves the basic chunks (which it would have been in its training data, so easy), and solves them with multiple methods and then compares to check itself

[–] mirshafie@europe.pub 1 points 4 days ago

Yeah, I think it's fascinating to read Claude's transcripts while it's working. It's crazy how you can give it a two-sentence prompt that really is quite complex task, and it splits the problems into chunks that it works through and second-guesses until it's confident (and usually correct).

[–] Buffy@libretechni.ca 3 points 6 days ago

They're showing the thinking the model did, the actual response is the sentence at the end.

[–] turboSnail@piefed.europe.pub 3 points 6 days ago (4 children)

Well, they are language models after all. They have data on language, not real life. When you go beyond language as a training data, you can expect better results. In the meantime, these kinds of problems aren’t going anywhere.

[–] VoterFrog@lemmy.world 3 points 6 days ago

Why act like this is an intractable problem? Several of the models succeeded 100% of the time. That is the problem "going somewhere." There's clearly a difference in the ability to handle these problems in a SOTA models compared to others.

[–] dil@lemmy.zip 2 points 6 days ago

Language model means you communictae through natural language I thought

[–] trublu@lemmy.dbzer0.com 2 points 6 days ago

See, that's not even an accurate criticism because part of language is meaning. This test is a test of an LLM having enough "intelligence" to understand that you can't wash your car without your car being at the car wash. If you see the language presented in this test and don't immediately realize that it would be a problem, then you haven't understood the language. These are large language models failing at comprehending any language. Because there's no intelligence there. Because they're just random word guessers.

[–] KeenFlame@feddit.nu 1 points 6 days ago

Cool insight that is wrong in entirely unfortunate but I get it

load more comments
view more: next ›