TechTakes

2254 readers

67 users here now

Big brain tech dude got yet another clueless take over at HackerNews etc? Here's the place to vent. Orange site, VC foolishness, all welcome.

This is not debate club. Unless it’s amusing debate.

For actually-good tech, you want our NotAwfulTech community

founded 2 years ago

MODERATORS

dgerard@awful.systems

Gemini 2.5 "reasoning", no real improvement on river crossings. (awful.systems)

submitted 6 months ago by diz@awful.systems to c/techtakes@awful.systems

33 comments fedilink hide all child comments

So I signed up for a free month of their crap because I wanted to test if it solves novel variants of the river crossing puzzle.

Like this one:

You have a duck, a carrot, and a potato. You want to transport them across the river using a boat that can take yourself and up to 2 other items. If the duck is left unsupervised, it will run away.

Unsurprisingly, it does not:

https://g.co/gemini/share/a79dc80c5c6c

https://g.co/gemini/share/59b024d0908b

The only 2 new things seem to be that old variants are no longer novel, and that it is no longer limited to producing incorrect solutions - now it can also incorrectly claim that the solution is impossible.

I think chain of thought / reasoning is a fundamentally dishonest technology. At the end of the day, just like older LLMs it requires that someone solved a similar problem (either online or perhaps in a problem solution pair they generated if they do that to augment the training data).

But it outputs quasi reasoning to pretend that it is actually solving the problem live.

you are viewing a single comment's thread
view the rest of the comments

[+] SGforce@lemmy.ca -12 points 6 months ago (22 children)

It's just overtrained on the puzzle such that it mostly ignores your prompt. Changing a few words out doesn't change that it recognises the puzzle. Try writing it out in ASCII or uploading an image with it written or some other weird way that it hasn't been specifically trained on and I bet it actually performs better.

[–] YourNetworkIsHaunted@awful.systems 14 points 6 months ago (7 children)

write it out in ASCII

My dude what do you think ASCII is? Assuming we're using standard internet interfaces here and the request is coming in as UTF-8 encoded English text it is being written out in ASCII

Sneers aside, given that the supposed capability here is examining a text prompt and reason through the relevant information to provide a solution in the form of a text response this kind of test is, if anything, rigged in favor of the AI compared to some similar versions that add in more steps to the task like OCR or other forms of image parsing.

It also speaks to a difference in how AI pattern recognition compared to the human version. For a sufficiently well-known pattern like the form of this river-crossing puzzle it's the changes and exceptions that jump out. This feels almost like giving someone a picture of the Mona Lisa with aviators on; the model recognizes that it's 99% of the Mona Lisa and goes from there, rather than recognizing that the changes from that base case are significant and intentional variation rather than either a totally new thing or a 'corrupted' version of the original.

[+] SGforce@lemmy.ca -8 points 6 months ago (6 children)

Exactly. It's overtrained on the test, ignoring the differences. If you instead used something it recognises but doesn't recognise as the test pattern (having the same tokens/embeddings) it will perform better. I'm not joking, it's a common tactic to get around censoring. You're just going around the issue. What I'm saying is they've trained the model so much on benchmarks that it is indeed dumber.

[–] YourNetworkIsHaunted@awful.systems 16 points 6 months ago (1 children)

I don't think that the actual performance here is as important as the fact that it's clearly not meaningfully "reasoning" at all. This isn't a failure mode that happens if it's actually thinking through the problem in front of it and understanding the request. It's a failure mode that comes from pattern matching without actual reasoning.

[–] diz@awful.systems 6 points 6 months ago* (last edited 6 months ago) (1 children)

It’s a failure mode that comes from pattern matching without actual reasoning.

Exactly. Also looking at its chain-of-wordvomit (which apparently I can't share other than by cut and pasting it somewhere), I don't think this is the same as GPT 4 overfitting to the original river crossing and always bringing items back needlessly.

Note also that in one example it discusses moving the duck and another item across the river (so "up to two other items" works); it is not ignoring the prompt, and it isn't even trying to bring anything back. And its answer (calling it impossible) has nothing to do with the original.

In the other one it does bring items back, it tries different orders, even finds an order that actually works (with two unnecessary moves), but because it isn't an AI fanboy reading tea leaves, it still gives out the wrong answer.

Here's the full logs:

https://pastebin.com/HQUExXkX

Content warning: AI wordvomit which is so bad it is folded hidden in a google tool.

[–] YourNetworkIsHaunted@awful.systems 6 points 6 months ago (1 children)

That's fascinating, actually. Like, it seems like it shouldn't be possible to create this level of grammatically correct text without understanding the words you're using, and yet even immediately after defining "unsupervised" correctly the system still (supposedly) immediately sets about applying a baffling number of alternative constraints that it seems to pull out of nowhere.

OR alternatively despite letting it "cook" for longer and pregenerate a significant volume of its own additional context before the final answer the system is still, at the end of the day, an assembly of sochastic parrots who don't actually understand anything.

[–] diz@awful.systems 6 points 6 months ago* (last edited 6 months ago)

Yeah it really is fascinating. It follows some sort of recipe to try to solve the problem, like it's trained to work a bit like an automatic algebra system.

I think they had employed a lot of people to write generators of variants of select common logical puzzles, e.g. river crossings with varying boat capacities and constraints, generating both the puzzle and the corresponding step by step solution with "reasoning" and re-printing of the state of the items on every step and all that.

It seems to me that their thinking is that successive parroting can amount to reasoning, if its parroting well enough. I don't think it can. They have this one-path approach, where it just tries doing steps and representing state, just always trying the same thing.

What they need for this problem is to take a different kind of step, reduction (the duck can not be left unsupervised -> the duck must be taken with me on every trip -> rewrite a problem without the duck and with 1 less boat capacity -> solve -> rewrite the solution with "take the duck with you" on every trip).

But if they add this, then there's two possible paths it can take on every step, and this thing is far too slow to brute force the right one. They may get it to solve my duck variant, but at the expense of making it fail a lot of other variants.

The other problem is that even seemingly most elementary reasoning involves very many applications of basic axioms. This is what doomed symbol manipulation "AI" in the past and this is what is dooming it now.

load more comments (4 replies)

load more comments (18 replies)