this post was submitted on 28 Mar 2025
39 points (100.0% liked)
TechTakes
1742 readers
103 users here now
Big brain tech dude got yet another clueless take over at HackerNews etc? Here's the place to vent. Orange site, VC foolishness, all welcome.
This is not debate club. Unless it’s amusing debate.
For actually-good tech, you want our NotAwfulTech community
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
It's just overtrained on the puzzle such that it mostly ignores your prompt. Changing a few words out doesn't change that it recognises the puzzle. Try writing it out in ASCII or uploading an image with it written or some other weird way that it hasn't been specifically trained on and I bet it actually performs better.
My dude what do you think ASCII is? Assuming we're using standard internet interfaces here and the request is coming in as UTF-8 encoded English text it is being written out in ASCII
Sneers aside, given that the supposed capability here is examining a text prompt and reason through the relevant information to provide a solution in the form of a text response this kind of test is, if anything, rigged in favor of the AI compared to some similar versions that add in more steps to the task like OCR or other forms of image parsing.
It also speaks to a difference in how AI pattern recognition compared to the human version. For a sufficiently well-known pattern like the form of this river-crossing puzzle it's the changes and exceptions that jump out. This feels almost like giving someone a picture of the Mona Lisa with aviators on; the model recognizes that it's 99% of the Mona Lisa and goes from there, rather than recognizing that the changes from that base case are significant and intentional variation rather than either a totally new thing or a 'corrupted' version of the original.
Exactly. It's overtrained on the test, ignoring the differences. If you instead used something it recognises but doesn't recognise as the test pattern (having the same tokens/embeddings) it will perform better. I'm not joking, it's a common tactic to get around censoring. You're just going around the issue. What I'm saying is they've trained the model so much on benchmarks that it is indeed dumber.
I don't think that the actual performance here is as important as the fact that it's clearly not meaningfully "reasoning" at all. This isn't a failure mode that happens if it's actually thinking through the problem in front of it and understanding the request. It's a failure mode that comes from pattern matching without actual reasoning.
Exactly. Also looking at its chain-of-wordvomit (which apparently I can't share other than by cut and pasting it somewhere), I don't think this is the same as GPT 4 overfitting to the original river crossing and always bringing items back needlessly.
Note also that in one example it discusses moving the duck and another item across the river (so "up to two other items" works); it is not ignoring the prompt, and it isn't even trying to bring anything back. And its answer (calling it impossible) has nothing to do with the original.
In the other one it does bring items back, it tries different orders, even finds an order that actually works (with two unnecessary moves), but because it isn't an AI fanboy reading tea leaves, it still gives out the wrong answer.
Here's the full logs:
https://pastebin.com/HQUExXkX
Content warning: AI wordvomit which is so bad it is folded hidden in a google tool.
That's fascinating, actually. Like, it seems like it shouldn't be possible to create this level of grammatically correct text without understanding the words you're using, and yet even immediately after defining "unsupervised" correctly the system still (supposedly) immediately sets about applying a baffling number of alternative constraints that it seems to pull out of nowhere.
OR alternatively despite letting it "cook" for longer and pregenerate a significant volume of its own additional context before the final answer the system is still, at the end of the day, an assembly of sochastic parrots who don't actually understand anything.
Yeah it really is fascinating. It follows some sort of recipe to try to solve the problem, like it's trained to work a bit like an automatic algebra system.
I think they had employed a lot of people to write generators of variants of select common logical puzzles, e.g. river crossings with varying boat capacities and constraints, generating both the puzzle and the corresponding step by step solution with "reasoning" and re-printing of the state of the items on every step and all that.
It seems to me that their thinking is that successive parroting can amount to reasoning, if its parroting well enough. I don't think it can. They have this one-path approach, where it just tries doing steps and representing state, just always trying the same thing.
What they need for this problem is to take a different kind of step, reduction (the duck can not be left unsupervised -> the duck must be taken with me on every trip -> rewrite a problem without the duck and with 1 less boat capacity -> solve -> rewrite the solution with "take the duck with you" on every trip).
But if they add this, then there's two possible paths it can take on every step, and this thing is far too slow to brute force the right one. They may get it to solve my duck variant, but at the expense of making it fail a lot of other variants.
The other problem is that even seemingly most elementary reasoning involves very many applications of basic axioms. This is what doomed symbol manipulation "AI" in the past and this is what is dooming it now.