this post was submitted on 06 May 2025

85 points (100.0% liked)

technology

24078 readers

285 users here now

On the road to fully automated luxury gay space communism.

Spreading Linux propaganda since 2020

Rules:

1. Obviously abide by the sitewide code of conduct. Bigotry will be met with an immediate ban
2. This community is about technology. Offtopic is permitted as long as it is kept in the comment sections
3. Although this is not /c/libre, FOSS related posting is tolerated, and even welcome in the case of effort posts
4. We believe technology should be liberating. As such, avoid promoting proprietary and/or bourgeois technology
5. Explanatory posts to correct the potential mistakes a comrade made in a post of their own are allowed, as long as they remain respectful
6. No crypto (Bitcoin, NFT, etc.) speculation, unless it is purely informative and not too cringe
7. Absolutely no tech bro shit. If you have a good opinion of Silicon Valley billionaires please manifest yourself so we can ban you.

founded 5 years ago

MODERATORS

context@hexbear.net

SexUnderSocialism@hexbear.net

gaycomputeruser@hexbear.net

Wakmrow@hexbear.net

SwitchyandWitchy@hexbear.net

ChatGPT's hallucination problem is getting worse according to OpenAI's own tests and nobody understands why (www.pcgamer.com)

submitted 6 months ago by yogthos@lemmygrad.ml to c/technology@hexbear.net

34 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] simontherockjohnson@lemmy.ml 20 points 6 months ago* (last edited 6 months ago) (3 children)

AI alchemists discovered that the statistics machine will be in a better ball park if you give it multiple examples and clarifications as part of your asks. This is called Chain of Thought prompting. Example:

Then the AI Alchemists said, hey we can automate this by having the model eat more of it's own shit. So a reasoning model will ask it self "What does the user want when they say < Your prompt>?" This will generate text that it adds to your query, to generate the final answer. All models with "chat memory" effectively eat their own shit, the tech works by reprocessing the whole chat history (sometimes there's a cache) every time you reply. Reasoning models because of the emulation of chain of thought eat more of their own shit than non-reasoning models do.

Some reasoning models are worse than others because some refeed the entire history of the reasoning, and others only refeed the current prompt's reasoning.

Essentially it's a form of compound error.

[+] semioticbreakdown@hexbear.net 5 points 6 months ago* (last edited 1 month ago) (2 children)

[deleted]

[–] simontherockjohnson@lemmy.ml 6 points 6 months ago

Another point of anecdata is that I've read that vibe coders say that non-reasoning models lead to better results for coding tasks because they are faster and they tend to hallucinate less because they don't pollute with automated CoT. I've seen people recommend Deepseek V3 03/2025 release (with deep think turned off) over R1 for that reason.

[–] simontherockjohnson@lemmy.ml 5 points 6 months ago* (last edited 6 months ago) (1 children)

my money is on the higher hallucination rate being a result of the data being polluted with synthetic information. I think its model collapse

But that is effectively what happening with RLMs and refeed. LLMs have statistical weights between model and inputs. For example RAG models will add higher weights to the text retrieved from your source documents. RLM reasoning is a fully automated CoT prompting technique. You don't provide the chain, you don't ask the LLM to create the chain, it just does it all the time for everything. Meaning the inputs becomes more polluted with generated text which reinforces the existing biases in the model.

For example if we take the em dash issue, the idea is that LLMs already generate more em dashes than exist in human written text. Let's say turn 1 you get an output with em dashes. On Turn 2 this is fed back into the machine which reinforces that over indexing on em dashes in your prompt. This means turn 2's output is going to potentially have more em dashes, because the input on turn 2 contained output from turn 1 that had more em dashes than normal. Your input over time end up accumulating the model's biases through the history. The shorter your inputs on each turn and the longer the conversation the faster the conversation input converges on being mostly LLM generated text.

When you do this with an RLM, you have even more output being added to the input automatically with a CoT prompt. Meaning that any model biases accumulate in the input even faster.

Another reason I suspect the CoT refeed vs training data pollution is that GPT-4.5 which is the latest (Feb 2025) non-reasoning model seems to have a lower hallucination rate on SimpleQA than o1. If the training data were the issue we'd see rates closer to o3/o4.

The other big difference between o1 and o3 and o4 that may explain the higher rate of hallucinations is that the o1's reasoning is not user accessible, and it's purposefully trained to not have safe guards on reasoning. Where o3 and o4 have public reasoning and reasoning safeguards. I think safeguards may be a significant source of hallucination because they change prompt intent, encoding and output. So on a non-o1 model that safeguard process is happening twice per turn once for reasoning and once for output, then being accumulated into the next turn's input. On an o1 model that's happening once per turn only for output and then being accumulated.

[+] semioticbreakdown@hexbear.net 2 points 6 months ago* (last edited 1 month ago) (1 children)

[deleted]

[–] simontherockjohnson@lemmy.ml 3 points 6 months ago* (last edited 6 months ago) (1 children)

Ah so ChatGPT works slightly different than what I'm used to. I have really only explored AI locally via ollama because I'm an OSS zealot and I need my tooling to be deterministic so I need to be able to eliminate service changes from the equation.

My experience is that with ollama and deepseek r1 it reprocess the think tags. they get referenced directly.

At this point my debug instincts have been convinced that my idea is unlikely.

Have we tried asking ChatGPT what's wrong before chain of thought prompting the benchmark data for each model to itself?

Ugh. This is all machine woo. I need a drink

I was trying to make a joke about this and was trying to remember the disease that tech priests get in Rogue Trader and Googles search AI hallucinated wh40k lore based on someone's home brew..... Not only that it hallucinated the characters back story that's not even in the post to give them a genetic developmental disorder..... To answer my question....I feel gross. Fucking man made horrors.

[–] Philosoraptor@hexbear.net 2 points 6 months ago (1 children)

Interesting. If that's right, it makes a lot of sense that models with this kind of recursive style would generate errors at a much higher rate. If you're taking everything in the session so far as an input and there's some chance for every input that the model produces an error, the errors will rapidly stack up with this kind of functionality. I've seen those time-lapses of how far generative AI can drift over 100 (or whatever) iterations of "reproduce this photo without making any changes" type prompts, with the output of each generation fed back in as input. This strikes me as the same kind of problem, but with text.

[–] simontherockjohnson@lemmy.ml 4 points 6 months ago* (last edited 6 months ago)

It happens faster to images because of the way LLMs work. LLM's work on "tokens", a token for text is typically a character, fragment of a word, a word, a fragment of a sentence. With language it's much easier to encode meaning and be more precise because that's what language already does. The reason NLP is/was difficult is because language is not algorithmically consistent, it evolves, and rules are constantly broken. For example Kai Cenat is credited with more contributions to the English language than the vast majority of people because children decided to talk like him. Point being is that language does the heavy lifting in terms of encoding a string of characters into something meaningful.

With images, it's a whole different ball game. Image tokenizers often work in several different ways, there are two types of token hard and soft. Hard tokens for example could be the regions of the image, part could be the colors, the alpha channels.

Hard tokens are also the visual encoders of meaning so a chair, table, or car could be a hard tokens based on their bounding boxes. These tokenization techniques are based in a lot of other types of machine learning.

Note that these tokens often overlap in practice and consume regions of other tokens, however as "hard" tokens they are considered distinct entities, and this is where the trouble starts esp. for image generation (that's roughly why a lot of AI did and still does things like draw extra fingers).

The next type of tokens are soft tokens, and they're a bit harder to explain, but basically the idea is that soft tokens are encoded by detecting continual statistical distributions within images. It's a very abstract way of reading an image. Here's where the trouble compounds.

So now when we're writing an image, what do we write the image with? You guessed it. Tokens. The reason that those AI drift time lapses exist is because LLMs are statistical and not "functional". They don't have the mathematical concept of "identity". Otherwise they'd try to recreate the same image by copying the data in the exact tokens (or just copy the image itself) , instead they try to regenerate the image by generating new tokens with the same attributes that it read from the image.

To illustrate this lets say an image contains a blue car and the AI can only tokenize it as blue car. Asking an LLM to run an identity function on that image will result in a different car because the resolution of the token is only like 2 dimensions "blue" and "car" which roughly means it will output the average "blue" "car" from its training data. Now with human made things it's actually a lot easier. There's a finite variation of cars. However there's an infinite variation of things that can happen to a car. So an AI theoretically can run an identity function off of a particular make/model/year of a vehicle but if the paint is scratched or the paint job is unique it will start to introduce drift there's also other sources of drift like camera angle etc. With natural objects this becomes a whole different ball game because of the level of variation, this complexity compounds with scenes.

So identity functions on text are extremely easy in comparison for example:

This works because the tokens are simpler and there is less of a loss of "resolution" from the text to the tokenize form. E.g. word "Poopy" is token "poopy". But once you get into interpreting an image, and re-encoding those interpretations onto a new canvas it becomes much more difficult. e.g. image of "Dwayne the Rock Johnson" is most likely a series of tokens like "buff man", "bad actor", etc.

This is a rough explanation because there's a lot of voodoo, and I'm more of a Software Engineer than I am a statistics/data guy so I approach the alchemy a little bit from an alchemical standpoint.