this post was submitted on 06 May 2025
85 points (100.0% liked)
technology
24029 readers
170 users here now
On the road to fully automated luxury gay space communism.
Spreading Linux propaganda since 2020
- Ways to run Microsoft/Adobe and more on Linux
- The Ultimate FOSS Guide For Android
- Great libre software on Windows
- Hey you, the lib still using Chrome. Read this post!
Rules:
- 1. Obviously abide by the sitewide code of conduct. Bigotry will be met with an immediate ban
- 2. This community is about technology. Offtopic is permitted as long as it is kept in the comment sections
- 3. Although this is not /c/libre, FOSS related posting is tolerated, and even welcome in the case of effort posts
- 4. We believe technology should be liberating. As such, avoid promoting proprietary and/or bourgeois technology
- 5. Explanatory posts to correct the potential mistakes a comrade made in a post of their own are allowed, as long as they remain respectful
- 6. No crypto (Bitcoin, NFT, etc.) speculation, unless it is purely informative and not too cringe
- 7. Absolutely no tech bro shit. If you have a good opinion of Silicon Valley billionaires please manifest yourself so we can ban you.
founded 5 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
But that is effectively what happening with RLMs and refeed. LLMs have statistical weights between model and inputs. For example RAG models will add higher weights to the text retrieved from your source documents. RLM reasoning is a fully automated CoT prompting technique. You don't provide the chain, you don't ask the LLM to create the chain, it just does it all the time for everything. Meaning the inputs becomes more polluted with generated text which reinforces the existing biases in the model.
For example if we take the em dash issue, the idea is that LLMs already generate more em dashes than exist in human written text. Let's say turn 1 you get an output with em dashes. On Turn 2 this is fed back into the machine which reinforces that over indexing on em dashes in your prompt. This means turn 2's output is going to potentially have more em dashes, because the input on turn 2 contained output from turn 1 that had more em dashes than normal. Your input over time end up accumulating the model's biases through the history. The shorter your inputs on each turn and the longer the conversation the faster the conversation input converges on being mostly LLM generated text.
When you do this with an RLM, you have even more output being added to the input automatically with a CoT prompt. Meaning that any model biases accumulate in the input even faster.
Another reason I suspect the CoT refeed vs training data pollution is that GPT-4.5 which is the latest (Feb 2025) non-reasoning model seems to have a lower hallucination rate on SimpleQA than o1. If the training data were the issue we'd see rates closer to o3/o4.
The other big difference between o1 and o3 and o4 that may explain the higher rate of hallucinations is that the o1's reasoning is not user accessible, and it's purposefully trained to not have safe guards on reasoning. Where o3 and o4 have public reasoning and reasoning safeguards. I think safeguards may be a significant source of hallucination because they change prompt intent, encoding and output. So on a non-o1 model that safeguard process is happening twice per turn once for reasoning and once for output, then being accumulated into the next turn's input. On an o1 model that's happening once per turn only for output and then being accumulated.
Ah so ChatGPT works slightly different than what I'm used to. I have really only explored AI locally via ollama because I'm an OSS zealot and I need my tooling to be deterministic so I need to be able to eliminate service changes from the equation.
My experience is that with ollama and deepseek r1 it reprocess the think tags. they get referenced directly.
At this point my debug instincts have been convinced that my idea is unlikely.
Have we tried asking ChatGPT what's wrong before chain of thought prompting the benchmark data for each model to itself?
I was trying to make a joke about this and was trying to remember the disease that tech priests get in Rogue Trader and Googles search AI hallucinated wh40k lore based on someone's home brew..... Not only that it hallucinated the characters back story that's not even in the post to give them a genetic developmental disorder..... To answer my question....I feel gross. Fucking man made horrors.