this post was submitted on 08 Dec 2025
824 points (98.6% liked)
Videos
17139 readers
48 users here now
For sharing interesting videos from around the Web!
Rules
- Videos only
- Follow the global Mastodon.World rules and the Lemmy.World TOS while posting and commenting.
- Don't be a jerk
- No advertising
- No political videos, post those to !politicalvideos@lemmy.world instead.
- Avoid clickbait titles. (Tip: Use dearrow)
- Link directly to the video source and not for example an embedded video in an article or tracked sharing link.
- Duplicate posts may be removed
- AI generated content must be tagged with "[AI] …" ^Discussion^
Note: bans may apply to both !videos@lemmy.world and !politicalvideos@lemmy.world
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
That study doesn't seem to support the point you're trying to use it to support. First it's talking about machines with error correcting RAM, which most consumer devices don't have. The whole point of error correcting RAM is that it tolerates a single bit flip in a memory cell and can detect a second one and, e.g. trigger a shutdown rather than the computer just doing what the now-incorrect value tells it to (which might be crashing, might be emitting an incorrect result, or might be something benign). Consumer devices don't have this protection (until DDR5, which can fix a single bit flip, but won't detect a second, so it can still trigger misbehaviour). Also, the data in the tables gives figures around 10% for the chance of an individual device experiencing an unrecoverable error per year, which isn't really that often, especially given that most software is buggy enough that you'd be lucky to use it for a year with only a 10% chance of it doing something wrong.
It's a paper from 2009 talking about "commodity servers" with ECC protection. Even back then it was fairly common and relatively cheap to implement though it was more often integrated into the CPU and/or memory controller. Since 2020 with DDR5 it's mandatory to be integrated into the memory as well.
Yes, that's my point. Your claim of "computers have nearly no redundancy" is complete bullshit.
It wasn't originally my claim - I replied to your comment as I was scrolling past because it had a pair of sentences that seemed dodgy, so I clicked the link it cited as a source, and replied when the link didn't support the claim.
Specifically, I'm referring to
This just isn't correct:
Sorry, I wasn't paying attention and missed that. I apologize.
Integrated memory ECC isn't the only check, it's an extra redundancy. The point of that paper was to show how often single bit errors occur within one part of a computer system.
Right, because of redundancies. It takes 2 simultaneous bit flips in different regions of the memory in order to cause a memory error and it's still ~10% chance annually according to the paper I cited.
ECC genuinely is the only check against memory bitflips in a typical system. Obviously, there's other stuff that gets used in safety-critical or radiation-hardened systems, but those aren't typical. Most software is written assuming that memory errors never happen, and checksumming is only used when there's a network transfer or, less commonly, when data's at rest on a hard drive or SSD for a long time (but most people are still running a filesystem with no redundancy beyond journaling, which is really meant for things like unexpected power loss).
There are things that mitigate the impact of memory errors on devices that can't detect and correct them, but they're not redundancies. They don't keep everything working when a failure happens, instead just isolating a problem to a single process so you don't lose unsaved work in other applications etc.. The main things they're designed to protect against are software bugs and malicious actors, not memory errors, it just happens to be the case that they work on other things, too.
Also, it looks like some of the confusion is because of a typo in my original comment where I said unrecoverable instead of recoverable. The figures that are around 10% per year are in the CE column, which is the correctable errors, i.e. a single bit that ECC puts right. The figures for unrecoverable/uncorrectable errors are in the UE column, and they're around 1%. It's therefore the 10% figure that's relevant to consumer devices without ECC, with no need to extrapolate how many single bit flips would need to happen to cause 10% of machines to experience double bit flips.