Cloudflare blames massive internet outage on 'latent bug' : technology

[–] Kolanaki@pawb.social 141 points 5 months ago* (last edited 5 months ago) (1 children)

[–] eager_eagle@lemmy.world 67 points 5 months ago* (last edited 5 months ago) (1 children)

thanks for illustrating the corpo speak

I hope the bug is fine

[–] moseschrute@lemmy.world 50 points 5 months ago* (last edited 5 months ago) (2 children)

Nobody ever asks if the bug is ok

[–] aeronmelon@lemmy.world 14 points 5 months ago (1 children)

Fun fact time:

That’s why they’re called computer bugs.

In 1947, the Harvard Mark II computer was malfunctioning. Engineers eventually found a dead moth wedged between two relay points, causing a short. Removing it fixed the problem. They saved the moth and it’s on display at a museum to this day.

The moth was not okay.

And to be fair, the word bug had been used to describe little problems and glitches before that incident, but this was the first case of a computer bug.

[–] FauxLiving@lemmy.world 6 points 5 months ago

The moth was not okay.

They didn't tell us this part when they taught it in school #RIP Bug, the OG bug who died to the OG pull request.

[–] tdawg@lemmy.world 7 points 5 months ago (1 children)

Poor guy :(

load more comments (1 replies)

[+] PiraHxCx@lemmy.ml 53 points 5 months ago* (last edited 1 month ago) (4 children)

[deleted]

[–] MagicShel@lemmy.zip 109 points 5 months ago (5 children)

Shitty code has been around far longer than AI. I should know, I wrote plenty of it.

[–] floofloof@lemmy.ca 31 points 5 months ago (1 children)

They trained it on the work of people like you.

[–] MagicShel@lemmy.zip 29 points 5 months ago

Shame on them. I mark my career by how long it takes me to regret the code I write. When I was a junior, it was often just a month or two. As I seasoned it became maybe as long as two years. Until finally i don't regret my code, only the exigencies that prevented me from writing better.

[–] wreckedcarzz@lemmy.world 15 points 5 months ago

The AI was the shitty code we wrote along the way

[–] tdawg@lemmy.world 7 points 5 months ago (1 children)

I too have looked at my earliest repos in dispair

[–] FauxLiving@lemmy.world 2 points 5 months ago (1 children)

It's always depressing when you ask the AI to explain your code and then you get banned from OpenAI

[–] 123@programming.dev 2 points 5 months ago (1 children)

Who didn't get hit by the fork bug the professor explicitly asked you to watch out for since it would (back then with windows systems being required to use the campus resources) require an admin with Linux access to eliminate.

It was kind of fun walking in to the tech support area and them asking your login name with no context knowing what the issue was. Must have been a common occurrence that week of the course.

[–] FauxLiving@lemmy.world 2 points 5 months ago

It was kind of fun walking in to the tech support area and them asking your login name with no context knowing what the issue was.

I see this zip bomb was owned by user icpenis, someone track that guy down.

[–] mycodesucks@lemmy.world 7 points 5 months ago

Now... I don't like to brag...

[–] foo@feddit.uk 3 points 5 months ago

But, AI can do the work of 10 of you humans, so it can write 10 times the bugs and deploy them to production 10 times faster. Especially if pesky testers stay out the way instead of finding some of the bugs.

[–] AbidanYre@lemmy.world 22 points 5 months ago (2 children)

Humans are plenty capable of writing crappy code without needing to blame AI.

[–] KazuyaDarklight@lemmy.world 11 points 5 months ago

Absolutely, but it does feel like things have spiked a bit recently.

[–] tonytins@pawb.social 7 points 5 months ago

Train on shitty code, get shitty code. Garbage in. Garbage out.

[–] ThePantser@sh.itjust.works 15 points 5 months ago

AI coding, AI compiling, AI bug testing, AI users, etc.

[–] renegadespork@lemmy.jelliefrontier.net 13 points 5 months ago

Indirectly, this was. He said this was a bug in their recent tool that allows sites to block AI crawlers that caused the outages. It’s a relatively new tool released in the last few months, so it makes sense it might be buggy as the rush to stop the AI DoS attacks has been pertinent.

[–] FauxLiving@lemmy.world 42 points 5 months ago (6 children)

If you want a technical breakdown that isn't "lol AI bad":

https://blog.cloudflare.com/18-november-2025-outage/

Basically, a permission change cause an automated query to return more data than was planned for. The query resulted in a configuration file with a large amount of duplicate entries which was pushed to production. The size of the file went over the prealloctaed memory limit for a downstream system which died due to an unhandled error state resulting from the large configuration file. This caused a thread panic leading to the 5xx errors.

It seems that Crowdstrike isn't alone this year in the 'A bad config file nearly kills the Internet' club.

[–] phutatorius@lemmy.zip 2 points 5 months ago (1 children)

‘A bad config file nearly kills the Internet’ club

There's no such thing as bad data, only shitty code to create it or ingest it, and bad testing that failed to detect the shitty code. The overflow of the magic config-file size threw an exception, and there was no handler for that? Jeez Louise.

And as for unhandled exceptions, you'd think static analysis would have detected that.

[–] FauxLiving@lemmy.world 1 points 5 months ago

Someone should make a programming language like Rust, but that doesn't crash.

/s

load more comments (5 replies)

[–] PattyMcB@lemmy.world 16 points 5 months ago (1 children)

Blame it on the massive tech sector layoffs

[–] sunbeam60@feddit.uk 3 points 5 months ago (1 children)

Evidence or speculation?

load more comments (1 replies)

[+] halfapage@lemmy.world 11 points 5 months ago (1 children)

[deleted]

[–] foo@feddit.uk 3 points 5 months ago

They're laying off testers because they think AI can do it all now.

[–] herseycokguzelolacak@lemmy.ml 11 points 5 months ago* (last edited 5 months ago) (2 children)

If your website or service was disrupted by the Cloudflare outage, then you only have yourself to blame.

The Crowdstrike incident from last year was a wakeup call to abandon centralized services like Cloudflare. If you missed that lesson, then it is on you.

[–] groet@feddit.org 6 points 5 months ago

Yes but no. If you use a different service for the same purpose as you would use cloudflare you will be just as offline if they make a mistake. The difference is just that with a centralized player, everyone is offline at the same time. For the individual websites that does not matter.

[–] MonkRome@lemmy.world 3 points 5 months ago

Anyone using any technology can miss something and end up in the same spot. I think the real takeaway is that there is way too much consolidation of our technology.

[–] DaMummy@lemmy.world 5 points 5 months ago (2 children)

Why's he saying it's not an attack? Sounds like he's protesting too much.

[–] tonytins@pawb.social 17 points 5 months ago* (last edited 5 months ago)

It's not the first time Cloudflare has shot themselves in the foot.

[–] grumpasaurusrex@lemmy.world 11 points 5 months ago (1 children)

There's nothing to be gained from Cloudflare lying about this. It honestly makes them look worse if the outage was caused internally vs if it had been due to an attack

load more comments (1 replies)

[–] A_norny_mousse@feddit.org 4 points 5 months ago (3 children)

a routine configuration change

Honest question (I don't work in IT): this sounds like a contradiction or at the very least deliberately placating choice of words. Isn't a config change the opposite of routine?

[–] monkeyslikebananas2@lemmy.world 7 points 5 months ago (2 children)

Not really. Sometimes there are processes designed where engineers will make a change as a reaction or in preparation for something. They could have easily made a mistake when making a change like that.

[–] 123@programming.dev 3 points 5 months ago* (last edited 5 months ago)

E.g.: companies that advertise on a large sporting event might preemptively scale up (maybe warm up depending on language) their servers in preparation for a large load increase following some ad or mention of a coupon or promo code. Failure to capture the market it could generate would be seen as wasted $$$

Edit: auto-scale does not count on non essential products, people would not come back if the website failed to load on the first attempt.

[–] NotMyOldRedditName@lemmy.world 1 points 5 months ago* (last edited 5 months ago) (1 children)

I don't think it was a bug making the configuration change, I think there was a bug as a result of that change.

That specific combination of changes may not have been tested, or applied in production for months, and it just happened to happen today when they were needed for the first time since an update some time ago, hence the latent part.

But they do changes like that routinely.

[–] monkeyslikebananas2@lemmy.world 1 points 5 months ago

Yeah, I just read the postmortem. My response was more about the confusion that any configuration change is inherently non-routine.

[–] floquant@lemmy.dbzer0.com 2 points 5 months ago

No, in "DevOps" environments "configuration changes" is most of what you do every day

[–] fushuan@lemmy.blahaj.zone 2 points 5 months ago (1 children)

They probably mean that they did a change in a config file that is uploaded in their weekly or bi-weekly change window, and that that file was malformed for whichever reason that made the process that reads it crash. The main process depends on said process, and all the chain failed.

Things to improve:

make the pipeline more resilient, if you have a "bot detection module" that expects a file,and that file is malformed, it shouldn't crash the whole thing: if the bot detection module crahses, control it, fire an alert but accept the request until fixed.
Have a control of updated files to ensure that nothing outside of expected values and form is uploaded: this file does not comply with the expected format, upload fails and prod environment doesn't crash.
Have proper validation of updated config files to ensure that if something is amiss, nothing crashes and the program makes a controlled decision: if file is wrong, instead of crashing the module return an informed value and let the main program decide if keep going or not.

I'm sure they have several of these and sometimes shit happens, but for something as critical as CloudFlare to not have automated integration tests in a testing environment before anything touches prod is pretty bad.

[–] groet@feddit.org 5 points 5 months ago (1 children)

it shouldn't crash the whole thing: if the bot detection module crahses, control it, fire an alert but accept the request until fixed.

Fail open vs fail closed. Bot detection is a security feature. If the security feature fails, do you disable it and allow unchecked access to the client data? Or do you value Integrity over Availability

Imagine the opposite: they disable the feature and during that timeframe some customers get hacked. The hacks could have been prevented by the Bot detection (that the customer is paying for).

Yes, bot detection is not the most critical security feature and probably not the reason someone gets hacked but having "fail closed" as the default for all security features is absolutely a valid policy. Changing this policy should not be the lesson from this disasters.

[–] fushuan@lemmy.blahaj.zone 1 points 5 months ago (1 children)

You don't get hacking protection from bots, you get protection from DDoS attacks. Yeah some customers would have gone down, instead everyone went down... I said that instead of crashing the system they should have something that takes an intentional decision and informs properly about what's happening. That decision might have been to clo

You can keep the policy and inform everyone much better about what's happening. Half a day is a wild amount of downtime if it were properly managed.

Yes, bot detection is not the most critical....

So you agree that if this were controlled instead of open crahsing everything them being able to make an informed decision and opening or closing things, with the suggestion of opening in the case of not detection is the correct approach. What's the point of your complaint if you do agree? C'mon.

load more comments (1 replies)

[–] floquant@lemmy.dbzer0.com 4 points 5 months ago

Did they though? Aside from the "every outage is a latent bug" angle, from their postmortem it doesn't seem to me like they tried to blame it on anything but their failure to contain the spread of (and timely diagnose) the issue

[+] falseWhite@lemmy.world 1 points 5 months ago

[deleted]

Technology

Our Rules

Approved Bots