Meet the Other Phone. Flexible and made to last.

Meet the Other Phone.
Flexible and made to last.

Buy now

Please or to access all these features

AIBU?

Share your dilemmas and get honest opinions from other Mumsnetters.

This global Microsoft Outage

496 replies

SSpratt · 19/07/2024 09:10

https://www.bbc.co.uk/news/live/cnk4jdwp49et

Any worries? It is chaos out there by the look of the news.

My experience is that I’m not able to work today and had trouble using my debit card this morning. The transaction eventually went through but it’s not showing on my account.

Planes grounded as mass worldwide IT outage hits airlines, media and banks

The cause of the outage is unclear - but Microsoft says it's taking "mitigation issues".

https://www.bbc.co.uk/news/live/cnk4jdwp49et

OP posts:
Thread gallery
10
HowardTJMoon · 23/07/2024 12:14

Ok, so let's run with this amazing idea.

Right now CrowdStrike Inc has a market capitalisation of about $70b. Last year the CEO got paid around $50m. Just to give a flavour of the kind of money at stake here.

Putin paid some large sum to get a bunch of otherwise diligent people to throw the game by pushing out a fatally flawed update. That caused a bunch of sysadmins around the world to have a really shitty weekend, a bigger bunch of people to be delayed in airports etc, and some relatively small inconvenience to a bunch more. Most people didn't even notice.

The only long-term effect is that a few people may lose their jobs, CrowdStrike's going to have a difficult few years, and the average rate of alcohol consumption among sysadmins has increased by a measurable extent.

Do you think Putin got value for money?

taxguru · 23/07/2024 12:52

ntmdino · 23/07/2024 11:49

Think it through, please.

For it to be worth it for a hostile government to approach a company to do this, the company would have to have hundreds of thousands of customers, or at the very least many high-profile large customers. You don't do that by being a small company in the AV world, so - by definition - that company is going to have hundreds (if not thousands) of employees across the world. Crowdstrike, for example, has nearly 8000 employees.

For every single one of those employees, you'd have to bring them into the conspiracy, or ensure that they didn't notice what you were doing with the company's primary product.

All it would take is one developer out of hundreds/thousands - all extremely well-versed in security practices and threat detection - to spot something awry, and everybody's going to jail.

It's just not realistic.

But according to the reports, this cock up was a simple one, caused by one person - there's nothing to suggest lots of people were involved and all made a mistake. If that's really true and one mistake by one person really can cause the disastrous outages of the last few days, then presumably, one person can be bribed to do something similar?? I think at the very least, it should be a warning shot that these things need to be checked a lot more thoroughly by a lot more people to prevent it happening again and more importantly to ensure that there aren't any "weak links" where someone could be bribed to do something catastrophic.

IncessantNameChanger · 23/07/2024 14:11

You can't pre empt every mistake. Even if you wanted to double up on staff to double check everything you'd still miss stuff. The biggest outage we had took out everything that ran on Unix. It was a global patch that worked globally. But on our platform turning things off needed to do the patch took us out for 24 hours. No other company had our unique set up ( running versions with versions). IT is messy. We had insanely clever people working on it.

Someone asked me why we can't have unique operating systems just for critical systems. I can't even start on how this isn't feasible. Plus what's a critical system? I worked on critical systems that most of MN wouldn't class,as critical.

There is industry standard robust tried and tested procedures in place that companies still cock up even when they follow them.

roses321 · 23/07/2024 14:26

IncessantNameChanger · 23/07/2024 14:11

You can't pre empt every mistake. Even if you wanted to double up on staff to double check everything you'd still miss stuff. The biggest outage we had took out everything that ran on Unix. It was a global patch that worked globally. But on our platform turning things off needed to do the patch took us out for 24 hours. No other company had our unique set up ( running versions with versions). IT is messy. We had insanely clever people working on it.

Someone asked me why we can't have unique operating systems just for critical systems. I can't even start on how this isn't feasible. Plus what's a critical system? I worked on critical systems that most of MN wouldn't class,as critical.

There is industry standard robust tried and tested procedures in place that companies still cock up even when they follow them.

Someone asking you that hasn't thought through their stupid question.

There are a million reasons why and actually you're multiplying the issues with compatibility, patching and maintenance and that's just off the top of my head.

I think this issue occurred due to a font change or something? Or a logo change? And tha'ts why it didn't get put through testing. Madness isn't it. How one change can cause such a huge problem.

InfoSecInTheCity · 23/07/2024 14:52

I think this issue was probably caused by something really stupid, if I had to guess, then I would guess that they did QA it but that their testing plan didn't include rebooting the PC after install. So they pushed it to the test environment, made sure it was feeding back information correctly, made sure it was doing everything it was supposed to do, didn't reboot so didn't see that it caused a problem during start up.

It will be something that silly. Something that possibly would have been spotted if multiple eyes reviewed the test plan, or if multiple testers were conducting QA, but even with multiple eyes, it could still have been missed.

I do not for a minute think this was a co-ordinated plan to bring about chaos.

roses321 · 23/07/2024 15:39

InfoSecInTheCity · 23/07/2024 14:52

I think this issue was probably caused by something really stupid, if I had to guess, then I would guess that they did QA it but that their testing plan didn't include rebooting the PC after install. So they pushed it to the test environment, made sure it was feeding back information correctly, made sure it was doing everything it was supposed to do, didn't reboot so didn't see that it caused a problem during start up.

It will be something that silly. Something that possibly would have been spotted if multiple eyes reviewed the test plan, or if multiple testers were conducting QA, but even with multiple eyes, it could still have been missed.

I do not for a minute think this was a co-ordinated plan to bring about chaos.

I just said what the issue was caused by.

ntmdino · 23/07/2024 16:25

taxguru · 23/07/2024 12:52

But according to the reports, this cock up was a simple one, caused by one person - there's nothing to suggest lots of people were involved and all made a mistake. If that's really true and one mistake by one person really can cause the disastrous outages of the last few days, then presumably, one person can be bribed to do something similar?? I think at the very least, it should be a warning shot that these things need to be checked a lot more thoroughly by a lot more people to prevent it happening again and more importantly to ensure that there aren't any "weak links" where someone could be bribed to do something catastrophic.

Yes, the cock-up was a simple one - but it relied on a number of people to collaborate in that mistake.

Typically, a release in a company like that follows a process like this:

  • Research into the problem being solved (eg new virus)
  • Development (creating the signature for defeating it)
  • Code review (common sense check from other devs)
  • QA (verify that the new signature solves the problem)
  • Regression testing (verify that the new signature doesn't break any existing functionality)
  • Merge into the next update release
  • More regression testing of the final release
  • Release

There are at least three teams involved there, possibly five, and the failure has to get past all of them before it escapes. It's possible for it to happen by chance, but to ensure that it happens deliberately without a single person catching it, and managing to pass it off as a mistake if they do? Practically impossible.

roses321 · 23/07/2024 16:36

ntmdino · 23/07/2024 16:25

Yes, the cock-up was a simple one - but it relied on a number of people to collaborate in that mistake.

Typically, a release in a company like that follows a process like this:

  • Research into the problem being solved (eg new virus)
  • Development (creating the signature for defeating it)
  • Code review (common sense check from other devs)
  • QA (verify that the new signature solves the problem)
  • Regression testing (verify that the new signature doesn't break any existing functionality)
  • Merge into the next update release
  • More regression testing of the final release
  • Release

There are at least three teams involved there, possibly five, and the failure has to get past all of them before it escapes. It's possible for it to happen by chance, but to ensure that it happens deliberately without a single person catching it, and managing to pass it off as a mistake if they do? Practically impossible.

It was a font change ffs. that is all. It happens.

ntmdino · 23/07/2024 17:15

roses321 · 23/07/2024 16:36

It was a font change ffs. that is all. It happens.

And yet, it still requires rigorous testing. Obviously.

roses321 · 23/07/2024 17:30

ntmdino · 23/07/2024 17:15

And yet, it still requires rigorous testing. Obviously.

Well obviously not! Otherwise it wouldn't have happened.

ntmdino · 23/07/2024 17:32

roses321 · 23/07/2024 17:30

Well obviously not! Otherwise it wouldn't have happened.

I said it requires rigorous testing, not that it was done.

roses321 · 23/07/2024 17:40

ntmdino · 23/07/2024 17:32

I said it requires rigorous testing, not that it was done.

Do you work in IT? If you don't, I would suggest not commenting on what should or shouldn't be done.

ntmdino · 23/07/2024 17:44

roses321 · 23/07/2024 17:40

Do you work in IT? If you don't, I would suggest not commenting on what should or shouldn't be done.

I've worked in IT for about 25 years, been a developer for 20 of those, and worked in security for 6 of those.

InfoSecInTheCity · 23/07/2024 17:54

@roses321 no actually you didn't say what the issue was, what you said is copy and pasted below:

"I think this issue occurred due to a font change or something? Or a logo change? And tha'ts why it didn't get put through testing. Madness isn't it. How one change can cause such a huge problem."

That wasn't a statement, it was a theory with question marks showing you didn't know if your theory was correct at all.

Crowdstrike CEO has described it as a "content update" and various media sites have put forward the supposition that this "could have been something as small as a font or logo change" but that hasn't been confirmed.

aNewYorkerInLondon · 23/07/2024 17:58

roses321 · 23/07/2024 14:26

Someone asking you that hasn't thought through their stupid question.

There are a million reasons why and actually you're multiplying the issues with compatibility, patching and maintenance and that's just off the top of my head.

I think this issue occurred due to a font change or something? Or a logo change? And tha'ts why it didn't get put through testing. Madness isn't it. How one change can cause such a huge problem.

I tell everyone I teach (yes I work in tech), that there are no stupid questions. Questions are necessary to learn. If you don’t question as you learn, you are memorizing, not understanding.

Anyway.

you mentioned that “compatibility, patching and maintenance” are multiplying the problem. I took that part of your message to mean that more uniqueness in systems would alleviate the severity of bad code roll outs (or cyber attacks). While this is absolutely true, more uniqueness does help, it is also prohibitively expensive. Even as it is now, tech is expensive. its one of the few industries where the “do-ers” (the programmers) typically make a lot more than the managers do.

adding uniqueness would amplify that by many factors, as well as create massive key person risk and also less mobility between jobs. It be would also make the tech impossible to afford.

if there were simple or cost effective ways to make mistakes impossible, engineers would have done it. They analyze technical problems for fun (at least the ones I work with do.)

Anyway, I hope you’re having a good day!

KnickerlessParsons · 23/07/2024 19:37

"I think this issue occurred due to a font change or something? Or a logo change? And tha'ts why it didn't get put through testing. Madness isn't it. How one change can cause such a huge problem."

I've heard the same. From our tech people.

roses321 · 24/07/2024 12:27

ntmdino · 23/07/2024 17:44

I've worked in IT for about 25 years, been a developer for 20 of those, and worked in security for 6 of those.

Then i'm unsure as to why you're standing outside the greenhouse throwing stones at it.

Have you read their PIR? If you haven't, i suggest you do.

ntmdino · 24/07/2024 14:38

roses321 · 24/07/2024 12:27

Then i'm unsure as to why you're standing outside the greenhouse throwing stones at it.

Have you read their PIR? If you haven't, i suggest you do.

Yes, I've read it. It shows incredibly lax testing procedures - they relied on the results of their content validator instead of properly testing the update.

At a minimum any software that loads at boot time should always be tested through a reboot cycle after deployment to the test instance. This is basic, basic stuff that any junior developer would be able to tell you.

I'm being critical because it's stupid shit like this that gives us all a bad name, and I'm constantly running up against idiots who say, "Oh, this is a minor change, it doesn't need testing, we need to get it out there as soon as possible...". Usually managers who won't be the ones who have to spend their evenings and weekends cleaning up the resulting mess, or slack developers who have no respect for others' time.

Don't forget; this wasn't a one-off mistake. Their incident report states it quite clearly: their defined process for handling these template updates was to not test them prior to deployment beyond basic validation, and their solution to the problem is "We'll test the IPC updates in future".

This is the result of the company - as a whole - not understanding its own software and the catastrophic impact its failure would have on its customers, or understanding it perfectly well but deciding that the savings on testing time were greater than the insurance liability.

That's something that everybody should be pissed off about - especially other people who work in IT.

HowardTJMoon · 24/07/2024 15:24

Absolutely, particularly given that this update provided parameters for drivers that were running in the damn kernel. And doubly-particularly given that those kernel-level drivers clearly weren't sanitising their inputs properly.

roses321 · 24/07/2024 16:38

ntmdino · 24/07/2024 14:38

Yes, I've read it. It shows incredibly lax testing procedures - they relied on the results of their content validator instead of properly testing the update.

At a minimum any software that loads at boot time should always be tested through a reboot cycle after deployment to the test instance. This is basic, basic stuff that any junior developer would be able to tell you.

I'm being critical because it's stupid shit like this that gives us all a bad name, and I'm constantly running up against idiots who say, "Oh, this is a minor change, it doesn't need testing, we need to get it out there as soon as possible...". Usually managers who won't be the ones who have to spend their evenings and weekends cleaning up the resulting mess, or slack developers who have no respect for others' time.

Don't forget; this wasn't a one-off mistake. Their incident report states it quite clearly: their defined process for handling these template updates was to not test them prior to deployment beyond basic validation, and their solution to the problem is "We'll test the IPC updates in future".

This is the result of the company - as a whole - not understanding its own software and the catastrophic impact its failure would have on its customers, or understanding it perfectly well but deciding that the savings on testing time were greater than the insurance liability.

That's something that everybody should be pissed off about - especially other people who work in IT.

Edited

I always love reading replies like this from people who clearly haven't ever made a mistake in their life.

I am questioning my hiring decisions as an IT Manager now, becuase i didn't realise how many devs, engineers and IT support staff there were out there who just knew everything and had all the answers.

Hindsight is a wonderful thing, it allows people to get on their horse, make a ton of assumptions and act as though they are the higher power in a situation...that THEY would never have done this.

You cannot know the ins and outs unless you work for Crowdstrike, so sitting here coming up to a week post incident moaning and bitching that you would never have done something so stupid, and reading the PIR making assumptions really does just make me yawn.

Human error happens, I suggest you deal with that at some point in your life, because it will be your error one day...

ntmdino · 24/07/2024 17:12

roses321 · 24/07/2024 16:38

I always love reading replies like this from people who clearly haven't ever made a mistake in their life.

I am questioning my hiring decisions as an IT Manager now, becuase i didn't realise how many devs, engineers and IT support staff there were out there who just knew everything and had all the answers.

Hindsight is a wonderful thing, it allows people to get on their horse, make a ton of assumptions and act as though they are the higher power in a situation...that THEY would never have done this.

You cannot know the ins and outs unless you work for Crowdstrike, so sitting here coming up to a week post incident moaning and bitching that you would never have done something so stupid, and reading the PIR making assumptions really does just make me yawn.

Human error happens, I suggest you deal with that at some point in your life, because it will be your error one day...

I didn't actually make any assumptions based on that PIR - it says it all quite clearly. This isn't complicated stuff, and skipping the testing on an update like this is something that even the most inexperienced developers would know is absolutely wrong and insufficient.

Of course I've made mistakes, but the act of deploying this update to customer machines without testing wasn't a mistake. They state quite clearly that the IPC template update went through the normal process for content updates, which is to rely on their validator and not put it through proper testing. That's a decision to skip a process that should've been required, as a matter of company policy and not a mistake.

I would totally expect whistleblowers to come out of the woodwork with evidence that they told managers these updates should be tested properly, probably during the many lawsuits that are going to be filed as a result of this.

Yawn all you like. I'd imagine the managers responsible for that decision did the same while they were being told what they should be doing.

New posts on this thread. Refresh page
Swipe left for the next trending thread