The CrowdStrike Outage and Market-Driven Brittleness

Friday’s massive internet outage, caused by a mid-sized tech company called CrowdStrike, disrupted major airlines, hospitals, and banks. Nearly 7,000 flights were canceled. It took down 911 systems and factories, courthouses, and television stations. Tallying the total cost will take time. The outage affected more than 8.5 million Windows computers, and the cost will surely be in the billions of dollars­—easily matching the most costly previous cyberattacks, such as NotPetya.

The catastrophe is yet another reminder of how brittle global internet infrastructure is. It’s complex, deeply interconnected, and filled with single points of failure. As we experienced last week, a single problem in a small piece of software can take large swaths of the internet and global economy offline.

The brittleness of modern society isn’t confined to tech. We can see it in many parts of our infrastructure, from food to electricity, from finance to transportation. This is often a result of globalization and consolidation, but not always. In information technology, brittleness also results from the fact that hundreds of companies, none of which you’ve heard of, each perform a small but essential role in keeping the internet running. CrowdStrike is one of those companies.

This brittleness is a result of market incentives. In enterprise computing—as opposed to personal computing—a company that provides computing infrastructure to enterprise networks is incentivized to be as integral as possible, to have as deep access into their customers’ networks as possible, and to run as leanly as possible.

Redundancies are unprofitable. Being slow and careful is unprofitable. Being less embedded in and less essential and having less access to the customers’ networks and machines is unprofitable—at least in the short term, by which these companies are measured. This is true for companies like CrowdStrike. It’s also true for CrowdStrike’s customers, who also didn’t have resilience, redundancy, or backup systems in place for failures such as this because they are also an expense that affects short-term profitability.

But brittleness is profitable only when everything is working. When a brittle system fails, it fails badly. The cost of failure to a company like CrowdStrike is a fraction of the cost to the global economy. And there will be a next CrowdStrike, and one after that. The market rewards short-term profit-maximizing systems, and doesn’t sufficiently penalize such companies for the impact their mistakes can have. (Stock prices depress only temporarily. Regulatory penalties are minor. Class-action lawsuits settle. Insurance blunts financial losses.) It’s not even clear that the information technology industry could exist in its current form if it had to take into account all the risks such brittleness causes.

The asymmetry of costs is largely due to our complex interdependency on so many systems and technologies, any one of which can cause major failures. Each piece of software depends on dozens of others, typically written by other engineering teams sometimes years earlier on the other side of the planet. Some software systems have not been properly designed to contain the damage caused by a bug or a hack of some key software dependency.

These failures can take many forms. The CrowdStrike failure was the result of a buggy software update. The bug didn’t get caught in testing and was rolled out to CrowdStrike’s customers worldwide. Sometimes, failures are deliberate results of a cyberattack. Other failures are just random, the result of some unforeseen dependency between different pieces of critical software systems.

Imagine a house where the drywall, flooring, fireplace, and light fixtures are all made by companies that need continuous access and whose failures would cause the house to collapse. You’d never set foot in such a structure, yet that’s how software systems are built. It’s not that 100 percent of the system relies on each company all the time, but 100 percent of the system can fail if any one of them fails. But doing better is expensive and doesn’t immediately contribute to a company’s bottom line.

Economist Ronald Coase famously described the nature of the firm­—any business­—as a collection of contracts. Each contract has a cost. Performing the same function in-house also has a cost. When the costs of maintaining the contract are lower than the cost of doing the thing in-house, then it makes sense to outsource: to another firm down the street or, in an era of cheap communication and coordination, to another firm on the other side of the planet. The problem is that both the financial and risk costs of outsourcing can be hidden—delayed in time and masked by complexity—and can lead to a false sense of security when companies are actually entangled by these invisible dependencies. The ability to outsource software services became easy a little over a decade ago, due to ubiquitous global network connectivity, cloud and software-as-a-service business models, and an increase in industry- and government-led certifications and box-checking exercises.

This market force has led to the current global interdependence of systems, far and wide beyond their industry and original scope. It’s why flying planes depends on software that has nothing to do with the avionics. It’s why, in our connected internet-of-things world, we can imagine a similar bad software update resulting in our cars not starting one morning or our refrigerators failing.

This is not something we can dismantle overnight. We have built a society based on complex technology that we’re utterly dependent on, with no reliable way to manage that technology. Compare the internet with ecological systems. Both are complex, but ecological systems have deep complexity rather than just surface complexity. In ecological systems, there are fewer single points of failure: If any one thing fails in a healthy natural ecosystem, there are other things that will take over. That gives them a resilience that our tech systems lack.

We need deep complexity in our technological systems, and that will require changes in the market. Right now, the market incentives in tech are to focus on how things succeed: A company like CrowdStrike provides a key service that checks off required functionality on a compliance checklist, which makes it all about the features that they will deliver when everything is working. That’s exactly backward. We want our technological infrastructure to mimic nature in the way things fail. That will give us deep complexity rather than just surface complexity, and resilience rather than brittleness.

How do we accomplish this? There are examples in the technology world, but they are piecemeal. Netflix is famous for its Chaos Monkey tool, which intentionally causes failures to force the systems (and, really, the engineers) to be more resilient. The incentives don’t line up in the short term: It makes it harder for Netflix engineers to do their jobs and more expensive for them to run their systems. Over years, this kind of testing generates more stable systems. But it requires corporate leadership with foresight and a willingness to spend in the short term for possible long-term benefits.

Last week’s update wouldn’t have been a major failure if CrowdStrike had rolled out this change incrementally: first 1 percent of their users, then 10 percent, then everyone. But that’s much more expensive, because it requires a commitment of engineer time for monitoring, debugging, and iterating. And can take months to do correctly for complex and mission-critical software. An executive today will look at the market incentives and correctly conclude that it’s better for them to take the chance than to “waste” the time and money.

The usual tools of regulation and certification may be inadequate, because failure of complex systems is inherently also complex. We can’t describe the unknown unknowns involved in advance. Rather, what we need to codify are the processes by which failure testing must take place.

We know, for example, how to test whether cars fail well. The National Highway Traffic Safety Administration crashes cars to learn what happens to the people inside. But cars are relatively simple, and keeping people safe is straightforward. Software is different. It is diverse, is constantly changing, and has to continually adapt to novel circumstances. We can’t expect that a regulation that mandates a specific list of software crash tests would suffice. Again, security and resilience are achieved through the process by which we fail and fix, not through any specific checklist. Regulation has to codify that process.

Today’s internet systems are too complex to hope that if we are smart and build each piece correctly the sum total will work right. We have to deliberately break things and keep breaking them. This repeated process of breaking and fixing will make these systems reliable. And then a willingness to embrace inefficiencies will make these systems resilient. But the economic incentives point companies in the other direction, to build their systems as brittle as they can possibly get away with.

This essay was written with Barath Raghavan, and previously appeared on Lawfare.com.

Posted on July 25, 2024 at 2:37 PM28 Comments

Comments

Bruce Schneier July 25, 2024 2:38 PM

We originally wrote this for The New York Times, but it was killed after the Biden/Harris news sucked up all the available op ed space.

Anonymous July 25, 2024 3:50 PM

There are plenty of companies who are subject to market incentives but use modern processes. What makes Crowdstrike different from, say, Cloudflare, is that their business is guaranteed by government fiat.

As long as the compliance checklist requires airlines and hospitals to install shitty third-party drivers, it does not matter whether corporate leadership has the foresight and willingness to spend money short-term for long-term benefits. Netflix’s QA practices did not come about because they were forced by a senator, but a senator may have contributed to one or three critical COBOL systems still running in 2024.

Earle N July 25, 2024 5:47 PM

Last week’s update wouldn’t have been a major failure if CrowdStrike had rolled out this change incrementally

Or, as in any sane system design, if they’d let the admins choose how to roll it out. In other words, if admins had told them “hell, no!” when presented with the idea of letting some external party push updates to private infrastructure.

I’m not convinced it’s entirely market-driven. Windows defaults to auto-updating, but I understand that network administrators can change that, precisely so they can test on some subset of systems first. Also, they’ll know enough not to do something stupid, like putting both redundant domain controllers in the same “random 1%” of systems. Some keep real or virtual machines around for the explicit purpose of pre-testing stuff like this.

Steven Andrés July 25, 2024 5:56 PM

Hi Bruce – great commentary. Wanted to alert you to a typo that’s likely tied to a copy/paste from a document with comments that changed “the firm” to “any business” in this sentence:

Economist Ronald Coase famously described the nature of the firm­any business­as a collection of contracts.

Roosevelt Guck July 25, 2024 7:13 PM

I heard that MSFT had published an api to allow software to access kernel mode without having to run code in kernel space, as Crowdstrike software does. The EU denied MSFT’s api on the grounds that it would give them an unfair advantage over competition. So, the Crowdstrike software runs in kernel mode and took down the entire server when the error occurred, instead of using the api. The problem was that the software ran in kernel space.

nobody July 25, 2024 7:24 PM

It was said over 40 years ago, and it still needs to be said today: If builders built buildings the way programmers write programs, the first woodpecker that came along would destroy civilization.

That said, throwing all end-point or point-of-sale computers into a boot-loop, requiring hands on hardware ON SITE to fix, well that will take time to fix. It’s not like they can roll out a patch and fix all the computers with the push of a button.

I think we’re overlooking something here. Workers are paid poorly, overworked, treated horribly, etc. On the other hand, a worker “accidentally” rolls out something like this, and the stock markets shift dramatically. Easy money for investors who know. Just hard for the worker not to get caught cashing out. On the other hand, if you they somebody money. Somebody unsavory. Perhaps gambling debts? Well it could be a way to pay off their debts. Food for thought…

Anonymous July 25, 2024 10:41 PM

Humans are just a bunch of sloppy, messy and imperfect creatures.

It’s a complex system. It does not require much thought to figure out that something will go wrong. Your buildings are indeed subject to defects. Many are old, standards have holes, tradesman do shoddy work sometimes, lack of maintenance by owners…

List of devices using leaked platform keys.

‘https://github.com/binarly-io/SupplyChainAttacks/blob/main/PKfail/ImpactedDevices.md

“This compromises the entire security chain, from firmware to the operating system.”

Spanning over 12 years, almost 900 devices and 22 keys.
https://www.binarly.io/blog/pkfail-untrusted-platform-keys-undermine-secure-boot-on-uefi-ecosystem

ResearcherZero July 25, 2024 11:28 PM

The general idea of security standards is to prevent someone disabling your critical infrastructure. A senator sure didn’t recommend the idea. Standards generally take decades to be implemented by governments after multiple failures and problems within sectors.

There are a bunch of inquiries into why events took place. Then a series of recommendations. Few of the recommendations are implemented by government, after lobbying from industry. Following repeated examples of the same problem and repeated inquiries, further recommendations are eventually implemented. This only takes place after a long deliberation process where interested parties provide input and arguments for and against, or changes and additions to the recommendations. Later governments might repeal it.

Finally after a lot more lobbying from industry and other interested parties, maybe the recommendations are passed. First though, something usually must fail or collapse.

FrostyGoop “is the first ICS-specific malware that uses Modbus communications to achieve an impact on operational technology (OT). Given the ubiquity of the Modbus protocol in industrial environments, this malware can potentially cause disruptions across all industrial sectors by interacting with legacy and modern systems.”

‘https://hub.dragos.com/hubfs/Reports/Dragos-FrostyGoop-ICS-Malware-Intel-Brief-0724_.pdf

‘https://www.wired.com/story/russia-ukraine-frostygoop-malware-heating-utility/

ResearcherZero July 26, 2024 2:35 AM

The general idea of government approved security companies is resources and experience.
Critical infrastructure now requires a more sophisticated and varied skill set to manage.
Your local computer repair man may not have the manpower necessary for large networks.

In the old days infrastructure was much more simple. Much of it was built by convict labour. Many of those that built the roads, courthouses and police stations died alone.
Often the cause of death was from accidents, excessive alcoholism and suicide.

For example, my old fake email address was registered as a domain by CyberTec.

If you click on the name for ‘https://www.schneier.com/blog/archives/2024/07/the-crowdstrike-outage-and-market-driven-brittleness.html/#comment-439587

… it will now take you to CyberTec WebMail – Login
‘https://mail.mysmartermail.com/interface/root#/login

In the old days all you could register and click on was a pint of larger. It certainly would not connect you to your family back in the old country. The meals were terrible and often far short of proper nutrition. You could only keep drinking your sorrows away.

Michael Humphrey July 26, 2024 3:54 AM

One issue with “rolling out incrementally” or “allowing admins to decide” – this was an anti-malware definition update. The software bug had been present for a long time, but was only triggered by the new definition file.
So there’s a dilemma here – do you react quickly to new malware and update definitions as soon as they’re available, risking triggering bugs, or do you update more cautiously, and risk malware infecting your systems undetected?

Clive Robinson July 26, 2024 4:02 AM

@ Bruce, ALL,

Re : All in one roll outs

You say,

“Last week’s update wouldn’t have been a major failure if CrowdStrike had rolled out this change incrementally: first 1 percent of their users, then 10 percent, then everyone. But that’s much more expensive”

Actually there is a flip side cost that is potentially higher much higher.

It’s been noted that some malware developers can have new malware not just ready to run but actively being pushed out at people within as little as 20 minutes of “patches” starting to be pushed out.

Such a short response time on patch to new malware rollout, means that incrementally rolled out patches will have a high probability of increasing the success of a malware attacks…

So for at least the past half decade patching vulnerabilities has become an unwinnable “Red Queen’s Race”.

Only the software industry will not admit it, because then they would have to accept “legal liability” for not supplying zero-vulnerability / zero-defect software etc…

So it’s a “Pays your money makes your choice” option.

Over the years people have thought me odd or even paranoid because I won’t do the trendy things etc that others do…

Even before the birth of the “Internet of Things'(IoT) I’d decided that “Home automation” involving “software” was a very bad idea, which is why my washing machine uses a “cam switch timer” rather than a “microcontroller”, similarly my gas boiler and gas cooker don’t have microcontrollers.

All of these things I can fault-find with simple test equipment and fix with multiply sourced parts when they go wrong. Microcontroller systems you can not, because they have to be replaced by “the manufacturer”.

Which even last century was a problem if the manufacturer either nolonger made the microcontroller boards or had gone out of business etc.

Now with IoT and “has to be always connected” the notion of a supply chain becomes warped to say the least.

We’ve seen a major new downside with Amazon and Chinese manufactured electronics products. People buy high end electronics that has to work with an Amazon or Chinese online server. They decided it is to expensive to run the server and everybody’s high end electronics suddenly became a pile of worthless scrapped parts…

This CrowdStrike issue shows why we really should not be buying such “junk” in the first place.

Will people learn from it?

No of course not, the problem will really have to get a great deal worse before it gets even remotely close to getting better.

ResearcherZero July 26, 2024 4:51 AM

Signal boxes set on fire and cables cut.

‘https://abcnews.go.com/International/france-train-lines-hit-arson-attacks-hours-2024/story?id=112296820

cybershow July 26, 2024 5:48 AM

Blaming “the market” is a cop out to me. Markets are indicators not
causes.

People are ambivalent toward technology. Last Friday there were as
many cheering and celebrating a “day off” as those running around
worrying.

I think the social and psychological explanations are more important
to security than the “economic” ones as to why we have crappy, fragile
technology that’s unfit for purpose.

Despite being sold on convenience, many people – including those in
high technical positions – care little about technology if it can be
made “someone else’s problem”. The complexity burden is too painful.
That’s why we allow incompetents to remotely update our kernels, we
use software from convicted criminal monopolists, and sell our most
treasured data to see cat pictures. I think most of us still, in a
deep sense, don’t believe any of it’s “real”. We have no idea how much
others have built total dependency of day-to-day life on a precarious
house of cards.

Like under Communism when you never cared to decorate an apartment or
tend a garden owned and run by the state, people do not see technology
as “theirs”. No matter how dysfunctional, degenerate and ruined it
gets, until people have a positive stake in technology – as opposed to
the fear and protection rackets we have now – I don’t think we’ll see
much improvement in cybersecurity or long-term resilience.

You can neither force security upon people by taking away their
agency, nor can people systematically shrug responsibility for their
own security. Security is built between equal stakeholders, through
social contract, and it’s a hard won and easily ruined relation.

This essay deals with the issue of sovereignty and
why people who neglect their ownership of endpoints to external MSPs
are playing a dangerous game.

Andy

Don't Tell You July 26, 2024 7:25 AM

It all starts with 3 questions.

  1. Why you need to connect critical business systems to internet?
  2. Why you need constantly to update your systems?
  3. Why anyone thinks that the newer is better?

On old days when you installed systems you might had a buggy system, but it was in known state, you knew the bugs and you could circumvent them via another means. Now when you install systems, they are moving target, constantly changing, nobody knows their state. In 5 minutes their state can already be changed via an update.

Also, has any business manager ever weighted the cost/benefit of using AV versus not using it at all? AV industry works on FUD and is long proved to not delivering what it promises. AV causes more problems than it fixes. It can silently delete your most important files. It has enormous power, it’s like a “god mode”, it can do anything.

““Antivirus is the ultimate back door,” Blake Darché, a former N.S.A. operator and co-founder of Area 1 Security. “It provides consistent, reliable and remote access that can be used for any purpose, from launching a destructive attack to conducting espionage on thousands or even millions of users.””

https://www.nytimes.com/2017/10/10/technology/kaspersky-lab-israel-russia-hacking.html

The biggest problem is – we allow this. We allow that business model that demands critical systems being constantly connected to the internet. Try install and operate Windows server/domain controller without the internet connection. We allow to blackmail us via fear that AV is necessity. We don’t vote with our feet. We deliberately handed over the control over our systems. The biggest botnet is actually our systems with remote updates.

Wayne July 26, 2024 9:40 AM

But hey! They gave a $10 gift card for Uber Eats to each company, so we’re all good, right? Which Uber promptly cancelled because so many were given out that they thought it was fraudulent….

I wonder when Delta Airlines will get their whole act back up and running, I understand they’re still suffering. But Southwest isn’t, because they still have a lot of infrastructure on Win 95 and 3.1. I hate to ponder whether or not that’s a good thing. Now that it’s broadly known, probably not.

Mexaly July 26, 2024 11:46 AM

All systems have some level of britlleness.

The difference between who recovered quickly
(or had no trouble at all),
were the ones who did their Business Continuity Planning.

It in your certification.

Retired now, that update would have stopped at my firewall,
been tested in the integration lab,
and skipped when it crashed in the lab.

Meanwhile, the alternate plans for downed services kick into gear.

If you read this blog, it’s what you did / would have done.

It looks like Delta was far less prepared than an organization their size should be. Maybe they’re scrimping on staff. Compared even with other unprepared organizations, they really choked hard.

The sheep and the goats are in different herds now.

anonymous July 26, 2024 1:04 PM

Can you rewrite that article swapping ‘operating system’ for ‘application software.’ Then we can get some economists to tell us what the prices of Microsoft Windows 12 and Server 2025, and RedHat Enterprise Linux 10, will be.

JerryK July 26, 2024 4:45 PM

With more than four decades as a software developer, I can testify that few companies make a useful investment in testing their software. Most testers are comparatively junior staff who are usually ignored by management even if they are any good at it. In more than one project a better than usual tester told me he had been advised by management that if he wanted a better career he should try to transfer to development. Companies keep their testing cheap by encouraging any talent there to move elsewhere.

Some places I worked had no test organization at all. There it was up to us developers. We all produce bugs sometimes, and finding one’s own can be very difficult. There’s a strong bias to see what you meant to do rather than what you did.

lurker July 28, 2024 2:42 AM

Given the number of recent incidents proven to be enemy action, or to be cases of ‘https://xkcd.com/2347/ I have to confess suspicion at the haste this Crowdstrike fail is being put down down to “accidental” failure to adequately test the update.

Clive Robinson July 28, 2024 4:54 PM

@ Bruce, ALL,

Re : Not the birthday paradox.

Whilst the birthday paradox is reasonably well known in Computer Security (more specifically one being the likes of Hash algorithms and their usage). And “halve the bits” or “square root of N” are given as approximations

Patching software instinctively feels like it should be the opposite or even inverse problem.

Only it’s not…

To see why keep in the back of your mind what patching aims to do whilst reading,

https://liorsinai.github.io/mathematics/2024/07/09/birthday-covering.html

Clive Robinson July 29, 2024 4:30 AM

@ lurker, ALL,

Re : Action not Accident.

“I have to confess suspicion at the haste this Crowdstrike fail is being put down down to “accidental” failure to adequately test the update.”

I’m noted for saying,

“There is no such thing as an accident, only insufficient information in time to respond.”

Which means yes I’m suspicious as well. Also because of the incompleteness of Hanlon’s Razor,

“Never attribute to malice that which is adequately explained by stupidity.”

I’m cautious of the so called “human condition” because “adequately” is a very large “fig leaf”, which HG Wells recognised long before when he acknowledged potential criminality in his longer version,

“There is very little deliberate wickedness in the world. The stupidity of our selfishness gives much the same results indeed, but in the ethical laboratory it shows a different nature.”

With Douglas W. Hubbard noting of Hanlon’s razor a few years back a rider that is appropriate to modern systems such as development,

“Never attribute to malice or stupidity that which can be explained by moderately rational individuals following incentives in a complex system.”

Which in a way is a reworking of Upton Sinclair’s simpler,

“It is difficult to get a man to understand something, when his salary depends on his not understanding it.”

As I have noted on this blog in the past, things can be made to look like accidents to cover up darker intent, which is why investigators look for financial or other oddities in “suspects lives”.

But few go on to consider the notion of parabolic paths in social conditions even though it’s oft said as,

“… the gravity of the situation.”

As say your boss or your bosses boss I can place certain strictures on your actions and the time you have to do them in and quote

“Time and tide wait for no man…”

As the need to do so. But in reality knowing that like Machiavelli I’ve set a course for you to unwittingly navigate that will cause you to end up on the rocks of my choosing.

Thus we have the “Unwitting agent” effect which is similar to the “Useful idiot” notion that is oft incorrectly attributed to Joseph Stalin or Vladimir Lenin (or more recently Vladimir Putin).

If you deny people “Information or the time to process it”, then they respond in a reactive, instinctual way, without thinking, as evolution has turned it into a survival instinct.

But how do you as an observer see it… “Accident or design”?

The software industry especially suffers from the notion of “rapid” with the emphasis on,

“80% of the functionality in 20% of the time”

or similar is always the way to go…

Thus if I was “looking for villains” with “deliberate intent” experience tells me to look a couple of layers up from those who apparently erred.

JonKnowsNothing July 29, 2024 12:05 PM

@Clive, @lurker

re: “accidental” failure to adequately test the update.”

One of the many Murphy’s Laws appears when SHYTE happens:

  • Always blame QA

Whenever you see fingers pointing at QA, you can pretty well bet that it is not QA at all. The other usual suspects are Engineering and Management. However, both of these groups have enormous amounts of CYA to cover up their Emperor has No Clothes situation. There is also a metric ton of stock options and bonuses and various other perks that might get interrupted if either of these 2 groups gets nailed for shoddy practices.

There are giant rabbit warrens about “testing” and nearly none of it applies to Engineering although the problems begin and end with Engineering.

Both Crowdstrike and the AT&T mega outage have nothing to do with QA. These are Management and Engineering errors. Errors they get away with for most of the time.

  • Fix it in the next release
  • Push to Production

In these current cases, the odds rolled against them. You can be assured they are not going to change one iota of internal procedures. They may fire a few workers on the QT, not because they made any error, but because they are the easiest to silence.

Clive Robinson July 30, 2024 3:47 AM

@ JonKnowsNothing, lurker,

Re : The idiocy of power.

“They may fire a few workers on the QT, not because they made any error, but because they are the easiest to silence.”

Yup…

There are a couple of old sayings from a time when most men could build a a shelf or bookcase, or do other repairs or improvements around the home with a few tools in a box kept under the stairs.

“Hard as nails straight and true”

“A straight nail is easiest to hammer down.”

Together kind of illustrates your observation.

I’ve worked in many areas of computers and communications from a time when building a computer needed a wire wrap gun or ability to hand tape PCB layouts and done research, design, engineering, QA&Test, third line support and a few other jobs, many in life/safety critical arenas.

But I started young building and designing canoes, boats, etc out of GRP / fiberglass as it was once called untill I became allergic. As part of my “Higher Education” I was taught how to be a tool maker which still surprises people when they get stumped by “tamper proof screws” etc and it takes me a few minutes to make them a tool (who expects the bosses boss in an expensive suit to grab a a real file 😉

One of the biggest problems we have in the ICT industry is “every one is a specialist” and they form cliques that would once have been called “Guilds” with their bullying hierarchal structures and petty trade secrets.

Worse even amongst those specialists is what you might call a belief in the magic of other specialisms. That is they have fallen into the trap of the inverse of Arthur C. Clarke “third law”. Which states “that any sufficiently advanced technology is indistinguishable from magic”.

Whilst true, it was never supposed to be used as an excuse to not to think, reason, or learn, which is what all to many use it for. Or worse take us back into a form of “Dark Ages”.

Because, if a manager does not understand what those they manage do, how can the manager manage?

It’s why we appear to be reverting to the religion of “Whip them harder” which never worked even in dystopian tyrannies.

ResearcherZero July 30, 2024 11:48 PM

At least all the certificates are up to date with underscores in the right places.

(System update removed automatic underscore addition from August 2019 to June 2024)

‘https://www.digicert.com/support/certificate-revocation-incident

Clive Robinson July 31, 2024 6:53 AM

@ Bruce, ALL,

OK CrowdStrike happened and lots of people went “OMG OMG OMG”.

Realistically though if history teaches us anything it’s that the set of circumstances behind it were far from unique and will happen “again and again and again”.

Each will be new and different but essentially they are all “instances in a class of failings” and the lesson to take away is “fix the class, not weep over instances”.

So keep your eyes open for articles like,

https://jvehent.org/2024/07/30/Are-security-and-reliability-fundamentally-incompatible.html

The message is for various reasons we are doing things wrong, and whilst that is correct, it’s also currently a result of legacy thinking.

We are all aware –or should be– of the old seesaw argument of,

“Security v. Usability”

And I pointed out years ago it’s actually more a case of,

“Security v. Efficiency”

That is arguably being more efficient or as some think faster means doing less work and that “can” give rise to security failings such as “Time based side channels” leaking confidential information.

The point is if you are aware of confidential information leaking by time based side channels –and you should– you can do things in different ways.

Sometimes though those different ways can be both “secure AND efficient”, yes it’s rare but it does happen.

The secret is of course “knowledge” which is obtained via,

“Learning and Experience”

Both of which appear increasingly rare in the “run with scissors” culture the software industry has become.

Acros the pond July 31, 2024 9:36 AM

This reminds of the book Normal Accidents: Living with High-Risk Technologies it is a book from 1984 written by Yale sociologist Charles Perrow. Talking about complex systems why it’s “normal” to expect them to create (big) accidents.

It comes down to two parameters: complexity and the level of coupling. The level of coupling is the slack in the system or how fast an event will trigger the next event.

His advice for reducing the chance of accidents is to have looser coupling or reduce complexity. I think this would be something for the affected companies to look in to. because now it was a CrowdStrike update tomorrow it may be something completely different.

There are already law initiatives that require (certain) companies to make preparations for these kinds of events. Cyber Resilience Act, Dora and NIS 2 to name a few.

Clive Robinson July 31, 2024 10:08 AM

@ Bruce, ALL,

Whilst this CrowdStrike issue has made main news around the world, it in effect covered up,

https://www.reuters.com/technology/cybersecurity/hackers-leak-documents-pentagon-it-services-provider-leidos-bloomberg-news-2024-07-23/

https://www.theregister.com/2024/07/24/leidos_data_leak/

Both of whom indicate that the story originated from Bloomberg that shall we say had a reputation. Which might account for the low take up in other news agencies.

The story appears to be that recently internal documents from Leidos have been leaked. Leidos is one of the largest ICT security companies to the US DoD and other agencies and organisations.

What the real story is is a little mystifying and apparently goes back at least a year or more ago.

Greg Hunt August 19, 2024 2:13 AM

There are parallels between the Crowdstike problem and security. Defence in depth is a motherhood statement in security, but the idea was, in the past, applied to quality management as well.

The stack of bugs that the Crowdstike RCA lists fell through a holes in the test strategy. Test approaches need to be layered in the same way that security mechanisms are, to minimise the chances of the gaps lining up and to maximiuse cumulative effectiveness. Different test approaches should be applied so that they act as quality controls on each other, not just the bugs in the software under test. Crowdstrike didn’t do that, hardly anyone does these days, the idea that we can write a single layer, or maybe two layers of test code that finds all defects is a delusion that is a side effect of IT’s recurring loss of institional knowlege.

Leave a comment

Blog moderation policy

Login

Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.