How Complex Systems Fail

Good summary list. It’s not directly about security, but it’s all fundamentally about security. Any real-world security system is inherently complex. I wrote about this long ago in Beyond Fear.

Tags: complexity

Posted on February 27, 2013 at 7:09 AM • 12 Comments

Comments

Rob • February 27, 2013 7:59 AM

I thought I would google simple system being clever. Google returns a link to a horse feed company. The Internet is literally unbelievable sometimes.

Paul Johnson • February 27, 2013 8:18 AM

A related take on this subject is “Engineering a Safer World” by Nancy Leveson. http://mitpress.mit.edu/books/engineering-safer-world

She considers safety to be a control system problem: the controller (human or automated) must keep the process within predefined safe parameters. The control system has a model of the process, and accidents happen when this model is wrong. Crucially, she also identifies a hierarchy of control systems including human operators, their managers, organisational management, regulation and legislature. At each level there is a control system trying to manage the process at the level below by using a model of the process to determine the control actions necessary to keep it in a safe state.

Dave Walker • February 27, 2013 8:23 AM

This is a really useful summary paper, and I’ve referenced it in a couple of places, now.

Significantly, it has the unusual joint properties of being reasonably comprehensive as an overview, and being comfortably short. I’m hoping that it being highlighted in places such as this, may serve to either show its utility as it stands, or suggest extensions (although I would hope that extensions are “not by much 😉 ).

Erica • February 27, 2013 8:41 AM

As Tony Hoare once wrote:

There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.

Or look at the Babbage language – that chose a third alternative as a sort of TSA foreshadowing:

http://www.tlc-systems.com/babbage.htm

jjjdavidson • February 27, 2013 1:27 PM

Reading the list, I was reminded of the Gimli Glider incident, where an Air Canada 767 ran slap out of fuel halfway to its destination. The chain of events, including equipment malfunction, change of air crews, manual measurements and miscalculations, and the final amazing landing, reads like the plot of a Michael Crichton novel, but is absolutely real.

Practically every point on this list can be applied to the Gimli flight. The aircraft and flight procedures both had redundancies, several failures added together caused the incident, the airplane was actually flying with part of its systems disabled, the crew that originally found a system failure was replaced before the failure was completely repaired, the initial catastrophe (out of fuel) was exaggerated by the new style of computer-display “glass cockpit” (no instruments!), the after-accident analysis blamed both the flight crew and the Air Canada management, the pilot partly blamed for the error in fact was the final defender against ultimate catastrophe, hindsight had everybody reviewing the previously “unthinkable” idea that a jumbo jet could find itself with complete engine failure at altitude, and so on.

Mindbuilder • February 27, 2013 2:22 PM

It is very important to realize that hindsight often makes the failure seem obvious when it wasn’t at the time. But it is also the case that often the failure WAS obvious, or at least quite predictable by those with sufficient understanding. Often the failure actually was predicted beforehand. Sometimes when those predictions are ignored it was because a reasonable risk was entered into by someone well informed. Other times the warnings were just ignored by those who didn’t know enough to make an informed decision. There are even times when people do things they know they shouldn’t do. Not because of a rational risk balancing, but just purely irrationally or because of too much regard for short term consequences.

moo • February 27, 2013 6:00 PM

This paper looks familiar. I think someone posted it in the comments section a few weeks ago..

John Campbell • February 27, 2013 7:05 PM

All systems– from the simple to the compless– regardless of checks and balances to ensure the system is honest with itself, is STILL operated by human beings.

Humans are always in the loop somewhere– input or output or in the middle somewhere– and are the inescapable root cause of failure.

RobertT • February 28, 2013 11:10 AM

@Erica
“There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.”

While I agree with this idea, it is interesting to consider the security cracks that are inherent in any minimalist system. These cracks are often due to the physical implementations rather then the functional specification. Obvious examples are Timing attacks and power analysis attacks (DPA).

The fix for these physical attacks is greater complexity, which interestingly opens up more cracks especially for side channel attacks. Power Analysis was widely used in the 1980’s but only really openly discussed in the last 10 years. Which raises the question, what new physical attacks will be revealed in the next 20 years?

Unfortunately even the best intentioned system design team cannot combat an unknown attack method. It is a little like trying to stop an unknown “zero-day”, all you can ever do is to work within the known space.

Recently I had a failure for a product I designed a few years ago where “shunt regulators” were used to smooth the DPA signature. I thought it was a great solution until a junior engineer showed me how he could predict the current through the shunt regulator. information Leaks about the shunt reg state, meant that my “perfect” solution became a builtin highly accurate DPA system. In other words the fix was worse then the complaint…..back to the drawing board.

name.withheld.for.obvious.reasons • February 28, 2013 12:53 PM

Let me add to the discusssion and to the communities awareness regarding future challenges.

The latest in surveillance technology is the use of Wifi (802.11xx) as sources for passive bi-static radar detectors. Depending on the physical and electronic geometries it is possible to discern objects as three dimensional spatial representations in time. I leave it to the reader to determine how this might be used or deployed. Then I suggest doing a little research regarding First Net which is a first responders network. As a spoiler, the government plans to deploy a “Nationally Aware Zoomed Internet” system.

Harold • February 28, 2013 9:13 PM

I arrived at this paper a few years ago, via The IT Skeptic’s blog.

Just as true today as it was then. I’ve gone from reading the blog posts of one pragmatist to the blog posts of another. When you take a step back from the discipline, it’s amazing how nuggets of wisdom like this can apply almost universally.

John Allspaw • March 1, 2013 7:10 PM

I’m lucky to know Richard and was happy he agreed to allow me to include the paper (and extra bits he wrote regarding the specific domain) in the book I edited for O’Reilly called “Web Operations”. He also spoke on the topic at last year’s Velocity conference: http://youtu.be/2S0k12uZR14

John

How Complex Systems Fail

Comments

Leave a comment Cancel reply