Excellent Analysis of the Boeing 737 Max Software Problems

This is the best analysis of the software causes of the Boeing 737 MAX disasters that I have read.

Technically this is safety and not security; there was no attacker. But the fields are closely related and there are a lot of lessons for IoT security -- and the security of complex socio-technical systems in general -- in here.

EDITED TO ADD (4/30): A rebuttal of sorts.

EDITED TO ADD (5/13): The comments to this blog post are of particularly high quality, and I recommend them to anyone interested in the topic.

Posted on April 22, 2019 at 8:45 AM • 110 Comments

Comments

David RudlingApril 22, 2019 10:00 AM

The author rightly references "Normal Accidents - Living with High-Risk Technologies" by Charles Perrow (1984/1999) which I have oddly just re-read and which is recommended reading for anyone interested in the intersection of security/safety, technology and people.
Also to be recommended is "Safeware - System Safety and Computers" by Nancy Leveson (1995) which I have also just finished reading.
Although neither is recent their continued relevance is because they both present the unchanging fundamentals and do it very well.
Incidentally I read them because I have also recently finished "Click here to kill everybody" by some chap whose name I can't quite recall but who is a very prominent and distinguished writer in that field !

SethApril 22, 2019 10:02 AM

Interesting article. The title should more appropriately be "How the Boeing 737 Max Disaster Looks to a Software Developer who is also a pilot". Frankly, I saw the article earlier on another site and skipped it since software development practices seem to be a significant contributor to the flaws in the 737 Max.

The author focuses entirely on the fact that the existence of the MCAS system (which malfunctioned, causing the crashes) was kept hidden from pilots. No mention of the fact that the abilities of that system were repeatedly expanded, apparently to the point where even some of the documentation originally given to regulatory agencies suggests the MCAS has far more limited capabilities than it has in practice.

0805April 22, 2019 12:12 PM

As a software developer I doubt, though, that it never occurred to none of the software developers that the thing they were designing were security relevant and needed redundancy. To me it feels more like when the new head of security of my firm instructed us to make all documents for the new product we were planning look as if we never thought of the possibility that the voltage our device runs with could ever harm anyone: the idea was that if we make a mistake that harms anyone we could weasel out of the situation by acting as if we never thought of this.

In our case we complained to the boss of the head of security and within a week we were allowed to document things like we wanted (which means it was easier to implement all the more-than-state-of-the-art things we do in order to keep our products safe).

AlanApril 22, 2019 1:15 PM

I worked writing flight control software for 5 yrs. No other software development I've seen in almost 30 years compares to the rigour of that development environment.

Software developers who have never worked in flight control software, FSW, don't have a clue. They just don't.

I have only 1 question for the software development team.

Q: Does the software perform 100% as specified?

That's my only question. I'd bet it does.

These systems aren't written by 1 or 5 people. Hundreds were involved. I'd be surprised if fewer than 30 people looked at the code between the developers, testers, and simulator people. At least 10 experts formally reviewed it for correctness.

Unintended design, most probably.
Software bug, almost no chance of that.
Less than great test scenarios? Definitely.

StevenApril 22, 2019 1:30 PM

The Spectrum article made a point I haven't seen elsewhere.
The problem with the new engines isn't just that increased thrust causes pitch-up.
It is that the engines themselves generate lift--like a wing, due to their size and shape--and since the engines are forward of the wing, the lift due the engines causes pitch-up.
And as with any wing, increasing the angle of attack of the engines (short of a stall) increases the lift due to the engines, so pitch-up causes more pitch-up.
So the plane is--in fact--unstable in certain regions of its envelope, and relies on MCAS to stabilize it.
This is very bad.

Gregory TravisApril 22, 2019 1:51 PM

A friend forwarded a link to this.

First of all, thank you very much for the kind words.

Second, this is the first time I've seen the safety issues and security issues related. I spent twelve-plus years doing cybersecurity (I have a CISSP, a security rating, was a fellow at the Center for Advanced Cybersecurity, and did cybersecurity for (some of) the Marine Corps' robotic war fighting efforts) and the overlap never occurred to me. So that, too, is very much appreciated as well.

The true story here, I believe, is not in the technical details. This is a story of tragedy and just how thin the veneer of technical civilization is. For in the end, all of the engineering standards, all of the technical best practices, all of the triage and technical quality assurance were swept away in an instant of bureaucratic collapse.

We will have to endure months, probably years, over engineers' hand-wringing about technical debt, root cause analysis, specification review, FEMA, etc. as if any of those things were germane to what happened.

Because they are not. Like the Ford Pinto considerations, like the Target breach, etc. ad nauseam, this happened because it was cheaper to pick up the pieces than to prevent the explosion in the first place. And, like the shuttle disasters, once people had gone down a certain road too far, turning around was beyond anyone's capacity for bravery. So they stumbled into manslaughter.

Gregory Travis


Gregory TravisApril 22, 2019 2:04 PM

"Technically this is safety and not security; there was no attacker."

A semantic nit. I believe there was an attacker. And that attacker is what blew away the veneer of technical civilization. That civilization is built on all of the various forms and spirits that guide the production process. The rules, the regulations, the best practices, the professional fraternities, etc.

All of those things exist because of the implied social contract that, if encouraged and supported, they will reciprocate by protecting society against harm.

The attacker destroyed those rules, regulations, practices, etc. such that the MCAS system was allowed to develop full term. It should have been lost to an engineering miscarriage almost immediately after germination. But that it was not is evidence that the body didn't reject it. The attacker rendered the body unable to recognize something that should not be viable in any way, shape, or form.

Who is that attacker? I think you can start to get a good idea by looking at the composition of the Board of Directors of Boeing, and the company's history post-merger with McDonnell Douglas.

JJApril 22, 2019 2:11 PM

@Alan

This.

The number of software people commenting on this issue as a "software" issue, on HN and elsewhere is astonishing. The comments from people whose software does nothing more than make pixels glow on a screen has been arrogant and presumptuous and prescriptive and defamatory.

It is in fact a system engineering issue. Aerospace system engineering: not the Facebook kind. The issue arises from a confluence of user interface for the pilot, training, control engineering, aero engineering, engineering management, and, at the lowest levels, software.

All of which are directed by a group of system engineers responsible for the entire aircraft design, the production process, the certification, and in-service support.

In my mind the tell-tale evidence is the oscillation in altitude by the aircraft. It's easy to diagnose: the "gain" in the control loop, consisting of the pilot and MCAS interacting with the aircraft wings is too high. High gain, such as more than 0.9 degrees of change to the rear stabilizer, results in oscillation. Low gain does not.

The existence of MCAS was not a software decision. The interface to external sensors was not software but mechanical and electrical engineering. The functionality specified for the MCAS would be controls engineering. And so on.

The absence of a visible indicator is... probably... the root cause. And that was a product specification, a.k.a "marketing", decision.


Steve FriedlApril 22, 2019 2:16 PM

What the article didn't mention was that Boeing sold an extra-cost option, an indicator on the display called "AoA Disagree", which lit up when the two angle-of-attack sensors disagreed by more than a certain amount. Neither of the two doomed airlines had purchased that option.

I do not believe there is any chance that the flight-control people didn't understand that an AoA sensor failure could crash an airplane, and I believe it got somehow overruled by higher in the food chain.

There were surely spirited arguments about this, and I suspect that five minutes after the first crash (Lion Air), the MCAS engineers knew exactly what happened and is why Boeing was able to get a "fix" in the works so quickly. I'm sure they were hoping this could get out the door without having to tell everybody about it, but the second crash mooted that.

I believe the fix coming out will disable MCAS if both AoA sensors don't agree, the AoA Disagree indicator will be standard, and MCAS will have much less authority than it does now. Pilots will have much more specific training about it as well.

What I do not understand is how "AoA Disagree" is not properly called "AoA Sensor Failure" and is seen as optional.

In any case, I'm pretty confident there are flight-control engineers telling their bosses: "I told you so, and I kept all the emails".

Gregory TravisApril 22, 2019 2:22 PM

One final comment, spurred by JJ's comment, above.

I originally wrote my article in a hotel room in Montreal in March, shortly after having flown my Cessna there and just after the Ethiopian crash.

At that time, I could not construct a narrative in my head that explained how anyone, at any level from most junior programmer to program manager, could have designed a system that took only a single unreliable sensor as input and used that input to configure a commercial airliner in a way that rendered it uncontrollable by its pilots. That is for one simple explanation and that was systemic incompetence. Also known as stupidity run head-on into a lack of understanding of history, of any kind.

Since then my view has changed. As more information comes to light, it is now becoming clear to me that the decision to use a single input to MCAS was a deliberate design decision driven by the non-negotiable requirement that the 737MAX not require pilot training.

But my question now is: does that absolve the software developers, who implemented this corporate requirement? We're they just "following orders," i.e. complying with the design/specification?

And, if so, at what point is the separation between responsibility because one is only "following orders," and the responsibility to disobey an order that would clearly cause so much harm, suffering, and loss of life if faithfully adhered to?

Gregory TravisApril 22, 2019 2:34 PM

"What the article didn't mention was that Boeing sold an extra-cost option, an indicator on the display called "AoA Disagree", which lit up when the two angle-of-attack sensors disagreed by more than a certain amount. Neither of the two doomed airlines had purchased that option."

Wouldn't have made any difference as the MCAS system does not take an AOA disagree indication into its calculations.

Furthermore, prior to the Lion Air crash, pilots were not even aware of MCAS's existence. So even if they did have the in-cockpit indication of AOA disagreement, it is extremely unlikely they would have made the leap between that and the fact that the airplane had a full nose-down trim. Instead, it probably would have just contributed further to analysis paralysis.

"I do not believe there is any chance that the flight-control people didn't understand that an AoA sensor failure could crash an airplane, and I believe it got somehow overruled by higher in the food chain."

At which point someone needs to be charged with willful manslaughter. I agree that there had to be someone who understood enough about the system that they could see that it would kill a lot of people very quickly. Which is exactly what happened.

"What I do not understand is how "AoA Disagree" is not properly called "AoA Sensor Failure" and is seen as optional."

See above. Prior to 737 MAX, the AOA sensors were little used pieces of cockpit fluffery -- AOA indication is not an important part of commercial airline flying and I know of few pilots who had AOA indication as any serious part of their scan. It's about as useful to to a pilot as knowing how many sheets of toilet paper are left in the lavatories.

It was not until AOA became a significant (only) input to an automated system that could configure the airplane into uncontrollability that it became important. And that didn't happen with Boeing airplanes until the 737 MAX.

Steve FriedlApril 22, 2019 2:49 PM

@Gregory Travis
> Wouldn't have made any difference as the MCAS system does not take an AOA disagree indication into its calculations.

Yes, I understand that. My point was that Boeing knew that AoA disagree was a thing, but thought it was something they could get money for rather than use it to make a safer airplane.

I cannot imagine that engineers didn't argue "Are you nuts?" regarding MCAS non-use of AoA disagree, but got shouted down. This boggles me.

It would surprise if this ended up being a technical issue, and will instead land squarely on Boeing management for all the reasons you mention.

Time to stock up on popcorn for the inevitable Congressional hearings, and nothing would please me more than to see some engineers showing up under subpoena, bringing stacks of emails.

Frank WilhoitApril 22, 2019 4:06 PM

@Alan:

Most bad software (and most software, of any kind, is utterly unfit for purpose) performs 100% as specified. Specification is the problem. It is quite possible that we will eventually discover that we are trying to transcend an essential limitation of the human mind.

supersaurusApril 22, 2019 4:37 PM

@Steve Friedl

I was a software developer for over thirty years, during which I wrote a lot of code that wrote programs to drive high speed placement equipment. disclaimer: this is *not* real time control software in the same sense as avionics code and I certainly claim no expertise in that realm, however I do have a lot of experience being managed.

in my experience it is common for management at some level going up the chain to develop selective deafness or brain fog. this is not just to schedule issues. here is a nutshell example: I once wrote some code that (simplified) approximated a solution to the Traveling Salesman Problem in the pursuit of increasing assembly line speed. the sales guys complained that getting that approximation took too long (an hour maybe, I forget exactly). I explained the reasons to deaf ears. the sales solution? "this problem is simple, just generate all possible paths and take the shortest one". I explained that finding a path through 100 points would involve generating and comparing approximately 9.3 X 10^157 paths (yes, 9 followed by 157 zeroes). the response? "well, the customer just needs to buy a faster computer". so I ignored them and did the best I could, which turned out to be plenty good enough. I could do this in part because, unlike a lot of code, we were not governed by a rigid spec, beyond "make it run faster and do it yesterday".

I think part of the problem is software has no weight and hence schedule issues, testing, and so forth are nearly perfectly elastic as seen by management (again, at a certain level, not necessarily down at first level managers). the software I wrote didn't have the potential to kill people; for that kind of error read about management failure in the challenger disaster or about the Therac-25.

Clive RobinsonApril 22, 2019 4:37 PM

@ Bruce,

Technically this is safety and not security; there was no attacker. But the fields are closely related and there are a lot of lessons for IoT security

As I've noted before, some languages that originate from Europe have only a single word to cover both the meanings of Security and Safety.

Even in the use of English you will find people saying something "is secure" or "make it secure" when the actually mean "it is safe" or "make sure it is safe". When you get down to the nitty gritty of many aspects of ICTsec, finding a meaning for security that is not the same as safety can feel like salami slicing.

Which is why I increasingly use the word "Privacy" instead of "secrecy" or "security" for the issues ICTsec covers most users for these days. Mostlt because it provides a clearer and more understandable meaning for most people. Especially when the likes of the FBI and DoJ are trying ever harder to turn the word "secrecy" to some kind of mental mapping in the publics heads to "Terrorist Behaviour", so as to get easier convictions.

supersaurusApril 22, 2019 4:50 PM

@Frank Wilholt

"Most bad software...performs 100% as specified."

au contraire, I think you would find that most large pieces of software cannot be said to perform 100% as specified" for two reasons: 1) because it is impossible to specify the behavior of large systems exactly; and 2) it is impossible to test all possible paths through the code on large systems, *especially* for large multithreaded systems.

of course if there is no spec I guess it all performs as specified, but I don't think that's what you meant.

if I interpret you correctly, I do agree that most large systems are beyond the comprehension of any individual. and let's not talk about automated proof of correctness, which seems to me to have a recursion problem.

MarkHApril 22, 2019 4:58 PM

As I wrote on the Friday squid thread, I disagree (meaning no personal disrespect) with much of the analysis offered by Mr Travis, though I am sympathetic to his perspective.

To focus on some main points:

• For about 60 years, jet transports have had automatic cockpit systems to ameliorate various types of handling instabilities inherent in their aerodynamic and engine characteristics.

These have not only worked well, but I am aware of no tragedies caused by their function or malfunction.

So if anyone concludes that it's intrinsically unsafe to fly an airplane with onboard systems to counteract built-in instability, well ... the facts don't support that.

Was MCAS done wrong? Absolutely.

Was the instability caused by the mounting of the MAX engines so severe that it should not have been tolerated even with a correctly implemented compensation system? Perhaps, but this has yet to be established. I think it more likely than not that the answer will be no.

• As I explained before, jet airliners never have "natural feel" on the controls because it's not feasible. Jet transport operations have been working with artificial feel for about 60 years, and I'm aware of no disasters caused by such systems.

Airbus FBW jets have no feel whatsoever, which I dislike "on principle" ... but those sidestick-controlled jets have achieved the same spectacular safety levels as their Boeing counterparts with old-fashioned control yokes.

So if anyone concludes that artificial feel harms flight safety ... the facts don't support that.

• Were the "people who wrote the code for the original MCAS system" ... "obviously terribly far out of their league?"

If by "people who wrote the code," Mr Travis means the people who actually wrote the code, I think that characterization is Dead Wrong.

I've been programming for almost half a century, and I'm accustomed to either small organizations, or obscure corners of very large organizations (which can be pretty much the same thing).

In my little corner, I wear many hats: system design, hardware design, specification of requirements, quality control, etc. I'm also a long time student of aviation safety. If somebody had come to me and described a system to control engine-related pitch-up by jacking the stabilizer, I would probably have said "whoa, that's risky and we can do it much safer." If they'd told me it will be governed by one AOA sensor, I would probably have said "this is a mistake and we need to rethink this."

Perhaps Mr Travis is used to the same kind of small-scale software development, which would make his characterization of the software team understandable.

However, in real life, no airframe manufacturer develops avionics software that way -- nor should they. See the comments by Alan and JJ above for a dose of reality.

In simplified form (I don't know the specifics at Boeing) there was some System Engineering team which designed and analyzed the MCAS concept. This team included people with the relevant expertise concerning airframe and engine characteristics, flight dynamics, and safety of flight operations.

A work product of that team was a software functional requirements specification which was transmitted to an Avionics Software Team. Most of the "people who wrote the code" did NOT possess the expertise of the System Engineering team ... nor would it be feasible (or even sensible) to expect or require them to have it.

If a member of the Software team had responded to the specification by asking, "isn't it a problem that this depends on one sensor?" then the System Engineering team would have answered, "we already analyzed that question, and determined that in case of sensor failure, MCAS behavior will be readily contained without significant impairment to flight safety."

Now, it's possible that Software team responded correctly to the specification, with a work product that had characteristics the System Design team didn't expect (namely, the crazy restart after a few seconds). But if so, that was probably:

a) an inadequacy of the specification
b) a failure of the quality control process
c) a mutual failure of communication between groups

Very likely, as Alan wrote above, the Software Team provided precisely what was asked of them, and the appropriateness of the requirements was the responsibility of the System Design group.

I can't rule out that the software people screwed up. On the basis of publicly available information, they probably made a correct realization of a faulty design.
_________________________________

I wasn't motivated to write this critique by some contrarian impulse.

It's as common as dirt, that when something goes badly wrong -- and the present situation might be the worst scandal in Boeing history -- people incorrectly diagnose the failure(s), and thereby miss opportunities to learn how to do better going forward.

At present, the delegation to brainless robots of decisions formerly made by people is expanding at a reckless (and ever-accelerating) pace.

As originally realized, MCAS was a Killer Robot which actively fought against cockpit crews who were in a desperate struggle to save themselves and their passengers. It's as ugly as a sensational sci-fi movie.

Automation gone wrong is going to be a growing danger for a long time to come. MCAS was such a horrible example of automation gone wrong, that it's imperative to arrive at a correct explanation for how this nightmare resulted from the work of many people who (almost certainly) all wanted everything to work very safely.

For people like me who are involved in the design of automation, it's critically important to understand how it goes wrong and how to get it right.

Clive RobinsonApril 22, 2019 5:09 PM

@ JJ,

High gain, such as more than 0.9 degrees of change to the rear stabilizer, results in oscillation. Low gain does not.

Gain alone does not cause oscillation you also need a feedback path that gives positive or 2pi around the feedback loop at one or more frequences.

I won't go into stability criteria as it's fairly long and tedious.

But I will say that if you have a feedback loop that on increasing gain starts to oscillate then the system is not unconditionaly stable. The consequence of this is that the design can not ever be considered "fail safe" thus should where possible be designed out and replaced with a system that is unconditionaly stable.

Whilst I doubt there are many software engineers that can tell you this, most hardware engineers that have done even rudimentary "control theory" can tell you that virtually straight off.

The same problems exist with systems that have cusps in their control plots. As gain increases the cusp becomes more susceptible to "chaotic behaviour". Again few software engineers are aware of this, likewise only some hardware engineers can.

For those that don't quite get why a system might be unstable or chaotic, have a think about a system such as a boat where for some reason you put the rudder infront of the propulsion system. Such an arangment is unstable and requires a very fast control loop to get stability. However as with fighter aircraft such inherant instability gives high agility which can be very desirable.

All joking aside it's the real reason for the expression "Putting the cart before the horse". Unless the horse moves very slowely the feed back control loop through the human is way to slow and the equivalent of a jack knife becomes not just possible but probable. It only takes a little thought to realise that such a push system is chaotic, whilst converting it to a pull system makes it not just stable but prevents jack knifes happening. Similar issues exist with rear wheel drive cars, they become "lively" very easily when the roads are slippery or a corner is taken to fast.

MarkHApril 22, 2019 5:51 PM

re Oscillation:

If this discussion refers to the recorded periodicity of altitude fluctuations (around 15 to 20 seconds, if I recall correctly) ...

The explanation I have seen is that MCAS was made its incorrect trim input (in response to a "stuck" sensor), held for a while, and then restarted its computation as though it were seeing the high AOA for the first time.

If that explanation is correct, it is not a feedback instability.

This restart phenomenon completely invalidated the restriction -- crucial for safety! -- of a limitation in the total trim change commanded by MCAS.

Although many factors will be understood to have contributed to this scandal -- chiefly, the way the "no retraining" tail ended up wagging the dog of system design -- I expect that the MCAS restart characteristic will be recognized as the core technical issue which made its fail-mode operation so acutely dangerous.

former bus driverApril 22, 2019 6:14 PM

@Clive --

I wish I'd had that elegant explanation at the ready when the transit system I used to work for began acquiring low-floor articulated coaches with the drive wheels on the trailer. These are now very common, but I'm sure many posters here will instantly recognize the inherent instability in such configurations.

While this design does make it possible to eliminate a drive shaft across the hinged section, and thus retain the low floor height for more of the vehicle's length, "lively" would be an understated way to describe the behavior of such vehicles on icy roads.

FaustusApril 22, 2019 7:29 PM

@ Mark

I appreciate your realistic viewpoint. And I guess I am doubling down:

Does anybody doubt that are 20 to 50 similar tradeoffs between safety and cost in every airliner? That airliners would be twice as expensive if twice as much effort and testing and equipment were put into them? And airline tickets twice as expensive! And that the planes would still fail on occasion because it is impossible to fully test complex systems?

Am I the only one who has worked in the real world? I have never been told to take whatever time is needed to do something perfectly. I have never been told to let the schedule slide to look for problems in something that has already been tested. But I have been explicitly ordered to ignore vulnerabilities.

Nobody can afford perfect safety or perfect security. Of course the security business will insist they try.

(Not that this is particular venal. We all think people definitely need the service we sell, whether it be policing, prisons, moral rectitude or yoga instruction. I was convinced that the Y2K bug needed to be preemptively addressed at great cost, and I had the software to do it. In retrospect this seems like a convenient belief in my part.)

The MAX MCAS certainly appears to be a badly evaluated tradeoff with the full benefit of hindsight. But the flawless complex device, the perfect complex program, the error free programmer, and the budget to create these, are simply fictions that don't exist, never did and never will as long as humans are the actors.

JG4April 22, 2019 9:43 PM

As usual, I appreciate the helpful discussion. Way too busy, or you'd hear from me more often.

@MarkH - I like your diligent inquiry. I particularly like this excerpt:

"Automation gone wrong is going to be a growing danger for a long time to come."

That captures a lot of the IoT problem, the projected intent problem, the Boeing problem and a so much more. I'd go a step further and say that unintended consequences are going to plague humans for a long time to come.

Did I mention the rental car that I had in January? The one with automatic brakes that hampered my ability to pass safely? I think that they would stop interfering when I stepped on the gas. I did like the audible alert from the radar when closing speed was high or the distance was close. One of the cars that I rented last year had a radar feature in the cruise control. It would follow the vehicle ahead at a safe distance if it were traveling slower than setpoint. It was a bit squirrelly when changing lanes if a slower vehicle were in the way.

Forensic FlyerApril 22, 2019 11:05 PM

They'll be arguing questions of liability for years, lots of angles to investigate here, but to me the saddest failure seems to be that lowest-on-the-totem-pole afterthought of modern technology - Documentation.

If there is any technical field where documentation might possibly be given the attention it deserves, it is in the aerospace industry, since the pilots need to know their airplanes, especially when something goes wrong - and yet the documentation failed.

We may all be barking up the wrong tree here, but if the internet speculation is at all correct, it seems that the pilots, in both cases, could have saved the day had they known exactly how the MCAS system operated - and how to turn the darn thing off.

The lack of that page in the documentation may be what caused those crashes and killed all those people.

Alyer Babtu April 22, 2019 11:45 PM

Ignorant question from a total outsider -

All these software plus hardware machines seem to fall under the heading of nonlinear, recurrent dynamical systems. These have general mathematics related to qualitative behavior, such as fixed points and stable orbits, exponential runaway, chaos, etc. Also neural network researchers study them for emergent behavior, which is a similar exercise.

E.g., just to reference a few obvious names, writings of Stephen Smale, Vladimir Arnold, Stephen Grossberg, etc.

Is this kind of “wholistic” mathematics used in the design, testing, and evaluation of real world products ? It seems it might help to provide qualitative insight into the behavior of the product and reduce surprises.

MarkHApril 22, 2019 11:46 PM

@Flyer:

The failure to document MCAS in flight manuals was a grevious fault, and there seems to be a strong consensus among the world's air carrier pilots condemning this omission.

However, if some reports about the Ethiopian crash are accurate, the cockpit crew was aware of the dangers of MCAS malfunction (in consequence of the first disaster), and tried to take the recommended corrective action ... and even so, were unable to survive :(
______________________

As is usual with airline disasters, there were many opportunities to save the situation:

• if all parties in the Boeing design process had a shared understanding of the "auto reset" characteristic of the delivered MCAS, they would surely have changed it, and we would never have heard of MCAS

• if the System Design staff had correctly classified MCAS as highly critical to flight safety (reports of an internal review say that they classified it incorrectly), it would have faced higher standards and much greater scrutiny ... and we would never have heard of MCAS

• if Boeing and/or regulators had properly understood the Lion Air (first) disaster, and respond with a fitting degree of proactive alarm, then the second disaster at least could have been prevented

This list is by no means exhaustive: it's easy to add several more.

How a first-class safety culture fell into such a catastrophic brittle failure, will be a focus of study by psychologists and sociologists.

TatütataApril 23, 2019 1:25 AM

Boeing filed a patent application on 31 August 2015 which matured into patent US9703293 for an "Aircraft stall protection system".

Although it makes no mention of engines or instable designs, it appears to me likely that it is related to the MCAS that is currently the chief suspect in the two tragedies.

The role of evolving regulatory requirements is already clear from col. 1, line 35:

Generally, it is undesirable to operate an aircraft in the stalled flight region. To preclude operation in this region, many regulatory authorities (such as the Federal Aviation Administration (FAA) in the United States) require that the subject aircraft demonstrate sufficient stall warning margin and effectiveness. To satisfy the regulatory stall warning requirements, many aircraft manufacturers employ stall warning systems. Stall warning systems provide visual, audible, and/or tactile indications to the pilot that the aircraft is approaching the stall angle of attack. Stall warning systems do not affect the pilot control of the aircraft, and as such, the pilot may elect to ignore the stall warning system and command the aircraft into the stall (or uncontrolled) flight region.


Stall protection systems, on the other hand, prevent the aircraft from entering the stalled flight region by taking control of at least some of the flight control surfaces from the pilot and actuating the flight control surfaces to maintain the aircraft in the region below the stall angle of attack. Generally, stall protection systems prevent the aircraft angle of attack from exceeding the stall angle of attack so that the wing retains predictable lift characteristics and pilot manipulation of the control surfaces remains effective, with the exception that manipulation of the control surfaces that would cause the airplane to exceed the stall angle of attack is prevented.

Aircraft that employ stall protection systems are typically certified through a Special Condition Issue Paper process (in the U.S.), since the traditional stall requirements cannot be assessed. Some regulatory agencies (such as the FAA) may give aircraft manufacturers performance relief credits for installing stall protection systems, which can result in competitive advantages during the aircraft certification process. For example, traditional operating speed margins based on stall speed in icing conditions are not required, which results in improved takeoff and landing performance. However, while existing stall protection systems prevent aircraft excursions into the uncontrolled flight region, they do not necessarily maximize aircraft performance and pilot input can actually lead to a more rapid depletion of aircraft energy than is desired. Aircraft that have implemented stall protection systems have generally removed traditional stall warning systems and replaced the stall warning demonstrations with stall robustness demonstrations.

At col. 6, line 60, it is clear that this is an active stall avoidance system, and not merely a stall warning indicator:

When executing the activation logic 200, the processor 154 commands the [Flight Control Computer] 112 to limit the aircraft angle of attack to first maximum angle of attack (α1) for a predetermined maximum time period, which prevents the aircraft from entering the stalled region of flight. This advantageously enhances the safety of flight by preventing a stall while allowing the aircraft manufacturer to realize the performance advantages granted by regulatory authorities.

What is a "performance credit"? And should a "competitive advantage" be a consideration in passenger safety? Is this another demonstration of the fundamental problem of the FAA as both a regulator and a promoter of the US aircraft industry?

Much more than an Angle of Attack sensor would be involved if one believes the passage beginning at col. 4, line 58:

The stall protection system 100 determines an aircraft flight state 120, which includes one or more of an angle of attack 122 (which may be received from an angle of attack sensor or an estimate from the [Flight Control Computer]), a secondary flight control surface position, such as a flap position 124 or speedbrake position 125 (which may be received from a flap or speedbrake sensor), airspeed or mach number 126 (which may be received from an airspeed or mach indicator), icing conditions 128 (which may include one or more ice detectors as well as static air temperatures and/or total air temperatures received from temperature sensors), thrust 130 (which may include throttle position received from a throttle position sensor), landing gear position 132 (which may be received from a landing gear sensor), a load factor 134 (which may be received from inertial sensors), aircraft gross weight 136 (which may be received from the FCC, or estimated), aircraft center of gravity 138 (which may also be received from the FCC or estimated), aircraft pitch rate 140 (which may be received from internal sensors), an angle of attack rate of change 142 (which may be received from an angle of attack sensor), and altitude 143 (which may be received from a barometric altimeter, a radio altimeter, or a global positioning system altimeter).

There is no mention of reliability or redundancy considerations in the patent, but this doesn't necessarily mean that the implementation wouldn't be concerned with this aspect. On the other hand, the patent drafter wanted either to cover all bases, or the specification of the control law of the black box might be highly complex indeed, so why are only single inputs mentioned? Not only the AoA is involved, but its rate of change too. Taking the derivative of a noisy (?) input doesn't sound generally like a sound idea.

I note that the priority date was rather close to the 737 MAX maiden flight on 29 January 2016 (source: Wikipedia). If this patent is indeed related to MCAS, the late filing would suggest that this system may have been insufficiently tested, as any earlier data supplied to the FAA during certification might be discoverable and could be held as prior art.

The patent allows a variety of implementations, but it does appear to be mainly foreseen as a software modification to an existing subsystem. A new box with connectors in the instrument bay might attract more scrutiny from the regulator.

TatütataApril 23, 2019 1:43 AM

The reference "Special Condition Issue Papers" in the patent do suggest a connection to MCAS and the 737 MAX.

These are defined in the FAA's National Policy Order 8110.112A of 3 October 2014.

g. Proposed Special Conditions. For a new TC, the basis for issuing and amending special conditions is found in § 21.16; for changes to a TC, it is found in § 21.101(d). Under the provisions of either § 21.16 or § 21.101(d), a special condition is issued only if the existing applicable airworthiness standards do not contain adequate or appropriate safety standards for an aircraft, aircraft engine, or propeller because of novel or unusual design features of the product to be type certificated.

(1) The phrase “novel or unusual” applies to design features of the product to be certificated when compared to the applicable airworthiness standards. The FAA uses IPs to address novel design features for which there are no regulations or the regulations are inadequate. The FAA uses IPs to document the basis, need, and wording of special conditions.

[...]

h. New Information. It is conceivable that a better understanding of environmental or other hazards not understood in the past, or that did not exist previously, would require a new method of compliance. Such items could include potential circumstances where the existing applicable regulations were developed unaware of the threats.

Clive RobinsonApril 23, 2019 2:29 AM

Whilst people are trying to argue at what level things went wrong ie managment down to software writers and the spec they are writing to. The elephant is still in the room messing the carpet.

The MACS was thought up as a band-aid to a problem, an obviously known safety critical problem that was a very real broken bone. It's rare as most know for a band aid to be able to fix a broken bone, because a band aid can not provide stability at the break, all it can do if large enough is bind the bone to a structure that will provide the stability.

When somebody decided to add the engines that was when the bone was broken in the 737Max, the software regardless of how it was arived at was a band-aid that was way to small and importantly had nothing to bind to. That is at no point was a stability structure added for the software to be able to bind to.

The real cause of the accident started with decision to break the airframe by adding the engines. From that point on the fate of nearly four hundred people and the thousands of family members was set in motion.

It must have been clear to many it eas the wrong decision because the result was not unconditionaly stable, thus in no way "fail safe".

The question at managment level should not have been,

    Can this be done?

It should have been,

    Can this be done safely?

What followed was an abdication of responsability at all levels down. But it was the decision to add the engines that gave rise to the deaths.

Software fixes, don't solve critical hardware problems without the right stable structure to bind to. Using one known to be unreliable sensor is in no way building a stability structure, I think that much is obvious to everyone reading this. That is it's the equivalent of building on shifting sands. But also anyone who has "worked to drawings" will have noticed the "center line" from which all measurments are taken, or the "point of origin". These are the base refrence points from which all measurments are made and all work tested for compliance. Where was this in this design, where was the refrence point by which the system was supposed to work? From what had been said it was just a known to be unreliable component...

Yes we have the advantage of hindsight, but these are well known engineering problems and stability issues were known to the Roman's and earlier. They might not have had the maths but they sure knew that whilst a man could push a handcart reliably a horse could only pull reliably. Likewise ship wrights knew you had to have the driving force infront of the stearing mechanism and pulling the boat along not pushing it. They knew how to make things stable, and they knew how to test a ship was stable. All known solved problems long before scientists and engineers who use the scientists work existed.

Two things were known one explicitly and one implicitly,

1, The engine change made the airframe unstable thus not fail safe.

2, To achive stable conditions with an unstable system you need a stable reliable refrence point to control it by.

The first is the explicit point it should have come out from the very initial "can it be done?" question.

The second is the implicit point every scientist and engineer should know as part of their core training. Even trainee technicians have it impressed upon them from the very begining. To make tests, you need measurments and measurments require valid reliable unchanging refrence points within the system. Thus the very first thing you do is establish that refrence point, otherwise anything that follows is usually meaningless.

Can anyone point to where a reliable refrence point was established in this system?

MarkHApril 23, 2019 4:55 AM

@Clive:

The installation of LEAP engines on the B737 made important changes to the handling characteristics. That these changes were too severe -- a "broken bone" as you suggest -- is a thesis I have seen asserted by a number of writers, including Greg Travis.

I have yet to see this thesis supported by anyone with the expertise to render a knowledgeable judgment on the matter. I think it likely that this thesis is, in fact, wrong.

To my understanding, the MAX tendency to pitch-up in certain circumstances is not in the first instance a safety problem, but rather an economic problem.

I suggest that if the MAX had no MCAS, it would fly safely. It would be easier for inexperienced pilots to reach excessive angles of attack, but that's not necessarily as bad as it sounds, for at least three reasons:

• by reputation, handling of the 737 on approach and entry to stall is immaculate; it's supposed to be a pussycat (though rather less so, with the LEAP engines)

• like any other type of jet airliner, the 737 has an excellent stall warning system which alerts the pilots (rather violently!) to excessive angle of attack in plenty of time to take corrective action

• when a 737 is getting close to stall, the "feel" system adjusts the centering of control column to automatically help push the nose down, thereby aiding stall avoidance

So if the cockpit crew gets too close to the AOA for maximum lift, they can easily fly right out of it with a good margin of safety, simply by doing what they have always been trained to do.

But ... if Boeing hadn't put MCAS in, this would have required changes to flight manuals and special type training for pilots. For the airlines, that would be a significant expense, and Boeing wanted to make the sales pitch "you can switch to the MAX without spending all that training money."

The terrible irony here (if my understanding is correct), that a system not needed for safety but only for economics, proved to be a severe hazard to safety.
___________________________

Certainly, the analysis I've just offered could be invalid.

As time passes, we'll get three kinds of feedback:

1. The FAA (and other certification authorities outside the US) will certainly be obligated to take a good long look at the "new" MAX with the MCAS patch. If the LEAP installation is really a "broken bone," they'll know that they must either withdraw type certification, or else face a storm of blame in the event of future disasters of this type.

2. The US NTSB will no doubt investigate and issue one or more reports on these MCAS-related crashes. The NTSB has a strong record of both technical know-how and independence; they've had many conflicts, sometimes rather bitter, with the FAA, with airframe manufacturers, and with individual airlines. Surely the NTSB will minutely analyze the changes to handling characteristics in the MAX, and weigh in on whether these render the design inherently unsafe.

3. In the next few years, the recurrence or absence of accidents and incidents which may be related to the pitch-up tendency will answer the question with real-world data.

I propose that it is most likely, that the MAX with improved MCAS will meet or exceed one million departures per fatal accident.

Denton ScratchApril 23, 2019 5:34 AM

@Gregory Travis

"does that absolve the software developers, who implemented this corporate requirement?"

IMO, it does not.

However software developers nowadays are not high on the food-chain. I mean coders; not their line managers. Pay for software developers has fallen over the last 20 years; and respect for the opinions of developers has fallen in sync. Coders are nowadays treated as fungible 'resources'. If they refuse to do as they're told, they can be replaced. In my last place of work, graphic designers were accorded much greater seniority and pay than coders.

There has been talk recently - I'm not sure where, possibly here - about the use of the term 'software engineering'. The discussion was about the increasing (in some cases universal) use of the term 'software engineer' in job ads. An engineer is a person who practises his discipline under the guidance of sound, up-to-date academic research; someone with a ten-year-old B.Sc in computer science doesn't meet that standard, I'm afraid.

Me, I'm no engineer; I'm a journeyman tradesman (retired, and I'm sure I never made the grade as a 'master'). Over the course of my career, I've (very) occasionally encountered real software engineers. These people, despite their scarcity, were also underpaid, and under-respected. To get good pay and respect, you have always had to make the jump to management. As a consequence, I have often been managed by quite good technicians, who had no clue about managing either projects or staff.

Nowadays there is just a whole lot more software being written; it stands to reason that most of it is being written by people who are no more than journeyman tradesmen, managed by people who can't manage. It's not that surprising that planes crash and websites get hacked.

We need to write less software, and hire more engineers, and less journeymen.

Clive RobinsonApril 23, 2019 6:15 AM

@ Bruce, ALL,

It needs to be said that it appears that the 737Max is not the only safety SNAFU Boeing has currently...

The 787 Dreamliner also has manufacturing issues that some consider safety critical production system defects,

https://www.nytimes.com/2019/04/20/business/boeing-dreamliner-production-problems.html

This is the second story airframe related safety story with regards Boeing. Importantly as it effects areas unrelated to the 737Max, it raises certain questions...

Why? Because if they were related then it would indicate a failure at one potentially issolated point in the organisation, however unrelated...

This suggests that it is a much more wide spread and deeply rooted issue of managment, cultural or both that effects the company in a much more general way, thus more of their products.

I suspect the finger will get pointed at nebulous "procedures" or some such to lift the corner of the rug befor the issue, but not the problem, gets swept out of sight.

Some GuyApril 23, 2019 6:48 AM

As devices and designs become more complex, not only does systems engineering become more complex, the systems people work in become more complex.

As with any increasingly complex system, it is necessary to decompose the system into manageable parts. But that requires strong communications Between teams. However, there is a limit to the amount of communications and coordination that can be managed. Inefficiency of scale comes into play.

With the complexity in communications comes points where individual design teams make locally optimal decisions that in the aggregate are suboptimal and potentially unsafe. Now add separate teams doing Vallay station and testing and the potential failure point never gets caught because failure requires interaction between systems. Since it is rare, it takes multiple real world failures before the problem is understood.

tfbApril 23, 2019 7:34 AM

The description of what has happened to the relationship between the FAA and the manufacturers has a name: regulatory capture. It happens in other industries, notably finance, and it's never a good thing.

I think a solution to regulatory capture like this involves two things: regulators need to be paid well enough that you get good ones, and people who want to become regulators need to be forced to divest themselves of any interests they have in the organisations being regulated and never be allowed to work for them again, with a very substantial penalty (they need to go to jail) if they are found to have violated the rules. Of course people will only do this if they are paid well enough in the first place which is another reason they need to be paid well.

That will never happen, of course.

KabanApril 23, 2019 7:38 AM

Thank you for sharing this.

The line

"...It doesn’t go old-school. It’s modern. It’s software..."

made me cringe. Poor requirements, poor design and poor engineering are older than software.

I believe the article ignores the issue of MCAS design that has nothing to do with software. The MCAS was, from Boeing's point of view, performing as intended in fail-safe mode.

The pilot facing "airspeed unreliable" condition is expected to disable autopilot and maintain the pitch. Maintaining the pitch requires, among other things, to precisely trim the stab by using trim inputs (and trim unputs, among other things, disable MCAS's intervention for a few (five) seconds). Should nosedowning continue, the pilot is expected to respond via Runaway Stabilizer checklist.

The problem is that whoever designed the crutch failed their task of assessing the load placed on pilot, injecting a straw that brought camel's back into flight law. Some Boeing test pilot felt at ease with system; two tired, underslept commercial airline crews did not manage.

It is, most certainly, not Horrible Software Problem. There was, as far as we can deduce for now, literally nothing wrong with software. It performed as intended. Blaming coders for
a) poor ergonomic decision, and
b) criminal lack of user documentation
is not "best analysis". It is display of emotion at best.

Petre Peter April 23, 2019 9:07 AM

Tragic what happened. While it doesn't seem to be the case here, attacks can be disguised as negligence. I am glad that forums like this exist where the word glitch doesn't suffice but I am very sad that this lesson costs lives.

wiredogApril 23, 2019 9:08 AM

@clive
"Software fixes, don't solve critical hardware problems"
That takes me back about 20 years to my days writing industrial automation software. We had a system that was failing, with switches triggering incorrectly, and the boss kept trying to fix the brokenness in software. Check the switch multiple times to see if it's actually closed, etc. Turned out someone had ignored our instructions and put the 500VDC rectifier right next to the control cabinet.

RealFakeNewsApril 23, 2019 9:48 AM

Someone should be looking at jail time.

Forget why MCAS was installed.

1) It uses a single input

2) It had no fault monitoring

3) It had an unrelated system fault message that was optional, but fundamentally did not affect system operation

4) Other changes were made to safety-critical systems

You are not telling me this was just some huge mis-understanding by a lone software engineer (looking at you, Volkswagen).

AlanSApril 23, 2019 10:33 AM

@ Gregory Travis

I want to comment your linking of the 737 Max failure to the Challenger disaster:

I cannot get the parallels between the 737 Max and the space shuttle Challenger out of my head. The Challenger accident, another textbook case study in normal failure, came about not because people didn’t follow the rules but because they did. In the Challenger case, the rules said that they had to have prelaunch conferences to ascertain flight readiness. It didn’t say that a significant input to those conferences couldn’t be the political considerations of delaying a launch. The inputs were weighed, the process was followed, and a majority consensus was to launch. And seven people died.

In the 737 Max case, the rules were also followed. The rules said you couldn’t have a large pitch-up on power change and that an employee of the manufacturer, a DER, could sign off on whatever you came up with to prevent a pitch change on power change. The rules didn’t say that the DER couldn’t take the business considerations into the decision-making process. And 346 people are dead.

But is the 737 Max failure the 'same' as the Challenger failure? In your article and here you appear to be suggesting that this is a disaster brought about by moral failing. Maybe that's the case in this instance, that corporate interests overruled the engineers, designers, and software developers in the face of their opposition. But a couple of points: 1. Organization/institutional culture and relations can never be separated from technical decisions. You are never going to be able to separate out the purely technical from the social/cultural. and 2. The argument you appear to be making in this case is not the argument that is made in the most cited analysis of the Challenger disaster, Diane Vaughan's, which to quote someone I quote elsewhere (see last link below) "bite[s] the bullet of technical uncertainty".

See earlier discussion on this blog of the "normalization of deviance" which Bruce ties to an article by Ron Rapp on the 2014 Gulfstream crash: IT Security and the Normalization of Deviance. See my posts below the main post: here, here, here, and here. I argue that Rapp and Bruce misunderstand the subtlety of Diane Vaughan's analysis of the Challenger disaster and her term "normalization of deviance".

FaustusApril 23, 2019 12:24 PM

The analysis in this forum of the technical issues around the MAX accident is helpful.

But I think the search for a scapegoat is not. Our tendency to search for scapegoats is exactly what makes accidents such as this more common.

Until we can dispassionately say that an error was made and trace its genesis without scapegoating we are simply encouraging people to hide any evidence of mistakes they have made. Or that they have simply been in the vicinity of, making these witnesses unwilling to speak out for fear of tarring by some mob, quite liking consisting of other potential scapegoats who want to divert attention from themselves.

Any error in such a complex aircraft is likely to have stemmed from many people's actions, and is therefore systematic in nature. You get to go on a witch hunt, or you get to get to the actual systemic source of the error, but not both. A witch hunt ensures that the most critical information is harder to find as everyone is covering their butts and denying as much knowledge as they can. People should be encouraged to bring forth information that may implicate them, not scared into hiding it.

We force people into lying when we don't credit the real pressures companies are under trying to make complex systems. They have to pretend that everything is as perfect as possible, when in reality some tradeoff calculation must go on, determining how far certain vulnerabilities are worth pursuing, the cost being in money, time, market share or public image.

This happens in NASA, in private companies, and in government. For example, there is a limit to how much effort we exert in making sure that we don't imprison innocent people. The US allows prosecutors to blackmail people into pleading guilty rather than going to trial as an efficiency measure, a tradeoff. It forces innocent people to plead guilty to avoid jacked up sentences. This justice tradeoff injures and kills more innocent people than faulty aircraft ever will. But we accept it, while simultaneously denying its existence and even fighting to deny convicted people the opportunity to seek justice when police or prosecutorial misbehavior is discovered, or new evidence is found, or we realize that old forensic processes were deeply flawed.

As long as we force people to maintain the fiction that any lack of safety or security or justice is unacceptable and punishable, we ensure that the tradeoff process remains a secret. When the tradeoffs are secret it is much harder to reach a consensus on what is acceptable and what is not. We can't bring in outside experts to evaluate how appropriate the tradeoffs are, and what REALISTIC standards should be used to establish these tradeoffs and what information the public has a right to know about short cuts taken for practical reasons.

So we play the game of moral indignation and false surprise again and again, when a dispassionate process of truth seeking and improvement would gain us much better results than a call for punishment.

Rj BrownApril 23, 2019 12:34 PM

"It costs too much and takes too much time to fix the hardware. Fix it with software."

How many time have I heard this since I started my career as a realtime embeedded software engineer in 1973. Many times, that is the correct way to go, but some things just cannot be fixed with software.

I remember working on a robot submarine in the 1980's where we were being very bold to use floating point in the control loop software. The analysis had been done, and it promised to be quite effective. My manager (a pure software guy with zero hardware experience) came into my office very frustrated. He had been working for several days trying to fix a problem with a positioning servo. Out of his frustration, he gave the problem to me. After about a hour of fiddling with it in the debugger, I had determined that the anti-aliasing filter in the hardware must certainly have too long a time constant, so I dug out the prints for the board. Looking at the circuit, there was nothing unusual, so I looked at the actual board we were using, which was a hand wired prototype. It had the wrong capacitor in the anti-aliasing filter. The technician who assembled the board must have misread either the number on the print, or the number on the part he actually used. I had the tech install the correct part, and the servo was working as desired within a few hours. When my manager saw that I had fixed it so quickly, he asked me what was wrong, and why couldn't he find it. I told him the problem, and how we fixed it. His frustration was increased. It seemed unfair to pull a hardware bug on a pure software guy. He thought everything had a software fix.

I have done a number of safety critical projects, including working on software for the 787, among other aviation projects. I much prefer to work on those projects -- especially aviation -- because for the most part, being required to do it right takes a lot of comprimise out of the equation. The FAA DO-178/B and now C standards, and the IEC 62304 and IEC 61508 standards for medical and industrial equipment, all require a methodological approach that mandates layers of design refinement and multiple reviews by multiple people. Then it is all capped off by extensive, and reviewed, testing. I likebeing made to do it right, because then the managers have a much harder time trying to make me do it wrong.

Alyer Babtu April 23, 2019 12:45 PM

@AlanS @Gregory Travis

the rules said that they had to have prelaunch conferences to ascertain flight readiness

E. Tufte (Internet search for Tufte plus space shuttle) reviewed the actual space shuttle flight discussion as it was presented in slides, concluding that in more than one case sufficient data was made available but proper evaluation and understanding of risks was obliterated by bad presentation. Perhaps the quality of communication has to be looked at here also ?

MarkHApril 23, 2019 1:17 PM

@Clive et al:

The "big picture" here will surely be in the realms of social sciences: psychology and sociology.

While complaints from airframe factory workers about safety are chronic (and if you think about it, a reassuring sign of a healthy safety culture), the situation depicted in the article linked by Clive looks really grim.

If Boeing's once world-leading safety culture is indeed melting away, it will be important to understand why.
________________________________

One likely contributing factor, is what psychologists call the availability heuristic: people naturally tend to assess probabilities by reference to those experiences which come most readily to mind. Typically, this gives a lot of bias to recent experience.

Airline operations have reached levels of safety unimaginable to those in the industry when I was young.

Newer types of aircraft are experiencing about one fatal accident per 3,000,000 departures. It would now be necessary to fly continuously (getting on the next flight as soon as you land) for centuries, in order to reach a 50% likelihood of being on a flight with an accident fatality ... and even then, you would probably be one of the survivors.
________________________________

How did safety get to such a high level?

Bruce has taught me to focus on incentive structures, and how they influence responsible/irresponsible conduct.

The airline business is a near-perfect case. So many of the stakeholders, including airframe manufacturers, operators, flight personnel, regulators, labor unions, engine and component manufacturers, travel and booking companies, insurance underwriters, etc. etc. etc. ... all understand that air disasters impose large costs on them. Most of the situations we usually look at (for example, security of online business operations) have inherent conflict in which stakeholders have opposed interests. In passenger aviation, the alignment is virtually complete. Surely, this has helped safety statistics to reach their present astronomical levels.
________________________________

Spectacular success sets the table for creeping complacency.

If my recall of aviation history is correct, the last time a flaw in a Boeing plane was held to be the likely cause of an air disaster was in 1996 (TWA 800, B747).

That means that if you're on the Boeing engineering staff, and you're younger than age 44, you have never gotten the news of a fatal crash for which your team might have been responsible ...

... until now.

AJWMApril 23, 2019 2:17 PM

@Rj Brown His frustration was increased. It seemed unfair to pull a hardware bug on a pure software guy. He thought everything had a software fix.

To be fair, it's drummed into student programmers (or was in my day) that it's essentially never a hardware problem. If your code doesn't work, look for the fault in your code, don't blame the hardware.

And to back this up, design flaws aside (like Intel's Pentium floating point and F00F bugs), I have only ever seen one instance where wrong calculations (vs complete failure) were the result of a hardware error. (On one particular DEC MicroVAX, under obscure circumstances -- DEC replaced the unit.) Of course, with today's incredibly complex processors design flaws are more likely, as are issues with embedded hardware systems where a bad component (like your capacitor) might change the behaviour without a complete system shutdown.

Clive RobinsonApril 23, 2019 3:00 PM

@ RealFakeNews,

1) It uses a single input

You forgot to add "known to be unreliable" after single...

Whilst all inputs will eventually fail or misread the particular input used is not just unreliable in reading it is unreliable in operation and prone to various failures. The more I find out about it, the less inclined I would be to use it at all. You'ld probably do better with a cup of mercury with a couple of damping vanes and a couple of carbon rods...

TheoApril 23, 2019 3:33 PM

@Rj Brown
@AJWM

The problem is not pure software people who assume every bug is a software bug. It's hardware people (particularly managers) who assume every bug is a software problem. (Software people who assume every bug is a hardware bug are a similar problem.)

It takes team work. People should start looking for problems where they are bested equipped to find them, with enough overlap that there isn't a gap. When the hardware guys blame the software instead of checking the hardware, hardware bugs will never be found (and vice versa).

I've never found it frustrating when I spent a month looking for a bug before reporting I couldn't find one; if this caused management to assign a different engineer with different skills who quickly found a bug somewhere I explicitly stated I was not equipped to look. It is however very frustrating to point out that this here capacitor is wrong and causing all our problems and have that dismissed.

Sancho_PApril 23, 2019 4:21 PM

(@MarkH)
Yes, the very first question would be: Is the MCAS necessary or not?


But (certainly I didn’t read everything that was reported):

Isn’t the AoA failure a very theoretical point in this particular case?
Only to understand the very truth,
I mean, is there any evidence of malfunction or bird sitting on the sensor causing the MCAS action in both flights?
- Or is it possible the pilot pushed up at “max” to get an early coffee break?

- Was it a automagical, planned restart or did the pilot switch back to auto?
- The MCAS did not realize that the sensor did not react, was dead, stuck? (*)
- The output (actor) function was unlimited? Adding up? WTF?

Of course I can’t understand using a single sensor, as I could not understand 3 identical but known as weak physically identical measuring principles. Were the sensors never improved?

For certified safety valve functions we had to use three independent, certified sensors or physically different signals. It was a mandated technical standard, best practice and therefore law in the EU.
No babel about coding, this is missing engineering (after faulty design).
In the EU there would be harsh personal consequences, regardless of emails or memos.

I think the airframe design is dangerous itself and can’t run under existing 737 certification, HW/SW adaption of the MCAS or not.

Also interesting details with the wheels (@Gregor Travis didn’t mention them, not sure if true):
https://www.schneier.com/blog/archives/2019/03/friday_squid_bl_669.html#c6790575

(*) With all critical analog inputs a stuck sensor signal must be detected by the signal conditioner, regardless of being used in control loops or for signaling.

RealFakeNewsApril 23, 2019 10:04 PM

@Clive Robinson:

AoA vanes have been problematic almost since their introduction. I'm still speechless at the fact the MCAS didn't do a compare.

I know a lot about aircraft systems and their operation due to several family members being airline pilots. I myself fly privately, too.

Most electronic systems that are safety-critical all, without any exception I'm aware of, compare with other like-sensors and "vote" on a solution. Whether they be FMS, ADC, or IRS. If any one (or more) systems generate a result outside of tolerance, that unit is deemed failed or faulty and isolated, and a fault generated.

With MCAS, a system with direct and unlimited flight control authority has exactly zero cross-checking/voting/fault isolation.

How did this system pass validation?

It's why I originally wrote "forget why it's there" because it breaks the most fundamentally hard-won principle: SAFETY.

Even stick-pushers are split so a fault in one system does not prevent control of the aircraft.

@Sancho_P:

As I understood the original problem, the accident chain is thus:

Aircraft is climbing in a slow flight regime above the stall

In both accidents, slat retraction is the triggering condition. This is because as the slats come up, AoA increases. This strongly suggests the aircraft is accelerating at high power.

What exactly is MCAS seeing at this point we will likely never know, BUT it is either an errently high rate of change in AoA signal causing MCAS to initially respond correctly by pushing slightly thus reducing AoA, or the AoA vane itself is doing something weird (aerodynamically) and going full deflection, and getting stuck.

There are two types of vane:

An arrow-type that is slightly weighted, and is lifted up aerodynamically by the airflow, and;

A point-type that protrudes perpendicular to the airflow, and is shaped such that Bournouli is invoked and causes the vane to rotate due to pressure differentials around its circumference until they are in equilibrium.

Which type is on the Max, I don't know, but may or may not be relevant to the issue of "faulty" AoA probes.

The Max would seem to have a particular problem with them, and this itself needs examining.

Denton ScratchApril 23, 2019 10:09 PM

@AJWM

"it's drummed into student programmers (or was in my day) that it's essentially never a hardware problem."

It really isn't. Usually.

We once had a problem with 8086-based machines crashing, reliably, some three months after customer delivery. These were not IBM-style PCs; they were non-standard hardware, running a non-standard OS (CTOS). Our major competitor (one of the Big Six) was not having these problems; they were shipping essentially the same circuitry. We were a tiny company - about a dozen staff.

Turns out that our motherboard had a substandard capacitor, in a location where our competitor did not. They were made in the same factory, but not to the same standard. This faulty component passed all soak testing (about three months); it failed only in production units, i.e. on customer sites. We had to travel to San Jose, descend into this weird basement, and interrogate this troglodyte hardware engineer, to learn the truth about why our customers were having these problems.

There is no way we could have fixed this in software. I can't see how the manufacturer could have anticipated the problem; "use top-notch components" is always a good plan, but it's not really a hardware design strategy (cooks always say the secret is use to really good ingredients, but in fact doing that doesn't make you a good cook).

Anyway, knowing about this hardware problem enabled us to get the jump on our Big Six competitor for several months, because they were beginning to ship the same motherboards, with the same hardware defect. Several months mattered back then. My boss said he got a hard-on when I told him what was going on.

I really enjoyed my one and only ever trip to San Francisco (and incidentally to San Jose).

RealFakeNewsApril 23, 2019 10:16 PM

MCAS is initially responding correctly to this faulty signal, but instead of only applying a limited input and stopping, it is again, questionably continuing to add nose-down input, forever.

Again, a simple timer, delta correction check, or simple detection of opposite trim input that PERMANENTLY cuts out the system would have prevented both crashes, but for some reason, MCAS was allowed to add nose-down trim to the physical trim limit without something saying "ya know what? That's absurd - we've got a fault".

Just how on earth did this system pass any form of sanity check???!!!

The Challenger disaster is an apt comparison. I can only imagine an engineer somewhere looking at the lack of checks, and raising an issue, only to be "shut up" by management.

Even I, as a mere mortal, immediately asked this question when I originally heard about the crash.

A design engineer is paid to ask this question in advance, and I can't imagine any system being installed without these fundamental design points being incorporated.

Either I'm a genius (unlikely), or someone was cutting some serious corners.

It almost looks to me as if MCAS is far from a carefully-designed function, and more a last-minute "we've got the hardware - let's add this feature".

Rushed; ill-thought out; disasterous.

Despite people saying systems are very carefully designed over months, I do think this system bypassed the normal conventions. Something just seems very wrong with it.

Erdem MemisyaziciApril 24, 2019 1:23 AM

Same software engineers who thought a single angle-of-attack sensor was good enough were later hired by Darth Vader to work on the Death Star.

GabrielApril 24, 2019 1:27 AM

I wonder how this ties in to the argument about the necessity for government regulation. Opponents of government regulation argue that no company wants to kill or hurt its customers. They argue that business is a multi-round game: if in one round you hurt your customers, then in the next round they will leave you.

And it seems that Boeing is indeed losing a lot of money from this fiasco, even before government has done anything: Cancelled orders, compensation to airlines for plane groundings.

MarkHApril 24, 2019 5:22 AM

Re: AOA Sensors

The 737-800 "MAX" is equipped with two angle-of-attack sensors, one on each side of the nose. These are of a weather-vane design. Here's a small but fairly clear photo:

https://i.stack.imgur.com/k0NTBm.jpg

I don't know whether the sensor in the photo is on a 737; the label looks too weathered to be from a new 800. However, when I looked at very high-res photos of MAX planes, the sensors seemed to me to have the same visual configuration.

In modern jets, "moving parts" have been ruthlessly eliminated from most of the gadgetry needed to keep track of the plane's motion, position and attitude. However, the AOA sensor is a moving part, with a shaft rotating on some kind of bearing ... and out exposed to the elements, at that. It's no surprise, that it is a comparatively failure-prone device.

As to Sancho_P's question, yes there is evidence, at least in the case of the first (Lion Air) MCAS-related disaster.

Here's a graph from Forbes magazine of data extracted from the Lion Air Flight Data Recorder, which was recovered from the wreckage.

It appears that after takeoff, the two AOA sensors tracked quite faithfully in the sense of their variation with time. However, the two sensors started with a discrepancy of about 10 degrees (left signal greater than right signal), and that discrepancy rose to about 20 degrees in the minutes before takeoff.

In sum, the left AOA sensor was reading substantially higher than the right. Unluckily for those aboard that flight, MCAS was apparently governed by the left sensor.

I would guess that when functioning properly the sensors should track within 1 degree of each other (or perhaps better). Though it's difficult for me to visualize what would cause the sensor to accurately track changes of AOA while presenting such a large offset, there's no doubt that signals presented from the left AOA sensor to the avionics was grossly inaccurate.
________________________________

In the pioneering days of aviation, a pilot could tie a thread, a string, or even a lock of some beloved's hair to a strut connecting the upper and lower wings of a biplane, and a quick glance would give an adequate picture of angle-of-attack.

That old-fashioned sensor was probably more reliable than its high-tech successors.
_________________________________

I share the perplexity expressed by so many, that the MCAS was designed to use only one sensor as its input.

There's some chance that an NTSB report, or perhaps a congressional hearing, will bring to light how that decision was made. Whatever rationale was offered at the time, must seem horribly foolish now.

Doug SelsamApril 24, 2019 3:08 PM

Sounds like the new GE engine actually required a new airplane to be designed, rather than modifying the existing 737. The "software problem" was trying to use software to "fix" an inherently unstable aircraft. Seems like a bean-counter issue. I don't think there's any substitute for designing an inherently stable aircraft. This sounds like what happens when the bean-counters start to practice engineering.

Jim WApril 24, 2019 6:09 PM

@Clive Robinson:

Small point but I believe you mean to say that a rudder "well forward" of a propeller is unstable. The rudder immediately forward of the propeller on a submarine is not unstable.

Clive RobinsonApril 24, 2019 8:06 PM

@ Jim W,

I believe you mean to say that a rudder "well forward" of a propeller is unstable.

It's the reason I said I did not want to get into stability criteria.

It's to do with the angle of attack and the effective point around which the vessal turns with regards to the assumed parallel stream of water.

If the rudder is infront of the effective pivot point then when the rudder turns the water flow in effect acts as positive feedback. So one degree of turn of the rudder with respect to the vessal generates a force that effectively pushes the rudder further across the stream thus making it at a different increasing angle of attack. The opposit occures when the rudder is behind the effective pivot point, you get a negative feedback effect. If the rudder is turned one degree with respect to the vessal, as the vessel turns the rudder comes back into line with the stream thus the force on the rudder decreases.

This raises the question of where the effective pivot point is, and the answer is it moves as the vessel turns. And can quite quickly move sternwards towards if not beyond the point of propulsion (think sail boats where the point of propulsion is about one third back from the bows and slightly forward of the center/dagger board etc). As long as the rudder remains behind the pivot point you will not have positive feedback. However having the pivot point move onto the point of propulsion can become chaotic, depending on a number of factors, which could easily fill a book each.

Thus the general rule of thumb is, keep the rudder behind the point of propulsion as that is inherently stable, and the point of propulsion well away from the pivot point range as that prevents cusp points causing chaotic behaviour.

You can build a simple physical model with a disk that rotates freely around it's center. On the edge of the disk you add a vertical post that can be used to easily turn the disk. Put two rubber bands on the post and stretch them out at roughly 60 degrees pulling the disk post towards you. This forms a nice stable arangment. If you pull the rubber bands upwards whilst keeping them tensioned things remain stable untill you reach the line of the post after that you get chaotic behaviour untill you have passed the line where the post is on the other side of the now rotated disk. If you are lucky the transition of the post from one stable position to the other will not be to "lively", if you are unlucky the post can remain balanced on the cusp untill some very very tiny imbalance occurs and it snaps round rather violently.

When you think about it a little you will see why I wanted to avoid talking about the stability criteria as it's very difficult to generalize in the non inherantly stable case.

But as I noted such "lively" behaviour can be an advantage, as is the case with a submarine and other naval vessels, it enables faster response to the helm especialy if you can deal with the nonlinear dynamic effects of broaching.

Back many years ago I was a guest on the largest oil tanker in the UK registry, and the captain and myself were enjoying a quiet drink and chat as we both enjoyed small boat sailing in our spare time. The subject got around to stopping distance, and I remarked that it was often quoted as several miles for large vessels. To which the reply came, "Yes in straight lines..." at which point the look of realisation on my face raised a smile on his face and he said "Yes, the sides of this ship do make quite a good brake".

WeatherApril 24, 2019 10:24 PM

Unlike fighter jets that are design to be unstable which gives high mounvabilty, airliner are stable.
What sounds like happened is the aircraft though that the airspeed was low, so applied flaps, which give high lift, from slow airflow, but makes more drag, instead of pressure differences to keep it flying, flaps direct airflow down.
Trim is a setting that you adjust to keep level flight with no hands on the stick.
At full flaps, plus drag force at high velocity ^3 ,it braked the aircraft, which, one tried to increase flaps but couldn't, two it stalled the aircraft, three like above, the engine weight plus down flaps, forced the nose down.

marksyApril 25, 2019 4:10 AM

The New York Times is taking an interest and have been speaking with more than a dozen current and former Boeing employees. They've also spoken to attorneys representing multiple whistle-blowers working on the factory floor who have been trying to raise their concerns for a long time.

Podcast put out by NYT this week:

The Whistle-Blowers at Boeing After two crashes of Boeing 737 Max jets, regulators and lawmakers began asking whether competitive pressure may have led the company to miss safety risks, like an anti-stall system that played a role in both crashes. In reporting that story, our colleagues began to look into whether the problems extended beyond the 737 Max.

Link: The Whistle-Blowers at Boeing (New York Times)

This doesn't directly address software issues on 737, but raises issues about quality control on the factory floor for 787 Dreamliner. Includes an interview with a Quality Control Manager from the Charleston plant.

Make sure you are sitting down before you listen to this...

Etienne MathieuApril 25, 2019 11:41 AM

First of all, any software system that is in the flight control loop must have redundancy in its sensors. Second, other sensors must correlate.

That is, if the plane is approaching Mach 1.0 (as in the Ethiopian disaster, where the pilots set take-off power manually and never again touched the throttles), in level flight, with the "overspeed Alarm sounding" then the chances of the aircraft approaching a stall are zero. Regardless of the AOA sensor data.

The only way a plane can be approaching a stall at Mach 1.0 is if it is going straight up.

In that case the rate of climb sensor and the rate of velocity sensors would be approaching a tangent, or shall we say shows signs of a ballistic curve.

The Ethiopians killed themselves with the throttle. The MCAS system just made sure they were reduced to size pieces that the squirrels were able to carry away.

Security SamApril 25, 2019 12:09 PM

Per Seattle Times failure analysis
The system went into total paralysis
With a single input to the controller
The software response became bipolar.

1&1~=UmmApril 25, 2019 12:57 PM

@Etienne Mathieu: "First of all, any software system that is in the flight control loop must have redundancy in its sensors. Second, other sensors must correlate."

Which apparently the MCAS software did not have at all...

AlanSApril 25, 2019 1:23 PM

@Alyer Babt (and Gregory Travis, Clive)

'Better' presentation doesn't make the issue of "technical uncertainty" disappear. You still have to decide the question of whether an event is really the same as something that went before, that is: is it normal or is it something new, an anomaly? That is whether one can assimilate it into the "normal practice" of the community or if it is an event that requires something other than "normal practice". Communities of practice, like most communities, tend to be conservative so the inclination is to try to assimilate new events to existing practice.

Styles of presentation might have persuaded people one way or the other as part of the overall interpretive work associated with on-going events but it is only with hindsight that we have certainty that an event was an anomaly or not and we know whether the interpretive work arrived at the "proper evaluation and understanding of risks".

Here's the full quote from John Downer's 2010 paper: Anatomy of a Disaster: Why Some Accidents Are Unavoidable:

Disaster Studies, as Pinch (1991: 155) puts it, needs to ‘bite the bullet of technical uncertainty’. Turner recognized that engineering knowledge is based in simplifications and interpretations, (Turner 1976: 379; Weick 1998: 73), but assumed that these simplifications ‘masked’ warning signals. The truth is more subtle, however. It is not that simplifications ‘mask’ warning signals, as Turner suggests, but that -- on a deep epistemological level -- there need be nothing that makes ‘warning signals’ distinguishable from the messy reality of normal technological practice. In other words: that there might be nothing to mask. If it is impossible to completely and objectively ‘know’ complex machines (or their behaviors), then the idea of ‘failing’ technologies as ontologically deviant or distinct from ‘functioning’ technologies is necessarily an illusion of hindsight. There is no inherent pathology to failure, and no perfect method of separating ‘flawed’ from ‘functional’ technologies.

Note that there are lots of technologies that persons in a practice community would agree were "flawed". The argument is that there are some aspects at the edge where there is uncertainty. This is basically a Kuhnian argument, which takes many of its major insights from Wittgensteinian Ordinary Language philosophy.

So looking at the 737 Max from this perspective some thoughts:

So one possibility is that there was a consensus in the various technical communities involved in the design of the aircraft that the design clearly violated the norms of their existing practice (that it "violated ..aviation canons"). Then I think you have a case for culpability of some type: e.g. lots of people were saying this is technically risky but were over-ruled.

Another possibility is the one above that the design community didn't see it. The new 737 Max was seen as a valid elaboration of the 737 problem solution (like the earlier generations) and consistent with the 737 philosophy. That is to say the 737 Max wasn't something new it was a 737. If you accept the culpability argument then you'll see this as "a myth" that people went out of their way to "preserve". But there's a possibility here that people at Boeing really did see it as "just a 737".

Another possible reading is that the design of the 737 involves lots of technical communities with their own practices. There is some of this is the Gregory Travis account. He comments of the software design: "what toxic combination of inexperience, hubris, or lack of cultural understanding". Were the software developers inexperienced with the requirements of aero-engineering? Where there failures of communication between other professions on the engineering and design team? Outsiders that aren't socialized in the practices of a technical community are often a problem. This also comes out in Clive's link to the NYT article on the 787 where there is a discussion of problems arising in Boeing's new plant in SC because they don't want the non-union staff 'corrupted' by the union staff in Seattle. But while this prevents socialization into the culture of unionization it also prevents socialization in the skills, knowledge and technical practices of the established Boeing community.

Outsiders can also be useful. One of the things Kuhn points out is that outsiders see things differently and apply different skills to develop a solution. If you look at a lot of technical and scientific innovations they often come from outsiders. Think 20th C. molecular biology. It was biology done by mathematicians, physicists and chemists. Competitive pressures probably prevent this but if someone socialized in the design philosophy of Airbus (very different according to Travis) had been a fly on the wall at Boeing they might well have seen it very differently. Where a Boeing employee might see normal practice; someone steeped in the culture of Airbus might see something very different. Sometimes the fish needs the perspective of the bird to understand water.

lurkerApril 25, 2019 4:55 PM

The hardware guy in me asks, if the new engines are mounted differently, such that the -empty- CoG has moved, how much is it allowed to move before recertification is required? And ditto for the new engines thrust angle profiles...

MarkHApril 25, 2019 6:09 PM

@Lurker:

I don't know what the thresholds are for various levels of scrutiny in certification -- it's a good question.

However, every new variant -- including the 737-8 "MAX" -- requires its own type certification.

The -8 was certified by the US FAA on 8 March 2017.

Transport jets have an enormous longitudinal range of CG. To put it simply, airliners need a big CG range (plus other trim capacity) because of their technical characteristics; and the tiltable horizontal stabilizer satisfies that need, making the jet airliner practical.

Probably the CG shift from the LEAP engines falls easily inside the airframe's capacity.

JamesSamFlameApril 25, 2019 6:22 PM

@jj @travis
“At that time, I could not construct a narrative in my head that explained how anyone, at any level from most junior programmer to program manager, could have designed a system that took only a single unreliable sensor as input and used that input to configure a commercial airliner in a way that rendered it uncontrollable by its pilots. That is for one simple explanation and that was systemic incompetence. Also known as stupidity run head-on into a lack of understanding of history, of any kind.
Since then my view has changed. As more information comes to light, it is now becoming clear to me that the decision to use a single input to MCAS was a deliberate design decision driven by the non-negotiable requirement that the 737MAX not require pilot training. “

I never discussed my flight software design and testing experience for security reasons. However here I’m compelled to speak up:
Flight software critical components are tested extensively from unit up through several levels of system integration. Then after many years of development. finally on heavily instrumented aircraft test flights.
I can’t believe that the automated unit/daily/development/integration/system flight certified MCAS Test Case Suite would NOT have caught a simple sensor malfunction.

There is no logical way in hell that a SINGLE critical sensor should be allowed in ANY aircraft design as flight control systems are both redundant and independent. This includes both software and hardware. The summed outputs are then compared in real-time @15 up to approx 240Hz.

Surely there are MCAS program test cases to feed the FCS off-nominal/bad sensor data. Repeat for two sensors both going bad. If two independent, safety critical sensors are BOTH declared bad then the aircraft is put into a highly restricted safe flight mode to fly home. The automated warning and control system would also alert the pilots.
Even with Boeing’s incompetent design the single critical failure would DISABLE the MCAS and let the pilots fly. But the MCAS system was a secret, deliberately kept from the pilots. You can’t send them an alert because they were kept ignorant of the MCAS safety system!
So the design was crippled from the beginning. Then not fixed even after the first crash and loss of hundreds of innocent lives.

A off-duty pilot prevented the first aircraft from crashing on its previous flight. Yet no one was notified or cared[1]

Both Boeing and FAA heads SHOULD have rolled after the first horrific crash. But corporate goals rule over the government.

[1] Human life is cheapened by technology. Maybe the engineers were distracted by their alluringly sweet 2,600 daily ‘smart’ phone notifications?

Clive RobinsonApril 25, 2019 7:33 PM

@ lurker,

If you think about it, it was not just the Center of Gravity (CoG) or angle of thrust that was effected by the new engines.

The increased size of the engine would have effected drag, and also the lift from the wings.

As we've also been told at certain angles the engine profile was such that it actually created lift as well. Thus creating a rather nonlinear response, which is not something either pilots or software developers want to deal with.

To be honest, the more we find out the more questions come to mind and the greater the disbelief at the chain of events that is being portrayed...

@ AlanS,

What ever the culture realy is at Boeing, I think it would be fair to say that two things have become increasingly clear,

1, Management have become at best shortsighted and apparently focused incorrectly for the company to survive.

2, The result of 1 is that design and production methods are adversely effected to the point of becoming a toxic liability.

As for the FAA, god alone knows what is going on there, you get the strong sense that the lights are on but everyone has left the building.

I guess it's now down to the NTSB to dig into both organisations, and hopefully come up with a report that brings out clearly the chain of failings in both Boeing and the FAA such that this type of tragedy does not happen again.

MarkHApril 26, 2019 3:25 AM

In my judgment, Clive's observations about the problems at Boeing, and the role of the NTSB are spot-on.

In addition, Congress is likely to conduct hearings, which are not likely to add new information about what went wrong, but may well be helpful in shaping policies to do better in the future.
______________________________

The question of the FAA is a little obscure, and not easy to resolve.

I found it very heavy going to even look up statistics about the FAA, and what I have derived might be wrong by a large factor. With that proviso, by my reading the FAA's Certification Service had about 1350 employees at the time the -8 MAX was certified.

Because the Certification Service has many responsibilities, I guess that the number of staff involved in the certification of new aircraft types may be as few as a couple of hundred.

If the process is anything like examples I've seen in other domains, when a manufacturer applies to certify a new type, it spews an avalanche (blizzard, tsunami, insert your favorite metaphor here) of technical documentation.

Review of that application might be the responsibility of only a few dozen FAA engineers. Simply, the regulators are at a vast numbers disadvantage relative to the industry, as is often the case.

Within that sea of paperwork, the addition of a small amount of code to the Flight Control Computer (possibly a very small fraction of a percent of the existing code) might not have stood out as a matter calling for particular regulatory scrutiny.

In the days before everything was done in software, a measure like MCAS would have taken the form of a new avionics box, or a significant hardware change to some existing box. I think that would have been much more likely to catch regulatory attention.

Another thing working against the odds of regulators catching such a defect, is that Boeing and Airbus get safety "right" almost all of the time, which is likely to lull people into placing excessive trust in their representations. This is a form of the availability heuristic I mentioned above ...
______________________________

The essence of the way I'm understanding the situation, is that it's going to be quite difficult for the FAA (or other certifying authorities) to reliably detect such failures of safety.

I'm sure that they will do their best, and plenty of thought will be given to what kind of reforms are appropriate.

As I mentioned earlier, safety has been so excellent because practically everyone has a stake in it. Perhaps the best guarantee against a recurrence of this kind of clusterf#ck, will be the response inside Boeing.

The economic cost to Boeing of this scandal will be very heavy, and I think the psychological burden as well.

Where I most see an analogy to the space shuttle Challenger situation, is that almost certainly, nobody thought "we're going to up the risk of a disaster in order to score some points."

Rather, what probably happened in both cases, is that in the quest to attain economic targets, people made decisions with safety consequences they failed to understand.

What will be effective means to guard against that?

william a. lynnApril 26, 2019 4:49 PM

@JamesSamFlame @jj @travis

"Yet no one was notified or cared[1]" This is not correct.

To the contrary, the entire 737 community was informed by Emergency Airworthiness Directive 2018-23-51, from the FAA on November 7, 2018, (minor correction December 6, 2018) which described the indications caused by the AOA failure, and what corrective action to take. This Emergency AD would have arrived in the operations office of every 737 operator on the planet within 24 hrs.

Dispensing with all the bureaucratese, here's the meat of the directive:
-------------------------------------
RUNAWAY STABILIZER

Disengage autopilot and control airplane pitch attitude with control column and main electric trim as required. If relaxing the column causes the trim to move, set stabilizer trim switches to CUTOUT. If runaway continues, hold the stabilizer trim wheel against rotation and trim the airplane manually.


Note: The 737-8/-9 uses a Flight Control Computer command of pitch trim to improve longitudinal handling characteristics. In the event of erroneous Angle of Attack (AOA) input, the pitch trim system can trim the stabilizer nose down in increments lasting up to 10 seconds.
In the event an un-commanded nose down stabilizer trim is experienced on the 737-8/-9, in conjunction with one or more of the indications or effects listed below, do the existing AFM Runaway Stabilizer procedure above, ensuring that the STAB TRIM CUTOUT switches are set to CUTOUT and stay in the CUTOUT position for the remainder of the flight.
An erroneous AOA input can cause some or all of the following indications and effects:
• Continuous or intermittent stick shaker on the affected side only.
• Minimum speed bar (red and black) on the affected side only.
• Increasing nose down control forces.
• IAS DISAGREE alert.
• ALT DISAGREE alert.
• AOA DISAGREE alert (if the option is installed).
• FEEL DIFF PRESS light.
• Autopilot may disengage.
• Inability to engage autopilot.
Initially, higher control forces may be needed to overcome any stabilizer nose down trim already applied. Electric stabilizer trim can be used to neutralize control column pitch forces before moving the STAB TRIM CUTOUT switches to CUTOUT. Manual stabilizer trim can be used before and after the STAB TRIM CUTOUT switches are moved to CUTOUT.
----------------------------------------

Crew awareness of the symptoms of and corrective action for an AOA failure/runaway stabilizer would have saved the Ethiopian crew and passengers. A runaway stabilizer can be caused by several failures, not only the MCAS, but the symptoms are the same, as is the corrective action: STAB TRIM CUTOUT switches to CUTOUT, followed by manual trim.

AlanSApril 26, 2019 6:35 PM

@Clive

Yes, I have no real idea what's going on at Boeing. I am more interested in the general sociological and cultural processes within organizations that lead to these type of failures and how they are accounted for after the fact. The epistemic processes by which knowledge of complex systems evolves is such that not all accidents are foreseeable and preventable. This calls into question the assumption that someone is always objectively culpable when a technical system fails. It's worth reading the Downer paper which I link to above which is an analysis of another 737 crash and references a lot of the sociological literature on disaster.

Lawrence D’OliveiroApril 26, 2019 7:44 PM

By the way, the article seems to have fallen behind their paywall. Luckily, archive.org has a copy here.

CarpetCatApril 26, 2019 11:13 PM

I seem to have a vague memory that the 2nd crash did indeed have CUTOUT properly toggeled. And the MCAS reactivated anyway. Maybe cutout doesn't mean what it used to. Maybe it's just another input to the code.

MarkHApril 27, 2019 4:26 AM

Ye Gods

@CarpetCat:

Because MCAS is pure software -- essentially, a process added to the Flight Control Computer software -- the cut-out switch is necessarily an input to the software.

From a brief CNN article, I just learned that on 5 April (one day after the Ethiopian government released a preliminary report on the second MAX crash), four Boeing employees called an FAA whistle-blower hotline to report problems of which they were aware.

These reports included "concerns about the MCAS control cut-out switches" ...

The scandal only gets worse and worse. Unless there is some physical problem with the switches or wiring, a failure of the cut-out switch to work would be a software defect. Horrible!
__________________________

Among the whistle-blower reports is "a previously unreported issue involving damage to the wiring of the angle of attack sensor by a foreign object."
__________________________

Here are official responses, to questions and criticism of the bizarre decision to use only one AOA input for MCAS:

Boeing: "The single angle of attack sensor was considered in relation to a variety of other factors, specifically well-known pilot procedures that would mitigate the effects of a failure. MCAS design, certification tests, and cockpit procedures were evaluated using a standard industry approach to failure analysis."

FAA: "Safety is FAA's top priority, and we have a longstanding well-established aircraft certification process that has consistently produced safe aircraft. When certifying an aircraft, we do not consider a single factor in isolation. Rather, we look at the interaction of all elements and systems, in addition to human and other external factors."

What a mess ... and so many people dead.

Clive RobinsonApril 27, 2019 8:09 AM

@ AlanS,

The epistemic processes by which knowledge of complex systems evolves is such that not all accidents are foreseeable and preventable. This calls into question the assumption that someone is always objectively culpable when a technical system fails.

Unfortumatly that's not how the law sees things and there is good historic reason for that.

When I was a much younger engineer than I am these days, I had a boss who regularly pointed out,

    Where there is a claim, there is blaim

It used to irritate, but as I got a little more worldly wise (Piper Alpha) I saw the wisdom of not just the expression but the implications both legaly and morally.

And for a time that I'm sure you can remember people genuinely tried to and in some cases succeeded in claiming against fast food outlets for supplying coffee that was too hot etc. They were ridiculed by late night comedians, we were effectively encoraged to laugh at careless old ladies who couldn't hold their drink etc by the Corporate Interests, who now lobby the drive to reduce or remove tort laws.

The problem is that the coffee from atleast one fast food outlet was too hot, way too hot and they knew it[1]. At around 90degrees C, it was at best only just off the boil when it came out of the machine into well insulated disposable cups, thus your hand gave you no warning as to how hot the contents were. It was so hot by atleast 50 C [2] it could and did create second and third degree burns faster than most people could react. Thus injury was essentialy guarenteed and the fast food chain knew it, and new it well because it had hundreds if not thousands of complaints and many people had been injured. Their official corporate policy was to deny, threaten and shame.

I once stated on this blog that realy there is no such thing as an accident or act of god, because they are all predictable under the laws of nature.

It's actually realy a question of knowledge and time. You can with knowledge and sufficient time dodge the bullet or falling bomb. We have actually designed machines that can detect incoming fire not just it's direction and elevation but velocity and likely target point and feed this to rapid autolaying systems to fire back at either the gun that fired or the bomb that's falling, or in the case of prototype designs for fighter aircraft to get out of the way.

The problem can thus be phrased in terms of "knowledge and time" and what would be considered "reasonable" to address the issue. Legislation also recognizes that you might not have time to respond thus the requirment for "liability insurance". That is whilst you might be regarded as behaving reasonably and in a timely fashion you are still legaly liable thus have to pay.

People might not think it fair that they can be found effectively not responsible but liable when there is an injury. But whilst they might not have had knowledge (1960's asbestos) the person who was injured had less knowledge. Thus the old idea about an acorn falling close to the tree is effectively how the tort law works. There is well over a thousand years of tort law which had settled on this as a working system. At times it can appear quite bizzar (multiple car "shunt" in adverse weather) but at the end of the day it actually works out as about the best solution we can get overall, by no means perfect but the least costly to society in general, or that's how it used to be.

The problem that many see is the associated costs, and the worst offenders are corporates in this regard, they will spend millions defending a case and when they loose they whinge about the unfairness of the legal system to those legislators daft / bribed enough to listen. What you don't hear about is the other hundreds if not thousands of people injured who are detered from seeking compensation because corporate lawyers make it very clear they will bankrupt them rather than pay what the law says they should.

The result was as the UK NHS found that whilst actual legal compensation was relatively small the legal costs were ruiniously expensive[3].

The real problem after that of "knowledge and time" is the ruthless gaming of the legal system which has developed over the last century. Where the actual victim is effectively now just a pawn to legal ambitions and wealth.

For some reason we are encoraged to think that those working for victims are no good "ambulance chasers" whilst their corporate and government brethren who actually cause most of the problems are some how heros "Fighting for justice" rather than their own greed and naked ambition.

Thus behind those gaming the system is "reputation and money". Boeing are looking at what some might say is an "existential threat" over what the MCAS --alledgedly-- has done. Just in compensation for lives lost they are looking at a billion or so especially "if fault is found". But that is probably the least of their problems. As others have noted Boeing has a reputation not just within their --manufacturing-- industry but their customers --airline-- industry and their customers customers the fare paying passengers who are ultimately paying for not just for the airlines and Boeing but all the associated costs of employees and services. If Boeing go out of business (possible) or reduce or stop manufacturing (probable) then the knock on effect is going to be immense. You are potentially going to see more places like "Downtown Detroit" and worse springing up.

The easy way to damage businesses in the airline industry is with a catchy phrase line, as British Airways discovered with "Fly the flag and loose a bag". Their "good name" and the premium it alowed them got trashed and other "no frills" airlines went in for the carrion to latter backdoor the prices up (EasyJet, RyanAir etc). If someone comes up with a phrase like "Boeing bones you daily" or similar that catches on then passengers will vote with their pockets and airlines will not just vote with their pockets, but multiple law suits etc.

Thus the MCAS which increasingly sounds like a "three bob bodge"[4] is going to cost an almost ungessable multiple of it's costs. Not just to Boeing but the US and world economy as airline costs will rise one way or other because of it.

Now Boeing has "strategic significance" to US National Security as well as several other countries, which raises the spector of "Too big to fail" that we saw with the inflation creating banking industry, just a few years back.

Potentialy Boeing going under will turn several chunks of the US into an industrial wasteland inturn bringing down whole regional economies. The final result could well be not unlike that of a warzone in a defeated country...

As was once remarked of mining towns where the mine collapsed the town became "Fit only for tumbleweed and gunlaw", or in more modern parlance "weed growth" and "high street crime", with selling drugs or guard labour being the only way to survive.

So for very many the stakes on this particular game are very high, and you can be sure that the sharks are already circling, not just in the US but all over the world.

That as my son has a habit of saying "That's just how it be bro".

Which leaves the question of,

    What's to be done and by whom?

I suspect the USG is going to have to "step in" somehow without making it to obvious to lessen both the impact and cost of securing National Security.

[1] https://www.treehugger.com/corporate-responsibility/truth-behind-mcdonalds-hot-coffee-lawsuit.html

[2] Protein is effectively rendered usless at around 40degrees C it's why hospitals go into fairly major responses when you have a fever that high. One biological sign is your body goes into uncontrolable muscle spasams you lose control of your bodily functions and you twitch like a 1990's Break Dancer. I know be only a little while ago I got sepsis and it happened to me...

[3] The English (UK) legal system from which the US legal system derived in many respects differs in the way the tort process is seatled. In the UK you can't claim for "hurt feelings" only actual harm and legal costs are awarded at the conclusion of the court case. In the US legal costs are generaly bourn by the parties. This has a "chilling effect" that the US Government and Corporates ruthlessly exploit as a method of "rights stripping".

[4] In the UK a "bob" is a nick name for a shilling with an equivalent value these days of maybe 4cents thus a "three bob bodge" is a job done not just very cheaply but very badly as well.

MarkHApril 27, 2019 10:05 AM

@AlanS, Clive:

"...not all accidents are foreseeable and preventable"

That's surely true, and I always worry in my own (very modest) engineering work about the problems which I fail to anticipate, and therefore are unknown to me.

That being said, the unforeseen should have little role in super-intensively engineered systems like modern airliners.

It was eye-opening for me, that both of the US space shuttle disasters resulted from causes people had identified and worried about. I expected that considering the experimental nature of the system, its extreme stresses, and very small number of flights, that it was at great risk for failing in ways that hadn't been foreseen ... but that's not what killed the astronauts.
_____________________________

In the present Boeing scandal, I'm sure that no one foresaw the fatal evolutions in which MCAS kept iterating pitch-downs numerous times against pilots who either (a) were baffled by the airplane's behavior, or (b) identified and responded correctly, but were unable to save the situation.

Nobody WANTED these crashes to occur. If Boeing had foreseen such scenarios, it would surely have done something different.

That they failed to foresee the danger in the first instance, would seem to be a lack of diligence in characterization and analysis of the system's behavior.

But then Boeing compounded the failure (I would say mathematically squared rather than doubled) by inadequate response to the first MAX crash.

By the time the Lion Air flight recorder data was available, Boeing management should have been in a state of Maximum Red Alert. Somehow, they still failed to understand what a toxic brew their engineering process had inadvertently created. This kind of failure of cognition and judgment is regrettably not rare, but in this instance dreadful and tragic.
_____________________________

@Clive:

Your comment on the word "accident" reminded me of a crash investigation specialist I saw on a television program, who said something along the lines of "we don't use the word 'accident,' because what happens in these situations is the result of decisions and actions people took."

Clive RobinsonApril 27, 2019 7:08 PM

@ Mark H,

we don't use the word 'accident,'

I don't know if you have seen the film "Hot Fuzz" or not.

But there is a thirty second explanation of why the Met Police say "Traffic Incident" not "Traffic Accident" as accident implies no fault. On speaking to someone I know who worked for both the Met Police and The Police Federation apparently it's "factually correct".

With regards "decisions" and "actions" at root they are based on "knowledge" and "time", insufficient of either will in effect increase the probability of an "incident" happening.

As you probably know from aviation, sometimes the correct action is counter intuitive. Things like stearing into a swerve/skid to try to regain control in a car, and I'm told the actions required to get out of a tailspin are likewise non intuative.

MarkHApril 28, 2019 2:03 AM

@Clive:

I remember my oldest friend, who used to fly for pleasure, recounting to me her flight instructor giving her practice in spin recovery.

Her overwhelming sensory impression, was of the windscreen "filled with green" -- her view was all earth, and no sky.

The reflexive* impulses in that moment are to pull back on the yoke, and to turn it opposite the plane's roll angle: both wrong actions which could aggravate the spin into a crash.

As you suggest, to do the right thing in that high-stress situation it's necessary to counter those adrenaline powered impulses.
____________________

I'm always impressed by displays of presence of mind in crisis situations; it's a capacity that doesn't come easily for me.

Most fans of single-seater open wheel racing know that holding the steering wheel in a crash can cause serious hand injuries: with no bodywork to shield the front wheels, part of the crash force may be transmitted via the steering gear to the wheel, with great violence.

I admire the young drivers I've seen on in-car video footage, who took their hands away from the steering wheel just before impact. It's hard to imagine myself remembering to do that.

One driver grabbed the sides of his helmet, which seemed a good technique, in the sense of being more like an instinctive response to "OMG I'm going into the wall!"
____________________

* More precisely, the learned responses which would be appropriate in normal flight regimes.

Clive RobinsonApril 28, 2019 4:36 AM

@ MarkH,

As you suggest, to do the right thing in that high-stress situation it's necessary to counter those adrenaline powered impulses.

Yes, but importantly to have not jist the knowledge that the impulses are wrong, but more importantly the right knowledge about what to do.

Tail spins in the early years of flying were always fatal or plane wrecking crashes. There was a myth that there was noting that could be done.

Then one day a RN pilot who had sufficient hight when he got into a tail spin out of desperation did what would appear to be the wrong thing. He never realy said why but it got both him and the airplane out in one piece. Strangly the news was slow to spread at first I guess because the myth was so strong. But eventually a year or so later it started to become part of standard training.

One of the things that strangly came out of it is "Parachutes for light aircraft"...

But the important thing was pilots got given the "knowledge" of how to get out of a tailspin if the plane could do so, and by training the extra litle bit of "time" to put it in practices so that even if the plane can't get out of a spin atleast some semblence of control to mitigate the crash was possible.

But the training which gave the knowledge, gave other knowledge about the real "how and why" of the stalls that give rise to spins such that in the main pilots can avoide them almost allways.

It's why "knowledge" and "time" are the key factors in preventing the mistakes that become tragadies

Sancho_PApril 28, 2019 10:42 AM

I thought there are two separate units, left and right, to control the aircraft?
Why did the FAA directive re AOA failure deal with difficult countermeasures but not suggest to switch control over to the co side?
Is the trim control singular?

MarkHApril 28, 2019 12:54 PM

@Sancho:

According to what I've read, the software selects which sensor at power-up. There's no control by which pilots can govern or change the selection.

If there were, the Ethiopian flight would have survived.

Sancho_PApril 28, 2019 5:41 PM

@MarkH

”According to what I've read, the software selects which sensor at power-up.”

Um, that would be very strange - having access to both sensors and probably select the misaligned one without checking any deviation between them?
But what I meant:
I’ve read somewhere about pilot and copilot having a completely independent set of instruments (e.g. AOA) and actors, one is selected to be in charge of control, the other is back up?
So I thought there would be two MCAS systems on board, too - my bad.

Thanks for the link to the graph. The huge initial AOA value (left) should have prevented them from takeoff (checklist?) and by automatically disabling the MCAS.

”… what would cause the sensor to accurately track changes of AOA while presenting such a large offset …”
When e.g. the coupling between flap and electrical sensor (potentiometer) is loose (esp. without airspeed)? Might be a very simple construction from the early days of the 737, likely with a mechanical zero setting.

MarkHApril 28, 2019 5:56 PM

A Last Observation On Instability

Many commentators, including the author of the article, have made assertions equivalent to this:

"Building a plane with inherent instability [in the case of the MAX, the engine-related pitch-up tendency] was in itself unsafe."

In other words, although the compensatory system for the instability was grossly flawed, the decision to tolerate the instability in the first instance was the "original sin" of the MAX disasters.

I've pushed back on this a few times, but now want to explain more specifically.

Almost all jet airliners ever flown have, inherent in their aerodynamic characteristics, the tendency to pitch down near the upper limit their operating speed range. (Note: most of their flight time is within the speed range in which this pitch-down tendency manifests.)

Further, this nose-down moment increases with increasing airspeed.

It's easy to see that if not attended to by the pilots, this pitch-down tendency could start an unintended descent, leading to increased airspeed and even greater pitch-down. Left unchecked, such an evolution would quickly lead to disaster.

So let's compare this situation, to the MAX engine installation.

1. Is it a deviation from ideal handling characteristics?

Yes for both

2. Does it tend to shift the plane away from the desired flight regime?

Yes for both

3. Is it divergent (does its effect dangerously magnify as it proceeds)?

Yes for both

4. Could alert pilots readily control the instability?

Yes for both

5. For the purposes of decreasing pilot workload and reducing the risk of flight approaching safe operating limits, is there an automatic compensation system built into the flight control system?

Yes for both

6. Does the automatic compensation system move the extremely powerful -- and therefore potentially dangerous -- horizontal stabilizer?

Yes for both
___________________________

For about 60 years, almost every jet airliner has flown with (a) this high-speed pitch instability, and (b) a "Mach trim" system to counteract it.

Whoever would defend the thesis that inherent instability is incompatible with flight safety, must confront the astonishing safety record of airliners with the pitch-down/auto-trim combination.

It's vital to learn the right lessons from the MAX scandal.

kjApril 28, 2019 5:56 PM

1. Aerodymically unstable plane due to new engine installation

2. Very Poorly written software to fix it - the original MCAS

3. Faulty angle of attack sensor reading due to bad wiring practices

4. FAA's ignorance

What's next?

MarkHApril 29, 2019 5:21 PM

A propos of the MAX being "too big to fail" ...

Based on the information I have seen, the MAX is a fundamentally sound design.

The odd effects of the engine installation probably could have been handled safely by pilots without MCAS -- but this might have required a change in the training, which Boeing didn't want.

The handling quirk could also have been managed by a correctly designed auto-compensation system.

Moving forward, the MAX will fly with an improved version of MCAS. With all of the attention and scrutiny, the new MCAS will probably be a robust safe design.

After that, the MAX will probably have an exemplary safety record.

So:

• the MAX will be repaired quickly

• Boeing's reputation will be soiled for years to come

• the losses to life and peace of mind caused by Boeing's failure will endure

Clive RobinsonApril 29, 2019 6:44 PM

@ MarkH,

Moving forward, the MAX will fly with an improved version of MCAS. With all of the attention and scrutiny, the new MCAS will probably be a robust safe design.

That's a technical view point, that is likely to count for little in the non technical world of Jo(e) Public.

Look at it this way, first there is going to be court cases, going on years not months, draging out the pain for Boeing.

After which I suspect both the MCAS and 737 MAX are both going to disappear, because their reputations are now shot. The 737 has gone from safe to suicide in the public eye, and that won't change untill long after the news has stoped talking about the two crashes and the court cases.

As for the MCAS software it will certainly be not just reworked extensively I think it will be compleatly changed and renamed something else. Because MCAS is now a "tainted brand name".

Likewise I suspect the MAX will die as well. There will have to be some changes, because the FAA in order to save face will probably insist on full analysis followed by a recertification at which point Boeing faces either a recall of existing 737 MAX or to kill the MAX off all together.

The point is the airlines know the 737 MAX is now a "White Elephant" and they will almost certainly claim in court that passengers will not get on 737 MAX etc due to the adverse publicity...

Boeing will effectively be out of options. They might save the airframe modify it and rename it, but as word will get out and create more bad press...

But realistically at the end of the day the 737 has reached the end of it's shelf life anyway. It was never designed for the size enginees airlines want to use for efficiency today. And yes they are likely to get larger still. So Boeing were going to have to ditch the 737 at some point in the near future anyway.

So probably Boeing's best option is to ditch not just the MAX name but the 737 and build a new airframe with a new name and alow for "future proofing" which just can not realistically be done with the 737.

MarkHApril 29, 2019 7:32 PM

@Clive:

We'll soon find out.

Certification authorities will be confronting MCAS v.2 in the next few weeks, and will have to decide -- pretty soon, I expect -- how to resolve questions around certifying the fix.

What the flying public will do, is an open question. I think that the great majority of travelers pay little or no attention to aircraft type.

Perhaps, when the MAX has 6 or 9 months without incident, passenger resistance will diminish to an economically unimportant level.

In the past, the flying public seemed to take disasters caused by technical faults in stride, with type-specific anxiety soon fading.

The DC-10, and even the Dr Havilland Comet, were economically damaged by their notorious failures, but nonetheless continued to fly

However, fatal accidents are now so rare, that the psychological fallout may be much more severe.

A few airlines already have fleets of MAX aircraft, and others are already wedded to operational plans and purchase contracts. Those many billions of dollars in investment can't be "disappeared" by magic; either the money is lost, or a way is found to press on.

The simplifying assumption that consumers have the attention span of a gnat, is not a bad first-order approximation.

My guess, is that whether passenger resistance is a serious problem, will be clear before summer is out.
_________________________

So probably Boeing's best option is to ditch not just the MAX name but the 737 and build a new airframe with a new name and alow for "future proofing" which just can not realistically be done with the 737.

Right on point, though it's not really an either-or proposition. Reportedly, in 2011 Boeing gave intensive analysis to the question: build a new narrow-body, or make another round of upgrades?

Though they opted for the latter, I'm sure they've had an ongoing R&D program for the successor line.

Now they'll have an incentive to accelerate that effort.

Traditionally, the critical item in designing a new airliner series is the wing, taking a LOT of engineering time and costing a couple of billions in today's dollars ... but my knowledge of the industry is getting out of date, and perhaps the process is all different now.

If they already have their wing design, they might be ready for first flight in 2 or 3 years.

Meanwhile, forms of the venerable 737 are likely to remain their bread and butter.

MarkHApril 30, 2019 2:32 AM

Before it can get better, it must stop getting worse

Two very discouraging signs about Boeing's management rot:

1. The CEO just told shareholders that MCAS was properly designed, and blamed the crashes on the pilots.

I don't have words for how bad that is ...

I hope that the stockholders will insist on his dismissal, for the good of everyone.

2. Previous news accounts reported that many MAX planes lack an "AOA disagree" warning -- hitherto a standard 737 feature -- because it's an extra-cost option. That's bad.

New reports say that Boeing intended this warning to function on all MAX planes, but the software was erroneously configured to disable this warning if a certain option package was not enabled. That's worse.

How are the mighty fallen! Perhaps they need to retain the services of an exorcist.

Note: the second link might soon fall behind a paywall.

Clive RobinsonApril 30, 2019 5:10 AM

@ MarkH,

Before it can get better, it must stop getting worse

But do Boeing senior managment know how to get out of this tailspin?

They might not have the knowledge or the time...

Back in the 1990's I was studying for a Masters degree and a chunk of it was about "disaster comnunications" for both natural and human caused disasters. Studies prior to that had shown that organisations that had been more open and apparently honest, even involving the public in the process of dealing with a disaster usually had a much higher public confidence rating in return. Thus were a lot more likely to survive and thrive after the disaster event.

Boeing are giving out signs that some of the senior managment have decided on a course that does not bode at all well for anyone.

The fact that FAA midlevel staff had been "discussing" the software issues of the MAX during the period upto the second incident does nor bode well for the FAA either.

I suspect that geo-politics will start to kick in soon. From the Wiki page on the MAX it appears that around five thousand have been ordered but only around ten percent of those have been delivered, and of those around 10% to Chinese airlines (not sure about Russia). With a significant number of yet to be forfiled orders from emerging markets around both the two Super Powers.

Due to the way politics has been happening between the US and other Super Powers Boeing might get rather more than a haircut as collateral damage to current US and other Western Nation like Australia foreign policy.

Consider what would happen if both China and Russia closed their airspace to not just the MAX but all new Boeing aircraft pending their own Air Administration Authorities reviewing and testing all the flight software and airframe upgrades etc which could easily add five or more years to the aircraft being cleared. No matter how good new engines would be the Boeing aircraft would be uneconomical in many emerging markets, which is realy the only growth area...

I suspect neither China or Russia will actually do this in the longterm because politically they tend to play the long game not the short game of the West, and thus tend to act in the economic long term by investment and other methods to maintain economic stability. But that does not stop them rattling the saber a little, thus I suspect it's going to be on a lot of peoples minds in the West to well beyond the point of insomnia...

Stephen MasonApril 30, 2019 9:31 AM

I write as a follow up to the observations by Clive Robinson from my perspective as a lawyer writing about electronic evidence. I have found the posts to be of great interest and educational, although I am aware of most of the issues that have been discussed.

What might astound many of those who have written the interesting posts above, is that there is a legal presumption in common law systems that computers are ‘reliable’. This presumption is sometimes explicit and at other times implicit. No legislation has defined what ‘reliable’ means, and no judge has either.

(Common law legal systems include the United States of America – the United Kingdom comprises three separate jurisdictions: England & Wales, Scotland [part common law, part civil law] and Northern Ireland – Canada, India, Australia, New Zealand, Singapore, Ireland, etc).

If you are before a judge in a court of law and asserted a fact, the judge will require you to substantiate the fact before it was accepted as part of the evidence. Regarding this presumption, it was merely asserted by the UK Law Commission in 1997, without providing any evidence, that computers are ‘reliable’ – this is one of the words that has been used regularly, even if the Law Commission did not use this precise word. So, we have the Law Commission asserting a fact that has affected thousands of people without providing any evidence to substantiate the claim. No judge would accept this if a party in legal proceedings asserted a fact, but the unsubstantiated presumption that computers are ‘reliable’ is accepted as a legal truth.

If you have read this far, you will wonder why I have offered theses comments, given that the hardware and software in the Boeing crashes will be considered in great detail by the investigators. This is unusual. I raise this issue because this presumption affects all of us everyday. If, for instance, you have a claim against a bank that you did not withdraw cash from an ATM, and the bank refuses to return the funds because it asserts that you are responsible for the withdrawal, and you take legal action in an attempt to recover the money, the lawyer for the bank will often pray in aid this presumption, and the judge will agree with it, regardless of the complexity of the systems (hardware and software) that sit behind the ATM. This occurred in the English case of Job v Halifax PLC (not reported) Case number 7BQ00307, 6 Digital Evidence and Electronic Signature Law Review (2009) 235 – 245 http://journals.sas.ac.uk/deeslr/article/view/1905

A number of people have commented that the life we are now living is based on increasingly complex systems. This is correct. It is disturbing if the attitude towards the presumption of computers being ‘reliable’ continues in legal circles. I have written about this extensively in Chapter 6 of the open source book: Stephen Mason and Daniel Seng, editors, Electronic Evidence (4th edition, Institute of Advanced Legal Studies for the SAS Humanities Digital Library, School of Advanced Study, University of London, 2017) http://ials.sas.ac.uk/digital/humanities-digital-library/observing-law-ials-open-book-service-law/electronic-evidence

The hypocrisy of the law is based on a lie: judges accept the presumption that computers are ‘reliable’, yet they also accept that software cannot be relied upon, as I pointed out in my essay ‘Artificial intelligence: Oh really? And why judges and lawyers are central to the way we live now – but they don’t know it’, Computer and Telecommunications Law Review, 2017, Volume 23, Issue 8, 213 – 225. Lawyers write clauses for contracts relating to the use of software code that require the user to accept that the software is not free of errors (read any software licence online). Such contract terms are considered so normal that nobody appears to understand this fundamental contradiction between the presumption that computers are 'reliable' and the acceptance of flawed software code as being normal.

With everyday life becoming ever more complex, legal systems need to abandon this ridiculous fantasy – by so doing, perhaps purveyors of systems might be more careful about design – because the law will (or ought to) be more willing to interrogate the system for evidence of causation.

MarkHApril 30, 2019 10:07 AM

@Stephen Mason:

As someone with no legal education, my intuition about "reliable" in a legal context is that it would typically apply to forms or sources of evidence.

For example, a reliable human witness would be a person who is thought likely to report what s/he experienced with a minimum of distortion, embellishment or partiality.

It's easy for me to imagine courts considering computer-generated evidence (for example, telephone bills) as reliable in the above sense. The computer isn't expected to lie to protect a friend, or to give a less accurate portrayal because a few days have elapsed. Unless there is particular evidence that the phone billing computer is making errors in the records, I would suppose there to be a presumption that the computer evidence is accurate.

This limited sense of "reliable" -- objective, and with accurate "recall" -- makes sense to me, in application to the ways people expect computers to operate, or experience their operation in many practical contexts.
________________________________

Where it gets a bit Orwellian, is that it again makes sense to me to presume that a computer is a reliable source of evidence, while at the same time accepting that its analytic outputs, decisions, and actions (when the computer controls something) may be far from the intention of the computer application.

Of course, when you look under the hood, this distinction gets very murky. A computer that, because of an algorithmic defect (or much more rarely, a hardware fault) generates undesired results, can in precisely the same ways generate inaccurate data records.

Even so, this very imperfect distinction isn't crazy. It happens every day in the analysis and diagnosis of computer misbehavior, that engineers look at computer-generated records (log files and the like) to gain insight into what's going wrong.

In such a process, the record generated by the computation is expected to be quite likely correct, even though the result of the computation is known to be faulty. And most of the time, that expectation is justified, for the simple reason that the computation to generate the record of data and computational results at any given step is usually far less complex -- and therefore, much less prone to defect -- than the total computation needed to produce the application results.

Thus, in a nutshell (or nut's hell?) the computer can be a (fairly) reliable witness and unreliable actor all at once.

Clive RobinsonApril 30, 2019 2:44 PM

@ Stephen Mason,

What might astound many of those who have written the interesting posts above, is that there is a legal presumption in common law systems that computers are ‘reliable’. This presumption is sometimes explicit and at other times implicit. No legislation has defined what ‘reliable’ means, and no judge has either.

As an engineer who has designed Safety Critical systems, and has been called upon to be an expert witness, I'm aware of the problem from both sides.

What people have to realise is that a "presumption" is effectively an assumed ground state to which you have to argue your way to a known state buy fulfilling a burden of proof.

As a Safety Critical engineer it is a requirment of the job that I pressume any invention of man or system to be not just unreliable but by cascade effects unreliable in ways that would to most imbue a system with not just free will but malicious free will. Or to put it another way, all the probabilities to line up adversely, as dictated by the conjunction of the semi-serious "Murphy's" and "Finagle's" laws,

    The perfidity of inanimate objects is such that the worst will happen at the most inconveniant time.

Thus you could say my job is to assume the worst, and the system designers have to prove the best of their system to me.

Which is actually what you would want me to do as an engineer, if you think about it for a moment.

However most are aware of the these days trite sounding "Presumption of innocence" this works the other way around. The best is assumed of a person and the accusor has to prove the worst of them.

For over a thousand years --untill Tony Blair's stupidity-- that is the way the legal system has worked in England. The accused has to be proved guilty by their accusors, that is the "burden of proof" is that of the prosecution not the defendent. The measures being supposadly, beyond reasonable doubt and on balance of probability.

Thus when you claim against a bank, the computer system gets the same presumption of innocence as an old fashioned accounting clerk, and it is upto you to supply the burden of proof that it has caused you quantifiable harm.

The advantage the bank has is that it will refuse to hand over anything that will help you prove your case, and will argue at --your-- great expense that for --their-- good and proper reasons that is the way it should be even though there is no logic or reason to their argument[1].

The FAA and all other airline regulatory bodies are supposed to work on the idea that the system is "unsafe" untill the designers fulfill their burden of proof to show that the system is to be considered safe/secure. However as in science you can not prove "absolutes" therefor you have to make argument to some quantifiable measure as your proof of safety. Often these measurands are called "axiomatic" which is a posh word for "reasoned assumption". But as we know from the notion of "entropy" everything eventually moves from a nonchaotic "ordered state" to a chotic "disordered state" which is the fundemental building block of science.

Thus there is absolutly no reason to believe that a computer has not suffered from "bit rot" or to coin a phrase "digital dementia", assuming of course that it was properly designed and constructed in the first place, which is highly unlikely.

Without doubt computers are designed and built by people. People are known to be unreliable at the best of times and in the main can not deal with complexity. Less well known is that a programable computer can never be proved to be correct in function by interrogation as it will tell you what it has been instructed to tell you.

Thus it matters not a jot how much you monitor a computer, previous behaviour is not nor can it ever be an acurate predictor of future behaviour. In exactly the same way "a loyal subject turns traitor".

The subtle issue also causing issue is the definition of "trust". As has been observed "ICTsec trust" as in "trusted systems" is effectively the opposite of "Human trust". Explaining this to people not steeped in the knowlege of the field of endevor can be difficult to put it mildly.

The law however is steeped in a thousand years of "human history" which came about due to informal logic thus reasoning. It was not untill the mid Victorian 1800's that logic became formal as did reasoning and proof systems that relied on it. However under a century later in the 1930's logic was shown supprisingly and to the dismay of many to have distinct limitations by Kurt Gödel, Alan Turing and Alonzo Church amongst others. Computers as we know them did not appear for another twenty years or so.

Thus engineers know that logic thus computers have limitations but the law apparently assumes otherwise.

As I've mentioned before computers do not have "directing minds" like all current technology they are oblivious to use. The notion of "Good-v-Bad" is one of human ethics and morals which computers inherently lack like any other machine or tool.

The fact that for no reason other than tradition the legislature and judiciary chose without thought to imbue computers with some semblance of humanity is a mistake. The legislature, judiciary and people in generally recognizes a gun to be a tool. That is it is most often operated by a hand under the influance of a directing mind. So if someone is killed or injured the assumption is that it was "the finger on the trigger" of the hand holding the gun both under the control of the directing mind.

So why the legislators and judiciary should chose to think of a computer as anything other than a tool is of great puzzlement to many engineers and any with knowledge of computers.

So from my point of view yes it is the presumption that needs to change, complexity does not "reliability, infalability or humanity make" as far as we are currently able to determin, thus there should be no presumption of innocence, correct functioning or reliability and the burden of proof should thus fall on the designer and operators who are the "directing minds" of the tool that is the computer when it comes to safety/security.

I may well be in a minority at this point in time, but I expect over time the pendulum will swing all be it glacialy in this direction.

[1] However this might have loosened up a bit with British Gas getting found guilty of harasment and their defence which was in effect "the computer did it". Lisa Ferguson was upset of being threatened by British Gas so she took them to court at quite some peril to herself, which British Gas attempted to exploit. It was eventually ruled that in effect that the computer was the invention of man and that man dictated it's actions therefor man was responsible for the computers failings by action, inaction or both,

http://www.g7uk.com/photo-video-blog/20090304-british-gas-guilty-of-harassment.shtml

https://www.brownejacobson.com/insurance/training-and-resources/legal-updates/2009/03/ferguson-v-british-gas-trading-ltdcourt-of-appeal-10th-february-2009

Which all be it a small step is a step in the right direction.

MarkHApril 30, 2019 3:26 PM

@Clive:

How interesting, that you studied crisis communication. Probably all software developers would do well to learn this ;)

Your tailspin metaphor is exact; as you wrote above, the recovery procedure is counterintuitive.

I recently learned that there's a slogan in the field of crisis communication with several variants: the one I heard was "Tell the truth; Tell it all; Tell it first."

To play devil's advocate for the Boeing CEO (I don't defend, but rather seek to understand), I can imagine that some $2000 per hour lawyers advised him that he MUST blame the dead pilots. It would be psychologically comforting, at least, to think that while he stood spouting that garbage, he felt like an ass.
________________________

What I expect to save Boeing, the 737 product line, and the MAX models themselves, is Newton's second law of motion. To stop the whole massive enterprise would need more resources than can be brought to bear.
________________________

What's not visible to us, because (for now at least) it's hidden within the walls of Boeing, is a tsunami of emotion: anxiety, a sense of betrayal, and red-hot fury.

The possible malfeasance and undoubted incompetence which brought this scandal about in the first place, was likely on the part of sufficiently few people that they could all sit around a conference room table.

There's an army of many thousands who labored diligently to make the MAX as safe as any plane in the world. I'll leave to your imagination, their attitudes toward those who undermined the integrity of their work.

According to the Seattle Times, even before the second crash Boeing had conducted an internal review identifying devastating flaws in the original safety analysis of MCAS.

I'm sure there's at least one crisis team roaming the halls at Boeing. They know there's been an infection, and they will apply metaphorical chlorine gas and blowtorches until they're confident they've cleaned it out.

MarkHApril 30, 2019 4:09 PM

P.S.

Just today, Bruce edited his original post, linking a well-written rebuttal by Peter Ladkin, to the original article by Gregory Travis.

I highly recommend it to anyone interested in understanding what really went wrong.

I think that a couple of Ladkin's points may be off the mark, but in general he offers a fact-based "cold shower" for Travis' understandable but inaccurately directed outrage.
________________________

I've known a fair number of pilots, and have formed some impressions of their typical perspectives and attitudes.

Travis' frame of reference is his experience at the controls of a Cessna 172, a type with which I have some familiarity.

There's a strong case that the 172 is the most successful airplane design ever. It's been produced in greater numbers than any other type, and does its job with elegantly robust simplicity.

Its construction reminds me of a bicycle: simple, economical, light and strong. The strut-braced high wing is very sturdy and can take a lot of stress.

Where handling qualities are concerned -- the plane's response to pilot control inputs -- the 172 is as friendly and docile as a newborn puppy. It behaves immaculately on stall entry, and can be recovered from a stall or even a spin with an amazingly small loss of altitude.

Even with rather ham-fisted piloting (a not infrequent occurrence among light-plane pilots who often have low time, or fly too infrequently to keep their skills honed), the 172 does its best to keep you out of trouble -- or when you get into trouble, makes it as easy as possible to fly back out of it.

It has its safety vulnerabilities like any plane does, but its whole personality is comfortable and reassuring.

When you're accustomed to such a plane, the dependence on complex technical systems of a 90 ton Mach 0.79 people-mover doesn't feel reassuring at all. The designers of such jets do their best to make them handle like a single-engine Cessna ... but they must use a lot of tech, to do so.

Stephen MasonMay 1, 2019 5:31 AM

The remarks by Clive Robinson are apposite. I cover them in Chapter 6 of Electronic Evidence, in which the British Gas case is discussed, naturally: Ferguson v British Gas Trading Limited [2009] EWCA Civ 46 http://www.bailii.org/ew/cases/EWCA/Civ/2009/46.html. For a senior barrister to argue before the Court of Appeal that letters sent out automatically by a computer were not the fault of British Gas was unimpressive. The position has changed little.

Interesting is the notion of trust, which Timothy S. Reiniger and I wrote about in 2015: ‘“Trust” Between Machines? Establishing Identity Between Humans and Software Code, or whether You Know it is a Dog, and if so, which Dog?’, Computer and Telecommunications Law Review, 2015, Volume 21, Issue 5, 135 – 148.

It is my simple, non-technical understanding (I was an Ammunition Technician in the British Army 1973-1982, which included bomb disposal, but that experience is long past) that we have to trust software, because the concept of trust is the absence of knowledge, as noted by, among others, William S. Harbison for his PhD, “Trusting in Computer Systems” (University of Cambridge Computer Laboratory Technical Report No.437, December 1997), p 17 (p 15 in the printed version I have), in which he said:

“the concept of trust is better associated with the idea of what we don’t know rather than what we do know. It can therefore be considered as a substitute for knowledge instead of a representation of it”.

[This dissertation was previously not available online. Bill gave me a copy some years ago, and I cogitated for a number of years on the topic, reading his work twice, before beginning the paper on trust. His work is now available at https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-437.pdf].

We – the users of technology – have to trust the systems we use. I fully appreciate the professionalism of those involved in developing systems, and also realize that it is the people that are given the authority to make decisions who can make life unbearable for the end user. As others have noted above, always remember the economics.

Clive RobinsonMay 1, 2019 6:36 AM

@ Stephen Mason,

We – the users of technology – have to trust the systems we use.

That would be extreamly unwise. It's also in part the reason we are in the mess we are currently in ICTsec wise.

As you are probably aware people refer at a very basic level to two ways of development "top down" and "bottom up" with regards to the computing stack. Mainly everything is built "top down" which unfortunatly will not be secure against "bottom up" attacks. Especial those below the securiry layers of the CPU and MMU that can "bubble up" directly through them with little or no difficulty.

The big assumption in top down is that the layers below can be trusted. Where as all the evidence on bottom up and reach around attacks show if is a dangerous fallacy.

Basically the lowest layer most programners ever get to influence is the CPU ISA at or below ring three, and any software security they try to implement is above that.

Anyone who can either get around the CPU/MMU layer or from underneth by DMA or other I/O to control the memory can freely change anything that the above CPU ISA software has done.

That is a fact of life I've been discussing on this blog for several years. What it boils down to is that you can not stop attacks below the CPU ISA, you can't even see them in a single CPU architecture from above the CPU ISA. Because the CPU will tell you what the lower level manipulation will tell it to tell you...

Are there ways to mitigate this, yes, but currently we have no architectures designed to do so in production or even close to production.

That's the reality of the computer stack that most top down methodology adherents don't talk about.

Whilst some who have thought about it suggest "encrypted memory" that has two assumptions,

1, An attacker from below the encryption layer can not find or determin the secret key.

2, An attacker can only modify memory from below the encryption layer.

Even though the methods to mount attacks may appear insurmountable to many and it can actually be extreamly difficult in practice, both assumptions are actually false and have been shown to be so at one time or another.

GabrielMay 1, 2019 8:45 AM

Wendover Productions has many excellent YouTube videos on all aspects of the aviation industry. He's just released a new video about the Boeing 737 Max.

https://youtu.be/BfNEOfEGe3I

He gives interesting background on the importance of fuel efficiency to airlines, in relation to other factors. And regarding the details of the blunder, he pretty much seems to follow Travis.

Zaphod May 1, 2019 8:59 AM


I’m not flying in a Max until Clive says it’s safe to do so.

Zaphod

Sancho_PMay 1, 2019 11:53 AM

@Clive Robinson

”We – the users of technology – have to trust the systems we use.” (@Stephen Mason)

I think the quote is - even when mentioning technology and systems - fundamental to our society, from family to world population.
This principle of trust, based on experience and probability, finally, over decades, led to to the “… legal presumption in common law systems that computers are ‘reliable’.”

That may be “extremely unwise” but is the only way to go, because
the other option would be not to go.

Upside down or UPC/CPU, it doesn’t matter if thousands are robbed, drowned, overrun by autonomous vehicles or whatever,
- until there is growth.

MarkHMay 2, 2019 12:19 AM

An Attempt at Perspective

By some very rough math, I've estimated that the MAX fleet flew about one billion km before its recent grounding. If you had made every departure (impossible of course because they are concurrent), you would have died twice.

Looking at fatality data for countries with roughly "median" traffic safety (for example, the US and Belgium), if you traveled the same billion kilometers on average roads in average vehicles, you would have died ... about twice.

To be sure, the MAX's accident record is quite poor compared to other new aircraft types (more than 5 times as many fatal accidents per departure, and because both crashes were total, a much worse ratio of deaths per departure). For an individual passenger, the probability of dying in a MAX crash has proved empirically (so far) to be roughly one in 400,000 per departure.

A man who cycles an hour per day in the UK faces about the same probability of death on the road, for each week in which he does so.
_________________________________

Disclaimer: I might well have screwed up the math in more than one place ...

Clive RobinsonMay 2, 2019 3:47 AM

@ Zaphod,

Nice to hear from you again, it's been a while, I hope you are well?

As for,

I’m not flying in a Max until Clive says it’s safe to do so.

I don't fly these days as the Drs think it unwise.

At the moment the cause for the crashes is still officially not decided. So you can not officially say it's safe to fly again as nothing would have changed.

But we can sum up the things we currently know that were to put it politely deficient.

1, Regulatory oversight.
2, Senior Managment.
3, Airframe evaluation.
4, Marketing of features choices.
5, MCAS software.
6, AoA detectors.
7, Change documentation.
8, Pilot training.


Potentialy there was Pilot error as well. But when you consider the above incompleate list, would it have been their fault? If knowledge that should have been given to them in an appropriate way and was not, then the error was not theirs.

But we also now have to consider other factors. There is the rivalry between Airbus and Boeing that at times looks chillier than the cold war. Back over a decade ago Airbus saw a potential hole in the market as well as a hole in Boeings product line up, and they filled it. It would appear that American Airlines and share holder interests then called the shots on Boeing managment back nearly a decade ago. Thus there is the question of just how much of a rush job the 737 MAX was as well as how it would be sold to the market place in a way that would delay orders to Airbus untill a Boeing product was available.

Arguably Boeing did the equivalent of "lipsticking the pig" or "slap on go faster stripes on a junker" for potentially unsound reasons. As others have noted, whilst the software might have been deficient in many ways, it has to be seen in a larger context, including that of geo-Politics. At any point in the chain leading to these two disasters the tiniest of changes would have set history in a different direction.

Over all though the whole situation does not look at all good not just for Boeing but the FAA and political decisions that went behind the cost cutting. Let's be clear, there are US politicians with blood on their hands over these crashes and the deaths.

Also what is not helping is lack of communications from Boeing managment, they have created a "news vacuum" into which anything and everything is getting draged not least of which is their reputation.

Whilst they may not be able to say much technically for legal reasons, there is still a lot they could still be saying to atleast partially fill the vacuum. In effect they have gone MIA or into hiding... In many peoples eyes all be it unfairly that's a sign of guilt.

Personally at the moment I'd like to know more about how the FAA ended up in the mess it has now become. I know a number of the pieces but there are a lot more "roaches out of sight" that need to be flushed out and stamped out. Not least is this myth that "A small change here and a small change there does not a different aircraft make" for certification avoidence. Let's put it this way, if you are marketing it as a new product with new flight charecteristics, which is what Boeing was doing, then it has sufficient changes to be a new aircraft...

MarkHMay 2, 2019 8:43 AM

As I feared, I got myself into trouble by attempting mental arithmetic in the middle of the night -- some errors came to me while I was brushing my teeth.

The demonstrated MAX fatal accident rate is about 1 per 200,000 departures, so my British bloke must cycle each day for a fortnight in order to equal the death risk of a single flight on a MAX with the original MCAS.

And its up-to-the-time-of-grounding fatal accident rate is about FIFTEEN times that of the modern airline fleet average.

If

• the MAX gets back up into the air
• its deliveries continue anything like planned
• son-of-MCAS avoids the insanity of the first version

then within about 2 years, the MAX fleet fatality rate is likely to improve to about 1:1,000,000.
________________________________

None of what I have written, is offered to excuse the cascade of failures -- cogently enumerated in Clive's post above -- which led to two completely avoidable disasters.

When we ride on a jet airliner, we hurtle at about 9/10 the speed of sound through air near -60 F, so thin that without pressurization the average person would lose the ability to solve even a simple challenge in less than half a minute.

We are propelled on this journey by a vast assemblage of components, including a small forest of turbine blades about the size of a credit card, each one of which generates the horsepower of a Formula 1 engine at full throttle while bathed in gas at a temperature which would quickly disintegrate ordinary engine materials.

At the end of each day we take such a ride, we're more likely to still be alive than if we had spent that same day in our usual activities at home.
________________________________

The MAX has disappointed this Herculean expectation, and is a very useful reminder that though many thousands of people do their work as near perfectly as they are able, the mistakes of a handful can undo it all.

As with security systems, it's not enough to do thousands of things right, because that one thing wrong is enough to open the door to failure.

Clive RobinsonMay 3, 2019 1:18 AM

@ MarkH,

Firstly on "late night math" you are not the only person to get things wrong, it happens to the best of us when tired (hence my occasional comments about productive and unproductive work hours).

As with security systems, it's not enough to do thousands of things right, because that one thing wrong is enough to open the door to failure.

Thankfully for most who follow this blog, whilst trying to find which door was oppened is just an intellectual excercise, for others it's a rather more serious issue.

It can also be an excercise in futility. If a work place ethics have become toxic then a door will open at some point, which one does not realy matter as it's the toxicity that is to blaim.

Look at it this way you fall sideways with your foot trapped, you know something is going to give somewhere and that it is going to hurt a lot. What gives way is a symptom of the fall, not the cause of the fall. If you want to avoid similar injury strengthening the boots you wear is not going to stop you falling again, or hurting yourself. In effect strengthaning the boot just moves the location of what gives. Stopping future falls is the real solution, but could prove to be difficult to do.

Unfortunatly, that's not the way the legal profession tends to work. Thus the law will look for the door opener and rightly or wrongly hold them to account, for being the weakest link in the chain.

And as the old saying has it, "There but for the grace of god go you and I".

MarkHMay 3, 2019 3:37 AM

@Clive:

"There but for the grace of God ..."

In my work life, I have a long history of obsessive-compulsive striving to make things bulletproof (yes, I have German ancestry), with mixed results.

Trying to imagine myself on the MCAS team, I visualize myself jumping up and down, because studying aviation safety has been a hobby of mine for decades, and I'm very alert to the deadly potential of a runaway horizontal stabilizer.

But supposing I had done so, the response would likely have been either "you just write the code like a good boy," or a patient explanation from one of the Wise Men that they had carefully analyzed the matter and it's safe as houses.

Might I have done any better than those who were there? I've no confidence about that.

As to legal matters, apart from astronomical payouts to family members, I don't think any individuals will face legal jeopardy ... though some might spend a grim part of their lives testifying as trial witnesses.

Fortunately, the policy of the NTSB is to search for probable causes, not to assign responsibility. In crash investigations, they have no judicial or quasi-judicial role.
__________________

More about the statistics ... my estimate concerning the UK cyclist is only for his risk of death while cycling.

Sad to say, for gentlemen of my age -- or Clive's -- two weeks of ordinary life carries the same death risk.

Maybe it's time for me to take up some dangerous hobby ...

Clive RobinsonMay 3, 2019 9:17 AM

@ MarkH,

Maybe it's time for me to take up some dangerous hobby ...

I used to be a very keen cyclist, I started getting serious about it due to the high prices of public transport and that for many places within 50 miles of crntral London it was actially faster than driving a car or using Public Transport. Oh and Public transport has the disease downsides for half the year and full of sweaty bodies at other times of year... I often cycled 50-100miles on a work day and at weekends I'd just cycle off to somewhere with a train service back to London. So most places around the South East coast got a visit as well as further afield like Birmingham and bits of the West country and black country. I used to carry an army Poncho&Liner for the occasions I did have to stay over night. Oh and 2M Ham transceiver just to chat on the way.

As for taking more risk as you get older I think I've mentioned "Micromorts" and "Microlives"[1] before. For a mildly fun way of looking at Micromorts,

https://charlesfudgemuffin.blogspot.com/2017/01/micromorts-risk-of-dying.html

[1] Micromorts are the risk of a non accruing activity such as driving 200miles. Once you stop the activity the risk ceases. Microlives lost is to accruing risk like smoking 200 cigarettes, you've lost 100 microlives of lise expectancy.

RachpoilMay 4, 2019 1:01 PM

I used to have a lot of respect for Boeing but the way that they rushed the 737 Max design is unforgiving. They cut big corners at the expense of passengers/crew safety and now 346 children, women and men are dead. All Boeing had to do is use the available second AoA sensor to cross check the data and disable that MCAS system if there was inconsistency between the two source of data it receive. It would also have been easy to have the plane computers report the AoA data inconsistency to the pilots so that the defective sensor could be fix. Cutting corner on something like that tells a lot about Boeing. I no longer trust that company and hope that Boeing will pay a very big price for their greed.

MarkHMay 14, 2019 10:00 AM

Yesterday, an article in Aviation Week & Space Technology described the results from a 737 crew who set up their flight simulator to reproduce a segment of the Ethiopian Airlines flight which ended in the second MAX crash.

The simulator is for the preceding model (called the NG), but the difference between that and the MAX is too slight to matter to their experiment.

Without going far into the details ... they found that the only survival option for the doomed Ethiopian crew was quick recognition of the MCAS problem followed by immediate corrective action.

Because the pilots on the crash flight took some time to realize what was going wrong, they already had the stabilizer in a strong nose-down position (put there by MCAS) and excessive airspeed before they started their recovery procedure.

It was from that point, that the simulator pilots attempted recovery. Regaining controlled flight required three things:

1. cutting back engine thrust to reduce airspeed;

2. a series of somewhat complex maneuvers on which current 737 pilots are not usually trained; and

3. loss of about 8000 feet of altitude.

Because the maximum altitude attained by the Ethiopian flight was 8000 feet, even knowing all of the tricks could apparently not have saved them.

What might have saved them was early diagnosis that MCAS was pushing their nose dangerously down; alas, it took them rather too long.
__________________________________

When the Ethiopian pilots understood that MCAS was running away, they switched off the automatic trim system (as they were supposed to do), effectively disabling MCAS.

Unfortunately, due to a mechanical limitation of the 737 flight controls, when the horizontal stabilizer is far out of trim, the pilots can't both (a) keep the nose at a reasonable angle, and (b) use the handwheels to correct the stabilizer position. [The simulator pilots overcame this by cycling in a repeated "roller coaster" maneuver between letting the plane dive while adjusting the handwheel, and then bringing the nose back up causing the handwheel to freeze.]

The Ethiopian pilots tried to get the stabilizer back into trim using trim control buttons which activate electric motors ... but in order to enable those buttons, they had to switch the auto trim system back on. As a consequence of Boeing's baleful system design, that also reengaged the malfunctioning MCAS, which quickly moved the stabilizer the wrong way ...

The article described the simulator pilots as "stunned" by how difficult it was to recover the flight.

Clive RobinsonMay 15, 2019 3:37 PM

@ JG4, and orhers,

If these are all correct, Boeing should burn to the ground.

There is much that I could say, but at the end of the day, the very least this can be is gross mismanagment from way up the totem pole.

Whilst I feel for the friends and family of the victims of the two flights, I also can feel compasion for the ordinary workers at Boeing, who's lifes are likely to take an even steeper turn for the worse as Boeing pays the price. Likewise the workers and their families of associated businesses some being little more than mom&pop stores.

Much though I often deride short term share holders, there are many others and a lot of peoples pensions are riding on Boeing shares, that just a short while ago would have been considered a fairly safe investment.

The harm this is likely to do to many parts of America could well be a blight for years to come.

Not that any of the above would have figured in Boeings senior managment team...

MarkHMay 17, 2019 9:02 AM

.
Another Consequential Design Decision

Thanks to JG4, for the link posted above. The article referenced this piece in the Seattle Times* about yet another design choice which played some role in the heavy loss of life.

In essence, 737 cockpits have two switches between the pilot seats to disable powered control of the horizontal stabilizer.

[As a reminder, "runaway" of the stabilizer is an extremely dangerous situation which can make the plane impossible to recover in not many seconds.]

Pilots have long been trained to flip both of those switches immediately as soon as they recognize an undesired movement of the stabilizer.

In previous 737s, the two switches had distinct functions: one disabled the electric stabilizer trim motors altogether; the other disabled automatic input to the trim motors, but left the pilots' electric trim buttons (up/down) functioning. The toggle switches have distinct markings to reflect their distinct functions.

The MAX retains the two switches, but now they both work like the first switch used to: they completely disable powered control of the stabilizer, including the pilots' trim buttons.
__________________________________

Reportedly, the Ethiopian pilots were well aware of how MCAS brought down the Lion Air flight (in general, airline pilots pay very serious attention to information on systemic causes of accidents).

And those Ethiopian pilots did their best, to follow the recommended procedure for recovery from MCAS runaway. But as explained previously, when the stabilizer has gotten far out of trim, mechanical forces make it practically impossible at the same time to both keep the nose at a reasonable angle, and to correct the stabilizer position using the manual control wheels. That's why the simulator pilots (whom I mentioned above) needed their "roller coaster" maneuvers in order to recover control of the aircraft (with considerable difficulty).

The Ethiopian pilots, finding their trim control wheels to be unusable, returned those toggle switches to their normal position. This enabled them to use their trim buttons to correct the stabilizer ... but it also activated MCAS again, which after its demonic pause of a half minute or so, resumed pushing the nose down.
__________________________________

The Seattle Times article presents two cases:

1. If the two switches had retained their distinct functions, the Ethiopian pilots could have operated only the second switch, disconnecting MCAS while leaving their electric trim buttons effective.

In that case, they would easily have regained control of the stabilizer, returned the plane to its desired flight path, and completed their flight safely.

2. Pilots have always been trained to operate BOTH switches in the event of a runaway, so retaining the pre-MAX configuration of the switches wouldn't likely have helped.
__________________________________

Of course, there's a third way: if Boeing had kept one of the switches as disabling only automatic inputs to the stabilizer, then when they published recommendations for what to do about MCAS malfunction after the Lion Air crash, the procedure could have called for operating only the second switch, and using the powered trim buttons as needed.

Now, it's not so simple as all that, because my hypothetical procedure would require the pilots to distinguish MCAS failure from other causes of stabilizer runaway. That might sound easy to those reading this comment, but when the plane wants to hurtle groundward and you're using much of your muscle capacity pulling back on the control column in an effort to arrest that tendency, diagnosing failure modes may require a kind of cool inference not readily accessible.

If the cause of the runaway weren't MCAS, flipping only the second switch could lose precious seconds and possibly doom the plane anyway.

So it's not such a simple win/lose proposition.

The change in the function of those two toggle switches is yet another design decision made at Boeing, which probably had a perfectly reasonable rationale behind it ...
__________________________________

* For anyone wondering about the crucial coverage from what is not a very famous newspaper nationally, the Boeing company started, and long had its headquarters, in the Seattle area. A lot of the production still takes place there, so Seattle has been a kind of "company town" for Boeing. Seattle Times coverage of Boeing matters is first rate.

Roderick ReesMay 18, 2019 6:02 PM

Dr Schneier - There are some excellent critiques of the 737 MAX disaster available on line. I would like to offer a slightly different point: the precise causes of these two crashes are not as important as the design environment that allowed them. This is a management problem, and even if the relevant deficiencies in the MCAS were to be fixed, the management attitudes that led to them would still be operating, and therefore we should expect more problems with the MAX.

It is illustrated by the reported meeting, or clash, between some pilots and a Boeing manager (I am surprised that Mike Sinnett let himself in for that - I had a long talk with him about a different problem and was impressed with his capability and his attitudes): in effect Boeing management is telling the pilots to stop complaining and to trust Boeing management. But the story, going back many years, is that the management cannot be trusted and should not be trusted (their first concern is not flight safety). The DER system was excellent, with one major deficiency, and all the DERs were very conscientious people; occasionally one would hear a rumor of a manager trying to press a DER to change his assessment, but the process provided for that to lead at once to a report to the FAA, and for the offending manager to be told to back off.

That one deficiency was software, which was explicitly not to be included numerically in the certification analyses. The reasons are well known, but unfortunately the explosive increase in computing power has meant that it was just too easy to say "And let's add this capability, and this, and this..." so that it expanded beyond anyone's comprehension to understand how it could cause failures. Software has changed into a kind of magic, in many people's minds (and AI is orders of magnitude worse as magic). I see no way to get it under control unless design managers are compelled to keep it simple enough for human understanding and analysis.

When that system was changed because the FAA could not recruit enough capable engineers, the Boeing management was given a free rein to tell the ARs to let anything pass if it would save time or money, or even to impose absurd design requirements such as "We'll make it so reliable that it will never fail" (!) by having triple computers all with the same software and the same inputs. (Even the SpaceShuttle with five computers had a problem with that).

I see no solution that does not include clearing out the FAA management that accepted that loose delegation, and also making Congress see that the FAA needs more money and resources to recruit capable supervisory engineers - and good luck with that.


Leave a comment

Allowed HTML: <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre>

Sidebar photo of Bruce Schneier by Joe MacInnis.

Schneier on Security is a personal website. Opinions expressed are not necessarily those of IBM Resilient.