Two thousand nineteen was one of the most turbulent years in Boeing’s history. Its 737 MACS (pardon the pun) troubles went from bad to worse to staggering when aviation regulators around the world grounded the aircraft and a steady trickle of disclosures increasingly exposed software problems and corners being cut.
The flaw in this aircraft, its anti-stall mechanism that relied on data from a single sensor, offers a particularly instructive case study of the notion of single point of failure.
One Fault Could Cause an Entire System to Stop Operating
A single point of failure of a system is an element whose failure can result in the failure of the entire system. (A system may have multiple single points of failure.)
Single points of failures are eliminated by adding redundancy—by doubling the critical components or simply backing them up, so that failure of any such element does not initiate a failure of the entire system.
Boeing Mischaracterized Its Anti-Stall System as Less-than-Catastrophic in Its Safety Analysis
The two 737 MAX crashes (with Lion Air and Ethiopian Airlines) originate from a late-change that Boeing made in a trim system called the Maneuvering Characteristics Augmentation System (MCAS.)
Without the pilot’s input, the MCAS could automatically nudge the aircraft’s nose downwards if it detects that the aircraft is pointing up at a dangerous angle, for instance, at high thrust during take-off.
Reliance on One Sensor is an Anathema in Aviation
The MCAS was previously “approved” by the Federal Aviation Administration (FAA.) Nevertheless, Boeing made some design changes after the FAA approval without checking with the FAA again. The late-changes were made to improve MCAS’s response during low-speed aerodynamic stalls.
The MCAS system relied on data from just one Angle-of-Attack (AoA) sensor. With no backup, if this single sensor were to malfunction, erroneous input from that sensor would trigger a corrective nosedive just after take-off. This catastrophe is precisely what happened during the two aircraft crashes.
The AoA sensor thus became a single point of failure. Despite the existence of two angle-of-attack sensors on the nose of the aircraft, the MCAS system not only used data from either one of the sensors but also did not expect concurrence between the two sensors to infer that the aircraft was stalling. Further, Lion Air did not pay up to equip its aircraft with a warning light that could have alerted the crew to a disagreement between the AoA sensors.
Boeing Missed Safety Risks in the Design of the MAX’s Flight-Control System
Reliance on one sensor’s data is an egregious violation of a long-standing engineering principle about eliminating single points of failure. Some aircraft use three duplicate systems for flight control: if one of the three malfunctions, if two systems agree, and the third does not, the flight control software ignores the odd one out.
If the dependence on one sensor was not enough, Boeing, blinded by time- and price-pressure to stay competitive with its European rival Airbus, intentionally chose to do away with any reference to MCAS in pilot manuals to spare pilot training for its airline-customers. Indeed, Boeing did not even disclose the existence of the MCAS on the aircraft.
Boeing allows pilots to switch the trim system off to override the automated anti-stall system, but the pilots of the ill-fated Lion Air and Ethiopian Airlines flights failed to do so.
Idea for Impact: Redundancy is the Sine Qua Non of Reliable Systems
In preparation for airworthiness recertification for the 737 MAX, Boeing has corrected the MCAS blunder by having its trim software compare inputs from two AoA sensors, alerting the pilots if the sensors’ readings disagree, and limiting MCAS’s authority.
One key takeaway from the MCAS disaster is this: when you devise a highly reliable system, identify all single points of failure, and investigate how these risks and failure modes can be mitigated. Examine if every component of a product or a service you work on is a single point of failure by asking, “If this component fails, does the rest of the system still work, and, more importantly, does it still do the function it is supposed to do?”
Leave a Reply