In my case study of the Boeing 737 MAX aircraft’s anti-stall mechanism, I examined how relying on data from only one Angle-of-Attack (AoA) sensor caused two accidents and the aircraft’s consequent grounding.
A single point of failure is a system component, which, upon failure, renders the entire system unavailable, dysfunctional, or unreliable. In other words, if a bunch of things relies on one component within your system, and that component breaks, you are counting the time to a catastrophe.
Case Study: How Airbus Builds Multiple Redundancies to Minimize Single Points of Failure
As the Boeing 737 MAX disaster has emphasized, single points of failure in products, services, and processes may spell disaster for organizations that have not adequately identified and mitigated these critical risks. Reducing single points of failure requires a thorough knowledge of the vital systems and processes that an organization relies on to be successful.
Since the dawn of flying, reliance on one sensor has been anathema.
The Airbus A380 aircraft, for example, features 100,000 different wires—that’s 470 km of cables weighing some 5700 kg. Airbus’s wiring includes double or triple redundancy to mitigate the risk of single points of failure caused by defect wiring (e.g., corrosion, chafing of isolation or loose contact) or cut wires (e.g., through particles intruding aircraft structure as in case of an engine burst.)
The Airbus fly-by-wire flight control system has quadruplex redundancy i.e., it has five flight control computers where only one computer is needed to fly the aircraft. Consequently, an Airbus aircraft can afford to lose four of these computers and still be flyable. Of the five flight control computers, three are primary computers and two are secondary (backup) computers. The primary and the secondary flight control computers use different processors, are designed and supplied by different vendors, feature different chips from different manufacturers, and have different software systems developed by different teams using different programming languages. All this redundancy reduces the probability of common hardware- and software-errors that could lead to system failure.
Redundancy is Expensive but Indispensable
The multiple redundant flight control computers continuously keep track of each other’s output. If one computer produces deviant results for some reason, the flight control system as a whole excludes the results from that aberrant computer in determining the appropriate actions for the flight controls.
By replicating critical sensors, computers, and actuators, Airbus provides for a “graceful degradation” state, where essential facilities remain available, allowing the pilot to fly and land the plane. If an Airbus loses all engine power, a ram air turbine can power the aircraft’s most critical systems, allowing the pilot to glide and land the plane (as happened with Air Transat Flight 236.)
Idea for Impact: Build redundancy to prevent system failure from the breakdown of a single component
When you devise a highly reliable system, identify potential single points of failure, and investigate how these risks and failure modes can be mitigated.
For every component of a product or a service you work on, identify single points of failure by asking, “If this component fails, does the rest of the system still work, and, more importantly, does it still do the function it is supposed to do?”
Add redundancy to the system so that failure of any component does not mean failure of the entire system.
If you can’t build redundancy into a system due to some physical or operational complexity, establish frequent inspections and maintenance to keep the system reliable.
Postscript: In people-management, make sure that no one person has sole custody of some critical institutional knowledge, creativity, reputation, or experience that makes him indispensable to the organization’s business continuity and its future performance. If he/she should leave, the organization suffers the loss of that valued standing and expertise. See my article about this notion of key-person dependency risk, the threat posed by an organization, or a team’s over-reliance on one or a few individuals.