Fragile Redundancy

How backup systems quietly come to share their primaries' dependencies, how contingency plans decay between exercises, and why failovers fail at the moments they are most needed.

Overview

Redundancy is treated as a near-synonym for resilience. In practice the relationship is much weaker. Most critical systems carry redundancy on paper that, when examined closely, depends on the same underlying components, the same suppliers, the same staff, or the same assumptions as the primary path it is meant to protect.

Fragile redundancy is not a defect of any one operator. It is the predictable result of treating redundancy as an asset to be procured rather than a property to be continuously maintained.

§01

Why backup systems share dependencies

Backup systems share dependencies because the same forces that shape a primary system shape its alternatives. They are bought from the same suppliers, certified under the same standards, run by the same staff, and routed through the same upstream services. The conditions that made the primary attractive make the same choices attractive for the backup.

Common-mode dependencies are often invisible until they are tested. Two independent data centres may run on power drawn ultimately from the same substation. Two independent providers may rely on the same submarine cable. Independence at the contract level rarely survives close examination of the physical or institutional layer beneath.

§02

Why contingency plans decay

Contingency plans decay because the world around them does not stand still. Staff move on, vendors are replaced, configurations drift, and the assumptions baked into the plan slowly diverge from operating reality. Without continuous use, the gap between what the plan describes and what the system actually does grows quietly each year.

Exercises slow this decay but do not stop it. The plans that survive longest are those embedded in routine operations — used often enough that their assumptions are continually checked — rather than those preserved in documents intended for emergencies.

§03

Why failovers fail

Failovers fail because they are exercised under conditions that differ in important ways from the conditions in which they are eventually needed. They are tested in isolation, on schedules, with the cooperation of the system being failed away from. Real incidents arrive in combinations, at inconvenient times, in the presence of partial information.

The systems that recover well from disruption are usually those that exercise their alternative paths as part of normal operation, rather than treating them as reserves to be activated only in extremis. Redundancy that is used is redundancy that works.

We study systems, not actors.

Related Patterns

PATTERN 001

Dependency Concentration

Observed In

Systems in which this pattern is one of the recurring structures we study.

Air Traffic Control
Forthcoming study
Electrical Grids
Forthcoming study
Telecommunications Networks
Forthcoming study