Black start: How a single VM brought a highly available infrastructure to its knees

Black start: How a single VM brought a highly available infrastructure to its knees

Do you have any of these memories? One of those stories from everyday IT life that you tell over a coffee (or a beer after work) and that elicits a knowing nod from your colleagues? A story that stays with you years later because the lessons you learned were so fundamental?

I have one of these. It's about a highly redundant system distributed across two data centers. A system that was designed to withstand the failure of an entire site. And it's about a 24-hour total failure, which should actually be impossible.

This is the anatomy of this failure. It is a reminder that Murphy's Law is omnipresent in IT: anything that can go wrong, will go wrong. Especially when it comes to a "temporary solution".

VM vs. highly available infrastructure

The setup: A bastion of redundancy (theoretically)

About a decade ago, I was part of a team supporting a then state-of-the-art VMWare infrastructure for one of our customers. The setup was, on paper, a fortress of availability:

  • Two data centers (RZ): Physically separate locations, connected by dark fiber.
  • Computing power: Several fully equipped IBM BladeCenters in both data centers.
  • virtualization: A large VMware cluster stretched across both sites, running hundreds of critical Linux VMs.
  • Storage backend: The centerpiece. A high-performance Fibre Channel SAN system, whose controllers were also distributed across both data centers and the data was mirrored synchronously.

The declared goal was crystal clear: absolute redundancy. Does a blade chassis fail? No problem, vMotion moves the VMs elsewhere. If an entire data center fails due to a power outage or natural disaster? No problem, the second data center should take over seamlessly. We felt safe. Too safe.

The trigger: the inconspicuous single point of failure

To understand what went wrong, we need to talk about a core concept of high-availability storage clusters: the Quorum Witness (QW).

Imagine a storage cluster with two heads (controllers), one in each data center. If the connection between the two fails, they both face an existential question: "Am I the one who is still alive, or am I isolated from the rest of the network?" If both controllers were to conclude that they were the "master", they would both accept write access. This would lead to a catastrophic state known as "Split-Brain" is known - two inconsistent versions of the truth that make data recovery almost impossible.

This is where the Quorum Witness comes into play. It is an independent referee. If a connection is lost, each controller tries to reach the QW. Only the controller who can reach the QW is allowed to continue. The other controller assumes that it is isolated and shuts down access to the data (the LUNs) to prevent data corruption. A simple but effective logic.

Our critical vulnerability? The Quorum Witness was not hosted on dedicated, independent hardware (e.g. a small physical server in a third location) as would have been best practice. It was a tiny virtual machine. Provisioned on the very storage cluster it was supposed to monitor.

It was a Temporary solution. Set up during the initial commissioning with the firm intention: "We'll get it right later." This sentence should still be ringing in our ears.

The chain reaction: from crash to total failure

On a perfectly normal Tuesday, the inevitable happened.

  • Step 1: The SAN cluster in the first data center (let's call it Cluster A) suffered a serious, unrecoverable hardware failure. Several controller components failed simultaneously. Exactly the scenario for which our redundant setup was built. Actually no reason to panic.
  • Step 2: As the VM with the Quorum Witness had its virtual hard disk on the storage of Cluster A, the QW also crashed with the cluster. It was gone from one second to the next.
  • Step 3: Now the second, actually completely intact SAN cluster (Cluster B) in the second data center was looking at the situation. Its status checks revealed two things at the same time:
    1. The connection to the partner cluster (Cluster A) is dead.
    2. The connection to the referee (Quorum Witness) is also dead.
  • Step 4: According to the iron rules of cluster logic, Cluster B now had to assume the worst. From his point of view, it was possible that Cluster A and the Quorum Witness were still running and only he, Cluster B, was isolated. To prevent a split brain at all costs, he did the only safe thing: He declared himself as the "split-brain minority" and disabled access to all storage volumes.

The result was a digital cardiac arrest. From one second to the next, around 200 virtual machines lost their hard disks. The entire system came to a standstill. Total failure.

The "War Room": 24 hours of technology and humanity

What followed was one of the longest crisis calls of my career. The "war room" was virtual, a WebEx conference that was supposed to run uninterrupted for 24 hours. A little anecdote in passing: back then, WebEx meetings had a hard time limit of 10 hours. In the middle of the night, at the height of the crisis, we had to frantically set up a new conference and migrate all participants to it.

But what I remember most from these 24 hours was not just the feverish technical troubleshooting. It was the human cohesion. The core team of storage and VMware specialists worked through the night, trying to revive the cluster manually and without a quorum.

The special thing was that many other team members, whose specialist areas (e.g. applications, databases) were not directly in demand at that moment, voluntarily stayed on the line. For hours on end. They were simply there to cover the active technicians' backs, shield questions from outside and provide moral support. It was a tacit promise: "We are here. You are not alone."

The lessons learned: What we took away from the abyss

After more than 24 hours, the system was up and running again. The "black start" had succeeded. But the relief was mixed with the bitter realization of how fragile our "bulwark" had actually been. The lessons we learned from this have shaped my work as an architect to this day.

  • Lesson 1: The pitfall of the temporary solution. This is the most important lesson. The phrase "We'll get it right later" is one of the most dangerous in IT. A makeshift solution that goes into production is no longer a makeshift solution. It is a ticking time bomb. Murphy's Law guarantees that it will go off exactly when the conditions are most unfavorable.
  • Lesson 2: Understanding true redundancy. Redundancy is not a feature that you buy. It's a concept that you have to understand and implement end-to-end. We had redundant servers, redundant storage controllers and redundant sites. But we had overlooked a fatal logical dependency that undermined the whole concept. You have to analyze the entire chain of dependencies, not just the individual links.
  • Lesson 3: Accessing the documentation. The irony of it all? Our detailed contingency plan, the documentation for restoring the SAN cluster, was... on a wiki. Its VM was on the failed cluster. We didn't have access to our own recovery instructions when we needed them most. Since that day, I've been preaching: Critical emergency documentation belongs in an independent, off-site location (cloud storage, a printed manual in the vault - anything not affected by the outage itself).
  • Lesson 4: The human factor. In the deepest crisis, when technology fails, it's the team that counts. Technical ability is one half of the equation. Team spirit, trust and mutual support are the other. A team that sticks together can solve almost any technical problem. This incident was impressive proof of that.

Conclusion: More than just a memory

This story is more than just an anecdote. It is a plea for care, for a deep understanding of the systems we build, and for a healthy dose of paranoia. It is a reminder that the most complex systems can fail in the simplest of places.

I invite you: Take a moment and think about your own systems. Where are your "provisional solutions" hiding? What dependencies have you perhaps overlooked? Is your emergency plan really accessible when the emergency occurs?

Such incidents are painful, expensive and nerve-wracking. But the most valuable lessons about engineering, design and teamwork are often only learned in the face of a real crisis. They make us better engineers, better architects and better colleagues.

This is exactly the kind of experience we share in our blog. If you want to find out more about how we use technology, overcome challenges and develop solutions together, take a look at our other articles, e.g. on our Approach to versioning. Perhaps you will find exactly the inspiration you need there.

en_USEnglish
Scroll to Top