What Lessons can be learned from BA's Systems Outage?

Posted by Andrew Ogilvie in IT News

At 9.30am on Saturday 27th May 2017 British Airways (BA) suffered a datacentre outage which caused the cancellation of over 800 flights, impacted over 75,000 customers, days of disruption and compensation costs potentially of more than £100 million.

Remarkably the actual power outage was only some 15 minutes in length.

In a datacentre it is typical for computer and network equipment to be connected to an Uninterruptable Power Supply (UPS) string of batteries which smooth out fluctuations in the flow of power and should enable an uneventful fail-over from the mains to diesel generators and vice-versa if required.

It appears in the BA incident something went wrong with the switchover from mains to generator power, or possibly the UPS was manually over-ridden, with a power surge perhaps physically damaging equipment.

What is interesting about this incident is that there were the two different types of failure - both at a physical and at a data level. Protecting from physical failure such as power outages or equipment damage is relatively straightforward - have two or more sets of the physical infrastructure in two different locations. This protection does appear to have been in place - it is understood BA do operate two physically independent datacentres.

The data failure was however what caused the impact on customers to be dramatically longer than the 15 minutes of physical power loss. Clearly something went wrong in terms of how software should fail-over between the datacentres, but most likely data or the flow of data between databases and/or various applications became corrupted, inconsistent or back-logged. This kind of situation is an IT manager's nightmare. Even if the physical equipment is restored and and recent backups are available, complex systems may require data to be restored in a particular order if data is not to become corrupted or lost. This takes time and meantime BA's passengers and staff could not access the airline's applications and the airline effectively went "on pause" for many hours.

One of the lessons to learned here is to test your failover strategies on a regular basis. That might be at a physical level failing over between different physical locations, between databases, between virtual machines or between applications.

Veeam DRaaS - Disaster Recovery as a Service - allows companies to securely and reliably replicate their key data and applications in near real time from their primary site to a secondary cloud based site. Backups can be easily tested on a regular basis to check their integrity. Systems can be failed over or failed back rapidly as required. Veeam DRaaS is delivered using by Veeam Backup and Replication software in conjunction with Veeam Cloud Connect services run by a managed cloud provider.