We have never had anything like this before
- Details
- Created on 06 April 2018
- Written by Steve Burrows
So said Eurocontrol, the European Organisation for the Safety of Air Navigation, last week, following an IT outage in the Enhanced Tactical Flow Management System (ETFMS) that affected around fifteen thousand flights across Europe last Tuesday afternoon and evening - around half of all the flights scheduled . Media commentators estimated that half a million passengers were affected by flight delays or cancellations. Following restoration of the ETFMS system, around eight hours after the failure, airlines were asked to re-load flight plan data loaded before the outage.
In truth, the ETFMS system has been very reliable, the previous outage was in 2001 so that’s around 6,000 days between major IT failures - but that will have been of little consolation to the airlines and passengers affected last week. We have high expectations of IT systems and from a customer perspective being told that a service failure is due to “IT issues” does not impress. IT failure conveys to our customers an impression, however unjustly, of organisational incompetence.
I have no idea how to quantify the cost of the disruption to those flights or their half-million passengers, but I’m sure the bill would be quite big enough to justify the cost of a “hot standby” system which could have been brought into service instantly when the primary system failed. Even the Isle of Man Government has instant automatic failover between replica datacentres for some of its systems, however auto-failover, hot standby and other approaches to resilient distributed systems with redundancy to mitigate failure are generally still uncommon in business IT. Most companies’ Business Continuity approaches to IT failures are based on “Disaster Recovery” rather than “Disaster Avoidance” - very often borne out of the sentiment expressed in Eurocontrol’s reaction to their system failure “We have never had anything like this before”.
This, I believe, is a legacy of old-school IT Management thinking - having a standby system in the old days of hugely expensive mainframe computers was simply an unaffordable luxury - “CFO says No”. If the live system went down the recovery strategy was to fix it and bring it back up, accepting that it would be down for several hours or days.
Times have changed - organisations and their customers depend upon the availability of IT systems to operate, every working hour and in many cases 24 hours a day every day of the year. Live failover / hot standby / active-active backup IT systems are commonly a business necessity. If one accepts the principle that every IT system will fail, sooner or later, then the only real consideration is the financial cost of providing hot standby systems vs. the cost to the business of lost custom.
If Eurocontrol had a true hot standby for the ETFMS system the downtime could have been reduced to a few seconds (or less) - it is entirely possible that Europe’s airports and airlines would not even have been aware of the failure. As it is they can be congratulated for restoring a complex system in a few hours, but they should still expect to be criticised for requiring airlines to re-enter their flight data because their failure to have an up to the minute snapshot of this data prolonged the disruption to European air travel.
The reality of IT is that all computer systems fail, sooner or later, it is inevitable. As businesses / organisations we should have contingency capabilities which mitigate the effects of failure to our customers, staff and organisations - not having those contingency capabilities is akin to driving without insurance, fundamentally irresponsible. The only real variable is how long we can afford for our computer systems to be out of action - an hour? a week? Most IT disaster recovery plans centre around rebuilding IT systems or restoring data and IT capability to a set of spare equipment - which can take many days. Whilst this is better than nothing, is it acceptable for your organisation to be unable to service customers for hours or days?
Having hot standby systems or other forms of IT system duplication with auto-failover in the event that one system or datacentre or data-communications link fails is one of the strongest arguments for outsourcing IT in smaller businesses. Most reputable IT outsourcing / cloud systems suppliers provide this capability by default, because their businesses cannot afford the cost of failing your businesses. Unfortunately many medium-sized and larger organisations with in-house IT systems seem to have unrealistic expectations about the improbability of a critical IT failure, or the likely duration of business disruption which will flow from it.
IT equipment is generally becoming more reliable, but software is increasingly complex, and malware / viruses / ransomware etc. are increasing common. One way or another, something is going to cause your systems to fail, and when it does the likelihood is that you too will be telling your bosses, customers, shareholders etc. “We have never had anything like this before”, or “It’s a new type of malware, our IT security systems didn’t detect it” - or whatever. In the case of the Eurocontrol ETFMS system, flights across Europe were crippled because a they were testing a new software release and had incorrectly linked their test and live systems together - a simple human error by IT folk.
The lesson from the Eurocontrol failure last week is very simple; we, almost all of us, depend these days on our IT systems being up and working. Whether we are a local travel agent, vehicle servicing garage, restaurant, retailer or global e-gaming operator, when our IT is down we are out of business. Too many of us don’t really understand the risk - how likely it is that our IT will fail us (almost certain), how long our systems will be down for if the problem is serious (probably days) - or the impact on our businesses of being unable to serve our customers for a prolonged period. Eurocontrol have a monopoly, their customers can’t go elsewhere, but most of our businesses do not - if we fail them our customers will walk.
Most of the old excuses in respect of cost and complexity of providing hot-standby / auto-failover systems have fallen by the wayside as IT equipment has become massively cheaper over the years. Duplicate / redundant IT systems may sound inherently expensive, but the reality is that most IT hardware is inexpensive. Our businesses have become more dependent on the constant availability of our IT systems, and our customers increasingly expect to interact with our IT systems via the Internet including out of office hours.
As businesses we need to change our mindsets from the old model of recovering from IT failures, which is typically based upon the slow process of rebuilding our IT systems and restoring data from backups, to ensuring that we have hot standby systems with real-time replication of data and a mechanism to switch between production and standby systems either automatically or at the drop of a hat.
That advice might sound costly and impractical, but with the increasing acceptability of Cloud computing and hosted services from smaller IT suppliers and resellers, our standard email, file storage, office productivity and common systems such as Accounting, CRM, ERP and Document Management / Workflow can all be hosted off-site by Cloud IT providers who offer hot-standby and autofailover as a standard feature of their service. Even here on the Isle of Man we have a choice of local IT suppliers who will host our everyday IT across multiple datacentres for resilience, meaning that if the primary IT system fails the interruption to our business whilst we are switched-over to the standby system will be imperceptible to our customers. IT availability management concepts which were, a few years ago, only really practical for large companies are now within the reach of pretty much all businesses - even the one-person business can easily buy resilient Cloud-based IT.
With the everyday IT systems taken care of, that really only leaves our bespoke systems as vulnerable to failure - which is of course what happened to Eurocontrol. Just as our everyday IT systems can be replicated to a hot standby for resilience, the same is true of almost all bespoke systems in our businesses - and again our local IT suppliers can help. The IT supplier capability on the island may not compete with the scale and sophistication of some of the global leaders, but it’s certainly good enough that we can generally avoid the issue of IT system failure impacting upon our customers. Local IT suppliers can set up replica systems and mirror data across multiple Isle of Man datacentres in real-time or near real-time, meaning that no business on the Isle of Man should need to be in the situation of letting down their customers due to IT failure - unlike Eurocontrol.
Customer service is a key differentiator, those of us in Manx business who care about it should probably be taking the Eurocontrol mishap as a timely reminder to revisit our IT disaster strategies, talking to our IT suppliers, and considering the move from Disaster Recovery to Disaster Avoidance.