canjoena - stock.adobe.com

Feature

Expensive datacentre outages: Untangling messy collaborations, contributing costs and complexity

Even if outages in the datacentre are trending down, Uptime data this year traces rising costs per outage. So what can be done?

Fleur Doidge

Published: 18 Nov 2022

As businesses increasingly rely on their digital infrastructure, downtime has a proportionally greater impact. However, while attention to reducing outages in the datacentre appears to be paying off, the costs per outage are not falling.

Andy Lawrence, executive director of research at Uptime, says a trend for “small improvements” in annual outage rates may be pointing up the high cost of the remaining outages. Some 78% of respondents to Uptime’s global datacentre survey in 2020 reported experiencing outages – yet in 2022 the proportion fell to 60%.

“The most expensive outages can be catastrophically so, with lost business and reputational risk sometimes affecting company valuations,” says Lawrence. “But even more routine and far less impactful outages are getting more expensive, as the costs of even relatively simple mitigation increase.”

Despite a push to build datacentres more cheaply, more are investing in on-site resiliency alongside distributed backups and recovery services in a bid to avoid outage-related revenue loss or financial impacts. Outages, of course, are also subject to inflation, with parts, labour, service-level infringements and the like all seeing an impact, says Lawrence.

About 40% of those polled by Uptime were from professional IT or datacentre services providers. Some 57% of a total of 830 respondents hailed from organisations boasting annual revenues under $10m – mainly consultants, design engineers and senior executives – and 28% of all those surveyed were based in Europe or the UK. Just 7% were from the $1bn-plus club.

Among those in 2022 experiencing an outage in the last three years, just 14% classified one outage as “serious/severe”, versus 18% in the 2019 survey. Many were partial rather than total failures of systems or equipment.

Lawrence points out that avoiding outages – or at least enabling a quick, smooth recovery – means making investments (including in training) in advance. But while power problems continue to be concerning, these are well understood.

“Most of the costs associated with power failures now relate to the restarting of systems and the recovery and synchronisation of data,” he says.

“The complex interconnectedness of model digital infrastructure can help alleviate big, single-site failures, but newer distributed architectures are subject to failures of their own. Software and configuration errors often reverberate across different sites and services.”

“Most of the costs associated with power failures now relate to the restarting of systems and the recovery and synchronisation of data”

Andy Lawrence, Uptime

Nitha Puthran, senior vice-president for cloud and infrastructure at Persistent Systems, says higher costs can flow on from increased reliance on digitised systems and applications, including large artificial intelligence and machine learning-enabled data warehouses.

While better backup power systems and effective disaster recovery software and operational plans are handling more outages as they arise, disaster recovery strategies may not be optimal – with procedures thoroughly tested to ensure “muscle memory”.

“Many organisations don’t make this a part of their IT strategy – it is still an afterthought,” says Puthran. “They like to spend a lot on the infrastructure that will run the day-to-day business, but less on building that redundancy that they can think of as more of a luxury – especially right in that transformation stage.”

Smaller organisations partnering large providers should read their fine print, ensuring that they understand availability levels and are not caught short. Planning and budgeting for outages and their recovery is often skimped somewhat, says Puthran, especially when resources are tight.

“And the drills can’t be a check-box any more – they have to be real and conducted in a timely manner, as part of compliance and so on,” she says, adding that effective plans must take in people, process and technology.

“Even if they have designed the solution, or if doing it collaboratively, make sure it goes through a well-documented, well architected process,” she says. “Should something happen, how do we return quicker, with less damage?”

Is the answer education? Perhaps – but that might depend on the tact of an approach when coming from a services provider, Puthran adds.

You are what you eat

Neil Thurston, chief technologist at cloud solutions provider Logicalis, pinpoints “digital spaghetti on top” – not only itself a source of complexity but consequent costs, especially as organisations transformed to deal with events like Covid.

Inevitably, aspects of this phenomenon are likely to represent a new normal and can be compounded by ongoing skills shortages. “Within our own customer base, who run their own datacentres, and third parties, and because of the pandemic and the global supply chain, standardisation has gone out of the window,” says Thurston.

People have sometimes bought whatever kit they could to cope with demand in short order. Operators may have physical underlay networks overlaid with virtual software-defined networks, ramping up complexity on the networking side. There can be more working parts to go wrong, too. And networking issues are not always obvious or easily diagnosed, says Thurston.

“We are in a period where datacentre engineers are going to be coming up against equipment they’re not quite used to and things will be different – resulting in lengthier troubleshooting,” he says.

“If it happens in that virtual world, the problem you’ve got is it’s not as easy as losing power to the datacentre. Who is impacted – it’s everyone. You’ve got to get the power back, but it’s a virtual problem, and you’ve just got to keep going until you find it. On the networking side, this is where it gets tricky, because everyone designs a network differently.”

Part of the strategy might include problem and knowledge management investigations that can be applied towards a re-standardisation that favours additional automation. “Engineering efficiency” might shorten outage lifecycles, says Thurston, while artificial intelligence and AIops may help detect and remediate patterns.

John Graham-Cumming, chief technology officer at web security company and Google tech partner Cloudflare, points out that higher-tier datacentres especially can be “incredibly stable” in their power and cooling. Instead, outages can be about how operators deal with the inevitability of the desire to continually make changes – because software is constantly evolving, for example, rather than something external.

“We are in a period where datacentre engineers are going to be coming up against equipment they’re not quite used to and things will be different – resulting in lengthier troubleshooting”

Neil Thurston, Logicalis

“What works has been a combination of things,” he says. “You want to find where your system is not resilient. We do a kind of chaos engineering, deliberately breaking things to see what happens.

“For example, take services or machines or networking equipment offline. With large, interconnected complex systems, introducing chaos to shake out the issues can be valuable.”

Progressive roll-outs while observing elements in the chain, such as software, can help catch impacts as they emerge, at a certain scale or number of locations and users, especially in a heterogeneous environment, he says. This can help quickly figure out the history and track changes that are affecting X or Y as they happen.

For Graham-Cumming, the usual suspects include redundancy, resiliency, disaster recovery, load balancing and more, yet culture can play a big role in impacts per outage.

He recommends taking a “blame-free” approach that doesn’t waste energy targeting responsibility for the cause or causes. Instead, focus on everyone pulling together to solve the problem, including the inevitable unknowns, as quickly as possible – without apportioning blame.

“Anyone should be able to say, ‘hey, I’m observing a problem or a potential problem’ and be able to call an incident right now to get the right people to go there and do that, and have that totally blame-free,” says Graham-Cumming, “especially if the person ‘responsible’ is an individual contributor simply doing their job and trying to achieve something.”

Jake Madders, director of Hyve Managed Hosting, suggests that diversifying suppliers might sometimes help by avoiding total reliance on one player. After all, anyone can have unforeseen problems.

“We’ve seen a trend of that increasing – we reckon related to Covid, because everyone’s remote,” he says, adding that this can make supplier communications trickier at times. Also, a lack of exposure to talk and happenings “across the desks” can reduce an organisation’s ability to keep on top of unexpected events.

“If we have one client, we would put half their stuff in the primary, and then use a separate supply for their disaster recovery, and the same with our ISPs or network providers,” says Madders.

Innovation both solves and adds complexity

With one customer, says Madders, they are installing their own battery system to go in between their system and Hyve’s racks – a move once unheard of for a tier-three or tier-four datacentre.

With rising costs, cyber security threats and compliance demands, and despite multiple policies and procedures focused on resilience, including on-site fuel, generators and kit, power outages and hardware failures still happen – so why make communications more difficult?

“For some, a 10-minute outage can be disastrous for their business,” says Madders. “A lot can be predicted – but a lot can’t. You can build a strategy in, but again, it can be cost-prohibitive – and everything has weak spots.”

IDC analyst Phil Goodwin, in his firm’s Q1 report The state of ransomware and disaster preparedness 2022 (released in May and sponsored by security vendor Zerto), suggests risks to data integrity and availability may actually have never been higher. Malware, data loss from exfiltration and ransomware are now pervasive, highlighting the need for effective disaster recovery.

Expensive datacentre outages: Untangling messy collaborations, contributing costs and complexity

Even if outages in the datacentre are trending down, Uptime data this year traces rising costs per outage. So what can be done?

You are what you eat

Innovation both solves and adds complexity

Read more about datacentre disasters

Read more on Datacentre backup power and power distribution

Datacentre operators urged to tighten up their carbon emissions and water usage reporting

Rising energy costs erode competitive edge of colocation datacentre operators

Cloudflare confirms outage caused by datacentre network configuration update error

Length, cost and severity of datacentre outages continue to rise, Uptime Institute research confirms