TEST & MEASUREMENT
By Simon Bearne, Commercial
Director, Next Generation Data
www.ngd.co.uk
Data Centres are quick to lay out their credentials but
can be slow to prove them – black testing will prove
critical infrastructure will work when needed
For cloud providers and their many customers, a robust
and continuously available power supply is amongst the
most important reasons for placing IT equipment in a data
centre. It’s puzzling, therefore, why so many data centres
fail repeatedly in measuring up to such a mission-critical
requirement.
Only recently, for example, cloud service providers
and communications companies were hit by yet another
protracted power outage affecting a data centre in London.
It took time for engineers from the National Grid to restore
power, and meanwhile thousands of end users were impacted.
Let’s face it. From time to time there will be Grid
interruptions, but they shouldn’t be allowed to escalate into
noticeable service interruptions for customers. Inevitably, such
incidents create shockwaves among users and cloud service
providers, their shareholders, suppliers, and anyone else
touched by the inconvenience.
The buck stops here
While it’s clear something, someone, or both are at fault, the
buck eventually has to stop at the door of the data centre
provider. Outages are generally caused by a loss of power in
the power distribution network. This could be triggered by
a range of factors, from construction workers accidentally
cutting through cables – common in metro areas – to power
equipment failure, adverse weather conditions, not to
mention human error.
Mitigating such risks should be easy when choosing a data
centre. Locate your data away from flood plains and ideally
choose a site where power delivery from the utilities won’t
be impaired; this is a critical point. Cloud providers and their
customers need to fully appreciate how the power routes
between their chosen data centre and through the electricity
distribution network – in some cases, it’s pretty tortuous.
Finding the ideal data centre location that ticks all the
right boxes is often easier said than done, especially in the
traditional data centre heartlands. Certainly, having an N+1
redundancy infrastructure in place is critical to mitigating
outages due to equipment failure.
Simply put, N+1 means there’s more equipment deployed
than needed and so allows for single component failure.
The ‘N’ stands for the number of components necessary
to run your system, and the ‘+1’ means there’s additional
capacity should a single component fail. A handful of facilities
go further. NGD, for example, has more than double the
equipment needed to supply contracted power to customers,
split into two power trains on either side of the building each
of which is N+1. Both are completely separated with no
common points of failure.
But even with all these precautions, a data centre still isn’t
necessarily 100% ‘outage proof’. All data centre equipment
has an inherent possibility of failure, and while N+1 massively
reduces the risks you can never become complacent. After
39
all, studies show that a proportion of failures are caused by
human mismanagement of functioning equipment. This
puts a huge emphasis on engineers being well trained, and
critically, having the confidence and experience in knowing
when to intervene and when to allow the automated systems
to do their job. They must also be skilled in performing
concurrent maintenance and minimising the time during
which systems are running with limited resilience.
Rigorous testing
Prevention is always better than cure. Far greater emphasis
should be placed on engineers reacting quickly when a
component failure occurs rather than assuming that inbuilt
resilience will solve all problems. This demands high-quality
training for engineering staff, predictive diagnostics,
watertight support contracts and sufficient on-site spares.
However, to be totally confident with data centre critical
infrastructure come hell or high water, it should be rigorously
tested. Not all data centres do this regularly. Some will have
procedures to test their installations but rely on simulating
the total loss of incoming power. But this isn’t completely
foolproof as the generators remain on standby and the
equipment in front of the UPS systems stays on. This means
that the cooling system and the lighting remain functioning
during testing.
Absolute proof comes with black Testing. It’s not for the
faint-hearted, and many data centres simply don’t do it. Every
six months incoming mains grid power can be isolated and
for up to sixteen seconds the UPS takes the full load while the
emergency backup generators kick in. Clearly, the power is
only being cut to one side of the infrastructure and it’s done
under strictly controlled conditions.
When it comes to data centre critical power infrastructure,
regular full-scale black testing is the only way to be sure the
systems will function correctly in the event of a real problem.
Hoping for the best in the event of real-life loss of mains
power simply isn’t an option.
Uptime checklist
•
Ensure N+1 redundancy at a minimum, but ideally
2N+x redundancy of critical systems to support
separacy, testing and concurrent access
•
Streamlining MTTF will deliver significant returns on
backup systems availability and reliability, and overall
facilities uptime performance
•
Utilise predictive diagnostics, ensure fit for purpose
support contracts, and hold appropriate spares stock
on-site
•
Regularly black test UPS and generator backup systems
•
Drive a culture of continuous training and practise
regularly to ensure staff are clear on spotting incipient
problems and responding to real-time problems – what
to do, and when/when not to intervene n
www.networkseuropemagazine.com