Reducing computer system downtime should be a top priority for every company, since your services going down will have a devastating financial cost - in lost business or paying SLAs - and also ruin your reputation. Understanding the difference between redundancy and resilience and working with a colocation data centre is a smart move and can be a highly cost-effective safety net to protect your system from unexpected downtime.
Resilience and redundancy are often used interchangeably, but failing to understand the difference means it's difficult to make important decisions about how to protect your system. Your IT is the life blood of your company, and failing to protect it will disrupt every element of your business, so it's important to understand exactly what is being put in place.
Redundancy and resilience defined
Redundancy – refers to the level of backup equipment a data centre has to take over when the primary equipment or infrastructure fails.
Resilience – refers to a data centre’s ability to continue operating when there has been equipment failure or anything else disrupting normal operation.
Redundancy and resilience are similar, and both are dealing with the same topic, but the key difference between them is redundancy is about specific equipment’s capacity, and resilience is about the data centre as a whole being able to continue operating.
To further complicate the discussion, the more redundancies a data centre has in place, the more resilient they will be. Although, there are other factors that contribute to resiliency, like having staff on site 24 hours a day and preventing equipment failure in the first place.
When talking to a data centre operator, make sure they explain the reasons behind their resiliency claims. If they claim to be highly resilient without explaining their redundancies, you should be suspicious. And if you want to cut to the chase, when considering data centres, look straight at their level of redundancy with certain equipment, which will be expressed in the unit of N.
Redundancy expressed as N
N – is a unit of measurement which is the amount of redundancy equipment needed to keep a data centre operating. For example, if a data centre could run on the power of one generator, one generator is one N.
The thing to remember about redundancies and N is that N will be a different value for each data centre, as it is always proportional to the data centre’s requirements.
Some common examples of N include:
N = The minimum equipment needed to keep the data centre running
N+1 = The equipment needed to keep the data centre running and one additional piece of equipment
2N = Double the minimum equipment needed to keep the data centre running
The higher the N the more resilient a data centre will be, because it increases the amount of equipment that can fail before the data centre has to begin limiting its operations.
Make sure any data centres you speak to give you the N value of their redundancies for when they are at full capacity, and not for their capacity at the time. Otherwise you might sign a contract having been told they’ll provide 2N redundancy, but a year later – when there are a lot more servers in the data centre – they actually provide less redundancy than that.
Examples of redundancy and resilience within a data centre
Different elements of a data centre require different redundancy equipment, but all of these redundancies need to be in place to have the data centre at an acceptable level of resilience. The minimum all of these redundancies should meet is N.
A mains power failure is one of the most common reasons behind a computer system failing, so proper power redundancy is absolutely essential to a data centre. Power redundancy has two elements to it: UPSs (Uninterruptible Power Supplies) to keep all of the servers powered when the mains fails, and generators to provide indefinite power until the mains comes back on.
UPSs are effectively large batteries that the mains power is fed through to the data centre, so when there is a power failure the battery is drained, keeping everything powered until the generators can be turned on. As well as N, it’s important to learn how long a data centre’s UPSs are able to keep the servers powered, and how much generator fuel the data centre keeps on site.
Regardless of the cooling method a data centre deploys – whether it’s aircon units, cooling towers, or even immersion cooling – their cooling redundancy can still be expressed in N, which will give you an idea of how resilient their cooling is. Another important factor for their resilience is how often something goes wrong with the data centre’s cooling system, if they’re relying on their redundancy regularly then that is a red flag.
A data centre should have multiple lines laid connecting them to the internet, achieving the redundancy score greater than N. To truly achieve a good level of resiliency these lines should be geographically diverse, so that if something damages one of them the other one won’t be affected because it is in a different location.
Weighing up redundancy and resilience of different data centres
The ultimate thing you need to weigh up is the resilience of the whole data centre. Redundancy is the largest contributor to resiliency, but remember that redundancy in one area won't provide over all resilience. For example: backup generators won't protect your servers from overheating if the cooling system fails.
You will need to work out exactly how you compare the merits of different data centres, but remember to not be blinded by impressive sounding equipment, since maximum uptime is the only thing that matters.