Fruitful Considerations for High Availability Network Designs
Fruitful Considerations for High Availability Network Designs
It is generally understood that high availability network designs require longer periods of availability or shorter periods of downtime - depending on whether you are a glass is half full' or glass is half empty' person.
But unfortunately for network engineers and network architects that is as far as the clarity goes. Exactly how long is too long for a highly available network to be down is a matter for individual businesses to decide.
An availability target of 99.9% allows for approximately 8.76 hours per year of downtime. An availability target of 99.99% allows for approximately 53 minutes of unscheduled downtime per year.
Even though both designs may be labelled, High Availability', for some businesses 99.9% (8.76 hours) may be acceptable but for other businesses 8.76 hours may be disastrous. This is the reason why business requirements play such a large part in the definition of the term and why acceptable downtime figures vary between businesses and corporations.
But once an engineer is able to determine exactly how long is too long in terms of downtime for the business, there are other considerations that will affect the robustness of the design.
Redundant telecommunications links and redundant hardware are the core components of a highly available network but what other factors can also influence network restoration time in the case of a failure?
Carrier or ISP Service Level Agreements (SLAs) influence the response times from the telecommunications provider. Any communications established over a third party carriage supplier should be subject to such an agreement.
The network vendor's hardware warranty replacement agreement influences if or how quickly the vendor will attend site in order to replace faulty hardware. This agreement should also feature in high availability calculations.
The reliability of the hardware in the critical path (measured by Mean Time Between Failures (MTBF)) affects how often support teams will be called to battle stations. The more reliable the components, the higher that the availability will be over time.
And even routing protocol convergence time will eat away significant amounts of time into very strict service restoration targets. BGP (Border Gateway Protocol), for example, may take precious minutes to reconverge after a failure depending on the size of the network.