Section 2.8. Hardware Redundancy

2.8. Hardware Redundancy

Business continuity planning (BCP) lies at the heart of redundancy planning. BCP is a methodology for managing risk from a partial or complete interruption of critical services. Applied to web applications, this covers the continuity of business in the case of software and hardware malfunctions, attacks, and disasters. Most of the technical jargon can be ignored at the small scale, but BCP basically means having a solid plan for disaster recovery.

The various levels of BCP apply to the various grades of catastrophe that could occur. Being prepared to deal with a single hard disk failing is very basic, while redundant networking equipment falls into a middle tier. At the highest level of BCP compliance, a business will choose to host critical applications in multiple DCs on multiple continents. While this reduces latency for international users, more importantly, the service can continue operating even if a whole DC is lost, and such things do happen from time to time.

For applications where dual DC failover is out of the question, a fairly acceptable level of redundancy is to have at least one spare of everything, or more where necessary (having one spare disk for a platform with over one hundred disks in use is, for instance, woefully inadequate). It's also very important to bear in mind that absolutely anything can fail, and eventually, everything will fail. This includes the usual suspects, such as hard disks, all the way through to components that are thought of as immutable: power cables, network cables, network switches, power supplies, processors, RAM, routers, and even rack postsanything at all.

We'll be talking more about redundancy from a design point of view, rather than just in terms of raw hardware, when we cover scaling in Chapter 9.