A neglected area of the hosting equation for quite a number of customers I have seen is the need for a Disaster Recovery (DR) or Business Continuity Plan (BCP) in the event the service at the provider goes down. This typically seems to be as the cost under normal hosting arrangements of a ‘warm’ (recent version of software, able to be ‘turned on’ with a short timeframe – normally needing full hardware backup) service is prohibitive, normally equally the cost of the original service.  ‘Hot’ services (full redundant, real-time replicated service in another location) are extremely expensive, even if used to house development resources – a common scenario to help ROI.

This leaves the customer quite vulnerable to outages, of any sort, to the service.  And in the event of a true disaster, relying on the Service Provider can leave you in what might be the end of a very long queue for restoration.  Lets go through a few examples of outages, and what many service providers will do:

  1. Network Outage: Service Provider Reaction?  Nothing really.  If they can’t keep the network going they have bigger problems than DR.  This happens a lot in single-DC providers, but as a lot of services are naturally in a single DC, this can easily affect a customer.  Anything from a single dodgy router (affecting others down the line, or forcing ‘flapping’ – continued switching between two routers), to Denial of Service, to DNS attacks (more common nowadays) to full ‘backhoe incident’ could be at play here. In most cases, you can only sit and wait, as without a DR environment, invariably bringing up the network quickest way to bring things online.
  2. SAN Outage: These are pretty dire.  Most Service Providers now rely heavily on SAN, which opens them up to wide-scale outages if the SAN has big problems.  And they do. It’s frightening how often they have to patch due to software bugs actually.  SANs are expensive, typically not over-provisioned (as this is even more expensive), and impossible to do quick replacements for.  This means most service providers, faced with a large outage, will just have to wait it out.  Unless you’re lucky enough to have Dr, or a server with locally attached storage, so will you.
  3. Internal Network Outage: Normally not a problem in most scenarios.  Network equipment is simpler, easier to troubleshoot and fix, and cheaper to replace, so most environments have backups.  That said, things like Spanning Tree issues can really mess up your day, leaving you with multi-hour outages.
  4. Server Outage: Once a fairly big issue, with the prevalence of virtualisation and good backups, this is normally a case of a multi-minute outage most of the time.  The biggest issue is if the service provider doesn’t have adequate monitoring, and can’t see there is a problem. Even with servers with locally attached storage, most Service Providers have spare equipment to bring up a server again in a few hours from a backup, unless you use custom hardware.
  5. Software/Service Issue: This happens quite a bit, with problems in software upgrades, incompatible versions, inadequate testing, or just platforms which are a bit ‘flaky’. In this case, troubleshooting is required, and a ‘warm’ backup, or ‘hot’ backup you can revert to a previous version is a really useful tool, and something that in most cases, can at least get you working again.

Thankfully, there are a few things you can do to make things easier.

  1. Check if you have easy access to backups if you need it.  If you don’t, seriously consider instituting your own, system-image based backup, and filesystem-based backup (might be the same, but a system image may be useless on incompatible hardware) so that if you need to, you can walk to another provider with the backup in hand.
  2. If you don’t have application level monitoring in place, invest in one of the many web-based providers who can monitor your site/service 24×7.  Many service providers are incapable of performing application level monitoring, and it’s not a good idea to have them do it anyway, as you want an external view of the service to make sure things like DNS are working appropriately.
  3. Most customers cannot afford a ‘warm’ or ‘hot’ backup solution, and access to backups can be problematic in a real outage scenario.  But moving development environments to a cloud provider can allow benefits such as continuous integration, as well as provide a ready-made environment for DR. Typically, customers will have a full replicated test environment in the cloud. While this may not allow the same level of performance in the event of true disaster, using these environments provides great test data, and a true DR solution, independent of your service provider.  If your provider is a cloud provider, consider a separate provider offering similar services, as we have seen cloud providers like AWS and Azure have multi-region outages lasting many hours in the past.
  4. Monitor your SLAs.  Most have no provision for ‘Force Majeure’, which means in a true DR scenario for the Service Provider, you are out of luck.  Look at real backup options to help in this situation.
  5. Investigate insurance arrangements.  Whilst not typically useful to handle losses due to outages (normally, these are pretty expensive), insurance can help you with the cost of equipment and migration to another service provider if needed.  This can allow the agonizing decision about when to trigger DR a lot easier to handle, and even if cashflow is a problem, short term loans could be obtained, or money fast-tracked if you’ve made prior arrangements.

I’ll go into more detail on specific techniques to aid in DR, and planning for DR and Business Continuity events in some later posts, as well as more detail on how to calculate all the elements of DR based on SLAs and availability metrics of your components, in later posts.