One of the areas that is, to my mind, still a barrier to companies moving their equipment to be hosted externally is the issue of SLA’s and contracts, and how well protected they are in the event of an issue, or extended outage.  This is particularly relevant in terms of Disaster Recovery Planning, something I plan on doing more blog posts about in the future.
Unfortunately, due to a number of ‘bad actors’ both on the customer and service provider side, unless ‘lawyered up’ a customer is likely to not receive the service they think they are paying for, and this normally comes up at times when people are screaming about why the server isn’t up.
There are a few difference causes for this:

  1. Customer Expectations
    Customers rarely check the fine print in their contract, and many customers (with not completely unjustified expectations) think if they provide money for a service, it should be up 100% of the time.  Unfortunately, this is wrong.
    The reality of most hosting companies, particularly in the ‘Shared Hosting’ space is that they have razor-thin margins and normally operate with what in many organisations would be considered a skeleton crew. The amazing thing is the actual service many of these companies end up delivering! However, advanced troubleshooting and proactive maintenance are normally hard for the smaller players, and can be hard for the bigger ones too, so ‘you get what you pay for’ is apt in Web Hosting – but you need to be careful who you spend it with too – some expensive plans reside on the same equipment as the smaller ones, just with bigger quotas.
  2. Service Providers can outright lie about SLAs
    A number of players in the hosting industry offer SLAs that have no engineering basis, and are actually more calculated gambles on their part: e.g. ‘benefit of the marketing dollars vs the payout or upset customers when we have an outage.’  This is crazy, but looking closely at the SLA’s and contracts you will normally find they limit liability in a number of ways, such as…
  3. Limiting Liability for Payouts and Damages
    On the service provider side of the fence, limiting liability is important, as otherwise a single customer could really affect the company.  However, by carefully wording the timeframe their availability hits, as well as the definition of ‘service’ (such as just being pings), any by limiting the exposure to credit, limiting it to a single months service fees, or even non-cash at all, the service provider makes it very difficult to obtain any sort of penalty at all in the event of an outage. This makes the SLA virtually worthless.
  4. Customer Expects Disaster Recovery, but doesn’t pay for it
    This is typical of large corporates, some of whom are aggressive in enforcing contracts, but when the time comes to actually pay for a proper, redundant environment in another Data Centre, decide not to for financial reasons.  Unfortunately, this normally doesn’t filter up to the CEO when an outage occurs, which can lead to uncomfortable conversations in the event of a severe outage.
    I know virtually all providers will work extremely hard to recover in the event of an outage, but razor-thin margins means that many providers simply don’t have the resources to bring up large portions of the customer base on redundant systems, if they even have them.  So customers are left waiting until the original servers are brought back up again, if they come back up again at all.
  5. Excluding Force Majeure
    Note that most hosting providers exclude ‘Force Majeure‘ from their SLA’s. Depending how this is defined, it can be limited to floods, riots etc, but can also just include any weather event, even seemingly simple ones. I’ve seen contracts worded such that a power outage could be considered Force Majeure – something completely unacceptable in the modern age of backup generators, redundant feeds, and modern Data Centres.
  6. Customer expects reliability from low-cost or no-cost services
    Unfortunately, historical performance is normally the expectation of a customer of the reliability of their services, not the actual price they pay.  This can be a problem for low-cost or no-cost services that small businesses in particular rely on far more than they should.  I have experienced and heard stories of customers with no-cost services like email threatening legal action when the service has gone down – whilst the service provider is unlikely to be liable in these situations  (but IANAL, so they certainly could be in some circumstances if they have made representations to the customer), defending themselves against law suits is a costly affair.
  7. Extra Credit: If your service provider suffers a complete outage and fails, or becomes bankrupt, what would you do?
    In Australia, we have the somewhat unique example of Distribute.IT, a service provider who went completely out of business after a their servers were hacked, and their drives wiped – but their backup was live replication of their drives, leaving them with no way to recover.
    In my experience, the ability for a customer to replicate their environment quickly is extremely rare, with many customers, including major corporates, left waiting until the service provider comes online.  However, the reality is that in the event of a major (such as multi-week) outage, many companies would actually not survive financially due to the cash-flow impact, reputation damage, loss of customer base, and potential financial penalties involved.  Without a proper backup, companies can, and have, been left with no recourse but to start again with their web presence.  When the web presence can include customer data, financial data, and mission-critical information, this isn’t acceptable.
  8. Extra Extra Credit: Are your cloud services more, or less reliable than physical hosting?
    This one is a direct side-effect of the marketing (some would call it good marketing) of the cloud providers as to the resiliency of their systems.  It is true that most cloud providers systems have enviable reliability, but AWS in particular had a terrible SLA, (but has recently changed for the better) and uses all the tricks in the book to avoid giving customers their money back. One of the core things about AWS in general is that the ‘server’ you bought has no guarantee to exist at any point in time, and while lessened now, those that build cloud service with a single server were often surprised when the server was torn down with no notice.  Its great that with AWS DR environments and scaling can happen almost for free, but in essence you need to built around failure on Day 1 to exist in the cloud. To date, I haven’t seen many companies build this into their architectures. It’s also important to note SLAs for cloud are typically for the cloud, or even the cloud as a whole (not your particular environment) making it even harder to claim an outage.  Defining what you actually mean by service reliability is very important here – and one of the many reasons full production environments are only gradually moving into the cloud, while DR and development environments are stampeding.

In a later post I’ll describe in more detail the things to consider in purchasing a hosting solution, including cloud considerations, and DR/backup selection.