Recently I’ve been involved in a lot of DR work, including reviewing best practice, and seeing the process of the business in implementing DR Plans.
I’ve come to the realisation, much like the movement towards Agile in development, that DR, at least in the IT realm is out of date, and certainly the emphasis required from standard such as ISO 27001 et. al. is misplaced.
In practice, many IT systems are built with a range of different fallback mechanisms, and in a great deal of cases, IT have documented, or generally know how to, implement fallover to a seperate system
So Disaster Recovery Planning or Business Continuity Planning will pretty much always look good, at least to a general muster – unless your IT area is just too busy to do it, or very immature.
But things like flooding taking out most of the CBD can make DR plans in an environment like Manila pretty useless. How to do DR if staff can’t make it to the DR site, or it’s flooded too, and the flooding
is so bad you couldn’t run from home, or their homes are flooded too? One of the truisms of any DR planning is its the unplanned things that get you – but how many people were requested to do Avian Flu DR plans for a scenario that was very, very unlikely?
In the end, it’s your ‘Black Swans’, or cascading failures that will have the biggest impact – and most organisations simply don’t have the resources to do any sort of actual prevention of these items, bcause they are too rare, and generally too costly, to remediate, if you can at all!
In addition, DR Planning is done, a lot of the time, by external consultants or centralised personnel without much deep knowledge of the systems involved. As such, DR Planning, like most planning, should be considered just like a ‘best guess’ of what people think is most important at the time of planning, and the remediation steps. In many cases, it’s not even this much – very few plans I’ve seen know the money required, whether the business will have appropriate cash flow, the resources available or backup facilities/space to enact a proper DRP. In particular, issues with SAN, reconstructing from backups, and Data Centre space are big issues, not normally considered.
So what are the things that work to help you in DR?
1. Full backup environments, in a fully seperate location. Like development environments that can be repurposed.
2. Doing testing. A lot of plans are developed with very little real rigour, and only by going through scenarios can you legitimately tease out
how or why things will work.
3. Awareness amongst employees that DR needs to be considered, so new systems have it baked in, and manual workarounds are available.
4. Consideration if you have unique hardware or software, that may be difficult to obtain quickly – can you run without it?
5. How long will it take before enacting a full-scale DR solution is financially viable? This is a number you should have in your head, in conjunction with customer impacts and reputation damage impacts, to enable decision makers to act quickly.
What things break in practice?
1. Access to locations aren’t available out of hours.
2. Key systems or physical elements (laptops, keys, keycards etc.) are either lost, whereabouts not available, or make unavailable as part of the incident.
3. Failover systems don’t. Or even harder to detect, can’t fail back.
4. Firewalls and access control isn’t configured for the backup site/environment.
5. Agreements for space in DC’s (or space that is ‘reserved’) and backup hardware from vendors is much more expensive than planned if needed in a hurry. In some cases, things like connecting fibre runs, additional power etc. can be extremely difficult to get done quickly, if you don’t have vendors willing to work with you.
So Why test?
IT has incredibly short lifecycles for infrastructure in general terms. But also, things like evacuation plans, access to external resources (like office space, data centre space) change quite often, and in my experience, very rarely are people notified.
You may be lucky enough to have a full-time DR support person, who can keep track of everything, but very few organisations do. Typically this role is split amongst incident managers (if you have them) and line management, or perhaps delegated to an area like security. In any case, it isn’t that persons primary role, which increases the chance the reality of the environment differs significantly from the ideal of the plan.
But testing finds the flaws. In Agile terms, it provides the concrete benefits of DRP sooner, whereas planning is analogous to the ‘requirements’ stage in normal development. It takes a long time, you don’t have certainty about the result, and doesn’t provide timely feedback (most of the time). It allows you to identify, and immediately verify the solution, as well as point in time costs. It also helps to weed out what the business says it does vs what it does, and where configuration changes have been implemented but not documented. In some cases, it might provide feedback significant enough that you change the core system to be better prepared for a disaster, such as new environment, or additional manual workaround processes.
So start with a basic plan, but make it better with each test iteration. You’ll find you save time in planning, and gain benefit in results. Much like a real incident, it also helps give concrete results of why you’re doing DRP in the first place!