Disaster recovery is becoming top of mind for many CIOs. Understanding the success criteria to make the disaster recovery journey of your own organization smooth and successful is critical, but the path to getting there can be difficult.
Follow the ten key steps below, to guide you on the right path to success.
- Understand why disaster recovery is important to your business, and what your specific disaster recovery requirements are.
The first key step is understanding why you are looking for a disaster recovery solution for your business, and what your requirements are- from a disaster recovery perspective as well as for the solution in need. Running a Business Impact Analysis (BIA) will assist in the impact of a disruption to your business and will also help expose the effect of such disruption to your reputation including the effect of any loss of data or loss of staff, the BIA is very much the building block and foundation of your disaster recovery planning and knowing what the business impact to outages is probably the most important aspect in defining the answer to the “why” question. Knowing the business impact will not only drive the Service Level Agreements (SLAs) for the business processes they will also assist the disaster recovery plan to minimise any prolonged outages which could be derived from human error during the recovery process. If these aspects are missing and haven’t been thought of yet then running a Business Impact Analysis should be the first thing that you do and will put you in good stead as you move forward.
An additional aspect of the disaster recovery process is to understand your Recovery Point Objective (RPO), Recovery Time Objective (RTO). From a SLA perspective, think about the amount of time and data loss your business can incur. Zero data loss is obviously ideal, but this can exponentially drive up the cost of the solution. Having a limit to the data loss that can be incurred by your business based on the business service is realistic. Both the time and data loss windows will translate to your RTO and RPOs respectively.
Additionally, does your business require adherence to any regulatory compliance or operating rules? For example, do you need to provide proof of a quarterly or yearly disaster recovery test? Disaster recovery testing is important, and there are a lot of factors to take into consideration here. What kind of replication technology would you choose – expensive hardware-based replication or host-based or even replication to the cloud. What you choose is based on various factors including cost, business policies, SLA requirements, and importantly environmental factors. For instance, if your data center is located in an area which gets affected by floods, then your disaster recovery location needs to be in a separate geographic area or even in the cloud.
- Should you build your own or buy off the shelf?
The next step is driven by how much investment you either want to make operationally or in capital expenditure. You probably have already invested quite heavily into infrastructure at your primary data center location – things such as server hardware, virtualization technologies and storage. You could take a simple approach and invest in another physical data center for disaster recovery, but this would lead to the expense of not only double software / hardware infrastructure costs but also additional physical location costs. A more savvy approach would be to utilize a vendor to supply disaster recovery services at a fraction of the cost of running dual locations. Keep in mind that choosing the right vendor is important too. You will want to look for a leader in the managed disaster recovery services space that has years of credible experience.
- Understand the difference between disaster recovery as-a-service and backup and recovery as-a-service.
Understand that disaster recovery and backup are different ball games. While backup is a necessary part of a business continuity strategy, it lends itself to SLAs of hours to days. On the other hand, disaster recovery is better suited to SLA requirements in minutes to hours. Based on the business uptime and data loss requirements specific to a business service, your business would deploy a disaster recovery solution for your business-critical applications, while backup would be sufficient for those non-critical business services which can take some downtime. Choose a disaster recovery as-a-service solution that can protect your entire estate or at least the critical elements of it that drive your business. This includes physical and virtual systems, as well as the mix of different OSs that typically are run within enterprise businesses today. The disaster recovery as-a-service solution that you choose should also be able to provide you with the ability to run your systems within their cloud location for a period of time, until you can get your infrastructure back up and running and transfer services back to your primary site.
- Choose the right Cloud Hypervisor.
It may seem like an easy decision to make- you would seek a vendor that runs the same hypervisor on the backend as you are on your primary site, but keep in mind this is not a necessity. If you are using VMware vSphere or Microsoft Hyper-V then running these type of hypervisors in the cloud is going to incur some additional licensing costs in a DR solution. Another thing to think about is whether you really need all the bells and whistles when you’ve invoked disaster recovery. Most of your time is going to be taken up with getting services up and running back at your own location as quickly as possible, so maybe not. What you basically need is a hypervisor to host your systems that provides the basic performance, scale and resilience you require. A more cost-efficient stance would be to utilise a KVM-based hypervisor running within OpenStack. This ticks the boxes in terms of enterprise ready and best of all, the service costs should yield a better ROI than those running proprietary hypervisor technologies, saving your business considerable money.
- Plan for all business services that need to be protected, including multi-tier services
Now were getting down to the nitty-gritty details. The business services that need to be protected will be primarily driven by the SLAs that brought you down this path. Keep in mind that you capture all operating system types that these business services are running on and also think about how you handle any physical systems that have not yet been virtualized. Moving virtualized applications to the cloud is an easy process, as these are already encapsulated by the hypervisor in use. But pure physical business applications are another matter altogether. It is not impossible to move physical application data to the cloud, but when it comes to a failback scenario, if the services you select does not have this capability, then you are a sitting duck. This is especially important to keep in mind in the case where a complete outage has occurred and a rebuild is needed. Another thing to think about is when your business services or applications are started in the cloud- can you start or stop these systems in a particular order if a business service is made of different processes, such as a multi-tier application, and also inject manual steps within your failover plan if so required? Controlling multi-tier business applications that span across systems is going to be a high priority, not only while invoking disaster recovery but also when you’re performing a disaster recovery test.
- Plan for your RTOs, RPOs, Bandwidth, Latency and IOPs
Understanding how you can achieve your Recovery Point Objective (RPO), Recovery Time Objective (RTO), as well as the IO load of virtual machines, and the peaky nature of writes through the business day within your systems, this data will help you understand what your required WAN bandwidth should be. Determine whether your disaster recovery service vendor can guarantee these RTOs and RPOs, because every additional minute or hour that your business is down as defined by the Business Impact Analysis is going to cost you. If you aim for RPO of 15 minutes or less, then your bandwidth to the cloud needs to be big enough to cope with extended periods of heavy IO within your systems. If your RTO is something like 4 hours, then you need to know if your systems can recover within that time period, keeping in mind that other operations too need to be managed, such as DNS and AD/LDAP updates including any additional infrastructure services that your business needs.
- Avoid vendor lock-in while moving data to the cloud
Understanding how your data will be sent to the cloud provider site is important. A solution that employs VMware vSphere on-premises and in the cloud limits you to a replication solution that works only for virtualized systems with no choice of protecting physical OS systems. This may seem acceptable at the time, but you will be locked into this solution and switching DR providers in the future may be difficult. Seeking a solution that is flexible and can protect all types of major virtualization platforms as well as physical OS gives you the flexibility of choice for the future.
- Run successful disaster recovery rehearsals without unexpected costs
Rehearsals or exercises are probably the most important aspect of any disaster recovery solution. Not having an automated disaster recovery rehearsal process that you test on a regular basis can leave your business vulnerable. Your recovery rehearsals should not affect your running production environment. Any rehearsal system should run in parallel albeit within a separate network VLAN, but still have some type of access to infrastructure services such as AD, LDAP and DNS etc. so that full disaster recovery testing can be carried out. Once testing is complete, it is essential that the solution include a provision to easily remove and clean up the rehearsal processes.
- How long can you stay in the cloud?
For a moment let’s imagine that the unthinkable has happened, and you have invoked disaster recovery to your cloud service provider. The nature of the outage at your primary location will dictate the length of time you will need to keep your business applications running on your service providers’ infrastructure. It is imperative that you are aware of any clauses within your contract that pertain to length of time you can keep your business running on the cloud providers’ site. There is also a big pull to get enterprises to think about running in the cloud and staying there, but this is a big decision to make. Performance of the systems is going to be one metric to poll against, as is performance of storage, or more precisely the quality of service of the storage that the cloud vendor will provide. On the whole, it makes sense to get back into your own infrastructure as quick as possible, since it is custom built to support your business.
- How easy is it to failback business services to your own site?
Getting your data back or reversing the replication data path is going to be important especially as you don’t want to affect your running systems within the cloud by injecting more downtime! Rebuilding your infrastructure is one aspect that needs to be meticulously planned. Any assistance that the solution itself can provide to make this process smoother is a bonus. Your on-premises location is going to need a full re-sync of data from the cloud location which may take some time, so the solution should be able to handle a two-step approach to failback- the re-sync should happen in one operation and once complete, the process to switch back your systems can be done at a time that suits your business.
Success, you’re now armed to create a robust business continuity plan.
Follow the steps above to gain an understanding of what’s needed to be successful on your disaster recovery as a service journey, and use them as checkpoints while developing you own robust business continuity plan for your business.