If you ask IT infrastructure and application owners what keeps them up at night, there’s a good chance the threat of disasters is high on the list. The reason is simple: organizations have become so dependent on information systems that many cease to function without them.
Highly-visible disasters, both natural and man-made, have served as a reminder that risk planning is a vital part of information technology operations. In this article I examine a number of approaches to disaster recovery prevention and planning, ranging from “hope nothing bad happens” to cloud-based DR.
In a previous life (as in a couple jobs ago) I ran server operations for a medium-sized business. Like many of my peers, I was occasionally pulled into disaster recovery (DR) planning exercises. Although our company recognized the importance of staying in business, we weren’t keen to spend a lot of money on DR preparedness.
As a compromise, we purchased an application that enabled our DR analyst to log the results of numerous business impact analyses (BIAs) from various process owners. As nearly as I can articulate it, the resulting DR plan was this:
- If the building starts shaking, I should grab the big honkin’ DR binder (BHDRB) off my bookshelf and run for the exit
- Assuming I survive, get myself and the binder to my car and drive the 50 or so miles to the third-party warehouse where we keep our backup tapes. (Do not go home to check on the welfare of loved ones or to grab guns and ammo for the coming anarchy.)
- Somehow convince the person at the warehouse to give me as many boxes of backup tapes as my car can carry.
- Drive with the tapes to an undefined location, where I would reconstitute the business using the magic spells contained in BHDRB.
For companies that cannot live with the half-hearted DR approach just described, there are a number of alternatives. You might opt to work with a traditional availability services partner who can host your DR environment and hold your hand throughout the process.
Another option is to architect your applications to be fault tolerant. Or you might design a solution that incorporates cloud services to enable capacity-on-demand. Let’s look at each of these approaches in turn.
Hosted DR Options
One well-established methodology is to work with a partner that specializes in disaster mitigation. These companies typically offer services with names like business continuity planning, managed backup, managed recovery, and hot-site management.
The partner generally sets up systems at a remote site to be employed if the primary site is brought down. They may also provide personnel, office space, and communications systems for use during a disaster. The two most common tiers of hosted DR are shared and dedicated, as described below.
“Shared hotsites are fully operational alternate facilities that have vendor-installed computer systems, communications structures, and other resources necessary for a client to recover designated business applications in case its own data center becomes inoperative or inaccessible. Hotsites maintain specific equipment configurations to match the client’s computer operations so that the client can replicate its data center operations with respect to the recovered applications. Clients generally select a hotsite some distance from the data center to be recovered, in order to be able to recover effectively in the event of a regional disaster or blackout. If a client’s primary hotsite is occupied, clients can use an alternate hotsite owned by the vendor. In case of a disaster, a traditional hotsite client will transport its backup tapes from a secure location, generally remote from its data center, to a hotsite and will load the software and transfer the tape data onto the hotsite’s storage systems. The client, with the assistance of the skilled hotsite personnel, can resume the disrupted data applications for a period of generally up to six weeks while the client’s data center is restored or replaced. Hotsites generally are used for critical applications with RTOs of 16 to 96 hours.” (1)
Although this arrangement has some obvious benefits, it presents a number of challenges, including:
- The need for a multi-site applications architecture, which likely will involve distributed databases
- A meshed network design involving substantial data transfer and replication
- A traffic management and load balancing front end that directs users to available nodes
- The need to run multiple data centers in different locations with corresponding latency and service radius issues.
The emergence of public cloud computing has enabled new models for disaster recovery. And while many prospective users still have technical and/or security questions, the potential is compelling.
To illustrate the point, let’s look at the 800 pound gorilla of public cloud, Amazon Web Services. I’ll briefly summarize their offerings; anyone wanting more information should check out their case studies and white papers.
The four DR services Amazon defines include:
- Backup and Restore
Basically, AWS Backup and Restore is an alternative to traditional tape-based backup. [For people like me who have a long-standing loathing of tape, this is a very good thing.] Rather than backing up to disk or tape, Backup and Restore uses an AWS Storage Gateway to move data to Amazon’s S3 storage cloud. All the data storage and catalog management takes place at Amazon, eliminating the need for tape jockeys and offsite tape vaulting. [FTW!]
- Pilot Light for Simple Recovery into AWS
While Backup and Restore focuses primarily on data, the Pilot Light model adds applications. Like the hosted DR models discussed earlier, Pilot Light involves provisioning of a cloud-based application environment ready to take over when disaster strikes. But unlike a full hosted solution, the customer only provisions the core elements of the application environment. Using a library of pre-configured Amazon Machine Images (AMIs), the user can quickly provision the rest of the application environment when needed.
- Warm Standby Solution
Taking the Pilot Light model one step further, the Warm Standby Solution forms part of an active/passive cluster. Most of the time a minimal amount of capacity will be provisioned with AWS. When needed, though, the system is rapidly scaled up to meet full production demands. As an added bonus, the AWS environment is available for test/dev, QA, and internal use. This pay-as-you-go consumption model potentially offers substantial savings compared with traditional hosting models.
- Multi-site Solution
Amazon calls their final DR scenario Multi-Site. This is essentially an active/active application cluster, with both on-site and cloud-based components. Using weighted DNS load-balancing, the user chooses how much application traffic to process in-house and how much to direct to AWS. If disaster – or a spike in load – occurs, more or the entire load can be pushed to Amazon. The process can be automated using the auto-scaling feature of AWS. The only tricky part is configuring the database and application logic to be cloud-aware.
While this may sound great, there are a number of questions you should ask any cloud-based DR vendor, including:
- What kind of data center resiliency and security do they provide?
- Do they offer SLAs, and if so are they compatible with your requirements?
- How quickly can they scale up capacity?
- How are customers prioritized in the case of a regional outage scenario if they have an overrun of capacity?
Recovery as a Service (RaaS)
The final DR model I’ll cover also extends cloud computing concepts to traditional DR. Unlike the Amazon model, however, it provides a service-based consumption model with simplified administration.
To illustrate, I’ll use an offering from nScaled. As illustrated in Figure 2 (below), nScaled provides both an appliance that sits at the customer site and acts as a gateway for cloud-based compute and storage, as well as shared cloud data centers. Data and application synchronization is handled programmatically.
- Local recovery from the appliance when failover isn’t necessary
- Remote recovery from the nScaled data centers when entire data center needs to be recovered
- Rapid application recovery – supports Recovery Time Objectives of 15 minutes per server; under 2 hours for full data center
- Minimal data loss – Recovery Point Objectives of 15 minutes
- Data replication and de-duplication
Figure 2: nScaled Architecture
Savings with this kind of service-based solution can be significant, both in terms of capital as well as administration and support costs. Of course, it’s still important to do your homework to be sure the solution meets your needs and to test regularly.
Although many companies continue to rely on the ignorance-is-bliss approach to DR, for most businesses the risk from disasters is too great to ignore. The good news is several models exist to avoid or mitigate the impact of disasters on the business. The ones covered in this article include:
- Traditional hosted DR services and solutions, following either a shared or dedicated hosting model
- Eliminating the need for DR by re-architecting and distributing applications so they are not fully dependent on any one site
- Developing a cloud-based solution that takes advantage of virtualization and the elastic nature of the cloud
- A hybrid cloud solution that makes use of appliances and cloud computing while simplifying administration
With a little work, just about any organization can find a DR option that strikes the right balance between business requirements and cost constraints.
All of the solutions listed in this article – other than the first one – are available from Equinix customers.
For more information or to start the ball rolling, please contact your Equinix representative.
(1) http://www.justice.gov/atr/cases/f9400/9438.htm, “US Department of Justice vs Sungard and Comdisco”, October 2001.