Disaster Recovery and High Availability in the Cloud While Controlling Costs

Bill Young

disaster-cloud

High availability and disaster recovery in the cloud have always been a core focus of infrastructure designs. Ensuring uninterrupted access to your application during site outages plays an essential role in selling business-critical products to customers. It also comes with significant roadblocks. Application design and the solution’s suitability to web scale-style deployments are among the most common. But the biggest roadblock to implementing a robust availability solution is cost.

Traditional designs usually address availability by ensuring that there is at least two of everything to protect against the failure of any one resource. However, if you add in a second site that must meet similar business continuity planning requirements as the primary, then two of everything turns into four of everything. And if your application stack is complex enough to require many tiers of dedicated servers or shared storage, then four can quickly turn into racks upon racks of costly hardware and software.

While no one in the space would question the need for redundancy, expensive solutions just drive up the cost of services delivered and erode the profitability of an organization.

For most SaaS applications with a stateful database backend – meaning one that tracks what occurred when the application ran previously – a traditional high availability/disaster recovery (HA/DR) solution is built around the following pillars:

  1. A remote DR site, separate from the primary hosting site and outside of the “blast radius” of the original site, in case there’s an environmental issue. The site needs to be populated with sufficient network, compute and storage hardware to handle the required workload.
  2. A stable, secure network link with sufficient speed and reliability to support the application’s recovery point objective/recovery time objective. (That’s the maximum age of the files that need to be recovered for normal operations to resume, and the maximum length of time the application can be down before failures occur).
  3. A database in a local cluster for patching and maintenance.
  4. Either a platform that handles server failover or running identically configured servers on each site running in an active/active arrangement.
  5. If the application depends on file or object data, a storage platform that can handle replication will be required as well.

Anyone who has priced out any of the items on the above list understands a seven-figure budget would be needed to implement and support that environment for a 3-year deployment cycle. Secure colocation space, enterprise network connectivity, hardware, and enterprise software licensing are all expensive. And because HA/DR is ultimately just risk mitigation, it is often hard for organizations to make a business case for the added expense because the return on investment (ROI) just isn’t there.

So how do you stack the ROI deck in favor of preventing your nights and weekends from being consumed by HA/DR issues? Check out this case study on a company that created an HA/DR solution and hosts an advanced document management system.