Cloud Computing: Plan for Failure

Vince DiMemmo

By Vince DiMemmo (Part 2 in a 2-part series)

Cloud service providers and end-users need to map out failure areas and plan accordingly. Best business practices such as disaster recovery, contingencies, and business continuity processes don’t go away because you’re using the cloud.

When you think about some of the architectural fundamentals of cloud services, you may realize that SLA control is equally in the hands of the end-user. Although many cloud service providers provide features and functionality to move workloads or data around, it’s largely an end-user responsibility to take advantage of this service features and integrate them into their operational scenarios.

This means the end-user needs to be involved in failure detection and response, have implemented or leveraged the necessary network functionality to minimize the impact of failures, and ultimately narrow the recovery window with contingencies.

Cloud needs to follow standard IT best practices, and end-users need to implement appropriate plans for their business. Service providers are showing great signs that they understand this as well, as we are seeing more innovation and options for the end-user to better use the cloud.

This will translate into more distribution of the service provider platforms, which will yield more service delivery zones, and more networking capabilities for end-users to access and integrate them. This will ultimately allow businesses to better plan for failure, recover faster, and ultimately have better control over their SLAs.

When we are not prepared for failure, there’s frequently some level of chaos. Whether that’s figuring out how to contain and correct the failure, or figuring our how to move your workloads and data from one service delivery zone to another or to perhaps to another cloud service provider. All at the same time everyone else is trying to do it. Planning for failure includes avoiding the chaos and having clear expectations that are based on planning and the implementation of best practices.

A Learning Experience

Each outage is a learning experience for both service providers and end-users. One thing we’ve learned repeatedly over the last 20+ years in the service delivery industry is that single points of failure frequently have cascading effects. So, when things go wrong, they can get much worse before they get corrected.

In the wake of a failure, when post mortem activities yield visibility into the details, it’s easy for customers to sit back and say, “Had I known, I never would have done this. Or had I known, I would have done that.”

The same thing applies for service providers. Had they known, things would have been planned or implemented differently.

None of this fixes the past, but it can have a significant impact on the future if we analyze the results, continue to innovate, plan accordingly, and implement best practices in support of the plans.

Key Takeaways

  1. Service provider clarity in the areas of service capabilities, SLAs, service delivery metrics, and end-user responsibilities
  2. End-users need to acutely understand areas of responsibility, service provider capabilities, and take control of their SLAs
  3. Service providers need to be more distributed and networked, innovate and ultimately create more options for end-users to plan and implement best practices
  4. End-users ultimately need to plan for failure, leverage service provider innovation and options, and implement feasible best practices

At the end of the day, failures are going to occur. It’s up to both the service provider and the end-user to leverage planning and implement best practices to address the unpredictable.