Amazon’s Cloud Outage: Lessons Learned

By Vince DiMemmo

As many of you know, Amazon’s cloud service went down on April 21.  Many of its customers were back up and running in a matter of hours, others weren’t so lucky. They were down for several days.

With so many companies relying on cloud services today, it got me thinking about the high-level takeaways from this recent incident.

What the market saw was less about Amazon, per se. It was more about end users coming to the reality of some of the implications when using cloud services.

At the root of it, it’s an awakening for cloud service providers to perhaps better recognize the expectations of their end users.  But also it’s a wake up call for the end users to have a more realistic perspective of what is cloud services… and where are the lines of demarcation between what a cloud service provider is promising and delivering, and for what the enterprise or the end user is still responsible.

Sometimes, through over marketing of cloud services, there’s a perception that day-to-day operations from the end user perspective magically go away.  This recent event in the industry highlights that people need to understand that day-to-day operations don’t go away because you’re using the cloud.  People need to better understand where these lines of demarcation are.

Services such as backups, redundancy, data replication, application and OS patching, all need to be clearly understood so that when an outage does happen — and an outage will happen — people are well prepared. Both the service provider is prepared in terms of how to handle their customers, and the customers are prepared to handle the outage.

Cloud is a Business Model. It’s Not a Technology.

What people need to remember is that cloud is really not a new technology. It’s a procurement model, it’s a delivery model of IT infrastructure.  It’s been using many technologies that have been around for years.

There’s always new variants of it, but fundamentally what makes up a cloud service is a bunch of technologies that have been around for a while.

It’s less about the technology and it’s more about the way the IT service is delivered to the customer, and how it’s consumed. It’s more about a business model.

Cloud is Infrastructure

It’s made up of a myriad of various infrastructure components.  It can be very complex, as we’ve learned. That’s why people rely on cloud service providers.  Underneath that cloud service provider is a bunch of technology from software through hardware, through cabling, through data center environments.

There’s people and processes around it, to manage it. Because of the nature of cloud, it being a business model to deliver IT more cost effectively, you tend to have a multi-kind of environment with lots of virtualization, which means you have a lot of people, a lot of end uses and a lot of business running in a very small footprint.

So when things break, it can go bad in a big way. But again, that’s no different than the mainframe breaking 30 years ago.  You had all your end users and, you know, all your business partners, everything was running on the mainframe, right. This is the mainframe in the cloud, and just like the mainframe, we didn’t ignore it. We had best practices, we had disaster recovery, business continuity. We planned for failure.

So picture the mainframe moving to the cloud, and abandoning all that stuff that we did for the last 20 or 30 years. Well, that doesn’t make sense.

Like any infrastructure, the cloud is going to fail. Whether it’s the servers, whether it’s the cabling, whether it’s a problem in the code, it’s going to fail. From an IT perspective, from an end user perspective, that needs to be understood, that needs to be planned for.

Getting back to those lines of demarcation, what is your responsibility?  I think enterprise end users learned that wasn’t understood, and that they need to be more proactive, understand where those lines are, and put the service leval agreement (SLA) in their control.  That starts by understanding that infrastructure, understanding where those lines of demarcation are, understanding that this is all about infrastructure.

Plan for Failure

They need to map out those points of failure and they need to plan accordingly. Traditional IT practices such as disaster recovery, planning for failure, and contingencies, etc. don’t go away because you’re using the cloud. Cloud needs to follow standard IT best practices, and end users need to plan around that.  Accordingly, service providers need to understand that as well, and I think what we’re going to start seeing is service providers creating more options for the end users to better use cloud.

This means more availability zones, more networking capabilities, allowing people to plan for failure and recover faster.  When failures happen, when people aren’t prepared for them, both on the service provider’s side and the end user’s side, there’s some level of chaos.

Whether that’s figuring out which customers you recover first, or whether that’s in the Internet, because everybody’s trying to get their big data off of one recovery zone or one platform and trying to move it to another, it’s all about avoiding the chaos and having this well planned and well thought out.

A Learning Experience

I think the Amazon outage was a learning experience for both sides of the industry, the service provider side and the customer side.  The thing we’ve learned over the last 20 – 30 years in IT and service delivery is that sometimes single points of failure have cascading effects.

A day or two later when more details become available and post mortem activities begin, it’s easy for customers to sit back and say, “Had I known, I never would have done this. Or had I known, I would have done that.”  The same thing applies for service providers. Had they known, things would have been planned differently, right?

In Amazon’s case, there were customers who did not have a problem, and they either got lucky or they had a level of detailed planning that other customers didn’t.  Some recovered faster than others.

That’s a result of sometimes cherry picking what you can recover faster, and as fast as possible.  Then there are others that fall into a category of you were last in line, or, you were the result of cascading events that were unpredictable and couldn’t be solved at that time. Unpredictable, in the sense that they happened without planning.

Now, in the post mortem process, I’m sure service providers will look at all the data, and incorporate more things.  From the various buckets of customers that were affected, every one of them is probably going to walk away with more input for their planning exercise.

Some companies were down for days. Should it have taken that long to recover? It’s hard to say. Typically things fall into the buckets that I’ve just described, and when things fail, you have to stop it, contain the failure and then recover what you can as fast as possible.

Key Takeaways

  1. service providers being clear
  2. the end user understanding the lines of demarcation
  3. service providers creating more options for end users to put SLAs into their control
  4. end users ultimately leveraging those options and implementing best practices
Print Friendly

Related Content