Amazon Web Services (AWS) Outage Thoughts Roundup


If you’re the least bit involved with the business-end of internet services (as I certainly am), you’ll have already heard that a few weeks ago, Amazon Web Services (AWS)[aws.amazon.com] suffered a major outage of an east-coast region of their platform. This outage caused serious issues and downtime for many other internet systems (including, but not limited to: reddit, foursquare, heroku, quora, and my employer Linden Lab / Second Life) and services that have come to rely on AWS over the past few years as a reliable provider of what have become known as “cloud computing services.” [wikipedia] This is their offical post-mortem of the incident.

What is particularly interesting and notable about this outage, in my opinion, is the set of lessons we in the industry can learn about putting our eggs in such a basket, what “high-availability” really means, the dangers of “sorcerer’s apprentice synrome” and “auto-immune” vulnerabilities in redundancy engineering, and how to maintain a high level of service in this age of “cloud computing.” People are still arguing and wanking about where to place the blame for all of the havok that this incident wreaked upon the internet, but the plain truth is that there’s more than enough blame to go around for everyone — the web sites and service providers, as well as Amazon itself. On one side of things, it’s true that engineers and administrators should have spread deployments across multiple AWS regions (not just availability zones). On the other side of things, AWS has made it difficult to use multiple AWS regions, had indeed maintained that spreading deployments across availability zones would provide adequate insurance against an outage — and it turned out that in this case they were very very wrong, and pushed a new cloud storage service (EBS) that proved to be even more unreliable and in many cases incompatible with the possibility of using multiple AWS sites.

Here’s a quick rundown:

No, in the end it turns out that it wasn’t Skynet’s fault after all. Just some over exuberance about new hottness, and a distinct deficit in reliability-engineering and availability paranoia.

  1. Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s