If you’re the least bit involved with the business-end of internet services (as I certainly am), you’ll have already heard that a few weeks ago, Amazon Web Services (AWS)[aws.amazon.com] suffered a major outage of an east-coast region of their platform. This outage caused serious issues and downtime for many other internet systems (including, but not limited to: reddit, foursquare, heroku, quora, and my employer Linden Lab / Second Life) and services that have come to rely on AWS over the past few years as a reliable provider of what have become known as “cloud computing services.” [wikipedia] This is their offical post-mortem of the incident.
What is particularly interesting and notable about this outage, in my opinion, is the set of lessons we in the industry can learn about putting our eggs in such a basket, what “high-availability” really means, the dangers of “sorcerer’s apprentice synrome” and “auto-immune” vulnerabilities in redundancy engineering, and how to maintain a high level of service in this age of “cloud computing.” People are still arguing and wanking about where to place the blame for all of the havok that this incident wreaked upon the internet, but the plain truth is that there’s more than enough blame to go around for everyone — the web sites and service providers, as well as Amazon itself. On one side of things, it’s true that engineers and administrators should have spread deployments across multiple AWS regions (not just availability zones). On the other side of things, AWS has made it difficult to use multiple AWS regions, had indeed maintained that spreading deployments across availability zones would provide adequate insurance against an outage — and it turned out that in this case they were very very wrong, and pushed a new cloud storage service (EBS) that proved to be even more unreliable and in many cases incompatible with the possibility of using multiple AWS sites.
Here’s a quick rundown:
- Amazon outage and the auto-immune vulnerabilities of resiliency
- Amazon EC2 outage: summary and lessons learned
- Heroku status and post-mortem from the AWS outage — an illustration of how NOT to do things if you want reliability, but an admirable case of owning up to and being honest about your faults.
- How SmugMug survived the Amazonpocalypse — an illustration of how you SHOULD do things if you want reliability from a company that used AWS but was able to stay up.
No, in the end it turns out that it wasn’t Skynet’s fault after all. Just some over exuberance about new hottness, and a distinct deficit in reliability-engineering and availability paranoia.