Lessons and Observations from the Amazon AWS Outage
“It turns out the cloud is actually just some place in Virginia” — Tweet by jckhewitt
We were impacted by the long EBS outage at Amazon’s US-East-1 data center. BrandVerity uses a geographically distributed collection network to find and store ads and the sites they send traffic to for our customers. While less than a third of our servers are hosted by Amazon, we were significantly affected by the downtime because we use Amazon’s Elastic Block Store (EBS) to store some of our data. In total it was 28 hours before Amazon restored our EBS volumes. However, we were able to run a number of our systems during the outage by employing traditional disaster recovery techniques (restoring from backups, etc.).
We had a few expected and unexpected observations and lessons learned from the outage.
Unexpected Dependencies and Disappearing Redundancies
Many companies launched over the last few years run on AWS (over a third of Y-combinator startups have their principal domain at AWS Source). Supporting technologies that are especially useful to SAAS companies are disproportionately run on AWS.
We found unexpected dependencies in our technology vendors that we assumed would be independent of our own issues. An outside monitoring vendor had an AWS dependency that took down their monitoring. Not only was our own in-house monitoring impacted, but our backup monitoring went down completely.
Of course, the bigger issue is that Amazon’s Availability Zones weren’t as independent as Amazon has advertised. There is much discussion of this on the web at the moment and Amazon’s post-mortem provides an excellent view into the issue.
The outage has certainly raised the visibility of true multi-datacenter redundancy, which had been replaced in many organizations by multi-availability zone redundancy.
Cloud-based Outages Allow Teams to Focus on Software-based Recovery Options
Unplanned outages happen. In our prior companies, hardware outages have resulted in scrambles at the data center. Since the hardware issues would be the most immediate, nearly all technical personnel were on hand helping the systems administrators.
In this outage, we were a little uncomfortable to be a step removed from the core issues, but it allowed us to get back up and running more quickly.
Our team was able to focus on software-based recovery options. Rather than racking and unracking hardware, installing disks and operating systems, we were able to instead focus on multiple recovery options. We pursued two independent paths for recovery and were in a great position should our primary recovery option have run into unexpected problems.
Using Amazon AWS is (was?) Today’s Equivalent of “buying IBM”
When we selected Amazon’s cloud platform several years ago it hadn’t become the standard bearer it is today.
We were pleasantly surprised by our customer conversations the morning of the outage. Customers readily understood the source of the problems and very few of them attributed the issues directly to us. While we are fully responsible to our customers, they generally afforded us a wider margin and assigned more blame to Amazon.
I’ve had to do a few customer calls in the past where there wasn’t a third party involved in an outage. Even when the issue was an unlikely and unfortunate collision of events, they seemed to be generally unhappier with us.
Customers are certainly more likely to understand that Amazon was having a massive outage than the low probability nature of a near simultaneous failure of two disks in a Raid 5 volume.
We didn’t expect this response, but certainly didn’t mind it. It will be interesting to see if this perception will hold given this latest outage.
Related Posts
0 Comments
No comments yet.
RSS feed for comments on this post.
Sorry, the comment form is closed at this time.