One site that wasn’t outwardly impacted by the February 28, 2017 AWS us-east-1 S3 outage was Netflix.

The reason? While Netflix relies on AWS, it also plans for (e.g. expects, unlike most organizations), architects for and tests failure.

Netflix practices what it refers to as multi-region, active-active replication – replicating data between different AWS regions for a resilient architecture. Netflix recognizes that a complete Region outage is unlikely (until February 28th), but still possible.

A complete discussion on Netflix’s architectural approach can be found in their December 2, 2013 Blog post – Active-Active for Multi-Regional Resiliency.

But more importantly, Netflix tests failure – on their live, production environments!

Would you do that?

Netflix does!

In fact, they created a suite of testing tools, the Simian Army, that they routinely use to invoke and test failure and failover. They started with what they call Chaos Monkey which is a service running in AWS that randomly terminates EC2 instances within Auto Scaling Groups to test resiliency. However, their suite also includes Chaos Gorilla which takes out an entire Availability Zone and Chaos Kong which simulates the outage of a Region. Many of these tools were released into the wild on Github so you can test your own architecture and ability to deal with failure.

As we know, redundancy and resiliency come at a cost. Not only in the architecture, but with the cost of additional services. Adrian Cockcroft, former chief architect for high performance technical computing at Netflix estimated that Netflix’s active-active architecture added about 25% more in costs, with most of that extra cost being in the storage replication.

The bottom line is that it is technically possible to plan for, architect for and test the possibility of an AWS region failure. The Simian Army is more than willing to join in your fight.

Yesterday, AWS suffered a rare service outage. Unfortunately, the outage affected S3 storage in the largest AWS region – us-east-1. And, since S3 (Simple Storage Service) stores data objects such as files and is the primary repository for numerous other AWS services, this outage had a ripple effect throughout AWS and the numerous websites that use S3 (there are reports that 150,000 sites were affected).

As expected, there was shock and outrage from the Twitter world, customers and the web. Yet, everyone needs to keep in mind that AWS does not guarantee 100% S3 availability, only 99.99%. While yesterday’s outage significantly exceeded the expected 99.99% (less than 1 hour outage over the course of a year with yesterday’s outage in the 4 hour range), an outage certainly shouldn’t be treated as a complete and total surprise. Anyone who has been in the IT field knows that outages and service disruptions happen. And many people in the IT world realize that there are ways to mitigate the remaining risk, but that the cost and effort required in many cases outweighs taking the chance on the inevitable happening. The big problem is that over time, with increasingly reliable hardware, software and infrastructure, the expectation has become that 99.99% availability does equal 100% availability.

While the outage was bad, the good thing was that it only affected 1 of the 4 US regions (not including GovCloud). This outage should get some organizations to consider implementing AWS’s S3 Cross-Region Replication. Cross-Region Replication was introduced by AWS on March 24, 2015. Cross-Region replication makes it easier for users to make copies of their S3 objects in a second AWS region. Built on top of S3’s existing versioning functionality, it is easy to set up and automatically replicates to a designated bucket in a different region, and the user only pays for data storage and any data transfer charges.

Unfortunately, in instances similar to yesterday’s Cross-Region replication is not a complete solution. Since every bucket has a unique name if you’d like to start using your S3 replica, you will need to manually update and/or configure your applications to refer to the new S3 buckets, but if you want to increase the availability of your own application, this is something that can be done.

Organizations may also want to consider implementing Cross-Region failover using Route 53 and Health Checks with their EC2 instances. Each AWS Region was designed to be completely isolated from other AWS regions. While launching instances in different Availability Zones within a region to provide fault tolerance for outages within an Availability Zone is relatively simple thanks to VPC, it is still not much more difficult to implement fault tolerance for a complete region and should definitely be looked at if you have the need for a High-Availability site or application.

The key takeaway from yesterday’s AWS outage is that we shouldn’t forget to plan for outages. Even if you are the world’s largest cloud provider and are promising 99.99% availability, there is still the possibility of a service disruption. Fortunately, there are tools available to help you mitigate that risk if you want to incur the extra expense.