One site that wasn’t outwardly impacted by the February 28, 2017 AWS us-east-1 S3 outage was Netflix.

The reason? While Netflix relies on AWS, it also plans for (e.g. expects, unlike most organizations), architects for and tests failure.

Netflix practices what it refers to as multi-region, active-active replication – replicating data between different AWS regions for a resilient architecture. Netflix recognizes that a complete Region outage is unlikely (until February 28th), but still possible.

A complete discussion on Netflix’s architectural approach can be found in their December 2, 2013 Blog post – Active-Active for Multi-Regional Resiliency.

But more importantly, Netflix tests failure – on their live, production environments!

Would you do that?

Netflix does!

In fact, they created a suite of testing tools, the Simian Army, that they routinely use to invoke and test failure and failover. They started with what they call Chaos Monkey which is a service running in AWS that randomly terminates EC2 instances within Auto Scaling Groups to test resiliency. However, their suite also includes Chaos Gorilla which takes out an entire Availability Zone and Chaos Kong which simulates the outage of a Region. Many of these tools were released into the wild on Github so you can test your own architecture and ability to deal with failure.

As we know, redundancy and resiliency come at a cost. Not only in the architecture, but with the cost of additional services. Adrian Cockcroft, former chief architect for high performance technical computing at Netflix estimated that Netflix’s active-active architecture added about 25% more in costs, with most of that extra cost being in the storage replication.

The bottom line is that it is technically possible to plan for, architect for and test the possibility of an AWS region failure. The Simian Army is more than willing to join in your fight.

Yesterday, AWS suffered a rare service outage. Unfortunately, the outage affected S3 storage in the largest AWS region – us-east-1. And, since S3 (Simple Storage Service) stores data objects such as files and is the primary repository for numerous other AWS services, this outage had a ripple effect throughout AWS and the numerous websites that use S3 (there are reports that 150,000 sites were affected).

As expected, there was shock and outrage from the Twitter world, customers and the web. Yet, everyone needs to keep in mind that AWS does not guarantee 100% S3 availability, only 99.99%. While yesterday’s outage significantly exceeded the expected 99.99% (less than 1 hour outage over the course of a year with yesterday’s outage in the 4 hour range), an outage certainly shouldn’t be treated as a complete and total surprise. Anyone who has been in the IT field knows that outages and service disruptions happen. And many people in the IT world realize that there are ways to mitigate the remaining risk, but that the cost and effort required in many cases outweighs taking the chance on the inevitable happening. The big problem is that over time, with increasingly reliable hardware, software and infrastructure, the expectation has become that 99.99% availability does equal 100% availability.

While the outage was bad, the good thing was that it only affected 1 of the 4 US regions (not including GovCloud). This outage should get some organizations to consider implementing AWS’s S3 Cross-Region Replication. Cross-Region Replication was introduced by AWS on March 24, 2015. Cross-Region replication makes it easier for users to make copies of their S3 objects in a second AWS region. Built on top of S3’s existing versioning functionality, it is easy to set up and automatically replicates to a designated bucket in a different region, and the user only pays for data storage and any data transfer charges.

Unfortunately, in instances similar to yesterday’s Cross-Region replication is not a complete solution. Since every bucket has a unique name if you’d like to start using your S3 replica, you will need to manually update and/or configure your applications to refer to the new S3 buckets, but if you want to increase the availability of your own application, this is something that can be done.

Organizations may also want to consider implementing Cross-Region failover using Route 53 and Health Checks with their EC2 instances. Each AWS Region was designed to be completely isolated from other AWS regions. While launching instances in different Availability Zones within a region to provide fault tolerance for outages within an Availability Zone is relatively simple thanks to VPC, it is still not much more difficult to implement fault tolerance for a complete region and should definitely be looked at if you have the need for a High-Availability site or application.

The key takeaway from yesterday’s AWS outage is that we shouldn’t forget to plan for outages. Even if you are the world’s largest cloud provider and are promising 99.99% availability, there is still the possibility of a service disruption. Fortunately, there are tools available to help you mitigate that risk if you want to incur the extra expense.

In today’s tech world there’s no reason for a startup to even consider purchasing hardware when there are so many cloud hosting options available. The real question that needs to be answered is WHICH hosting provider you should use.

From a startup’s perspective, I think you need to look at the following major areas:

• Cost – For your particular requirements and reasonable growth expectations going forward, how do the costs for the providers you are considering compare? Compute resource costs are generally more than 80% of your total cloud costs, so it is best to start there. However, differences in RAM and CPUs and whether you will be able to take advantage of discounts complicate the analysis process. To make this even more difficult, cloud providers are continually lowering prices with Azure attempting to match AWS pricing. In the end, pricing may not be the determining factor in the selection process.
• Grow with your business – While all of the major hosting providers offer autoscaling and load balancing, both extremely important as you grow, some of the minor providers may not. Also, what are your geographical needs? Of the major players Azure and AWS have a good worldwide distribution of data centers while Google has a late start and is working to catch up. How do the providers handle redundancy across regions and availability zones within regions? Are there contracts involved or can you just add and remove resources as your business needs dictate?
• Features and Functionality – The first things everyone thinks about are computing and storage, but what else might you need or use? VPN, Virtual Private Cloud, Direct Connection to on-premises resources, Search, Transcoding, IoT, Caching, etc. If you’re a Microsoft shop, using Azure might be the best choice. While these may not be the first issues you think about, when looking longer term will they be important to you?
• Databases – Which databases are you going to use? What is supported by the provider?
• Government Contracts – Will you have government contracts and need ITAR and/or FedRAMP compliance?
• Security – How easy are the security features to use? What security features are offered? What security features do you need? Do you need Active Directory integration?
• Documentation – How extensive is the documentation? Is it easy to use?
• Is the administrative platform simple, intuitive and easy to use?
• Support – What support is available? What are the costs of that support?
• Partners – What 3rd party partners are working with the provider? What value are they adding?
• Experienced people – Can you readily find people experienced with the provider’s services? AWS has over 26,000 Certified Solutions Architects, but it is much harder to find people with certified skills in Azure or Google or even harder for the minor providers.
• Provider’s commitment and investment in Cloud Services – What is the provider’s roadmap? Where do they fit on Gartner’s Magic Quadrant for Cloud IaaS? Gartner’s August 2016 report states that the cloud market has undergone significant consolidation around Azure and AWS leaving an uncertain future for other service providers and their customers.

Magic Quadrant

The bottom line is that the choice of a cloud provider needs to be looked at through the lens of the individual organization. Many factors need to be considered and the provider that is right for one startup may not be the best choice for another.