Yesterday, AWS suffered a rare service outage. Unfortunately, the outage affected S3 storage in the largest AWS region – us-east-1. And, since S3 (Simple Storage Service) stores data objects such as files and is the primary repository for numerous other AWS services, this outage had a ripple effect throughout AWS and the numerous websites that use S3 (there are reports that 150,000 sites were affected).

As expected, there was shock and outrage from the Twitter world, customers and the web. Yet, everyone needs to keep in mind that AWS does not guarantee 100% S3 availability, only 99.99%. While yesterday’s outage significantly exceeded the expected 99.99% (less than 1 hour outage over the course of a year with yesterday’s outage in the 4 hour range), an outage certainly shouldn’t be treated as a complete and total surprise. Anyone who has been in the IT field knows that outages and service disruptions happen. And many people in the IT world realize that there are ways to mitigate the remaining risk, but that the cost and effort required in many cases outweighs taking the chance on the inevitable happening. The big problem is that over time, with increasingly reliable hardware, software and infrastructure, the expectation has become that 99.99% availability does equal 100% availability.

While the outage was bad, the good thing was that it only affected 1 of the 4 US regions (not including GovCloud). This outage should get some organizations to consider implementing AWS’s S3 Cross-Region Replication. Cross-Region Replication was introduced by AWS on March 24, 2015. Cross-Region replication makes it easier for users to make copies of their S3 objects in a second AWS region. Built on top of S3’s existing versioning functionality, it is easy to set up and automatically replicates to a designated bucket in a different region, and the user only pays for data storage and any data transfer charges.

Unfortunately, in instances similar to yesterday’s Cross-Region replication is not a complete solution. Since every bucket has a unique name if you’d like to start using your S3 replica, you will need to manually update and/or configure your applications to refer to the new S3 buckets, but if you want to increase the availability of your own application, this is something that can be done.

Organizations may also want to consider implementing Cross-Region failover using Route 53 and Health Checks with their EC2 instances. Each AWS Region was designed to be completely isolated from other AWS regions. While launching instances in different Availability Zones within a region to provide fault tolerance for outages within an Availability Zone is relatively simple thanks to VPC, it is still not much more difficult to implement fault tolerance for a complete region and should definitely be looked at if you have the need for a High-Availability site or application.

The key takeaway from yesterday’s AWS outage is that we shouldn’t forget to plan for outages. Even if you are the world’s largest cloud provider and are promising 99.99% availability, there is still the possibility of a service disruption. Fortunately, there are tools available to help you mitigate that risk if you want to incur the extra expense.

In today’s tech world there’s no reason for a startup to even consider purchasing hardware when there are so many cloud hosting options available. The real question that needs to be answered is WHICH hosting provider you should use.

From a startup’s perspective, I think you need to look at the following major areas:

• Cost – For your particular requirements and reasonable growth expectations going forward, how do the costs for the providers you are considering compare? Compute resource costs are generally more than 80% of your total cloud costs, so it is best to start there. However, differences in RAM and CPUs and whether you will be able to take advantage of discounts complicate the analysis process. To make this even more difficult, cloud providers are continually lowering prices with Azure attempting to match AWS pricing. In the end, pricing may not be the determining factor in the selection process.
• Grow with your business – While all of the major hosting providers offer autoscaling and load balancing, both extremely important as you grow, some of the minor providers may not. Also, what are your geographical needs? Of the major players Azure and AWS have a good worldwide distribution of data centers while Google has a late start and is working to catch up. How do the providers handle redundancy across regions and availability zones within regions? Are there contracts involved or can you just add and remove resources as your business needs dictate?
• Features and Functionality – The first things everyone thinks about are computing and storage, but what else might you need or use? VPN, Virtual Private Cloud, Direct Connection to on-premises resources, Search, Transcoding, IoT, Caching, etc. If you’re a Microsoft shop, using Azure might be the best choice. While these may not be the first issues you think about, when looking longer term will they be important to you?
• Databases – Which databases are you going to use? What is supported by the provider?
• Government Contracts – Will you have government contracts and need ITAR and/or FedRAMP compliance?
• Security – How easy are the security features to use? What security features are offered? What security features do you need? Do you need Active Directory integration?
• Documentation – How extensive is the documentation? Is it easy to use?
• Is the administrative platform simple, intuitive and easy to use?
• Support – What support is available? What are the costs of that support?
• Partners – What 3rd party partners are working with the provider? What value are they adding?
• Experienced people – Can you readily find people experienced with the provider’s services? AWS has over 26,000 Certified Solutions Architects, but it is much harder to find people with certified skills in Azure or Google or even harder for the minor providers.
• Provider’s commitment and investment in Cloud Services – What is the provider’s roadmap? Where do they fit on Gartner’s Magic Quadrant for Cloud IaaS? Gartner’s August 2016 report states that the cloud market has undergone significant consolidation around Azure and AWS leaving an uncertain future for other service providers and their customers.

Magic Quadrant

The bottom line is that the choice of a cloud provider needs to be looked at through the lens of the individual organization. Many factors need to be considered and the provider that is right for one startup may not be the best choice for another. 

 

I just read a CNN article that essentially says that launching the Obamacare Health Exchange was an impossible task (http://money.cnn.com/2013/10/03/technology/obamacare-glitch/index.html). (I also like the URL which maintains that the issues are still glitches – I think everyone is beginning to realize that they are way past the ‘glitch’ stage).

The CNN article quotes Neil Quinn, CTO for Prolexic:

The rule of thumb, according to cybersecurity firm Prolexic’s CTO Neil Quinn, is for a website-runner to prepare for two to five times the web traffic expected at peak levels.
“But obviously, they weren’t going to be given a huge amount of time or budget on this,” Quinn said. “In an ideal world, you get some more informed estimates of what you are going to see [in terms of traffic].”

WHAT????

Is he trying to tell us that they weren’t given an adequate amount of time and budget to implement the cornerstone of Obama’s 8 years as President??? You’ve got to be kidding me!!!! That’s just laughable. I wonder how many tech people CNN Money had to talk to to get that quote.

The article then goes on to state:

It’s also possible that the Affordable Care Act legislation doesn’t allow the government to do a limited test run or roll out the exchanges slowly, such as going state by state or starting with people whose last names begin with letters A through M.
Those concerns, along with privacy concerns and logistical issues, created what Prince called “essentially an impossible task.” Any option the government chose came with serious cons, he said.

Using a massive cloud service like Amazon’s would have handled the large traffic load — but it would also mean storing sensitive health care information in the cloud and on rented servers, which would have drawn criticism. Instead the government used its own system and servers that it could control, which adds security, but it didn’t have the funds to build robustly enough to handle week-one traffic.

While I agree that the legislation didn’t provide for going state by state or starting with people whose last names begin with letters A through M (which, based on reports would have only mildly helped since this would have only reduced traffic by ½), the legislation certainly didn’t prevent load testing which should have easily identified these issues.

I also agree that a cloud service like Amazon’s would have handled the traffic load -economically I might add. However, what the article failed to mention is that Amazon has a special ‘region’, GovCloud, specifically designed with securing sensitive data. From Amazon Web Services:

AWS GovCloud (US) is an isolated AWS Region designed to allow US government agencies and customers to move sensitive workloads into the cloud by addressing their specific regulatory and compliance requirements. The AWS GovCloud (US) framework adheres to U.S. International Traffic in Arms Regulations (ITAR) regulations as well as the Federal Risk and Authorization Management Program (FedRAMPSM) requirements. FedRAMP is a U.S. government-wide program that provides a standardized approach to security assessment, authorization, and continuous monitoring for cloud products and services. AWS GovCloud (US) has received an Agency Authorization to Operate (ATO) from the US Department of Health and Human Services (HHS) utilizing a FedRAMP accredited Third Party Assessment Organization (3PAO).

So again, while people may try to make light of the issues by calling them ‘glitches’ or an ‘impossible task’ the bottom line is that what is happening with the exchange is just an inexcusable, embarrassing, totally preventable fiasco that had incompetent leadership.