AWS Service Disruption: Lessons Learned

The recent disruption of services on the Amazon Web Services (AWS) platform has generated a lot of talk about the risks of relying on public cloud. However, if you look past the hyperbolic headlines (which ranged from “The Internet is Broken!” to “Buy <insert name of non-AWS product> and you won’t have to worry about AWS outages again!”) there are some good lessons to be learned around best practices for high-availability (HA) design and the risk of over-relying on a single cloud region.

While AWS has a good history of reliability that exceeds what most companies are able to achieve in a non-cloud environment (with the added benefit of an unparalleled set of features), there have been some past notable service disruptions. The recent issues with the AWS S3 service – which AWS refers to as a “Tier 0” service given its use within other AWS services – led to widespread disruptions within the AWS us-east-1 (Northern Virginia) region. But what’s notable is that S3 was working just fine in the other 13 AWS global regions. A bit of history: us-east-1 is the oldest AWS region, and is often the default region for many customers running on AWS. As such, many customers are over-leveraged in this region and have not given proper consideration to multi-region designs for high profile workloads.

Failure Is Always a Possibility

When working with AWS, there are two mantras – coined by AWS CTO, Werner Vogels – that are frequently heard:

  1. “Everything fails, all the time.”
  2. “Design for failure and nothing will fail.”

If you take these mantras to heart, you are well-advised to plan for region-wide disruptions for critical workloads. The main challenges in this type of design related to data synchronization from both a cost and timeliness standpoint. Cross-region data transfer does incur a charge and has historically been more expensive than inter-region (cross-AZ) data transfer charges. However, AWS has lowered the intra-region data transfer costs to the point where you can transfer data between regions for as little as $0.01/GB (from us-east-1 / Northern Virginia to us-east-2 / Columbus). Data synchronization is a bigger concern for most workloads, as there is latency involved in copying data between regions that may impact the consistency and statefulness for certain workloads. However, there are generally acceptable designs which can be achieved based on your SLAs.

As a simple example, consider a basic 3-tier web application. You can run this type of workload in a single region using AWS Route 53, ELB, EC2, and RDS Database services:

Blog-Ryerson-AWS1

A website running under this reference architecture might have been impacted by the S3 disruption. While this example does not use S3 directly, the other services rely on it for some of their features.

This same workload can be extended across multiple regions and maintained by replication of the data, static objects, and application configurations:

Blog-Ryerson-AWS2

Note that while it may be tempting to design this as an “active/failover” configuration, it is certainly possible (depending on the data consistency and stateful requirements of your application) to run as an “active/active” architecture by using Route 53 routing policies (Latency Routing – which directs traffic based on your users optimal path – is a good choice for this type of design). If an outage is detected with one origin region, Route 53 will begin routing only to the responsive region. To take it a step further, you can enable auto-scaling for your web and application tiers to automatically scale out additional capacity within a single region in the event of a regional disruption. Such an approach allows you to run a steady-state infrastructure which supports normal traffic patterns, but will scale as necessary to support increased traffic as could happen during a regional outage.

How To Properly Leverage Your Cloud (Minus Unnecessary Panic!)

A further layer of protection from AWS-specific issues can be achieved by using non-AWS services in conjunction with your AWS deployment. A good example of this is fronting the website with a non-AWS Content Delivery Network (CDN) such as Akamai or Cloudflare. These services provide edge caching and DNS routing independent of AWS. And while typically used to speed up page delivery times, they can also provide availability during service disruptions by caching content and maintaining DNS routing availability. As an aside, AWS has its own CDN – CloudFront – that was impacted by the recent service disruption on some of its edge locations.

Use of any of these multi-region configurations would have mitigated or negated the impact of disruptions from the recent S3 disruption.

If you are interested in learning more about how to architect for high-availability on the public cloud, contact AHEAD today!



mm
Author: Dan Ryerson
Dan Ryerson is a Principal Consultant at AHEAD where he focuses on enterprise cloud transformation.

Leave a Reply