AWS Reliability Tools and Best Practices

Onica

Managed Services, Migration
December 19, 2017

[rt_reading_time label=”Read Time:” postfix=”minutes” postfix_singular=”minute”]

What is AWS Well-Architected Framework? AWS Well-Architected Framework (WAF) has five pillars, which simplify the process of building secure, high-performing, resilient, and efficient infrastructure for cloud applications. It provides a consistent review and measurement process for cloud architects, using AWS best practices.

As one of the five pillars of the AWS Well-Architected Framework (WAF), reliability is a key focus of framework best practices. Infrastructure reliability means different things to different people, but is defined as, “the ability of a system to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions such as misconfigurations or transient network issues.” Of course, uptime is a critical metric for network reliability.

The WAF reliability pillar emphasizes three areas of concern: Foundations, Change Management, and Failure Management. This blog will look at each area and review the AWS tools and best practices that can be used to address each one.*

AWS Reliability Foundations

The Foundations best practices for AWS involve reliability issues that should be defined before the system is architected. This includes limit management and network topology planning.
1. Limit management— limit management addresses the physical limitations and resource constraints of your network architecture. Physical limitations include issues like ensuring that your AWS instances provide the bandwidth and storage capacity that you will need now and in the future. AWS has soft limits like number of requests, number of EC2 instances and number of EBS volumes.These can be changed (you might need permission). Hard limits like the number of security groups and the number of rules in the security groups cannot be changed. Use AWS free Trusted Advisor checks for to test the adequacy of your architecture for performance, service limits and security groups. AWS recommends that you track limits by storing them in DynamoBD or by integrating your Configuration Management Database with AWS Support APIs and by setting alarms for limits tracked by CloudWatch.
2. Network topology planning–topology planning is about planning for future growth in terms of the number of IP addresses that you’ll need and the systems and networks that you might need to integrate with. You also need to plan for resiliency–for possible failures, misconfigurations, attacks, and unexpected increases in traffic or service use. A best practice for IP addresses is to use VPC to allocate private address ranges as identified by RFC 1918 for your VPC Classless Inter-Domain Routing blocks to either provide non-Internet accessible resources or to extend your data center. For resiliency, best practices are to make sure connections to the data centers are redundant, and that you have a subnet or set of subnets for each Availability Zone to serve as a barrier between the Internet and your applications. AWS also has many attack protection services such as Web Application Firewalls that can be used to deflect common attacks.
3. You can manage what you can’t measures, and monitoring is critical for effective change management. AWS has customizable hooks and visibility into everything from instance performance to network layers, down to request APIs themselves. Identify what services and applications you want to monitor, define the metrics you are concerned about, and learn how to access logs for these metrics from AWS products and features. The key AWS service that supports monitoring is Amazon CloudWatch, which allows for easy creation of alarms that can be set to automatically trigger scaling actions.

Planning for network changes is another foundation of the AWS Reliability Pillar–changes in demand, monitoring changes, and changes in execution.
1. Changes in demand–often it is when demand spikes that you become aware of some architectural defects, just when you need them least! The best way to avoid scalability issues is to test your implementation rigorously against as-close-to-real conditions as possible. Using AWS Auto Scaling is the best practice to automate instance replication. You can use Auto Scaling groups for specific resource types and use CloudWatch to set scaling triggers.
2. Change execution in the cloud is a matter of software development. As an infrastructure-as-code environment, changes in infrastructure can be described as differences between running environments and objects that exist in source control. Set up development, test, and production environments that will allow you to effectively test your changes before deploying them. You can also test complete deployments with all the bells and whistles in your production environment–networks, firewalls, data transmission, etc. The key tool that enables infrastructure-as-code is AWS CloudFormation. You can deploy any part of your infrastructure and applications as distinct CloudFormation stacks.

Failure management recognizes that failures will happen. It is critical to know how to identify, respond and prevent future failures. AWS WAF breaks failure management into three parts: data durability, withstanding component failures, and planning for recovery.

Data durability–The loss of data is one of those things that keeps most IT leader up at night. A general best practice is to define a recovery point objective (RPO), which is a threshold of loss time that constitutes an incident, and a recovery time objective which is the amount of time it will take to restore data. AWS recommends regularly testing backup and restore capabilities to define these thresholds and set policies accordingly. What durability is AWS? The key AWS service is S3 storage. With 99.999999999 percent reliability, S3 provides near perfect data durability.
Withstanding component failure–load sharing is the primary means of eliminating single points of failure that can damage or lose data. In the cloud, Mean Time To Recover (MTTR) is more important than Mean Time Between Failure (MTBF), because recovery can be automated based on calculated recovery times. AWS can automatically take action and notify appropriate personnel. A WAF best practice is to design your infrastructure such that your systems are decoupled so that you can avoid a domino-effect of cascading failures. AWS offers multiple load sharing tools including Availability Zones in multiple AWS Regions, Elastic Load Balancer, Application Load Balancers, and S3 storage. An AWS best practice is use AWS SDKs to test components to withstand failure and to determine failure and recovery thresholds.
Planning for recovery–expect the unexpected might be the watchword for AWS failure management. It is critical to know what to do in the event of a serious system, service or component failure. Testing for resiliency, performing multiple disaster recovery drills until responding becomes second-nature, keeping all versions of the network in sync, and using Availability Zones to be able to shift operations to a working site to avoid business disruptions are all critical AWS best practices. The key AWS service for recovery planning is AWS Identity and Access Management, which can be used to grant access to those that need it if disaster strikes. Regular backups on S3 is also critical, as is the ability to automate the delivery of all systems to another AWS Region or account.

The good news is that AWS has been proven to be more reliable than data centers when it comes to change and failure management. As Richard Cowley, Director of Operations at Slack has said, “AWS does a much better job at security than we could ever do running a cage in a data center.” Follow these best practices to ensure your infrastructure is as reliable and resilient as possible.

This blog summarizes a more detailed AWS document, “Reliability Pillar: Well-Architected Framework.”

Learn more about the other Well-Architected Framework pillars: