Disaster Recovery on the Cloud

Onica

Cloud Native Development
September 14, 2018

[rt_reading_time label=”Read Time:” postfix=”minutes” postfix_singular=”minute”]

At TriNimbus, we see many enterprises choosing a disaster recovery (DR) project as their first initiative to get to the cloud, and it makes perfect sense. Instead of being exposed to a lot of risks, unknowns and tight deadlines while migrating your production environments to the cloud—while they’re actively serving your customers—companies choose to build a DR-version of their production environment (which is often located on-premise) on top of the AWS platform.

In this blog post, we will try to expand and elaborate on some of the methodologies, tools and practices that we use in these types of engagements, while providing a rationale behind some decisions that assure the success of the project and overall customer satisfaction.

Advantages of a DR-first Approach

Since DR sites are in fact a copy (to a certain extent) of your production environment, during the process of planning, architecting and implementing a DR solution, companies are faced with the same questions and dilemmas that will pop up if they were to migrate their production environments, but without the inevitable risks that would otherwise hang over the heads of the Operations team. As an added bonus, because the DR environment is not yet serving your customers, you can perform sufficient performance, load, and functional testing without affecting your clients.

Another beneficial aspect of starting with a DR initiative is that your team will be able to develop a completely new set of skills, familiarize themselves with the very important concept of Infrastructure as Code (IaC), and work with the new tools and methodologies required for developing and operating cloud solutions, which are very different from traditional tools and skills required for operating in physical colocation facilities and data centers.

In the days before cloud computing was available, if the organization required to operate a DR site, the contracts with DR facilities had to be negotiated and signed, cages had to be built, networks wired, hardware sufficient for running the production environment had to be procured, provisioned and installed (or as was very common: old, end-of-life hardware from production facilities was re-purposed for DR use), software installed and configured, etc.

Processes used in DR differ from those used in production enough that they inevitably translate to confusing instructions and lack of the appropriate runbooks. To add to the picture, building and maintaining a DR site in a physical data center carries significant upfront investments and hefty operational expenditures. In short, many organizations see DR as a costly and cumbersome project and prioritize it accordingly.

By moving a DR site to the cloud, an organization can drastically reduce their DR budget, as well as have the ability to completely failover to DR in a matter of minutes—as opposed to hours and even days in a traditional data center. This is all achieved by maintaining a very small footprint for DR infrastructure in the cloud—called a “pilot light”—with the ability to quickly scale up and out in the event of a failover, while only paying for the capacity that is currently required and provisioned.

Leveraging Infrastructure as Code

Today, in the cloud, organizations can resolve many of the challenges that affect the speed and quality of infrastructure deployment by leveraging tried-and-true software development practices and provisioning their infrastructure with code.

For example: development needs a new isolated environment to test a web app, or operations need to test a new egress firewall rule? A team can create a CloudFormation template to create a VPC, subnets, some base security groups, and voilà!—you have the equivalent of an isolated piece of a data center all to yourself, and it only took you 10 minutes and with less than 100 lines of code. And if your templates use intrinsic functions and parameters, you can reuse them over and over, in any of the AWS regions across the globe. Then in the same way, write some more code for your application servers, databases, directory services, firewalls, load balancers, certificate management, logging, security groups, and so on.

The enormous benefit of IaC is the fact that you can reuse the same templates across multiple environments while just using different environment-specific parameters to create independent deployments. Regardless of the AWS region you are deploying your application or infrastructure in, the same templates can be reused, and that makes your deployments consistent, no matter if it is a dev, QA, UAT, prod or DR environment. The only things that may need to differ between environments are a handful of parameter values and the application’s persistent data.

While working on a DR environment the team should keep in mind that they are not only building DR – but rather planning, architecting and designing a foundation for the future production in the cloud.

Example Key Objectives of a DR-first Project

Build the infrastructure for DR in the cloud while maintaining any required compliances (e.g. HIPAA). Note: to meet requirements such as this, it is imperative to include the Security team from the beginning of the project.
Reduce existing RTO (e.g. from 36 to 24 hours) and RPO (e.g. from 24 to 12 hours).
Re-architect any directory services (e.g. Microsoft Active Directory) and DNS needs, and incorporate them into a managed-service (e.g. Amazon Route 53) where available.
Architect external networking requirements such as AWS Direct Connect or VPN connections back to on-premise offices or data centers.
Build any custom, 3rd party solutions the environment requires (e.g. an SMB NAS which can support a huge amount of data).
Architect, design and implement a CI/CD pipeline for application deployment.

Let’s take a look at how TriNimbus may solve some of these challenges. And by using IaC, not only should we meet these key DR objectives, but we can make the deployment of additional environments—like dev and QA, and eventually production—that much simpler, as all the components can be deployed anywhere, any time.

Naturally, you need to start with a good high-level plan, and drill down until you get the granularity required. When doing this, it is key to have a good naming convention, and use resource tags. We will go over this point when we start talking about the templates, configurations and IaC principals.

Once the overall high-level architectural design is created and approved, it’s time to choose the tools, and start writing code.

Tooling

Here are some of the tools we use in addition to what’s provided by AWS:

FreeMarker: A templating engine which can be used for generating and parametrizing CloudFormation templates.
Jenkins: Automation and orchestration tool
Packer: Used to “bake” Amazon Machine Images (AMIs) which are used for instantiating EC2 instances
Consul & Vault: For secrets management
Spring Cloud Config: For application configuration management
Git: A source control management (SCM) tool
SMB NAS storage appliances

Some of the AWS services we use include Amazon Virtual Private Cloud (VPC), Amazon Elastic Compute Cloud (EC2), AWS Identity and Access Management (IAM), AWS Certificate Manager (ACM), Amazon Elastic Block Store (EBS), AWS Elastic Load Balancing (ELB), Amazon Simple Storage Service (S3), AWS Systems Manager (SSM), Amazon Route 53, AWS Direct Connect, AWS CodeCommit, and AWS CloudTrail.

Using a templating engine like FreeMarker, and a couple data files, we are able to write a source template for each component of the solution, which in turn creates separate CloudFormation templates for each account, region, environment or sub-environments all from one source file. To make this easy to accomplish, all you require is good naming conventions and tag usage, along with some logic.

Infrastructure

One FreeMarker template created all the CloudFormation templates needed to deploy infrastructure for each account, region, and type of service (e.g. applications, AD, or shared services). This includes all networking components such as VPCs, VPC peering connections, subnets for each layer, gateways, and routing. Just to illustrate a point to the power of IaC, deployment of those templates to multiple regions are equivalent to travelling to a city on the other side of the country (or world!) and setting up a data center, including server cages, core networking gear with VLANs, and required firewalls—mind-blowing for IT professionals who haven’t been exposed to the cloud yet!

Security

The following are a few of the security challenges/requirements that can arise in a DR migration project. For all components, we follow the Principle of Least Privilege and Access methodology. Also, one of the fundamental requirements of various compliance levels (e.g. HIPAA) is full end-to-end encryption—at rest and in transit—therefore challenges such as this can be addressed as well.

Security groups: Similar to firewalls. A FreeMarker template is created using our naming convention and logic that allows only the required inbound traffic for each individual application or services. This completely encapsulates each application to meet compliance and beyond. When compiling template it’s easy to generate hundreds of security groups in each environment. If we want to add another application or want to deploy in a different region, we just add it to the common data file, re-compile and seamlessly re-deploy.
Utilize the use of IAM roles, policies, and KMS extensively to contribute to the concept of least privilege and access across the environments. Yes, also create these with the use of templates.
TLS/SSL encryption in transit: Some challenges could arise if you want to keep your ACM certificates around in your DR environment to decrease future provisioning/validation time. Because in a cold DR environment, the services don’t get created until you activate the environment, there’s nothing to assign dormant certificates to—and unused certificates aren’t automatically renewed. Some custom code can solve this issue, but it’s an in-depth and interesting solution which could be a blog in itself.
Naturally, all data is encrypted at rest on EBS, EFS, S3, and third-party storage solutions.

Application Servers

Close to 100% of the application deployment templates can be created from one single FreeMarker template. Using an associated data file, this template distinguishes between:

Application, account, and environment.
Windows or Linux OS.
The number of drives per instance.
Application Load Balancers (ALBs) with appropriate rules, ports, and TLS certificates.
Auto Scaling groups (ASGs) with min, max, and desired number of instances, and update policy attributes.
Appropriate IAM roles, policies, variables, and any other requirements.
Correct tags for each instance depending on application and other variables.

This one template can create all the custom CloudFormation templates needed for your DR account. When you need to deploy other environments (e.g. dev, QA, etc.) it’s just a matter of adding them to the data files.

Automation

So far, we have gone over how we can automate the creation of our CloudFormation templates, but that’s just part of the puzzle. There’s also the creation of our AMIs and actually configuring the servers once deployed. Here’s a bit on how those requirements can be met.

Packer: We can use this to create our AMIs in a couple stages: start from a base image, add some commonly used tools, and then finish with the latest and greatest code updates. This process helps to maintain immutable components.
Jenkins: The heart of the CI/CD workflow, used to deploy and bring things together.
Consul, Vault and Spring Cloud Config: As we launch instances from the AMIs, a script runs on them which checks the instance’s tags to see which application and environment each instance is in, and grabs the appropriate tokens, configs and secrets needed to configure the application—all done securely and accurately.

Conclusion

If you read this and find yourself facing a similar project on the horizon, take the time to carefully consider how the Cloud fits in your long-term plans. Strategic plans to migrate hardware solutions to the Cloud can be daunting to start. If you need help with planning and assessing your Cloud strategy, TriNimbus would be happy to assist.

In the meantime, we recently released a case study for a DR project with PointClickCare. Please download it and have a read for a real-world example of a DR migration!