By Brandon Pierce, Director of Engineering at Onica and Roy Kalamaro, Security Architect at Onica.
This post was originally published on the AWS Partner Network (APN) blog.
Although a hybrid cloud environment is not always ideal, it may be necessary for clients who are facing certain limitations in their workloads. If that’s your situation, we want to make sure you’re setting up your hybrid cloud architecture correctly.
What is AWS Systems Manager?
AWS Systems Manager is a valuable resource for quickly assessing operational insights and taking action in both AWS and on-premises environments. AWS Systems Manager gives you visibility and control of your infrastructure on AWS and and allows you to automate operational tasks across your AWS resources. It provides a unified user interface so you can view operational data from multiple AWS services, thereby shortening the time it takes to find and fix operational problems and making it simple to manage your infrastructure securely at scale.
In this post, Onica’s Brandon Pierce and Roy Kalamaro share use cases and demonstrate how organizations can utilize AWS Systems Manager to simplify hybrid environment operations, enabling you to significantly reduce operational overhead and manual procedures.
Managing Hybrid Workloads with AWS Systems Manager
On-premises servers are a limiting factor in hybrid infrastructures and are often unable to integrate with the capabilities of cloud services or communicate seamlessly with cloud counterparts.
AWS Systems Manager offers unprecedented insights and access with a unified user interface (UI) that includes information from a multitude of AWS services and on-premises servers. AWS Systems Manager uses a lightweight agent installed on servers to provide visibility, eliminating communications challenges faced in most hybrid environments.
At Onica, we utilize AWS Systems Manager to simplify resource grouping while leveraging access to automated command execution. This previously would have been documented in a manual Standard Operating Procedure (SOP) and runbooks to execute manual actions.
Through AWS Systems Manager, we can provide contextual information to Amazon CloudWatch Alarm notifications.
Below are some ways in which our team has found value in the automation provided by AWS Systems Manager.
Remote Management of Hybrid Environments at Scale
When we have the need to manage systems on-premises for clients, the AWS Systems Manager Agent (SSM Agent) allows for seamless management using the same console, API, automation, and tooling that we would utilize within AWS.
One of the main challenges we have faced when working in a hybrid environment has been utilizing a single management tool for control and orchestration of Windows and Linux OS across multiple hosting platforms.
SSM Agent is able to monitor the heartbeat of Amazon Elastic Compute Cloud (Amazon EC2) instances, as well as that of remote on-premises servers. Additionally, it allows our team to run commands and verify output regardless of the OS, hypervisor, or platform.
In Figure 1, you can see orchestration on Windows and Linux servers across multiple hosting platforms and AWS Regions using AWS Systems Manager.
Figure 1 – Orchestration on Windows and Linux servers using AWS Systems Manager.
At Onica, we capture instance-level metrics using Amazon CloudWatch whose agent collects the custom and standard metrics from these instances and sends them to CloudWatch Logs. We configured CloudWatch Alarms in response to specific metrics (CPUUtilization, RAM, DiskWriteOps, etc.) that are deemed critical for customer workloads to function effectively.
Alarm Enrichment Solution
One challenge we faced was meeting customer requirements for real-time notification to stakeholders with relevant Windows/Linux OS and application-level health data when these CloudWatch Alarms are triggered. In response, our team at Onica developed an Alarm Enrichment solution that utilizes Amazon CloudWatch and AWS Systems Manager services in a hybrid environment.
This solution utilizes AWS Systems Manager to collect additional information about the impacted system and includes that information in the ticket to an engineer.
Automated Runbook Execution
A typical operational challenge is the timely and proper execution of runbooks during an incident or maintenance exercise. Depending on the number of impacted systems, there may be a large number of engineers involved in remediation. More critically, there’s the risk of human error due to the improper following of SOPs.
Traditional solutions to these problems involve custom scripts or third-party orchestration software. These solutions often have large price tags or require separate efforts to maintain complex systems in and of themselves. They also don’t scale into the cloud very well, as they were not designed for such dynamic environments.
The goal for automated runbook execution is to reduce the engineering effort and any downtime associated with customer application failures. This can be achieved in a cloud-native manner by using AWS Systems Manager and Amazon CloudWatch. We monitor the CloudWatch Logs for specific values or patterns using CloudWatch Alarms to detect abnormal application or process-level errors and utilize AWS Systems Manager to perform the remediation activities.
To view the use cases and more details, please click here.
AWS Systems Manager simplifies hybrid cloud management. It makes the oversight of thousands of instances and virtual machines running over eight different operating systems no more challenging than the management of a few instances running in a single Availability Zone.
For our team at Onica, this has resulted in deprecating previous hybrid management solutions from third parties that are costly to implement and maintain.
With AWS Systems Manager, we have also been able to reduce weekly helpdesk tickets and automated alerts by 5%, translating to roughly 10% reduction in the human effort required to support the same amount of resources. We foresee additional savings over time as operational efficiency continues to increase and new workloads are launched with these strategies.