AI-powered Observability for Lambda
Dec 3, 2021. 10 min
We are excited to announce AI-powered observability and monitoring solution for applications running on AWS EKS (Kubernetes Service). With CloudAEye, one can now add intelligence to their SRE/DevOps practices. Users can seamlessly monitor logs and metrics in realtime with state-of-the-art anomaly detection and root cause analysis.
Last year, Google announced GKE Autopilot that enables automatic management of a GKE Kubernetes cluster. However, application management remained the responsibility of the application developer. With AI support for AWS EKS from CloudAEye, application management on EKS has become intelligent and whole lot simpler.
Kubernetes is the defacto open-source orchestration framework for deployment, scaling, and management of containerized applications. AWS EKS is a managed service and certified Kubernetes conformant to run Kubernetes on AWS. EKS service offers managed control plane that automatically manages availability and scalability of Kubernetes API servers and etcd persistence layer. EKS lets you create, update, scale and terminate nodes in managed node groups for Kubernetes cluster.
Observability is a daunting task for organizations with large microservices architectures. Tracking millions of logs and hundreds of metrics and key performance indicators (KPI) generated every hour requires a lot of resources and human effort. Recruiting individuals with the expertise in SRE / DevOps to serve this purpose is also challenging and expensive.
Neglecting the need to have an efficient observability platform can cause havoc in the organization when there is downtime. In today’s fast paced environment, downtimes can prove fatal. Having a system which can foresee a system failure, narrow down the problem when it occurs, and come up with solutions to fix it is crucial.
Vanilla rule-based alert systems can hardly be of use during such times, and deciphering the cacophony of thousands of alarm bells ringing during a failure might actually hinder progress. Besides, because of the spammy nature of alert systems crying wolf even during minor inconveniences, developers often prefer to shut them down and turn a deaf ear to their warnings.
CloudAEye’s observability offers a system which not only alerts a user of an anomalous behaviour, but also picks out the relevant log messages from the ocean of logs and encapsulates all the details required in order to get to the root cause of the problem. Along with the identified anomalous log messages with details of their origin, we also surface the top metrics which reflect the anomaly which gives a better understanding of the overall anomalous behaviour. The anomalous predictions are not rule based and the AI models use the combined information from both logs and metrics to make anomalous predictions. As a result it outperforms any existing rule based alert system and gives reliable and valuable information of an anomaly.
In this blog I will explain how I set up a sample application on EKS and ran a load generation script to simulate a real-time production environment. I then installed CloudAEye’s Logs and Metrics Service’s agents to export logs and metrics, and deployed AI analyzers on the cluster. In order to simulate a real-time error scenario, I used chaos engineering to inject a network related fault into my system. When the error got injected, the AI analyzers immediately detected the fault and I will show the details of the anomaly displayed in the dashboards.
Spring Pet clinic is a sample application that has been designed to show how the Spring Framework can be used to build simple but powerful database-oriented applications. This application is built using Spring Boot + MySQL. The project is available in Github at - Spring Petclinic Cloud
Picture 0: PetClinic Architecture
The application consists of a number of microservices and its functionalities:
Picture 1: Screenshot of PetClinic’s UI
I hosted the application in a cluster named demo-chaos-cluster in EKS.
In order to simulate a real-time production environment, I used a load generation script written in Python to perform different activities which mimic users’ behaviors. This way I could generate sufficient logs and metrics data for different use cases for which one can use the application.
Picture 2: Load Generator Mimicking User’s Behaviour
Logs and Metrics Services are managed services that make collection, parsing, storing and analytics of logs and metrics easier. They enable operating and analyzing logs and metrics from distributed cloud applications at scale in a highly cost effective manner. They are built on fully open-source search and analytics engines.
The services can be easily created from the CloudAEye console. When the services are created, the user is prompted to provide details of their cluster after which the commands to install the agent are made available.
On installing the agents into the cluster, they export logs and metrics to Elastic and Prometheus endpoints respectively.
Picture 3: Logs Service UI on CloudAEye’s Console
I created Logs and Metrics Services for the Spring PetClinic application and set up the agents using the commands provided in the console.
Picture 4: Setting up PetClinic for Installing Agents
Picture 5: CloudAEye Agent Installed on the Cluster
On installing the agents both the system and application logs and metrics are exported and can be viewed in the OpenSearch link provided and on the CloudAEye Metrics console. Additional information on the system and application metrics can be viewed on the Grafana Dashboard provided.
Picture 6,7: Grafana Dashboard with Cluster and Container Metrics
Picture 8: Elasticsearch with System and Application Logs
Logs and Metrics Analyzers are AI-powered managed services that surface anomalous logs and metrics from applications by providing actionable operational insights. They intelligently analyze the data in near real time using machine learning (ML) and deep learning (DL) models and can detect errors or any anomalous activities in user applications. This helps to reduce their MTTD (mean time to detect).
I deployed the AI analyzers for the PetClinic application using the CloudAEye console.
Picture 9: Analyzers Deployed for Petclinic
Now that the analyzers and agents are set up, I will inject faults to simulate real time production environment and view the AI’s capabilities in detecting faults and providing insights.
For this demonstration, I injected a network related fault Network Delay using Chaos Mesh.
From their website: Chaos Mesh is an open source cloud-native Chaos Engineering platform. It offers various types of fault simulation and has an enormous capability to orchestrate fault scenarios. Using Chaos Mesh, you can conveniently simulate various abnormalities that might occur in reality during the development, testing, and production environments and find potential problems in the system.
Network Delay is a fault which causes a latency in the network connections of the target Pods. I targeted the database pod and injected delay in it. On doing so, the connections were shut off between the database and other services and accessing it was not possible.
Picture 10: Deploying Network Delay
The system was down and the users started facing complications. The load generator script was constantly sending requests but was unable to proceed further. The metrics started fluctuating and the logs endpoint was populated with error log messages.
In a matter of minutes I receive an email notification from CloudAEye stating that some anomalous behaviour has been detected in my application’s logs and metrics.
Picture 11, 12: Logs and Metrics Alert Notifications
The metrics alert directed me to the Grafana dashboard and the logs alert directed me to an anomaly dashboard with the details of the detected anomaly:
Picture 13: Single Anomaly Dashboard Overview
The anomalous logs messages were caught along with additional information such as logs count, confidence levels, details behind the logs messages such as the name of the service, container name and id, namespace name, time, etc.
Picture 14: Erroneous Logs Caught by the Model with Details
From the log messages we can see clearly that there is a communication link failure between the customer service and the database. Similarly, other services such as vets-service also ran into similar errors and we were getting timeouts. From here we can observe that the model has detected that there is a communication failure between the database and the other services. Cross checking with the top KPIs detected by the metrics analyzers, we can see that the HTTP related metrics are showing the anomaly. From the information gathered from the logs and metrics analyzers, we successfully identify that there is a network issue between the services and the database.
Picture 15: Metrics Console Displaying the Anomalous Metrics
The above error had displayed just the error detected at that particular time, but we can also view the overall health of the system from the anomaly dashboard.
Picture 16: Dashboard Overview
Picture 17: Distribution of Anomalies Over Time and Services
Clicking on a particular highlighted box leads to the details of that particular anomaly.
Picture 18: Anomalous Logs Message Counts with Confidence Levels
Clicking on a particular highlighted bar leads to the details of that particular anomaly.
Picture 19: Donut Chart of Confidence Levels of Overall Anomalies Detected
Picture 20: Error History
Picture 21: Anomaly Timeline
Logs timeline of the occurrences of anomalies detected by the model, and the log counts is shown.
The Mean Time to Detect this issue and analyze the root cause was done within a matter of minutes. The erroneous log messages were highlighted along with the details of the services and the top anomalous KPIs were also displayed. This system saves us time and money by reducing the mean time to detect and thus the overall time to get a system back up and running.
Picture 22: CloudAEye SaaS Sign up Page