Bayesian Optimization and Hyperparameter Tuning
Nov 12, 2021. 12 min
I am happy to announce AI-powered Observability support for serverless applications running on AWS Lambda. My colleagues talked about observability for lambda earlier. In this blog, I will share how we are using AI to support AWS Lambda use-cases.
With the ease of use and pay per invocation business model, AWS Lambda has increased in popularity and is widely used in both small and large scale organizations. Although being convenient and cost effective, Lambda does come with its own set of caveats, some of which have no concrete existing solutions. One such obstacle is to provide a convenient and automated observability for monitoring applications. Observability for such serverless applications has always posed a challenge and it gets exponentially convoluted as their functionalities increase.
Providing an automated AI-driven observability platform which takes the load off of a human’s shoulders while being cost effective and time efficient is a service desired by countless developers. CloudAEye’s Lambda service bridges the gap of running a complex production serverless environment in AWS Lambda and having an efficient and automated observability platform to visualize its health and activity, and to help maintain and debug it. The service also includes state of the art AI-powered analyzers for detecting anomalies, errors, abnormal activities, and narrowing down the root cause of such detected behaviors in your applications.
Picture 1: CloudAEye Lambda dashboard
Creating a logs service in CloudAEye enables our agent to collect your application and system logs. Our visual representations provide intricate details while still being extremely easy to use and follow. We also provide visualizations for the application metrics for each lambda function (such as execution durations, memory usages, estimated costs, number of invocations, etc.).
Compared to AWS’s existing logs solution on CloudWatch, our UI is more intuitive and covers the usability issues. As developers ourselves, we are aware of how daunting it can be to use CloudWatch. So we designed a solution on our own for your convenience.
CloudAEye AI models detects anomalies in your lambda based applications. Once an anomaly is detected, you will get notified via email and SMS. The runtime context of the system along with the relevant details of the anomaly will be highlighted in the Lambda anomaly dashboard.
Tracking the behavior of applications at a low level is an arduous task with millions of logs and noisy data flowing into your system everyday. The user-friendly observability frontend with abstract visualizations for a large scale production environment does not really help either.
Our AI analyzers constantly process each log message, track the behavior and monitor all system activities. If anything goes wrong, it is detected by the AI models in a matter of minutes and you are notified of it at the earliest.
In many cases, having bugs in production often go unnoticed until it eventually starts affecting the performance significantly. By the time it's detected it would have been a part of the architecture and removing it safely is risky and time consuming. Being notified of an anomalous behavior at the earliest helps you nip your problems in the bud and prevents it from having a large scale impact on your overall system. Your observations, response times, ability to debug and fix errors improve significantly and reduce your downtime thus saving your time and money.
Deploying our AI on your applications will help you to:
Picture 2: CloudAEye Lambda Anomaly Dashboard
For demonstration purposes, I have used an app “Serverless Shopping Cart Microservice” (https://github.com/aws-samples/aws-serverless-shopping-cart) which is a sample application provided by AWS for demonstrating how to implement a shopping cart microservice using AWS Lambda. This application uses AWS services such as AWS Lambda, Amazon API Gateway, Amazon Cognito, Amazon DynamoDB and AWS Amplify.
I deployed the app and set up a controller to automate mimicking real-time users’ behavior where it performed different sets of activities such as adding items to the cart, listing the cart, removing items from the cart, checking out the cart, proceeding with payment, etc. Logs and metrics were thus continuously generated for each of the Lambda functions.
After deploying the app, I installed an agent to set up logs and metrics streaming. The step-by-step guide for starting a service and setting up an agent is available in the CloudAEye console. Similarly Logs and Metrics AI analyzers were also created.
Below are the code examples of how I was able to set up the agent:
python3 aws_lambda_agent.py --agent-mode create --lambda-function-names 'GetProducts,GetProduct,CheckoutCart,MigrateCart,ListCart,AddToCart,UpdateCart,GetCartTotal,CartDBStream,DeleteFromCart' --lambda-aws-region us-east-2 --enable-logs 'true' --cloud-env aws --logs-destination XXXX --destination-http-url XXXX --app-name 'lambda-cart' --app-key LC --user-key XXXX --user-secret XXXX
Picture 3: Architecture of Lambda Shopping Cart
After setting up the agent, the logs are streamed and the Lambda serverless dashboard is set up.
The serverless dashboard (refer to picture 1 above) contains visualizations and information of the various metrics and behaviors for each Lambda function, along with the logs.
Picture 4: Logs Service
To mimic a real time error in the production environment, I created a backup of the database and then killed the existing one for some time. Since the controller was running, the automated users who tried to access the website were then unable to do so. The failures in the applications were also shown in the logs.
In a matter of minutes, I received an email stating (see picture 5 on the right) that there has been an anomalous behavior detected in my system.
Picture 5: Anomaly Email Alert
On clicking the link provided in the alert, I was redirected to the Lambda anomaly dashboard, which had several visualizations of the anomaly. The screenshots displayed below cover the injection of two faults and the anomalous behavior detected by the AI models over that time range:
The anomaly dashboard clearly shows the relevant log messages for an anomaly. This helps to pinpoint issues quickly. Refer to an example in picture 6 below.
Picture 6: Logs Messages for an Anomaly
Shows which metrics were actually affected when the error occurred. From this interpretation (see picture 7), we can see that the metric that was behaving abnormally was memory related.
Picture 7: Metrics for an Anomaly
The frequency of erroneous logs predicted at a time interval tells us that a critical issue is at hand and that it needs to be fixed immediately. See picture 7 above.
Anomaly count with confidence level for each of the anomalies is displayed. Clicking on one of the errors will redirect the user to the dashboard for that particular anomaly. See picture 8.
Picture 8: Anomaly Count
You may sign up for 30 days free-trials at console.cloudaeye.com/auth/register-page
Picture 9: CloudAEye SaaS Sign up Page