Observability and Monitoring of applications and Service in Cloud
To become cloud ready is one thing, but to constantly deliver and maintain the readiness needs a lot of effort. In Cloud native software development approach we tend to build the infrastructure by assembling distinct infrastructure resources (e.g. Compute instance, Memory, Disk, Network interface, IP address etc.). This makes the architecture of the infrastructure complex. You need to monitor these resources together to know the overall health of the infrastructure. We also tend to break the monolith application or software system into micro services so that we can make changes to each component independently and bring changes in each component faster to production. This microservices architecture and the communication between each of the services makes the system complex. We also use dependent off the shelf services like database, messaging, identity management, user management and many more similar solutions in our services so that we can focus on building the business logic. The dependent services makes the whole architecture of the software system even more complex. You need to monitor the interaction among these dependent services with your service that implements the business logic constantly as well. Monitoring the status and health of the whole system and each individual component is crucial when we think about the steep promises we make to our customers in terms of availability. We need to build the product to be more cloud ready, resilient and fault tolerant. Observing the status, monitoring the infrastructure and service components and reacting fast to any issues or potential issues is needed to achieve these goals. First you need to be aware of what tools exist, what you need to monitor and how, how you can get notified on current issues and how you can act fast. In this article; I will explain a few tools, metrics and ways that cloud native developers follow to build their observability stack and monitor their system.
To monitor the cloud infrastructure and software components and know just in time about any incident we have several tools for observability. All cloud providers have their own tools. AWS CloudWatch for monitoring and CloudTrail for logging. Azure Monitor is similar to AWS CloudWatch and Log Analytics is similar to CloudTrail. In GCP, Google Cloud's operations suite is used for both monitoring and logging. Some generic tools that can work in any cloud infrastructure are Dynatrace, Grafana, Prometheus, Kibana, Loki, FluentD only to name a few.
For example, Dynatrace automatically can install agents in the Kubernetes Pods or Containers where an application is running and a dashboard can be created based on those metrics. the applications or services can be instrumented to gather and publish certain specific metrics to Dynatrace. Similarly we can use Prometheus to collect the metrics and build dashboards or can use Grafana to build a dashboard using metrics from Prometheus. Like Dynatrace custom metrics can be sent to Prometheus also. On the other hand Kibana is used for gathering application logs also Loki can be used for the same. Kibana also has capabilities to create dashboards. Now the question is what kind of metrics we need to track and build dashboards on them. There are thousands of metrics (if not hundreds) but we can't monitor all of them, we need to come up with a bare minimum that makes sense to monitor the health of the whole system that includes different cloud infrastructure components and cloud native micro services that together build the cloud product.
The tools discover many metrics without the application or service instrumenting the code to expose metrics. These are generic metrics available for all kinds of applications and services. Many of them may not be useful for your particular software system but at least few of them you can directly monitor, the CPU Utilization, Memory Utilization, Disk Utilization and Network I/O. By monitoring metrics on these resources you can identify any issues that might occur or have already occurred. These metrics are known as Service Level Indicators or SLIs. SLIs are defined by you as indicators that can reveal states of your system at a particular point of time.
The RED method examines these three metrics for micro services architecture
(Request) Rate - the number of requests, per second, your services are serving.
(Request) Errors - the number of failed requests per second.
(Request) Duration - distributions of the amount of time each request takes.
But it works only for request-driven services and not much help for batch requests or streaming services or messages. Google’s site reliability engineers (SREs) decided to monitor four metrics: latency, traffic, errors, and saturation. These are called Four Golden Signals.
Latency is the time it takes to send a request and receive a response. can be measured both from Client and Server side. You should track latency of both successful and failed requests. As in cloud products customers expect fast response and we also aim for faster response possible this is the important metric to measure. Faster response time needs to be taken into consideration while designing each component. For example, CDNs and caching data in memory or caches instead of getting it from a data storage. Sometimes certain code changes can have an impact on the latency so it needs to be monitored constantly. Sometimes a heavy load on a component can cause slow response and certain actions can be taken manually or dynamically based on that. Solutions are using Load balancer, scaling in and out dynamically. We can also monitor which micro service in the workflow consumed how much of the latency.
Traffic is a measure of the number of requests flowing across the network, both HTTP requests and messages have to be taken into account. The cloud product or certain components or micro services might not be designed to handle load in terms of traffic after a certain limit. For example certain components can become very slow or even crash. if we are able to monitor the traffic and such incidents occur. We will have better information on why those incidents occurred and can design and enhance the components accordingly. or can design for dynamic scalability depending on the traffic. But before that we need to be able to analyze the behavior of the components individually and together depending on a certain level of traffic.
We can track numbers or errors per unit time or rate of error. Errors in micro services architecture are mainly HTTP response families(4xx,5xx) etc it can also be a crash. We can also track errors of hard dependencies of the application and services like databases. The cause can be bugs in code or any other dependencies not working properly or network errors.
Saturation is the load on your network and server resources. Every service consumes a certain amount of resources, main resources being CPU, memory, Disk and I/O operations. After a certain limit of load all its resources get exhausted and the services start performing poorly.
Many times these generic metrics only gives you an indication what is going on or how the whole setup is working, to track down individual issues or failure of workflows and understand why they have failed you need to build custom metrics that is specific to the business case and propagate them to the metrics collector tool that you are using. And then of course you need to drill down to specific logs.
Sometimes all information is not available in these tools, for example, if you would like to include some metadata about your environment along with the metrics. For this reason tools e.g. Dynatrace and Prometheus provide APIs that you can use to extract data points from them. Then you can combine these data points along with your metadata and derive new data points or insights.
After you identify the SLIs that you are going to monitor you need to define what are the upper and/or lower limit of these SLIs or in other works what are the thresholds for these SLIs. If the SLIs are within the defined thresholds then you meet your objective. On the other hand, if the thresholds are crossed then your objective is violated and you need to take some action to ensure that the objectives are met. Defining a threshold helps clearly identifying whether the objectives are met. These are known as Service Level Objectives or SLOs. For example, your SLO can be defined as, you want to serve each customer query generated from your SaaS application within 200 milliseconds. Then you set the threshold for latency or response time as 190 millisecond, keeping a room to act when the threshold is violated and still meet your SLO. You constantly monitor the SLI and when a SLO is breached you take action. The action can be already defined before or can be defined after some investigation is done on the incident. The action can be taken manually or automatically. Service level Agreements or SLAs are on the other hand, an agreed SLO with your customers. When a SLA is breached your customer is legally allowed to impose some penalties on you. For this reason SLAs are generally less restrictive compared to SLOs. Ín terms of the previous example we used for SLO on latency, we can say our SLA is 250 milliseconds.
After you know which Metrics you are interested in and where all of the metrics are available you need to build dashboards that contain several charts and graphs to reveal insights. These charts can be analyzed by data scientists or any one interested. Many times you need time series charts to view how a metrics looked over a span of time.
Even if you build very expressive dashboards that reveal information about your whole system, someone needs to look at them to understand what’s going on. Depending on a metric or a combination of metrics you can define thresholds. And depending on the threshold you can define Alerts. Whenever the thresholds are reached or crossed an alert is created automatically and sent to different channels for example slack messages or emails or SMS. Dynatrace, Kibana, Prometheus all support creating alerting rules and based on the rules alerts are generated. Different tools have different levels of integration with other channels to propagate the alerts. Alertmanager is a tool that specializes in receiving, processing and distributing alerts. There are APIs in Alertmanager that can be used to send alert messages to Alertmanager.
With the observability stack in place you are better equipped to detect and analyze any issue that may occur in your system. You can even take proactive actions manually or automatically based on these tools at hand.