Updated: Mar 14
The software offering you make in the cloud has to be resilient in nature keeping in mind the it's not a monolith but made out of different components assembled together. Many components are used off-the-shelf and are not developed by you. For this reason you don't have control over all components and you need to take extra care that you put more thought
on the resilience of the component you develop and own. One of the features that makes the application or service more resilient is its ability to cope with high and low demand, adjusting itself dynamically and utilizing resources cost effectively. In Cloud native development scalability has to be built in the architecture of the components and in the service as a whole. Here are a few aspects that you need to think about and act accordingly to build a scalable application or service in the cloud.
What is scalability ? How to achieve scalability?
In cloud most of the time the component you build, it could be a off-the-shelf reusable solution or a micro service that implements part of the business logic, and will be dependent on other components deployed in the cloud. There will be yet other components who are dependent on your component. Irrespective of other components you components should be able to adjust itself within a reasonable amount of time with respect to the amount of work it has to do. Let’s take an example of a micro service which is executing some business logic and serving certain business use cases and part of a workflow that touches three other micro services, a database of the cloud application. Each of these micro service has their own local database to keep some information about the data they process. Some of them also depend on other services like message queues, caches etc,. These micro services are communicating among themselves using RESTful APIs. Your micro services need CPU and memory to process the requests, have to save some information about the processing it is doing to its local database and then send the processed data to the micro services downstream. Now when a high number of requests come to your micro service, it might not be able to process them after a certain threshold of requests per second has crossed because your micro service has a limited capacity of CPU and memory, it can only make a certain number of database operations per second. In this scenario, your micro service needs to be able to add extra resources so that it can cope with the extra load it is experiencing. In the worst case the micro service will crash and your micro service will not be able to process any request which will affect all customers and all workflows. While to overcome a crash there are certain ways to automatically restart the micro service, but in the first place you need to avoid the crash. On the other hand, when the number of requests are low and your micro service holds more resources in anticipation that at any time high requests can come and it needs to serve them all, in this case the resources will be under utilized. You need to reduce the amount of resources you consume and make sure they don't sit idle. It also costs you more. There are two types of scaling options : vertical scaling and horizontal scaling. This blog in cloudzero.com gives a good definition of vertical and horizontal scaling and a good comparison between them. Another blog in the missioncloud.com also provides ample details. In the cloud many times we perform scale out and scale in meaning increasing and decreasing the number of instances of the application or service deployed.
It's important to understand that you need to know about your service or product that you offer. How it behaves in different circumstances, so that you can take action to scale out or scale in.
Assess the load you need to serve
At first you need to know how many requests or messages on average comes to your service on an average day, this givens you a Baseline depending on which you will decide how many instances of your service you need to run with how many CPU and how much memory. Then you need to know what is the highest load that the service need to support means the request that comes in the peak time. Then you know how much extra processing power you need to add. Also, you need to analyze if the load the service experiences is a constant load or fluctuating and unpredictable. Sometimes due to usage patterns of the users the load becomes predictable, e.g. high load on Tuesday to Thursday business hours, average load on other times on those days and on Monday and low load on weekends. To gather the details information about load the service experience you best bet is an observability stack, it can provide you exact numeric data based on which you can make decisions. If you want to know more about observability you can read this article I have written on observability. Once you know the pattern of load the service experiences you an also analyze how the service behaves in those various circumstances using some more metrics in your observability stack. For example, you may discover that when load increases to more than 85% the processing time of the service becomes slow and when load increases more than 95% the service crashes. At this point you will have to come up with a plan on how to handle these situations. Will you analyze the situation and take some actions manually or set up some rules that help you take decision dynamically, for example, add another instance of the service when load reaches 85% and remove an instance when load drops to 50% and at a time keep a limit on how many minimum and maximum instances can co exist and can be launched. Also, it is important to make a strategy on how you will react and implement the scaling, will you get an alert of the situation and manually increase or decrease instances or automatically the scaling will be done? There are different tools to define dynamic scaling rules and perform scaling automatically, depending on which kind of cloud infrastructure you use.
Asses your scalability Limits
Even though you scale out and not scale up sooner or later you will hit your scalable limit, resources are not infinite. The deployment environment can have a hard limit on how many VMs or instances of a container or pod you can spin up. The consumable network bandwidth also has a limit so you can not assume that you can grow infinite if you have no limits on how many application or service instances you run. The services you depend on for example your downstream services or your database also has a limit on how many requests it can process per second. Scaling out also does not come free, there are costs associated with your scaling as you will consume more resources. In many cloud platforms there is simply an upper limit or budget that you are allowed to consume. So it is important to know the limits in terms of resources that you can consume. In many cases the lower limit is 1 unit of resource and 1 unit of the application or service but depending on the architecture and use case it may not always be 1 unit. You need to validate this assumption too. More than knowing the limit or the resource you can consume it about knowing when you need to scale. The programming model that you use and the amount of processing you need to do for each unit of work is different in each use case. It's different for each application and service. The unit capacity of resources you get in your cloud environment is also not always the same. For this reason you need to have the statistical data available to know how many units of work you have at one instance of application or service at a time and you still be within your Service Level Objective (SLO). For example, you may decide to process each HTTP request within 200 milliseconds. And you may notice once you reach one million concurrent requests your processing time takes longer for all concurrent requests, or you may notice your CPU and memory reaches the limit, or the maximum limit of concurrent database connections is reached. These events are an indication that you need to scale in an upward direction. The same is true to reduce resource consumption, you might notice after the concurrent requests drop to half a million your resources are underutilized. That is another indication to scale but to a downward direction. An Observability stack is a must have that allows you to discover the statistical data points and measure your system.
How and when do you scale?
Whether there is a scaling capability to some degree for the service or not at all, you need to think, how should you scale? Does the service scale vertically by scaling up and down or scale horizontally by scaling out and in. In general scaling vertically needs downtime of the service whereas scaling horizontally does not need downtime especially when the service is stateless. You also need to keep in mind that you aloud also scale down or scale in to maximize resource utilization and minimize cost. Is the scaling done manually or automatically? Do you have a fixed time when the scaling is performed or it is done dynamically depending on the current situation of the load ? Another point to consider is, when the load threshold is hit you may not immediately scale, there might be some sharp peaks that go away rather quickly. You can think of scaling when the load is continuous for some period of time, if the service is able to withstand the load and will not crash.
Perform Load and Stress Test
Load testing and stress testing are most well known ways to discover how the service behave when there is high load and when a stress situation arises. This blog in loadninja.com describes different types of testing if you are wondering what load and stress testing mean. This blog in blazemeter.com compares between the two. This kind of testing need a lot of effort but if you are able to perform them they answer many unknown questions about
Know your Dependencies
You need to know the external resources, services and the downstream services or application and their limits. You need to know their SLAs and what impact they have on your system when they are not available, even if for a short period of time. You will design your system accordingly when you know these factors. When you know the CPU core you use can only handle a million instructions per second you will measure that as your Service level Indicator and monitor the situation. Whenever the limit is near, you will apply your scaling mechanism. When you know the maximum time it takes to overcome an application or service crash you will understand how many crashes you can ignore or survive. You can ignore a service crash if your service is stateless and multiple instances of service are always running and recovery from a crash is automatic and takes a few seconds and it is not a problem if around 100 requests or units of work are affected by this crash and it's not an issue for you. When any of these are not true you will try to avoid a crash. For example, if you know the limit for the requests your database can handle at a time you might just stop taking more requests or just keep requests in a queue or can implement a caching mechanism to ease the load on your database. When you know your messaging service can be down for 5 minutes in total in a whole year for any reason you will implement a mechanism for getting acknowledgement for the messages you send. When you know the downstream service can just fail once per thousand times you will apply retry mechanisms. These also can make the situation better together, scaling is not the solution of all problems. You also should monitor the resources and dependencies to get the statistical data in hand based on which you can develop algorithms to scale better and be more resilient.
Know your Infrastructure
The decision to scale or not to scale but to apply some load control mechanisms also depends on the infrastructure your application or service is deployed. The resource limit it offers you, the different ways it has to help you scale in any direction and whether it allows you to scale dynamically and automatically. The infrastructure also provides you with other services and tools to help scale. For example a load balancer allows you to distribute traffic automatically among the different instances of service that you run without you taking care how to route requests when a new instance comes in or stop routing requests to an instance that just went off. All public cloud providers and PaaS solutions have different ways to detect and define limits and apply auto scaling rules. You need to know what is given to you by the platform you run and what you need to implement on your own.
How to Handle
In many cases you can detect the overload situation or an instance or node crash and can add a new instance automatically you may want to scale up. When there is not too much load then it's better to scale down or scale in to utilize resources as much as possible and be more cost effective. There are other options too when you don't want to scale or have reached the scalability limits. It is mostly with the upper limit that you are concerned with. When you don't get any request then you can not stop all your instances because at least one instance needs to be there as a request can come at any time if you aim to be 24X7 available. In some cases you are not aiming for 24X7 availability or it simply does not make sense as it might be for example, your service is only used during business hours. In this situation you can have some automation to stop your service completely and start just before the next business hour starts. You need to make sure that you might need to get all the history or the data that you might need that you processed earlier. On the other hand, in case you reach the upper limit you need to take other kinds of actions. If you don't safeguard your service some bad things will happen, your service may crash, or may not recover automatically or quickly the state of data might be inconsistent. For example, in many cases in the cloud we process web requests. To handle the high traffic beyond the scaling limit you may throttle the incoming requests. You can simply declare your are full and just reject requests. You may impose some rate limit as a whole or per client to ensure you never reach the scalable limit. You can also introduce a queueing mechanism and even can introduce priority queueing for important requests. You can simply accept the request and schedule to process it later and implement a mechanism to deliver the result in asynchronous fashion. Even if limits are not reached, asynchronous request processing makes the situation always better.
Architectural patterns to achieve scaling?
The way the system is designed also makes scaling easy. When you have a singleton in the architecture that does all the work or part of the work dedicatedly then just adding new instances does not make sense or will only impact a little to improve the situation. If all nodes are not equal and work in a master-worker fashion then only scaling the worker does not help. When an automatic leader selection process is not there among the workers after a master goes down the architecture does not ensure scaling. There are architectural patterns that allow you to scale dynamically. The micro-services architecture helps you break silos so that your scaling is effective and you can scale part of the system but the micro-services architecture is not an architectural pattern to achieve scaling.
Below are few of the architectural patterns that allows scaling and overcome scaling limits:
Using a Load Balancer where the instances does not share any information among them
Using a Load Balancer with multiple stateless instances that offers same capability
Loose coupling of subsystems
Asynchronous processing of requests
CQRS or Command Query Responsibility Segregation
Anti Corruption Layer pattern
Circuit Breaker Pattern
Map Reduce/ Data flows
Tree of responsibility
These two blogs also describe a few of these patterns well.
I have discussed a lot of aspects of working with scalability to achieve better resilience or reliability for your cloud
native development but it is not all, it will keep you going.