Often cloud companies provide Service level Agreements (SLAs) in terms of availability to showcase how cloud ready and reliable their cloud products are. Each cloud native application and services must carefully adhere to technologies, mechanisms and processes that ensure the overall SLA of the cloud product. Disaster recovery is closely related as it ensures the capability and well defined process (manual, automated or semi automated) to bring back the cloud component to a usable condition after it was unavailable due to any unforeseen situation. RTO (Recovery Time objective) , RPO (Recovery Point Objective) and MTD (Maximum Tolerable Downtime) are some indicators to measure how good your disaster recovery mechanism is. Less the RTO (Recovery Time objective) the more easy it is for you to meet your overall SLA. When your RPO (Recovery Point Objective) is less you can advertise your cloud product as more resilient. This blog beautifully describes RTO, RPO and MTD, in case you are searching for a definition. I am going to discuss a few points you can use to assess your cloud components readiness with respect to availability and disaster recovery.
High Availability, Replication and Redundancy Strategy
The deployment architecture means how the runtime artifacts are deployed and organized. In cloud native development, most of the time a cloud product is divided into several cloud components and follows micro services architecture. Each of these components has their own life cycle, however together they fulfill certain functionality of the cloud product and forms the backend of the cloud product.
We need to identify whether there is a singleton in the component landscape. A single ton is a single point of failure and needs greater attention to be highly available. You need to think whether the singleton can be replicated with similar instances, will have one or more hot standbys or how fast it can be recovered. A better approach is to divide the whole set of functionality into several dedicated components.
Once you achieve that the next approach is to run several instances of each component. In that way you can scale easily and also perform blue green deployment, canary deployment. It also helps when for some reason one instance is down. The additional or secondary instances can work in different fashion: active-active, active-passive. The active-active instance can be used simultaneously along with the primary instance. All the instances can receive all kinds of requests in parallel, or you can accept only read or idempotent operations in the secondary instances. In active-active more all the instances are always active. If the secondary instances are read only then there should be an automatic sync between the primary and secondaries. In active-passive mode the secondaries are hot standbys, they are constantly synced with the primary but can not be used while the primary is working. Once the primary is unavailable the secondary becomes primary. The transition between secondary to primary is often automatic and takes almost no time. After the secondary becomes primary either the other primary is repaired and restarted/relaunched or a new fresh instance is launched as primary. The good news is several cloud services and cloud platforms (IaaS, PaaS) already have these mechanisms that you either need to opt-in or just include in your initial configuration to avail.
In some deployments all components are equal, in others they work in leader-follower or master-worker fashion. In this case, downtime of a leader or master has more impact than follower or worker, as usually there are many followers but only one leader. Sometimes, these special component architectures are provided by frameworks that come with their own deployment management features. For example, once the leader is down all the followers automatically elect a leader among them. In this case, we need to know will it cause a temporary down time when a leader is unavailable or will the deployments support a read only mode till there is a new leader or the old leader is back again.
The deployments that support several instances or leader-followers generally use several nodes to run those components and instances. All the nodes together make a cluster. Nodes can be added or removed at runtime, dynamically or manually depending on need. When this has to work smooth, there should be a supervisor component inside the cluster or outside of the cluster that can supervise and control the nodes, in terms of the health detection, availability, deciding when to mark a node as unavailable, retire or launch a node, how to load balance traffic or work units between them and much more.
High Availability, Replication and Redundancy Strategy
The cloud components should be deployed with redundant copies, preferably in physically separated hardware stack and datacenter facilities. This complements the deployment architecture you have designed for high availability. The public cloud providers use the term availability zone to indicate that they have completely different data centers separated by distance but still in the same Region. They also provide easy deployment options so that you can deploy the cloud components across multiple availability zones. There are other components like load balancers, Content Delivery Networks that can be used to balance traffic between different availability zones. There are options where you can manage these components on your own or you can delegate that task to the cloud providers. When you delegate the task to cloud providers the existence of such components often is hidden from you. When you use another Platform as a Service to deploy your components sometimes distribution of multiple instances in different availability zones is automatic. You always need to be careful and informative about the infrastructure you use to understand whether your components can be deployed in different availability zones. Your architecture also should allow the use of multiple instances deployed across multiple availability zones and data centers. When the instances of the component need to sync with each other (for example frontend servers of database solutions) you also need to think how that is going to happen, will it be a regular load and unload of the whole data and state, or just deltas, or a real time replication of the new changes. When it's a real time replication then the clients or end users of the component can consume the component in active-active fashion with zero-downtime to switch between primary and secondary. They both can also work as primary. In case it's a stateless web service you don't have to think about synching those instances as long as they use the same application database. The database can provide high availability out of the box, which means the database takes care of synching different database instances. When it’s a stateful web service you can use application level load balancers and implement session affinity so that the requests with the same session always go to the same instance of the component.
You need a test strategy to test the high availability of your component. For example you can use chaos testing and shoot down an instance of a component and can analyze how your component behaves in this situation, whether it is able to cope with the situation and up to what extent. You can generate the amount of real-life load and more increasing load through stress testing in a test environment and test if your component resilient enough to handle these situations.
You need to make it a practice to perform such type of testing regularly and build or use handy tools that can help you to perform such activities easily.
Even if you have the best high availability options on board, you need to detect a disaster and then take action. Taking action can be manual, semi automatic or automatic. Example of a manual action is, deploying a new instance of the component once you identified that an instance has crashed. Example of an automatic action is, you have implemented or used a platform provided monitoring and once the monitoring detects that an instance is down it automatically spins up a new instance. Another example is automatically converting the secondary as primary in an active-passive deployment.
Whether you have automatic or manual actions, platform provided mechanism or in-house implemented mechanism to deal with the disaster situation, you will need to first identify that a disaster has happened, if not to take any manual action but atleast for future analysis and determining whether you reached SLA. Disaster detection is also needed for your dependencies, the resources and services that a component depends on. When a cloud component has the ability to detect disaster or unavailability of its dependencies, the developers can use this information to build the component in the most resilient manner. For example, if your database is unavailable for a short time, you can survive the situation by implementing a cache or a temporary storage e.g. kafka. When a component can detect a dependency is unavailable it can automatically degrade its service but still be operational. For example, the component can start offering part of the functionality which is not affected by the disaster. You can use many ways to decide that an instance of a node of a component or a node of a system is gone, one of the easiest ways is by sending requests in periodic intervals to the component that either the component developer implements or the platform provides. In certain architectures like leader-worker architecture or primary-secondary where instances need to sync between each other, the components need to know that one of the nodes is unavailable and needs to update itself by removing the entry of an unavailable instance and adding an entry of a newly created replacement. This kind of functionality might come from the framework you are using to develop or from the platform or might need to be implemented in-house. Disaster detection is central to these kinds of deployment architectures. You need to be careful not to cause any Split Brain issues.
In the worst scenario you should be able to recover as fast as possible from a disaster. You need to have a well defined process which is automated that takes the Operations team also into loop but should not contain steps that are time consuming due to human interventions. You need to rehearse and determine how long it will take to recover from a failure.
Is the recovery automated or is it manual, when the recovery is automated does it also contain the post-recovery housekeeping steps that might exist, for example, any resynchronization tasks. Some platforms like Kubernetes, Cloud Foundry etc. provide automatic recovery options once after service fails. On the other hand you also should know the recovery objectives of the resources and services that your component depends on and the process that you need to follow to recover your dependencies. The recovery options and process should be clearly known to all who need to know.
You have to define how you can recreate the component and deploy it, do you have a backup? For example, do you store the latest version of the container that contains all needed to launch your component? Can you launch the component in a secondary location or a completely new location to bring the business continuity as fast as possible? Do you know the answers of these questions for the database and other services and resources that you depend on ?
A planned disaster simulation can reveal a lot about your high availability claims and effectiveness of your disaster recovery process. A disaster simulation has to be well thought of and should be part of your system architecture. Otherwise, simulation becomes much more difficult. You also need to decide whether you want to simulate a duster in a live environment where your customers are working or you want to build a new environment to perform disaster simulation. You can for example, shoot down some components completely or partially (say kill a few instances but not all) and can observe the effect it has in the other parts of your software system. For example, in micro services architecture where all micro services together provides a set of functionality and constitutes the backend, you can shoot down one micro service and watch its effect on the other directly or indirectly dependent micro services and to the cloud native software as a whole. In another approach, you can shoot down some of your dependencies to understand up to what extent your component can survive. For example, if you are using a database that allows primary database for read and write and a replica database as read only, you can shoot down the replica to understand how much extra load you primary has to take and up to what extent it can perform the write operations with latency. These kinds of activities are part of chaos engineering.
Having your SLIs, SLOs and SLAs are important to understand up to what extent you need to support high availability . As much as you want to be available you have to pay more for the redundant infrastructure. A well defined metrics and a well defined goal allows you to understand the need. Metrics also allows you to detect disaster and calculate how well the component is performing with respect to its SLA. A well defined observability stack is needed to take action to achieve the goal where your actions are data driven and statistically proven. When you want to simulate disaster you need the metrics and measurements that you can analyze quickly and easily. Without metrics and measure data for those metrics and an analytical visualization tool your disaster simulation is hardly meaningful.
During and after a disaster happens you may need to follow a compliance process that is setup in your company to inform anyone who is interested including your customer. As an example, you might need to report the outage internally to other stakeholders and to your customers, the progress of the situation until the situation is resolved. Sometimes you need to capture information about the disaster for introspection later.
What I have discussed above is not a complete solution that provides high availability nor does it say how to overcome an outage. Rather it instigates your thoughts and helps you think about what are the main important aspects you need to take care of. When you have an answer to the points discussed above the component under scrutiny is in a good position in terms of high availability and disaster recovery.