Where a dockerized application should persist data?
Updated: Sep 11, 2021
Great! We want to host an application in a docker container. We are taking a step towards micro services architecture and cloud native deployment. But, the very next questions from developers would be, where to persist data and how to access already stored data ? In this blog we will see what are the best practices.
Where we should store data that the application needs depends on the purpose of the data and its required longevity. Data can be short leaved and temporary or needed to be long leaved in a persistence storage. Example of short leaved data is files in a temporary folder and example of long leaved data is data in a Database or storage volume.
Intermediate data produced by application during processing
Data produced by application for processing is intermediate data. Application stores this data in the memory[RAM] of the host machine where the container is running. There are options in docker run command to limit memory to a container. We can also decide which type of memory can be assigned to a container and how much memory can be consumed by a container. See the Docker documentation for details.
Short leaved temporary data
In few cases, applications need to produce and store data for a temporary basis for internal processing. This data is not required when we restart the container or deploy a new container. If we lose this data it does not harm. When a container is started from an image all the layers the image consist of are read only layers. Docker creates a writable layer for the container on top of the read only layers. When we host our application in a Docker container the container allows us to write data into its writable layer. This kind of data qualifies to be stored within the container. We get this data back even if we stop a container and restart it again, as it is stored in the container’s writable layer. But when the container is deleted the data is lost because the writabale layer is also deleted. It also makes the size of the container large. It's not a good idea to store a large amount of data within the container as the I/O performance is not efficient. A storage driver allows us to write data into the container. It provides a filesystem in the container, using the Linux kernel and manages the filesystem. The reason for less efficient I/O is this extra abstraction layer. We can choose from a number of storage drivers. They areoverlay2, aufs, fuse-overlayfs, devicemapper, btrfs, zfs and vfs. Some of them come by default and need no configuration for your docker. As a developer you can write programs that create, delete and read directory and files and write and read data from these files just like the application will do when an application is not running in a container. Read more about different storage drivers here.
Long leaved unstructured data
Unstructured data is often stored in file format. Often applications want to read from, write or append to a file. These files need to persist even after the container and the application in the container terminates. In this case it's better to store these files in the host machine or to a remotely accessible volume. Docker provides us mainly two options: volumes and bind mounts.
Volumes are created at the first time when you mount volumes to a container, if they already do not exist, but when we stop the container the volume and data in the volume remains. On the other hand when an empty volume is mounted in a directory of a container where data already exists, data is copied to the volume. Multiple containers can be mounted to the same volume and the containers can access data simultaneously. The volume can be in the local host as well as in a remote machine in the cloud.
Using bind mounts we can bind a host file system to a Docker container, the host file system can be anywhere in the local host. When a filesystem is bind mounted to a directory in a Docker container where data already exists it behaves like /mnt command of linux. The data in the container is not removed but can not be seen or accessed in the container till the bind mount is active.
We also can use tmpfs mount when Docker is running on Linux or named pipe when Docker is running on Windows. To know more visit the Docker documentation.
Long leaved Relational and semi-structured data
Applications need persistence to read and write relational and semi structured data. Example of semi structured data is JSON or key-value storage. For relational data we need a relational database like Postgresql. To store JSON we need a semi structured database like mongodb. For key-value pairs we need databases like Redis. This kind of data persists longer than the life of a container and needs to be available beyond the life of the container that is running the application that produced the data. Contains are designed to be short leaved and re-deployable. When a container dies a new container can be created within seconds and the new application instance in the new container should be able to access the data. Even multiple instances of the same application running in separate containers should be able to access this data. That is why this data has to be outside the container. All these databases can be hosted in a different container in the same or different machine with persistent volume ensuring protection against data loss. Even better that the databases run in a cloud accessible via secrets like username and password. The application running in its own container should be able to connect to the host and port where the database is hosted through protocols like JDBC etc. This is much like an application running in a container communicating with another application anywhere in intranet or internet.
Secrets and sensitive data
Sensitive data like username, password, SSL certificate or SSH private key can be stored using Docker secrets.
Secrets are stored in an encrypted form. See Docker documentation to know how to create secrets and use them. Secrets only work in Docker swarm though. We can store the secrets in the etcd of the host machine and provide them directly through the --env option of the docker run command to a Docker container.
Application configuration and environment variables
Non sensitive data like application configuration can be stored and provided to the containers using docker config. Docker configs are similar to docker secrets but they are not encrypted. They are mounted in the container as a container file system. This feature is also available only in Docker Swarm. We can directly bind mount config files into docker containers when we are not using Docker Swarm. Read more about docker config in the Docker Documentation.
In this post we have learned based on the type of data there are several options Docker provides to store and access data.