Good suggestion. I've done blue-green professionally with services that are built to have high availability and in cloud environments. If I were to actually setup some form of that, I'd probably use ZFS send/rcv to keep a backup server always 15 minutes behind and ready to go. I wouldn't deal with file-based backups that take an hour to just walk the dataset to just figure out what's new. ๐ Probably not happening for now.
avidamoeba
When you say stopping the service for an instant you must mean pausing its execution or at least its IO. Actually stopping the service can't be guaranteed to take an instant. It can't be guaranteed to start in an instant. Worst of all, it can't even be guaranteed that it'll be able to start again. When I say stopping I mean sysemctl stop
or docker stop
or pkill
etc. In other words delivering an orderly, graceful kill signal and waiting for the process/es to stop execution.
That's the trivial scenario that we know won't fail - stopping the service during snapshot. The scenario that I was asking people's opinions on is not stopping the service during snapshot and what restoring from such backup would mean.
Let me contrast the two by completing your example:
docker start container
- Time passes
- Time to backup
docker stop container
- Make your snapshot
docker start container
- Time passes
- Shit happens and restore from backup is needed
docker stop container
- Restore from snapshot
docker start container
Now here's the interesting scenario:
docker start container
- Time passes
- Time to backup
- Make your snapshot
- Time passes
- Shit happens and restore from backup is needed
docker stop container
- Restore from snapshot
docker start container
Notice that in the second scenario we are not stopping the container. The snapshot is taken while it's live. This means databases and other files are open, likely actively being written to. Some files are likely only partially written. There are also likely various temporary lock files present. All of that is stored in the snapshot. When we restore from this snapshot and start the service it will see all of that. Contrast this with the trivial scenario when the service is stopped. Upon stopping it, all data is synced to disk, inflight database operations are completed or canceled, partial writes are completed or discarded, lock files are cleaned up. When we restore from such a snapshot and start the service, it will "think" it just starts from a clean stop, nothing extra to do. In the live snapshot scenario the service will have to do cleanup. For example it will have to decide what to do with existing lock files. Are they there because there's another instance of the service that is running and writing to the database or did someone kill its process before it had the chance to go through its shutdown procedure. In the former case it might have to log an error and quit. In the other it would have to remove the lock files. And so on and so forth.
As for th effect of docker on any of this, whether you have docker stop container
or systemctl stop service
or pkill service
the effects on the process and its data is all the same. In fact the docker and systemctl commands will result in a kill signal being sent to the process of the service anyway.
Currently duplicity but rsync took similar amount of time. The incremental change is typically tens or hundreds of files, hundreds of megabytes total. They take very little to transfer.
If I can keep the service up while it's backing up, I don't care much how long it takes. Snapshots really solve this well. Even if I stop the service while creating the snapshot, it's only down for a few seconds. I might even get rid of the stopping altogether but there's probably little point to that given how short the downtime is. I don't have to fulfill an SLA. ๐
Yeah if you're making a backup using the database system itself, then it would make sense for it do something like that if it stays live while backing up. If you think about it, it's kinda similar to taking a snapshot of the volume where an app's data files are while it still runs. It keeps writing as normally while you copy the data from the snapshot, which is read-only. Of course there's no built-in way to get the newly written data without stopping the process. But you could get the downtime to a small number. ๐
Oh interesting. I was under the impression that deletion in LVM was actually merging which took some time but I guess not. Thanks for the info!
Docker doesn't change the relationship between a running process and its data. At the end of the day you have a process running in memory that opens, reads, writes and closes files that reside on some filesystem. The process must be presented with a valid POSIX environment (or equivalent). What happens with the files when the process is killed instantly and what happens when it's started afterwards and it re-reads the files doesn't change based on where the files reside or where the process runs. You could run it in docker, in a VM, on Linux, on Unix, or even Windows. You could store the files in a docker volume, you could mount them in, have them on NFS, in the end they're available to the process via filesystem calls. In the end the effects are limited to the interactions between the process and its data. Docker cannot remove this interaction. If it did, the software would break.
It depends on the dataset. If the dataset itself is very large, just walking it to figure out what the incremental part is can take a while on spinning disks. Concrete example - Immich instance with 600GB of data, hundreds of thousands of files, sitting on a 5-disk RAIDz2 of 7200RPM disks. Just walking the directory structure and getting the ctimes takes over an hour. Suboptimal hardware, suboptimal workload. The only way I could think of speeding it up is using ZFS itself to do the backups with send/recv, thus avoiding the file operations altogether. But if I do that, I must use ZFS on the backup machine too.
I've yet to meet any service that can't recover smoothly from a kill -9 equivalent, any that did sure wouldn't be in my list of stuff I run anymore.
My thoughts precisely.
Thanks for validating my reasoning. And yeah, this isn't Immich-specific, it would be valid for any process and its data.
And this implies you have tested such backups right?
Side Q, how long do those LVM snapshots take? How long does it take to merge them afterwards?
And I'm using Docker, but Docker isn't helping with the stopping/running during backup conundrum.
Good point. I guess
kill -9
is somewhat less catastrophic than a power-yank. If a service is written well enough to handle the latter it should be able to handle the former. Should, subject to very interesting bugs that can hide in the difference.I'm currently thinking of setting up automatic restore of these backups on the off-site backup machine. That is the backups are transferred to the off-site machine, restored to the dirs of the services, then the services are started. This should cover the second half I think. Of course those services can't be used to store new data because they'll be regularly overwritten with every backup. In the event of a hard snafu where the main machine disappears, I could stop the auto restore on the off-site machine and start using the services from it, effectively making it the main machine. If this turns out to be reasonable and working, I might trash all of the file-based backup-and-transfer mechanisms and switch to ZFS send/recv. That should allow to shrink the data delta between main and off-site to minutes instead of hours or days. Does this make any sense?