Suspend and restore

From BitFolk
Jump to navigation Jump to search

What suspend and restore is and how it can be used with BitFolk VPSes.

Definition

Normally when a BitFolk host machine is subject to maintenance that requires a reboot, all VPSes on the machine will first be shut down. Once the maintenance is complete and the host is up again, all VPSes will be booted again.

It is possible instead to suspend a running VPS to permanent storage. Upon boot, all suspended VPSes will be restored from permanent storage before the remainder of the VPSes are booted.

This is very much like the hibernation feature which you may be familiar with when using desktop Linux. When the VPS is restored, everything that was running before should be running again.

Advantages

As the VPS basically just leaves off from where it was before, this is faster and less disruptive. The state of the VPS should be the same as it was before, except that the clock will have stood still for the elapsed time and very likely all TCP connections will have been torn down.

Possible issues

Warning Warning: A bug prior to Linux kernel 4.2 was discovered and in August 2021 caused irreparable filesystem damage to two customer VMs during restore. Despite this being a bug fixed 6 years ago (as of the time of this writing), there are some BitFolk customers still running such ancient and unsupported kernels. BitFolk does not recommend use of suspend+restore with kernels older than 4.2. As BitFolk cannot tell which kernel version customers are using, every customer VM was opted back out of suspend+restore on 30 September 2021 as a precaution.

Restore sometimes does not work correctly. The failure mode is quite deterministic in that if it works it should always work but if it fails it will fail every time. At the moment it is difficult to predict whether it will work or not, so the only way to tell is to try it.

Sometimes the failure is in the kernel; in other cases it is an application which objects to time standing still for a long period.

Therefore the default is to not use suspend/restore. All of BitFolk's own infrastructure VMs (30+) do use it however, so it is successful more often than not.

Since this feature was offered around 80 BitFolk customer VPSes have been suspended and restored many times each with only a few failures.

How to use

You can enable suspension in the runtime preferences section of the Panel.

As these are early days for the suspend feature, if you enable it then BitFolk will add a Nagios ping monitor to your VPS so that BitFolk can tell if restore was successful next time there is a host reboot. Since failures tend to be immediate, this will alert BitFolk to the problem and allow for a speedy manual reboot to minimise your downtime.

Limitations

The suspend preference currently only matters for occasions when BitFolk is performing scheduled maintenance on the host machine. Suspend will not take place if there is for some reason an unexpected reboot or crash of the host machine.

You also won't be able to trigger a suspend/restore yourself - although it's unclear why you would want to do this.

Migration

BitFolk can also use the suspend and restore mechanism to migrate your running VM between two different hosts without shutting your VM down. Your VM is suspended as described above but the memory image is then transferred to the target host and restored there, instead of restoring it on the same host.

At the moment this is a slightly experimental procedure that is only used prior to scheduled maintenance, by request. Instructions for requesting it will be provided in the maintenance notifications.

BitFolk is working towards bulk use of migration to empty hosts prior to maintenance, thus making migration the default.