This article contains notes about the Linux Softdog software watchdog kernel module, the watchdog software which makes use of it, and their configuration and use at BitFolk.
What is a watchdog timer?
A watchdog timer is a device that triggers a system reset if it detects that the system has hung. A program running on the system is supposed periodically to service the watchdog timer by writing a "service pulse." If the watchdog is not serviced within a particular period of time, the watchdog assumes that the system has hung, and triggers a system reset.
A BitFolk VPS should be at least as reliable as a normal piece of hardware, and if you're experiencing regular problems you should contact support to see if they can be resolved. A system can become unresponsive for many reasons however, and you might decide that having a watchdog to reboot it would be desirable.
What is Softdog?
Usually, watchdog timers are implemented as add-on cards, or as on-chip peripherals within microcontrollers. This is clearly not feasible for a virtual machine, however the Linux kernel can provide a software watchdog implemented using kernel timers.
How Softdog works
The Softdog driver is usually loaded as a kernel module. It's been shipped as a module with Debian and Ubuntu kernels for a long time, and probably CentOS too. Once it's loaded it creates a device node, /dev/watchdog. When something opens that device file, the kernel starts a timer that will expire in 60 seconds by default. Every time there's a write to the device node the timer is reset. If the timer expires the kernel will reboot itself.
On Debian and Ubuntu, the userland side of this is provided by the watchdog package. This provides a daemon which (amongst other things) will open the watchdog device and periodically write to it to let the kernel know that userland is still alive.
# Start watchdog at boot time? 0 or 1 run_watchdog=1 # Load module before starting watchdog watchdog_module="softdog" # Specify additional watchdog options here (see manpage). watchdog_options="-b"
/etc/watchdog.conf (a lot of comments and some defaults removed for brevity; see the man page for full options)
watchdog-device = /dev/watchdog realtime = yes priority = 1
In this configuration, the watchdog daemon will only write to the watchdog device. There are other tests that you can configure the watchdog daemon to make, such as being able to fork processes, allocate memory, stat files, etc., but this article is just about setting up a last ditch attempt to reboot if userland dies.
It is safe to stop the watchdog daemon because it will close the watchdog device node. The kernel only runs the watchdog timer while the device node is opened. If you'd like to test the function of the watchdog you could for example pause the watchdog daemon:
# kill -STOP $(cat /var/run/watchdog.pid)
60 seconds after this, your VPS should log "SoftDog: Initiating system reboot." to console and then immediately reboot. This won't be a graceful reboot; it won't even take the time to sync filesystems, so be careful.