Backups

From BitFolk
Jump to navigation Jump to search

All about BitFolk's free local backups service.

Overview

BitFolk is happy to provide a 6-times-daily incremental rsync-based backup service, storing your data locally (in same data centre but on different hardware to your VPS; no data transfer charge).

The service itself is free and includes 2GiB of storage. Additional backup storage can be purchased at the normal rate for SSD-backed storage. If you don't need the full 10GiB basic storage allocation then you can use some of that, or you can purchase extra storage for this purpose.

This is not the cheapest or most secure way to backup your data, but it may be useful in that it is fairly simple to enable and comes with built-in features like incremental data transfer, deduplication and alerting without you having to think about it. Access to your backed up data will always be available over NFS.

Please note that no guarantees are made of the integrity or availability of backups made; they are provided on a reasonable effort basis.

Setup

There's a few simple things you need to set up on your VPS in order to start making use of BitFolk's backups.

Install rsync

BitFolk's backups use rsync for transferring data, so you need to make sure you have it installed.

Allow SSH access from BitFolk's backups hosts

The backups happen over SSH so please allow access to your SSH server from the following hosts:

  • backup0-vip.bitfolk.com
  • backup2-vip.bitfolk.com
  • backup3-vip.bitfolk.com
  • backup4-vip.bitfolk.com

If you have your SSH server on a non-standard port (i.e. not port 22) that is fine, just mention that when telling BitFolk about the paths to back up.

If you are allowing connections by host name then for recent versions of OpenSSH you may need to allow DNS lookups in /etc/ssh/sshd_config:

UseDNS yes

as the default for UseDNS has changed to no. Alternatively allow connection by IP as the IP addresses should not change. Most people co not try to do access control by host name so that most likely won't be an issue for you.

Add the rsnapshot public key

BitFolk will authenticate using a public key, so you'll need to add the rsnapshot SSH public key to your root user's .ssh/authorized_keys file.

Please note that this file is PGP-signed by key ID 2099B64CBF15490B and the only line from the file that you should use is the one that starts with 'ssh-rsa'.

This will give BitFolk's backup servers full root access to your VPS. If you're okay with that then you're done, but if you'd prefer to restrict this key to only using the rsync command you can use a wrapper script, such as the one described here.

This strategy uses the rrsync script which is usually distributed with rsync. On Debian you can find a copy of it in the /usr/share/doc/rsync/scripts/ directory. After you have copied it to somewhere else (e.g. /usr/local/bin/) and made it executable you would put it in the user's $HOME/.ssh/authorized_keys file like this:

command="/usr/local/bin/rrsync -ro /",no-agent-forwarding,no-port-forwarding,no-pty,no-user-rc,no-X11-forwarding ssh-rsa AAA... bitfolk backup

That should allow access by that key, only to call rsync for read-only operations.

Contact support with your list of paths

Now is the time to contact Support with a list of paths that you want to have backed up. This could be simply / (the root), although there would be a lot of things under there that don't need backing up, so you would probably prefer to list off a few top level directories instead.

Excluding data from being backed up

You can exclude things inside your selected paths by using rsync filter syntax in a file called .bitfolk-rsync-filter in the directory that contains whatever you wish to exclude.

For example, if you have asked for /var/ to be backed up, but you wish to exclude /var/log/apache/, then you would create the file /var/log/.bitfolk-rsync-filter with the following content:

- apache/

Filters only apply to the directory that the .bitfolk-rsync-filter file is in.

Once BitFolk lets you know that this is set up, backups will then take place according to the schedule you've chosen. You will not be charged for the bandwidth this uses, although it will show up on your Grafana graphs.

Schedules

The backups run every 4 hours, so that's six times per day.

Also…

  • …once per day the oldest four-hourly snapshot will become a daily snapshot, and…
  • …once per week the oldest daily snapshot will become a weekly snapshot, and…
  • …once per month the oldest weekly snapshot will become a monthly snapshot.

The timings of the schedules are fixed, but you can choose how many iterations of each level will be kept. The default schedule keeps:

  • 6 four-hourly snapshots, and
  • 7 daily snapshots, and
  • 4 weekly snapshots, and
  • 6 monthly snapshots.

So at the most granular you'll have access to a day of four-hourly changes and at the least granular there will be versions going back 6 months. We refer to this default schedule as 6-7-4-6.

If that level of retention is not to your liking then you can pick a schedule with whatever levels of retention you like, e.g. 3-3-2-18 would be:

  • Every 4 hours, retain the last 3
  • Seven times per week, retain the last 3
  • Four times per month, retain the last 2
  • Twelve times per year, retain the last 18

Obviously the higher levels of retention will lead to more data being stored which will increase the amount of storage you need to pay for. The majority of customers making use of the backup service just stick with the default 6-7-4-6 schedule.

Incremental backups

The backups are made incrementally; only changes against the most recent four-hourly snapshot will be transferred, and only changed files with relation to the most recent four-hourly snapshot will be stored. Files that never change will only be stored once.

Access to backups

If you need to restore files from your backups you can mount them using NFSv3 over TCP. The mount points will be shown in the Backups section of your Panel account, one for each snapshot.

A typical entry in /etc/fstab would look like this:

85.119.80.241:/data/backup/rsnapshot.6-7-4-6/hourly.0/85.119.82.75/ /mnt/backups/hourly.0 nfs ro,hard,intr,noauto,nfsvers=3,tcp 0 0

Your backups are only available read-only, so there is no way that anything your VPS does can corrupt them. They're also locked down to only be available from the main IPv4 address of your VPS. That also means that you can access them from the Rescue VM as that will also use your main IPv4 address.

Statistics

In the backups section of your Panel account you'll find some useful statistics. The most basic information includes the total on-media size of the data you have backed up, as well as the limit of what you're paying for. You'll also find a list of the paths that are being backed up.

Further down you will find the differential usage and per-snapshot usage figures.

Differential usage

This shows the volume of changed bytes between each snapshot. If no files change at all then the differential usage would be zero bytes, even though mounting the snapshot over NFS would show the full content of the files.

Per-snapshot usage

This shows the amount of data present in each snapshot, before duplicate files are considered. If you mounted this snapshot over NFS and copied the data out, this shows how much there would be.

Alerting

When BitFolk sets up your backups they will also configure some alerting as a means to make you aware of common problems related to the backups. The two types of alerts that can receive are related to usage and age of backups.

Usage

You have a set amount of space available for use with backups. If you reach 95% of your limit then you will start to receive warning alerts from BitFolk's monitoring.

Once you reach 99% you'll start to receive critical alerts. The alerts will continue even if you exceed 100%, and will include both the percentage and the absolute amount of storage used.

If you are over 100% usage then no more backups will take place. Aging of snapshots will continue though, with the most recent snapshot just being a clone, therefore eventually you will actually go below your limit again and a new backup will be taken.

Age

If the time since the last successful backup run passes a certain threshold then you will start to receive alerts about the age of your backups. The exact times depend on which schedule you've chosen, but for anyone with four-hourly backups (which includes everyone on the default schedule) it's 10 hours for a warning alert and 24 hours for a critical alert.

The most common reason for backups not successfully happening is because you're using too much space, in which case you would also have been receiving usage alerts. Other common reasons include your VPS having crashed, or SSH access being broken in some way, e.g. you've reinstalled and not allowed access to the rsnapshot public key, or your SSH fingerprints changed. Check your Panel, check your logs, and then contact [[Support] if necessary.

Please bear in mind that there are lots of things that can go wrong with backups that still involve backup runs appearing to be successful! Just because you are not receiving alerts about the age of your backups does not mean that your backups are correct. You need to verify that yourself.

Frequently Asked Questions

How do I reduce the amount being backed up?

There's several ways you can control the amount of sata that is backed up.

Remove some paths from your backups

At the moment you need to contact Support and ask for this to be done.

Exclude some things from being backed up

It's very likely that your backups contain things that don't need to be backed up. We can't tell you what needs to be backed up, so it's best to ask yourself the question, "if this data ceased to exist right now, and I needed it, how long would it take me to re-create it?" Data that comes with your operating system and its packages is probably going to be quite simple to restore. Temporary data and caches of all kinds may not be worth backing up at all.

You can exclude things yourself with the .bitfolk-rsync-filter files. They will eventually age out of your backups.

Switch to a schedule with less retention

Clearly the less data you retain the lower the amount of storage you need. You'll need to ask Support to do this for you.

Do bear in mind though that retaining data that rarely changes is cheap whereas constantly-changing data uses a lot of storage even for relatively low retention. Therefore you are usually better off finding things that you needn't be backing up and excluding them, than just keeping it all for a shorter period of time.

I've exceeded my backup quota and my backups have been disabled. How do I re-enable them?

In an ideal world you see the alerts coming and adjust things before 100% is reached, but it's not always possible. Once you're past 100% there's a few different things you can do.

Dedicate some more space to backups

Storage is still relatively cheap. On a monthly contract, 5GiB of storage is £0.40+VAT per month. It's a little less on quarterly and yearly contracts. You can just order some more and dedicate some or all of it to backups.

You could also ask for some unused storage to be taken away from your VPS and used for backups, though this is a bit more disruptive because shrinking filesystems requires a little bit of VPS down time.

Ask Support to nuke some snapshots

If you've gone over quota because you accidentally left some very large files in a place that was being backed up, the quickest way to return to normality may be to ask Support to remove the most recent snapshots that include the anomalous files. This will be done by cloning the newest snapshot that doesn't contain the files over the top of each one that does. Therefore this is not a good way to get rid of things that are very old as it would entail losing a lot of more recent snapshots.

Just wait for it to resolve itself

It isn't a great idea, but the problem definitely will resolve itself eventually because your snapshots continue being rotated without new data being backed up, so data at the limit of retention will be destroyed. For example, on the default 6-7-4-6 schedule all data will be gone after 6 months. As soon as you go below 100% the backups will start again so if you're just backing up too much data then you'll quickly go above 100% again.

Can Support just delete some files out of my backups?

In general, no. BitFolk would really rather not go digging through your backups. If we can resolve the problem by completely removing a snapshot then we'd rather do that. But if absolutely necessary then yes Support can do this.

Is there any easy way to see which files changed between snapshots?

Yes.

Examining small numbers of files

Mount two consecutive snapshots, e.g.:

$ grep backup /etc/fstab
85.119.80.213:/data/backup/rsnapshot/hourly.0/85.119.82.121/ /mnt/backups/hourly.0 nfs ro,hard,intr,noauto,nfsvers=3,tcp 0 0
85.119.80.213:/data/backup/rsnapshot/hourly.1/85.119.82.121/ /mnt/backups/hourly.1 nfs ro,hard,intr,noauto,nfsvers=3,tcp 0 0
$ sudo mkdir -vp /mnt/backups/hourly.{0,1}
mkdir: created directory `/mnt/backups/hourly.0'
mkdir: created directory `/mnt/backups/hourly.1'
$ sudo mount /mnt/backups/hourly.0
$ sudo mount /mnt/backups/hourly.1

Files that haven't changed will have the same inode. Using ls -i that would be the first column, e.g.:

$ ls -lai /mnt/backups/hourly.*/etc/passwd
918015 -rw-r--r-- 2 root root 2376 Apr 21  2016 /mnt/backups/hourly.0/etc/passwd
918015 -rw-r--r-- 2 root root 2376 Apr 21  2016 /mnt/backups/hourly.1/etc/passwd

Files that have changed will have different inodes, e.g.:

$ ls -lai /mnt/backups/hourly.*/home/andy/.bash_history 
921955 -rw------- 1 andy andy 17371 Dec 26 21:00 /mnt/backups/hourly.0/home/andy/.bash_history
920717 -rw------- 1 andy andy 17319 Dec 26 14:00 /mnt/backups/hourly.1/home/andy/.bash_history

Obviously changed files will also have different contents, so a sha256sum or similar would also show this. Comparing inodes is much quicker though, and shows you which versions of the file are actually taking up storage space.

Examining file changes in bulk

  1. Get the rsnapshot-diff program. On Debian/Ubuntu it can be found in the rsnapshot package, or you can just install it from upstream.
  2. Run it against your two snapshot directories:
$ sudo rsnapshot-diff -sv /mnt/backups/hourly.*
Comparing /mnt/backups/hourly.1 to /mnt/backups/hourly.0
+ 0 /mnt/backups/hourly.0/home/.backup_sentinel
- 0 /mnt/backups/hourly.1/home/.backup_sentinel
+ 17371 /mnt/backups/hourly.0/home/andy/.bash_history
+ 858 /mnt/backups/hourly.0/home/andy/.Xauthority
- 17319 /mnt/backups/hourly.1/home/andy/.bash_history
- 858 /mnt/backups/hourly.1/home/andy/.Xauthority
+ 21 /mnt/backups/hourly.0/home/andy/src/twitfolk-dg/etc/last_tweet
- 21 /mnt/backups/hourly.1/home/andy/src/twitfolk-dg/etc/last_tweet
+ 63460507 /mnt/backups/hourly.0/home/andy/src/twitfolk-dg/var/twitfolk.out
- 63445848 /mnt/backups/hourly.1/home/andy/src/twitfolk-dg/var/twitfolk.out
+ 0 /mnt/backups/hourly.0/etc/.backup_sentinel
- 0 /mnt/backups/hourly.1/etc/.backup_sentinel
+ 35 /mnt/backups/hourly.0/etc/openvpn/ipp.txt
+ 232 /mnt/backups/hourly.0/etc/openvpn/openvpn-status.log
- 35 /mnt/backups/hourly.1/etc/openvpn/ipp.txt
- 232 /mnt/backups/hourly.1/etc/openvpn/openvpn-status.log
Between /mnt/backups/hourly.1 and /mnt/backups/hourly.0:
  8 were added, taking 63479024 bytes;
  8 were removed, saving 63464313 bytes;

It can take a little bit of practice to decipher this output.

The + or - symbol tells you whether the file was added or removed. Then follows the size in bytes and finally the path.

We can see that a ~17kB file called hourly.1/home/andy/.bash_history went away, but a new ~17kB file called hourly.0/home/andy/.bash_history was added. Obviously the user typing a few extra commands caused the file contents to be different between backup runs so a completely new file was stored in the backups.

We could focus on the larger files that are being added with use of grep and sort:

$ sudo rsnapshot-diff -sv /mnt/backups/hourly.* | grep ^+ | sort -rnk 2 | head -5
+ 63460507 /mnt/backups/hourly.0/home/andy/src/twitfolk-dg/var/twitfolk.out
+ 17371 /mnt/backups/hourly.0/home/andy/.bash_history
+ 858 /mnt/backups/hourly.0/home/andy/.Xauthority
+ 232 /mnt/backups/hourly.0/etc/openvpn/openvpn-status.log
+ 35 /mnt/backups/hourly.0/etc/openvpn/ipp.txt

The grep is only keeping lines that start with a plus symbol. The sort is using the second key (-k 2) for a reverse (-r) numeric (-n) sort. head is only showing us the top 5 results (-5).

So, home/andy/src/twitfolk-dg/var/twitfolk.out used up ~60MiB, which is no surprise since it is an un-rotated log file which is always being appended to. It will change with every backup and should probably be excluded from backups, or at least rotated with only a single compressed version being kept.

Can you export my backups to somewhere else?

No, sorry. Backups are to be accessed only by your VPS's IP address. If your VPS isn't functional then you can boot the Rescue VM and access backups from there, as that will use your VPS's IP address(es).

Can my backup space be made writable so I can use it for additional storage?

No. Backups must remain read-only so we can have more confidence in their integrity. If you have bought additional backup storage but now don't need it and want to have it removed or added back to your VPS then just contact Support and ask for this. It may even be possible to do it without you having to reboot your VPS. Note that the initial 2GiB free allowance of backup space cannot be added to your VPS as regular storage.

Does it really need to run as root?

Most people will want to run the backups as root to ensure that access to all files is available. Other than file access there is no reason why it must connect as root though, so as long as you can arrange for all files to be accessible you can ask for any user you like. You might use filesystem ACLs to allow a non-root user to read files they wouldn't otherwise be able to.

You can force commands with SSH public keys to restrict what connecting programs can do. You can't just force the rsnapshot public key to use rsync as it needs to pass some arguments, but you can use a wrapper script to try to ensure that only an rsync command is used.

Can it be made to follow symlinks?

The short answer is no.

The longer answer is: at the moment the backup system can't do per-customer rsync settings, and the default is to not follow symlinks, so that's what happens for everyone. At some point possibly it could be extended to allow per-customer settings but so far all actual use cases have been solved by other means.

The typical reason for wanting the backup rsync to follow symlinks is so that directories to be backed up can be linked under some single directory tree like /srv/backup, that way easily showing what is being backed up and what isn't: if it exists under /srv/backup then it's being backed up otherwise not.

There is another school of thought that it would be better to back up everything and then exclude the things that shouldn't be backed up. That way you know it's being backed up unless it's being specifically excluded. The way to exclude things from a BitFolk backup is by using a .bitfolk-rsync-filter file.

If you are still in favour of only backing up things that exist under a single directory tree then you can use bind mounts instead of symlinks:

# mkdir -vp /srv/backup/etc
# mount --rbind /etc/srv/backup/etc

The above example makes your /etc directory available as /srv/backup/etc. This also works for files, not just directories. It is even possible to make such a mount read-only, while the source remains read-write.

Don't forget to put the bind mount in your /etc/fstab like other mounts, so that it re-appears at boot time.

Limitations

There are always trade-offs when deciding on a backup strategy and the solution offered here by BitFolk may not be the most suitable for you. It is important that you realise what its limitations are and make a decision for yourself. With that in mind, here's a bit more detail about some of the limitations. If it's not going to work for you then maybe one of the alternatives would be more suitable.

Backups are stored locally

All of BitFolk's servers are in the same facility. Although your backup data is physically separate from your VPS's data, if there is a major outage in that locality then you may not have access to your data for some time. A fire, bomb or some other serious event could physically destroy both your VPS's storage and the storage that your backups are on.

Backups are not compressed

Backed up data is stored bit-for-bit the same as it was read. As it is infrequently-accessed it could benefit from compression. Perhaps the btrfs or zfs filesystems could be used in future to provide compression.

No filesystem-level snapshots are being taken

BitFolk's backup system is connecting to your VPS by ssh and invoking rsync. The files it reads can change while it is reading them. Data that is changing all the time, e.g. database files probably cannot be successfully backed up in this manner. You should take care to dump out contents of databases to static files and have those backed up, instead.

If any level of the schedule has zero retention, all lower levels must also have zero retention

The schedules are tiered in that the oldest four-hourly snapshot becomes a daily, the oldest daily becomes a weekly, and so on. This means that if you don't actually have any retention at a given level then you can't have any retention at levels below it. So, 6-0-0-0 (keep just six four-hourly snapshots) is fine, but 6-0-4-0 is not valid. You'd at the very least need to do it as 6-1-4-0 (keep six four-hourly snapshots, one daily snapshot and four weekly snapshots).

Changes are stored on a per-file basis not a per-block basis

If a file changes then both copies of the file will be stored in their entirety. More sophisticated backup solutions would only store changed blocks between the two files.

Files are only compared against the exact same file path in the most recent snapshot

Files that alternate between two states will keep being stored in their entirety as the only deduplication being done is against the most recent copy. Also moving a file to a new name (a common thing to do with log file rotation) will result in both the old and new copies being stored. This can result in a large amount of storage being used to keep renamed copies of things there is already a backup of. Consider excluding some/all rotated logs.

Any metadata change will cause a new copy to be stored

Deduplication in this system is provided by hard-linking the new file path to the old file path where the files are identical. As hardlinks all have identical metadata (owner, group, permissions, etc.), any change of metadata will force a new file to be stored.

Backups are only available over IPv4

At the moment the backups are only available for NFS mounting over IPv4 and backups are only taken over IPv4. This shouldn't matter too much because at the moment every BitFolk VPS comes with one free IPv4 address, so it is only an idealogical issue for those who wish to make every service available over IPv6 (or who wish to run without IPv4).

At some point BitFolk will have to solve this as it is theoretically possible that a VPS will come with no IPv4 addresses as standard.

Backups are not stored encrypted

You will need to trust BitFolk staff not to access your backed up data. Also, given that this is a "pull" setup, should BitFolk's backup servers be compromised the attacker would have unrestricted root access to your VPS. You can restrict the SSH public key in use to only have access to the rsync command but that would still allow arbitrary access to your data.

Alternative backup strategies

If the limitations of this service are too great, or if it just doesn't work how you'd like it to work, you should find some other way to do your backups. Here's some suggestions. Please feel free to add more.

Do it yourself on S3

If you use Amazon's S3 as a back end and create the backup logic yourself you can end up with a very flexible and very cheap solution.

Many backup solutions like Duplicity can use S3 as a back end and also store the backups encrypted.

rsync.net

Tarsnap

Tarsnap is encrypted, deduplicated and compressed. It uses S3 as a back end. The Tarsnap client can only be used together with the (pay as you go) Tarsnap service.

Borg backup/Attic

Borg (a fork of Attic) deduplicates in blocks, compresses and encrypts. You can also mount backups. https://borgbackup.readthedocs.io/en/stable/ https://attic-backup.org/