Maintenance/2021-05-Re-racking

From BitFolk
Jump to navigation Jump to search

This is the wiki version of the announcement email. It may contain updates compared to the email version.

Update: 2021-05-27

We've been able to move all customers off of snaps.bitfolk.com ahead of this work so while we are still relocating that server, no customers will be affected by that work. Customers on the other three servers will still be affected.

TL;DR:

We need to relocate some servers to a different rack within Telehouse.

On Thursday 27 May 2021 at some point in the 3 hour window starting at 22:00Z (23:00 BST) all customers on the following servers will have their VMs either powered off or suspended to storage:

  • elephant.bitfolk.com
  • limoncello.bitfolk.com
  • talisker.bitfolk.com
  • snaps.bitfolk.com

We expect the work on each server to take less than 30 minutes.

See "Frequently Asked Questions" at the bottom of this article for how to determine which server your VM is on.

If you can't tolerate a ~30 minute outage at these times then please contact support as soon as possible to ask for your VM to be moved to a server that won't be part of this maintenance.

Maintenance Background

Our colo provider needs to rebuild one of their racks that houses 4 of our servers. This is required because the infrastructure in the rack (PDUs, switches etc) is of a ten year old vintage and all needs replacing. To facilitate this, all customer hardware in that rack will need to be moved to a different rack or sit outside of the rack while it is rebuilt. We are going to have to move our 6 servers to a different rack.

This is a significant piece of work which is going to affect more than 25% of the customer base. Unfortunately it is unavoidable.

Networking Upgrade

We will also take the opportunity to install 10 gigabit NICs in the servers which are moved. The main benefit of this will be faster inter-server data transfer for when we want to move customer services about. The current 1GE NICs limit this to about 90MiB/sec.

Suspend & Restore

If you opt in to suspend & restore then instead of shutting your VM down we will suspend it to storage and then when the server boots again it will be restored. That means that you should not experience a reboot, just a period of paused execution. You may find this less disruptive than a reboot, but it is not without risk. Read more about that in the Suspend and restore article.

Avoiding the Maintenance

If you cannot tolerate a ~30 minute outage during the maintenance windows listed above then please contact support to agree a time when we can move your VM to a server that won't be part of the maintenance.

Doing so will typically take just a few seconds plus the time it takes your VM to shut down and boot again and nothing will change about your VM.

If you have opted in to suspend & restore then we'll use this to do a "semi-live" migration. This will appear to be a minute or two of paused execution.

Moving your VM is extra work for us which is why we're not doing it by default for all customers, but if you prefer that to experiencing the outage then we're happy to do it at a time convenient to you, as long as we have time to do it and available spare capacity to move you to. If you need this then please ask as soon as possible to avoid disappointment.

It won't be possible to change the date/time of the planned work on an individual customer basis. This work involves 4 of our servers, will affect hundreds of our customers, and also has needed to be scheduled with our colo provider and some of their other customers. The only per-customer thing we may be able to do is move your service ahead of time at a time convenient to you.

Rolling Upgrades Confusion

We're currently in a cycle of rolling software upgrades to our servers. Many of you have already received individual support tickets to schedule that. It involves us moving your VM from one of our servers to another and full details are given in the support ticket.

This has nothing to do with the maintenance that's under discussion here and we realise that it's unfortunately very confusing to have both things happening at the same time. We did not know that moving our servers would be necessary when we started the rolling upgrades.

At the moment we are at a point where we are moving customers off of snaps.bitfolk.com (which will be part of this maintenance) and on to servers that won't be part of this maintenance.

We've now moved all customers off of snaps.bitfolk.com and that has reduced the number of customers who will be affected by this maintenance.

Further Notifications

Every customer is supposed to be subscribed to the announcement mailing list, but no doubt some aren't. The movement of customer services between our servers may also be confusing for people, so we will send a direct email notification to the main contact of affected customers a week before the work is due to take place.

So, on Thursday 20 May we'll send a direct email about this to customers that are hosted on the affected servers.

Frequently Asked Questions

How do I know if I will be affected?

If your VM is hosted on one of the servers that will be moved then you are going to be affected. There's a few different ways that you can tell which server you are on:

  1. It's listed on our web panel
  2. It's in DNS when you resolve <youraccountname>.console.bitfolk.com
  3. It's on your data transfer email summaries
  4. You can see it on a traceroute or mtr to or from your VPS.

If you can "semi-live" migrate VMs, why don't you just do that?

  • This maintenance will involve some 25% of our customer base, so we don't actually have enough spare hardware to move customers to.
  • Moving the data takes significant time at 1GE network speeds.

For these reasons we think that it will be easier for most customers to just accept a ~30 minute outage. Those who can't tolerate such a disruption will be able to have their VMs moved to servers that aren't going to be part of the maintenance.

Further questions?

If there's anything we haven't covered or you need clarified please do ask on the users mailing list or privately to support.