The up-time race and why Vade are not starting

For those of you who have been shopping around, you will probably have noticed a lot of hosts saying they will offer an uptime of N (where N is some randomly picked 99.999% type number, or even 100% in some cases). What all these numbers fail to tell you is what they actually mean. Asking for clarification will usually get you pointed towards an obtuse terms document. They will even try to pull the wool over your eyes regarding what happens when they don’t meet these numbers.  In most cases there will be credits available if your site is offline for an amount of time, standard seems to be 1 day for every hour offline outsite of the level above.

The Numbers

Availability per day per month per year
99.999% 00:00:00.4 00:00:26 00:05:15
99.99% 00:00:08 00:04:22 00:52:35
99.9% 00:01:26 00:43:49 08:45:56
99% 00:14:23 07:18:17 87:39:29

So lets stay we were to offer a 99.99% up-time “guarantee” on our Starter Web Hosting Package. Your site could be offline for  1 hour, 52 minutes and 34 seconds before we would have to give a credit of 1 day, a credit that is actually only worth about €0.14. What is more, most of these “guarantees” have big get out of jail sections allowing the host in question to say the outage was not caused by them therefore no credits.

Then comes the question of how the up-time is measured. Is an outage only calculated after there is a defined period where the software can’t contact the server, or is every failed attempt counted as an outage. However you check the server will also have an effect on the number it ends up with as will where the checking happens from. If checking happens in Ireland but then the international connectivity to Ireland is reduced (happened a few months back) or even completely removed, does that count as an outage even though everyone in Ireland can still talk to the servers. It all remains very gray at the best of times.

Recently when we here in Vade’s Web Hosting Team had a meeting about up time, we calculated all the different bits that went into an up-time figure. This covers everything for time between hard disk failures to the time involved in running a new network cable should an existing one fail for some reason. What we came up with was that we could provide 100% up-time, 99.999% of the time, the other 00.001% would create an outage that would destroy the up-time stats for years if they happened. In real terms, one of these major problems for every 10 000 other problems is all that it takes. And who’s to say when this one time will actually happen.

For instance, all our servers have redundant hard drives so we can have a failure of 1 disk and things will still keep going without anyone noticing. (except of course our on call engineer who has just been woken up to go replace it – these things never fail during the daytime) If two hard-disks were to go at once, we would need to install two new hard-disks, rebuild the file system array, and then restore the files from the backup server. (We have a rolling 7 day backup system.) Total downtime could be a few hours or more depending on which two disks failed and how far away the on call engineer was from the server with the problem. (Vade runs a distributed network). The same is true of power supply failures, to cpu failures, to any other random part of the server that may fail. In most cases it is quicker to just completely swap out the machine and move the hard disks across, but again that requires time to physically move things, a lot more than the 5 minutes per year available with 99.999% up-time.

So after considering all the facts, we decided we wouldn’t pick a figure at which to base our up-time requirements on. Any outage is already viewed as a failure on our part and gets a full report done as to why it happened and what can be done to ensure it doesn’t happen again. Our customers are also welcome to request credits for an outage too, and I will say that we don’t use the 1 hour = 1 day rule although it has been a while since we have issued any credits. In fact the last one we issued credits for wasn’t even an outage that stopped anyone accessing the site. All I can say is that we are generous because without you our customers, we wouldn’t be operating.

Edit: We decided to share one of the external monitoring services we use. There are issues with using it but it should give an indication of our uptime from an unbiased party. Just click on the image below. It changes to display the actual uptime too.

nslookup

Comments (2)

monitoring uptimeOctober 20th, 2009 at 5:06 am

Nice post. Thanks for sharing it. Better if you can monitor website rank, position and uptime in one place!

Leave a comment

Your comment