Yesterday, a number of internet services, including LiveJournal, Sun.com, Yelp, Technorati, Craigslist, and a good section of Second Life went belly up at the same time. The cause? A power outage in San Francisco, followed (more importantly) by some sort of failure of backup power at a big-time co-location facility there.
For those not familiar with the term, a co-location facility (sometimes called a co-lo), is a beefy building with all of the infrastructure required for putting your servers. They charge lots of money, and in exchange, generally provide you with lots of air conditioning, protected rooms, and uninterruptable power. This particular facility at 365 Main in San Francisco, touts itself as having 100% uptime in the recent past. As a matter of fact, just a couple of days ago, they put out a press release specifically bragging about their uptime. Of course, it’s now been taken offline. Apparently a line (or “mob”, if you will) of sysadmins gathered outside the facility when the thing went down.
According to their white paper and specs (and one would assume their service level agreements), this wasn’t supposed to happen. Their “Ten 2.1-megawatt Hitec Continuous Power Systems”, “three 20,000 gallon double-lined fuel tanks”, and “extended protection time by means of the integrated diesel engine” aren’t worth anything if they don’t actually switch on in the case of a power failure. Some folks were saying that a disgruntled employee was to blame. I find this somewhat unlikely that that would happen in combination with a city-wide power outage.
But, on the other hand, Robert Goulet running around and slamming all of the red power-interrupt buttons would explain why the backup power didn’t help any!
Anyway, the moral of the story for datacenter operators is, test, test, test. Unless of course you figure that the extra money you’re bilking out of your customers and saving by not testing, will make up for the millions of dollars in lawsuits caused by the outage. And for the service operators and architects, putting all your eggs in one basket really still isn’t a good idea — even if it’s a big fancy basket with supposed “100% uptime.” But hey, those are the risks you’ve got to take. In some cases, it’s not worth it to build out/rent a totally redundant data center for the few hours once-in-a-blue-moon co-lo facility outage.