Yesterday, a number of internet services, including LiveJournal, Sun.com, Yelp, Technorati, Craigslist, and a good section of Second Life went belly up at the same time. The cause? A power outage in San Francisco, followed (more importantly) by some sort of failure of backup power at a big-time co-location facility there.
For those not familiar with the term, a co-location facility (sometimes called a co-lo), is a beefy building with all of the infrastructure required for putting your servers. They charge lots of money, and in exchange, generally provide you with lots of air conditioning, protected rooms, and uninterruptable power. This particular facility at 365 Main in San Francisco, touts itself as having 100% uptime in the recent past. As a matter of fact, just a couple of days ago, they put out a press release specifically bragging about their uptime. Of course, it’s now been taken offline. Apparently a line (or “mob”, if you will) of sysadmins gathered outside the facility when the thing went down.
According to their white paper and specs (and one would assume their service level agreements), this wasn’t supposed to happen. Their “Ten 2.1-megawatt Hitec Continuous Power Systems”, “three 20,000 gallon double-lined fuel tanks”, and “extended protection time by means of the integrated diesel engine” aren’t worth anything if they don’t actually switch on in the case of a power failure. Some folks were saying that a disgruntled employee was to blame. I find this somewhat unlikely that that would happen in combination with a city-wide power outage.
But, on the other hand, Robert Goulet running around and slamming all of the red power-interrupt buttons would explain why the backup power didn’t help any!
Anyway, the moral of the story for datacenter operators is, test, test, test. Unless of course you figure that the extra money you’re bilking out of your customers and saving by not testing, will make up for the millions of dollars in lawsuits caused by the outage. And for the service operators and architects, putting all your eggs in one basket really still isn’t a good idea — even if it’s a big fancy basket with supposed “100% uptime.” But hey, those are the risks you’ve got to take. In some cases, it’s not worth it to build out/rent a totally redundant data center for the few hours once-in-a-blue-moon co-lo facility outage.
#1 by Edison Was Right on July 25, 2007 - 1:24 pm
Interesting story — I hadn’t heard about this outage.
I wish more people would realize the benefits of using 48 VDC power distribution in data centers. Backup power for 120 VAC is inherently more complicated and less reliable than DC. DC distribution is also more efficient in that kind of environment — why on earth would anyone want to put hundreds of hot, inefficient AC power supplies in one room? It’s just silly.
#2 by benoc on July 25, 2007 - 1:50 pm
OOhhhhh yeah, DC datacenter distribution is actually a hot topic now. I remember back a few years ago, when it was being pushed as a solution for in-rack power distribution.
That is, you would buy a special rack PDU that stepped down to DC appropriately, and then fill the rack with their special servers that accepted the DC power. It never really caught on.
This, however is one of the big plusses for the new fad of “blade servers.” But still, you’d get even more efficiency if there were some sort of systems for distributing DC through the whole datacenter. Maybe all we need are some standards.