Bad Bad Bad Bad Bad


So I’m very much aware of the power issues at MIT’s W91 data center (which are still ongoing, AFAIK). But some pretty bad shit apparently went down at W92 last night as well. Unfortunately, the network and infrastructure services team over there aren’t the best at “communicating,” shall we say, but they did leave some gems on the 3down.mit.edu status site. Of course, as of noon today, there are still no incoming email messages for my mit.edu account, and up until about 11:00am, the web.mit.edu website was still unavailable (in spite of what the updates below state):

  • Thu, Nov 2nd, 12:15am: W92’s data center experienced a power failure when returning from generator power. Currently, web, email and related services are unavailable. A further update will be posted by 1am.
  • Thu, Nov 2nd, 1:45am: Services are returning. MIT’s Home Page and e-mail are restored. AFS servers are returning, as well. A further update by 3am.
  • Thu, Nov 2nd, 3am: Most services are returned and operating normally. Two AFS fileservers continue to return to service. We will continue to monitor all services closely for the next few hours. Our apologies for this unplanned service interruption and thank-you for your patience and understanding.
  • Thu, Nov 2nd, 10am: At approximately midnight, W92’s data center experienced an unplanned outage as NStar power was restored. This loss of power impacted the following key services: email, web, central windows domain, AFS servers, etc. Services were restored between 1am and 4am and now, most services are available. We are aware that some websites remain unavailable as one AFS server is still experiencing difficulty. As we learn which websites are experiencing problems, we will post alternate web addresses here.
  • Thu, Nov 2nd, 11am: We are currently experiencing delays in mail delivery from sites outside MIT due to backlogs from last night’s power outage. We are working on clearing the backlog at this time.

It’s nice that they can afford to buy everyone Aeron chairs, but apparently can’t afford the materials or manpower for a working UPS and generator transfer. Of course, there were probably some mitigating factors.

When I left there a few months ago, they hadn’t yet “discovered” log-based UFS for Solaris, so any power outage included at least a 3-4hour additional downtime for fsck and/or tape restores because of the corruption.

I get the feeling that their mail infrastructure was running pretty dangerously close to the “tipping-point” recently anyway, what with increases in spam and volume. They even had some unfortunate incident recently where somebody hacked a ton of accounts and used webmail to send out a boatload of spam through the authenticated servers. This had the effect of getting MIT’s outgoing mail servers on some blacklists out there for a while.

<TECHY SPEAK&gt When a mail system is running without much overhead space, the deluge of queued mail from the internets will definitely send it into the weeds after even a brief outage. I don’t envy the task of the folks trying to bring it back up, because it’s a bit of a spiraling problem. A deluge of incoming messages overtakes the busy server, and the mail queues get long. As the queues get long, they take longer to process. This can be combated (at least in sendmail) by adding additional servers, multiplexing your queue directories into subdirectories for qf,df,xf files. On a busy mail system it also might become necessary to have up to a dozen separate sendmail queues with the properly tuned number of queue runners each (new in sendmail 8.12 or thereabouts), and with queue groups dedicated to a specific subset of your mail load either randomly, or sorted by destination, or both.</TECHY SPEAK&gt

I wonder what the chances are that the couple of folks who run the mail system over there have already done all of this, or are even using a relatively modern version of sendmail, or another MTA. Anyway, I remember back before we took these steps on the UIUC email system, an outage of an hour or so would back things up for several hours. Computers generally don’t like doing anything in directories with 50,000+ files in them.

Maybe I should call the powers-that-be over there and ask if they could use some consulting help? (Just kidding. You guys are great. Well, most of you at least. Well, some of you at least.)

  1. #1 by 1/2 Two Grim Dudes on November 2, 2006 - 8:15 pm

    You misplaced your [techy speak] tags. They should bookend this entire POS entry.
    Also, amusing that MIT is having email server issues.
    -Libberal Ahts Majah

  2. #2 by benoc on November 2, 2006 - 11:18 pm

    Hmmmm maybe I should have taged the entire entry “not for rubes like 1/2”!