Abandon hope all netids beginning with m,u,v,x and y
Recent troubles concerning a mail server at MIT reminded me that I should get around to documenting one of my favorite tales of woe. It happened back in February 2002 while I was working at what was then CCSO, in what was then PSG. I had only been working at UIUC for 6 months or so at that point, and had only been administering the campus student/staff cluster “dataservers” for less than that.
I must warn. This is a long entry, and is probably uninteresting to most out there. However, I put it here for my own documentation (this is the story I tell at job interviews when asked about an arduous “troubleshooting experience”) and for the possible enjoyment of a small population of my peers.
Background
First, some background on “the cluster”. The UIUC student/staff cluster was a behemoth of a system with 8 or so front-ends behind a load-balancing switch, relaying mail delivery and IMAP mail reading traffic to 8 “dataservers”. Each dataserver had a specific subset of the campus’ 80,000+ email accounts on it based on what letter the netid of said account began with. Each dataserver was paired with another one in a multipathed SCSI setup (cleverly designed by Mark Roth). This way, the data could be mirrored across two separate arrays using Veritas Volume Manager, and each dataserver could take over duties from its partner if the need arose. This eliminated single points of failure. To this day, I remember the layout of which netids went to which datasrv, but I won’t bore you with those details.
Trouble Brews
I’m pretty sure the day was Wednesday, February 12th when I strolled into my office at the somewhat late hour of 9:45am. I opened my email to see two help desk tickets that had been cc’d to me concerning folks with email problems. The symptom was that these two people were complaining that they were receiving other people’s email. Simple problem, I thought: someone’s just borked their .forward file somehow or messed up their email forwarding settings in PH (the university’s directory service). I delayed for a few minutes getting caught up on other email, and then investigated enough to realize that the problem wasn’t so simple. They had no .forward files, and their PH entries were in order. And here’s where things started to really go into the weeds.
In a matter of a few minutes, I got about 3 or 4 more help desk tickets of other users noticing the same thing, and then my phone rang. It was Phil at the help desk, things were getting kind of crazy over there with users calling and noticing the same thing. You see, that’s one thing about working on a system with tens of thousands of users. If there’s a problem affecting even a small percentage of them, the phones start ringing off the hook.
At this point, with my office-mate (Jon Marks) and I sitting there scratching our heads, my boss (Mark Roth), and some other observers strolled in and started brainstorming with us. It was decided fairly quickly that we needed to shut down the cluster, at least temporarily — and that’s what I did (after submitting an appropriate message for status.uiuc.edu). We kicked all of the interlopers out of the office, and Jon, Mark and myself started to work the problem. We looked at all of the netids that had been experiencing the problem, and determined that they were all on datasrv6 (beginning with m,u,v,x or y). This was somewhat encouraging, since it narrowed the problem down. Also, since this was indeed an “exigent circumstance,” we opened up one of the affected email inboxes and noticed that all of the mis-directed email also came from netids on datasrv6. So we had the problem completely isolated to one system — of course that one system contained about 10,000 users. Satisfied, we turned on the rest of the cluster, leaving datasrv6 off, and kept thinking.
That Sinking Feeling
We realized that it couldn’t be a procmail problem, since procmail executes as the destination user and wouldn’t have permission to write to the other users’ inboxes. Same with qpopper (our POP3 client). Jon made the breakthrough at that point and noted correctly that there’s no way this is a software problem. The permissions on the home directories and mail spools would make it impossible. After a few more minutes of discussion we came around to that thinking. Unfortunately, the alternative brings an all-too-familiar sinking feeling to us system admins — “hardware failure.” Or more specifically, “data corruption.” It’s extremely rare for a disk, much less an array of disks, to just start corrupting data. However, a while back we had re-configured the RAIDs on some of the heavier-loaded dataservers (including datasrv6) to operate at level 0+1 rather than the slower, but parity-checking level 5. In addition, the Veritas software mirroring between the two arrays added even more complexity. Sure enough, at this point when poking around in the system logs (saved on a remote log server), we were able to spot tell-tale messages of filesystem errors and disk corruption. This only came up in users’ inboxes, because those are the highest-used files on the filesystems. At that point, most users were POP users, and qpopper (our POP3 server) would actually copy the entirety of the user’s inbox into a temporary “popdrop” folder, work on it, and then copy it back. With data corruption and all of this file activity, it was pretty much like people’s inboxes were decks of cards being shuffled together.
At first, thinking that it could be just a VxFS fluke or something, we unmounted the filesystems, and attempted to run fscks. M and V were damaged beyond repair (about 6500 users). The small ones — U, X, and Y were recovered with some time and pain. A tape restore and data-loss incident was indeed in our future, but we had yet to know just how bad it would get. Curiously, I peeked through the ticketing system to see how far these problems went back. Sure enough, it was around 8pm on the previous day when the first complaint came in. This was, of course, before last night’s backup. That would be a full day of data loss. That’s bad. Actually, that’s just about as bad as things can get as a sysadmin. It was at this point that someone in the Help Desk wrote on the white board in there “abandon hope all netids beginning with m,u,v,x and y” as their phones kept ringing and ringing and ringing.
Recovery Number 1
So, we broke the Veritas mirror, and disconnected what we had determined to be the bad MTI Gladiator disk array. We remade the destroyed filesystems and started the TSM restore — it would take the rest of the day to bring back all of that data. On the filesystems that weren’t completely destroyed, I wrote up a shell script using procmail and formail and god-only-knows what else to flip through everyone’s inbox and find and re-direct any misplaced messages. At about 7pm that night, things were done (or so we thought) and back online with the filesystems mounted on dataserver 5 (just to be safe, because we were convinced that datasrver 6 was cursed). A targeted mass-mail was sent out to those users who were screwed out of 1.5 days of data, but surprisingly enough, there wasn’t really much of a firestorm about it.
Oops!
The next day (Valentine’s Day), I arrived somewhat early to keep an eye on things. With a sickening sense of dejavu, my phone rang again and it was Phil, from the help desk. Someone with a netid beginning with “m” was seeing someone else’s mail. Reacting quickly, I shut down sendmail and qpopper on dataserver 5 and alerted the operations center to post another status message. Luckily, this time I had caught it before any of the file systems were completely borked. Looking back on our actions of the previous day, I had made the almost-fatal mistake of removing the wrong gladiator array. The evil data-corrupting one was the one that we restored onto, and the unplugged one was the one that was really good. We hooked up the other array, remounted the bad one as read-only — this time being careful not to unmount it or even attempt an fsck, and began the slow and arduous task of copying everything over from one array to the other. At any moment, the bad array that all of that data was on could have pooped out completely, and we would have lost another day of data. Luckily, it did not. After the data was copied over, and a quick run of my inbox-unfux script that I had used the previous day, things were all good at around 6pm. There was much rejoicing, and I had even made it in time for Valentine’s Day dinner at Kennedy’s with my wife!
Epilogue
From that day on, we never replaced dataserver 6, and stuck with an odd number of backends (1,2,3,4,5,7 and 8) until the cluster was consolidated to faster servers, and eventually retired. MTI was very helpful with diagnosing the demonic array. We rebuilt it as 8 separate volumes, one on each disk, and narrowed down the corruption to one particular disk. They sent that disk to Seagate’s failure analysis laboratory, and received an acknowledgment of defect. A certain section of the disk’s platters were “under-deposited” with the proper magnetic coating, causing eventual surface failure and data corruption. If we had had the array built as raid level 5, the parity checks would have definitely caught mismatches and the drive would have been marked as failed with no data loss. I guess that’s probably the moral of the story, isn’t it? Trading performance for reliability will have unintended negative consequences, and no matter how many levels of redundancy you think you have — you’re still not safe.