Any readers of this blog who are in my field, or in a similar field, understand what “Firefighting Mode” is. It’s when your job consists almost entirely of fighting fires (i.e. handling emergencies, servicing interrupts), rather than dealing with things in a rational, prioritized manner, and devoting some time to medium-term and long-term infrastructure improvements. The problem is, that once you’re in firefighting mode, the lack of medium and long-term planning, and the lack of infrastructure improvements just cause more fires to pop up down the road. Hence, it is a vicious cycle, and it can eventually wreak havoc on the morale of the overworked staff. I’m trying to put this into appropriate words here, and what follows is my first attempt. I’m sure the “powers that be” here are already aware of the situation, but it does feel good to get it off my chest and into words:
The group is too bogged down with every-day tasks to do long-term or even medium-term development work, or to follow best practices for many of our operations. The security incident earlier this year was a direct result of this. SSH access lock-downs were supposed to be part of the planned system audit (over 1 year overdue) that was never accomplished because of lack of available, knowledgeable manpower in the group. Likewise, I have been far too busy to make progress on Linux knowledge exchange, or Solaris Jumpstart service installation. Both of these projects, as well as other possible infrastructure projects (tru64 infrastructure replacement, documentation wiki, deploying Solaris logging UFS, etc.) are essential to our ability to effectively manage a multitude of servers effectively and efficiently.
Without proper manpower to attend to and improve systems administration infrastructure, organizations tend to fall into a vicious cycle of not realizing possible administration efficiency improvements and methodologies, and therefore not improving the important metric of “man-hours of administration work required per server”. This, combined with an ever-increasing number of servers, makes the problem worse, keeping the group in “firefighting mode,” and eventually affecting staff morale as well.
An interesting paper on the topic:
And I think I’ll also throw in Mark Roth’s “Sanity Through Organizational Evolution” paper here as well: