So in honor of Systems Administrator Appreciation Day, and because I have a new job in production systems at Athenahealth (more about that at the bottom of this post), I’m slightly modifying and reposting a blast from the recent-past.
“So what exactly would you say you do here?”
I’ve flown out to remote locations and been on-site for the build-out and spin-up of three new production data centers within the last year. I’ve been present for load tests and at public launches of new video games’ online services and product and feature launches to predict and solve system load issues from rushes of new customers hitting new code, networks, and servers. And yes, I’ve spent my share of all-nighters in war rooms and in server rooms troubleshooting incidents and complicated failure events that have taken parts of web sites, or entire online properties offline. I wasn’t personally involved in fixing healthcare.gov, but that team was made up of people I would consider my peers, and some people that have specifically have been co-workers of mine in the past.
Do you use the internet? Ever buy anything online? Use Facebook? Have a Netflix account? Ever do a search on Duck Duck Go or use Google? Do you have a bank account? Do you have a criminal record, or not? Ever been pulled over? Have you made a phone call in the past 10 years? Is your metadata being collected by the NSA? Have you ever been to the hospital, doctor’s office, or pharmacy? Do you play video games? If you’ve answered yes to any of the above questions, then a portion of your life (and livelihood) depend on a particular group of professional engineers that do what I do. No, we are not a secret society of illuminati or lizard people. We do, however, work mostly in the background, away from the spotlight, and ensure the correct operation of many parts of our modern, digital world.
So what do we call ourselves? That’s often the first challenge I face when someone asks me what I do for a living. My job titles, and the titles of my peers, have changed over the years. Some of us were called “operators” back in the early days of room-sized computers and massive tape drives. When I graduated college and got my first job I was referred to as a “systems administrator” or “sysadmin” for short. These days, the skill sets required to keep increasingly varied and complex digital infrastructure functioning properly have become specialized enough that this is almost universally considered a distinct field of engineering rather than just “administration” or “operations”. We often refer to ourselves now as “systems engineers,” “systems architects,” “production engineers,” or to use a term coined at Google but now used more widely, “site reliability engineers.”
What does my job entail specifically? There are scripting languages, automated configuration and server deployment packages, common technology standards, and large amounts of monitoring and metrics feedback from the complex systems that we create and work on. These are the tools we need to scale to handle growing populations of customers and increased traffic every day. This is a somewhat unique skill set and engineering field. Many of us have computer science degrees (I happen to), but many of us don’t. Most of the skills and techniques I use to do my job were not learned in school, but through my years of experience and an informal system of mentorship and apprenticeship in this odd guild. I wouldn’t consider myself a software engineer, but I know how to program in several languages. I didn’t write any of the code or design any of our website, but my team and teams like it are responsible for deploying that code and services, monitoring function, making sure the underlying servers, network and operating systems function properly, and maintaining operations through growth and evolution of the product, spikes in traffic, and any other unusual things.
Back in 2001, I was working for the University of Illiniois at Urbana-Champaign for the campus information services department (then known as CCSO) as a primary engineer of the campus email and file storage systems. Both were rather large by 2001 standards, with over 60,000 accounts and about a terabyte (omg a whole terabyte!) of storage. This was still in the early part of the exponential growth of the internet and digital services. I remember a presentation by Sun Microsystems in which they stated that given the current growth rates and server/admin ratios, by 2015 about ⅓ of the U.S. Population would need to be sysadmins of some sort. They were probably right, but the good news is that since then our job has shifted mostly to finding efficiencies and making the management of systems and services of ever-growing scales and complexity possible without actual manual administration or operation — so the server/admin ratio has gone down dramatically since then. Back then it was around 1 admin for every 25 servers in an academic environment like UIUC. Today, the common ratios in industry range from a few hundred to a few thousand servers per engineer. I don’t think I’m allowed to say publicly what the specific numbers are here at TripAdvisor, but it is within that range. But, we still need new engineers every day to meet needs as the internet scales, and as we need to find even more efficiencies to continue to crank that ratio up.
Where do the production operations engineers come from? Many of us are ex-military, went to trade schools, or came to the career through a desire to tinker unrelated to college training. As I stated earlier, while a degree in computer science helps a lot understanding the foundations of what I do, many of the best engineers I’ve had the pleasure of working with are art, philosophy, or rhetoric majors. In hiring, we look for people who have strong problem solving desires and abilities, people who handle pressure well, who sometimes like to break things or take them apart to see how they work, and people who are flexible and open to changing requirements and environments. I believe that, because for a while computers just “worked” for people, a whole generation of young people in college, or just graduating college, never had the need or interest to look under the hood at how systems and networks work. In contrast, while I was in college, we had to compile our own linux kernels to get video support working, and do endless troubleshooting on our own computers just to make them usable for coding and, in some cases, daily operation on the campus network.
So generally speaking, recent college graduates trained in computer science have tended to gravitate towards the more “glamorous” software engineering and design positions, and continue to. How do we attract more interest in our open positions, and in the career as an option as early as college? I don’t have a good answer for that. I’ve asked my peers, and many of them don’t know either. I was thrilled to go to the 2014 SREcon in Santa Clara last year (https://www.usenix.org/conference/srecon14), and to attend and present at SREcon 2015 this year. A huge portion of the discussion panels there and the engineers and managers there from all the big Silicon Valley outfits (Facebook, Google, Twitter, Dropbox, etc.) face the same hiring problems. It’s admittedly even worse for us at TripAdvisor as an east coast company fighting against the inexorable pull of Silicon Valley on the talent pool here.
One thing I’ve come to strongly believe, and which I think is becoming the norm in industry operations groups, is that we need to broaden our hiring windows more. We need to attract young talent and bring in the young engineers, who may not even be strictly sure that they want an operations or devops career, and show them how awesome and cool it really is (ok, at least I think it is). To this end, I gave a talk at MIT a little over a year ago on this subject — check out the slides and notes here. I didn’t know that this is what I wanted to do for sure until about a week before I graduated from MIT in 2000. I had two post-graduation job offers on the table, and I chose a position as an entry-level UNIX systems administrator at Massachusetts General Hospital (radiation oncology department, to be more specific) over the higher paying Java software engineering job at some outfit named Lava Stream (which as far as I can tell does not exist anymore). Turns out I made the right decision. The rest of my career history is in my LinkedIn profile (https://www.linkedin.com/profile/view?id=8091411) if anyone is curious. No, I’m not looking for a new job.
“Now (and forever) Hiring”
So, if anyone reading this is entering college, or just leaving college, or thinking of a career change, give operations some consideration. Maybe teach yourself some Linux skills. Take some online classes if you have time or think you need to. Brush up your python and shell scripting skills. At least become a hobbyist at home and figure out some of those skills you see in our open job positions (nagios, Apache, puppet, Hadoop, redis, whatever). Who knows, you might like it, and find yourself in a career where recruiters call you every other day and you can pretty much name your own salary and company you want to work for.
And specifically for my group at Athenahealth? We manage the production infrastructure for athenaNet. We are a cloud-based medical services company working to build the healthcare internet and to improve healthcare in the U.S. through our innovation. We are the #1 practice management system and #2 electronic health record (EHR) system in the country according to the 2014 KLAS survey. The infrastructure that my team runs is counted on and trusted by over 67,500 medical providers and millions of patients. Does any of this sound interesting to you? Even if you don’t think you fit any of the descriptions we have currently listed (like this one) but might be up for some mentoring/training and maybe an internship or more entry-level position, tweet at me or drop me an email and we’ll see what we can do. See you out there on the internets.