Archive for category work
Most friends of mine will already know that this January, after being a Senior Systems Engineer for over 4 years at Linden Lab (the makers of Second Life), I embarked on a journey to find something new, different, fun, and challenging (no, I wasn’t looking for — nor did I receive — additional financial compensation for switching jobs). This blog entry is a bit late, I know, but I figured I couldn’t readily sleep tonight, and just got back from an amazing showing at E3 in Los Angeles, so it’s appropriate to put some things down in virtual ink.
After looking around for about two weeks and working with an awesome recruiter at Hollister in Boston, I found what I was looking for. Since January, I’ve been a Senior Systems Administrator in the LiveOps group at Harmonix Music Systems.
As the video game industry grows more internet-connected, social, and network-dependent both in synchronous and asynchronous multiplayer capabilities, the rhythm, dance, music, and beat-matching games that we make at Harmonix need to as well. And all of that requires server infrastructure. Not just a hodgepodge of a few dozen boxes in a closet in the back of the office run by one guy, but an actual redundant, reliable, and well architected infrastructure that can hopefully serve the needs of all of our customers, contribute to the joy they get from playing video games, and grow and adapt for future titles with minimal financial and human investment.
As we saw from the piss-poor launch situation of Diablo III, this can sometimes be a challenging and daunting task, and I look forward to being on the team that makes sure all of our multi-player and backend functionality continues working without a hitch — even during the hopefully huge launches of new games coming up later this year in the Rock Band and Dance Central franchise that will both rely heavily on a newly architected and constructed backend infrastructure.
No, it’s not as huge or as technically challenging as running a virtual world with 16,000 servers and over 150 well-organized and configuration-managed high-load mysql servers spread over 3 remote datacenters, but it’s somehow more “fun.” At least so far it has been, and I hope it remains that way. In short, I love my new job and I love the company (we’re hiring, by the way). The organizational, personal and corporate-level challenges of designing, maintaining, and growing a smaller infrastructure with a smaller staff in a smaller company in many ways trump the far-out optimizing of an infrastructure with tens of thousands of servers and crazy-over-optimized solid-state-disk mysql clusters with 20+ slaving nodes each running an entire 3d virtual world the size of denmark and with the economy the size of Brazil or whatever country it is equal to these days.
The stuff we do, and will do, at Harmonix might not be the basis of any papers I’d be able to present at ATC or LISA, but it’s actually more reproducible and applicable to the vast majority of systems operations groups out there that we rely upon in our daily connected lives. And I look forward to sharing some of it with you, my faithful readers, as I hope my time here at Harmonix draws on through several awesome upcoming projects.
I used to think that working on SL was cool, and would occasionally see news stories or emails from residents saying how they met their mate there, or learned to escape a debilitating mental or physical deficiency by existing in the virtual world and could easily grasp the impact of what we were doing. At Harmonix, In addition to my work at the office in beautiful Central Square in Cambridge, Massachusetts, I get a chance to accompany our amazing community team to demos, conferences, and expos (to set up on-site backends for the games given lack of internet connectivity) and see just how much the once-derided (by my parents at least) video game industry has matured and risen to prominence in American culture. Harmonix’ games have brought fun and happiness to millions, and no doubt introduced (or re-introduced) many to the joy of music itself (and now, with the Dance Central series, dance). Remember the first time you played Rock Band with a few of your friends, and experienced that rush of pleasure and appreciation playing through a good tune properly and getting in the “groove”? Talking with some of our customers and fans at PAX-EAST and more recently E3, it really does kind of hit home.
And besides, I’ve said it before, but I get great satisfaction from working for a company that actually makes something real. We ship code on discs (ok, it’s downloadable content too these days) for people to buy, and play, and get enjoyment out of. This is not a shady business model of exploiting our customers by gathering up their personal information to spread around to the highest bidder (*cough* Facebook *cough*), or leveraging internet search (which, call me naive, is kind of a solved problem) and email account provisioning (also a solved and uninteresting problem) to also gather up personal information, track users, bubble them into predetermined categories and force feed them advertisements all the while violating their expressed wishes for privacy in many cases (*cough* Google *cough*).
So, what if my new job is less intense, less technically challenging or “awesome” in a geeky unix tech way? In many ways it is more rewarding, and I feel good about what I do when the day is done. I don’t think I could say that if I worked for any of those aforementioned silicon valley behemoths (despite being hounded by their recruiters regularly). But even more importantly on a personal level, and my main reason for switching jobs, is that it is quite a bit of a shift out of my comfort zone and more challenging in other ways. I’m now working with a smaller group and company of diverse talents and far different attitudes, personalities, and skill sets than I got used to at the mostly-all-computer-geek IT departments of universities and Linden Lab where I previously made my living.
So let’s lift a glass to change, sometimes even if it’s just for change’s sake. And also to all of the different types of people that make the video game industry, and our lives, work — the artists, musicians, talkers, writers, dreamers and thinkers, along with us nerdy engineers. And most of all I propose a toast to fun and joy, both of which I hope to be contributing to for many millions of players during my time at Harmonix.
Gee, It’s been a while since this old thing has been updated, eh? There are lots of new things in my life since that last post down there. New job, new year, new haircut to name a few
As you may have already noticed, the blog looks a little bit different today. I went ahead and migrated from an old Movable Type v3.3 instance I was hosting myself onto the up-to-date platform at wordpress.com. It wasn’t without a few bumps and hiccups, but I’m pretty sure everything is here and future updates will go smoothly. And I do have some updates in the queue.
If you’re the least bit involved with the business-end of internet services (as I certainly am), you’ll have already heard that a few weeks ago, Amazon Web Services (AWS)[aws.amazon.com] suffered a major outage of an east-coast region of their platform. This outage caused serious issues and downtime for many other internet systems (including, but not limited to: reddit, foursquare, heroku, quora, and my employer Linden Lab / Second Life) and services that have come to rely on AWS over the past few years as a reliable provider of what have become known as “cloud computing services.” [wikipedia] This is their offical post-mortem of the incident.
What is particularly interesting and notable about this outage, in my opinion, is the set of lessons we in the industry can learn about putting our eggs in such a basket, what “high-availability” really means, the dangers of “sorcerer’s apprentice synrome” and “auto-immune” vulnerabilities in redundancy engineering, and how to maintain a high level of service in this age of “cloud computing.” People are still arguing and wanking about where to place the blame for all of the havok that this incident wreaked upon the internet, but the plain truth is that there’s more than enough blame to go around for everyone — the web sites and service providers, as well as Amazon itself. On one side of things, it’s true that engineers and administrators should have spread deployments across multiple AWS regions (not just availability zones). On the other side of things, AWS has made it difficult to use multiple AWS regions, had indeed maintained that spreading deployments across availability zones would provide adequate insurance against an outage — and it turned out that in this case they were very very wrong, and pushed a new cloud storage service (EBS) that proved to be even more unreliable and in many cases incompatible with the possibility of using multiple AWS sites.
Here’s a quick rundown:
- Amazon outage and the auto-immune vulnerabilities of resiliency
- Amazon EC2 outage: summary and lessons learned
- Heroku status and post-mortem from the AWS outage — an illustration of how NOT to do things if you want reliability, but an admirable case of owning up to and being honest about your faults.
- How SmugMug survived the Amazonpocalypse — an illustration of how you SHOULD do things if you want reliability from a company that used AWS but was able to stay up.
No, in the end it turns out that it wasn’t Skynet’s fault after all. Just some over exuberance about new hottness, and a distinct deficit in reliability-engineering and availability paranoia.
- V – This Blog: With the advent of twitter (see notamateurhour) this blog has been lacking in content. Turns out it’s much easier to just sit out a few lines of text regularly than it is to construct several paragraphs worth of content that’s worth reading. We’ll see if that changes.
- V – Hanging Out: With only very rare exceptions, I’ve been lacking in the hanging-out department for the past two or three months for various reasons (work, laziness, etc.). I think this is going to be my new years resolution: “hang out more.” I don’t remember the last time I made a run for the Border Cafe or saw some good pickin’ at the Cantab lounge with my old homies.
- ^ – Work: I did some pretty neat things this year, and enjoyed doing them. A massive mysql upgrade and migration went off with minimal outage — onto solid state disk hardware, which has worked out pretty sweet. I also did a whole bunch of fiddling and reworking with our DNS system for added speed and reliability, and built tools for making database reslaves an order of magnitude faster by using LVM snapshots. All in all, I’d say probably well worth the raise and titular promotion that I got this quarter. I’m not going to deny that there have been issues at Linden this year, and it hasn’t been the happiest of times morale-wise, but from where I sit things seem to be looking up.
- V – Red Sox: A grim 2010 for the boys of summer. ‘Nuff said. But I’m definitely looking forward to the Sox of ’11.
- ^ – Green Lifestyle: With Kristy no longer needing the car to drive to rotations in Worcester or Cambridge, there was no logical reason remaining to own a car while living in the city. $250/month for parking plus $80/month for insurance plus the hundreds of dollars that the car was going to potentially start costing us in maintenance to get it fixed and keep running is far from worth it for something you only drive maybe once a week or so. Zipcar is more than affordable and convenient enough for those occasional jaunts to Costco or the mall, and getting back and forth between Boston and Cambridge is just a quick ride on the CT2 or 47 MBTA bus.
- ^ – Burning Man: This year I did Burning Man for my first, and definitely not last, time. It was an awesome experience pretty much beyond words. Hopefully I’ll find the time to write a bit about it here and post some pictures, but you really have to go for yourself. Definitely one of the highlights of 2010 for me.
- = – Life in General: 2010 has been kind of a doozy of a year. Lots of stuff going on, things done, places gone to, lessons learned. Whew, I feel kind of tired just thinking about it all. I met a few awesome new friends, and said farewell to a few as well as they headed out of town and on with their lives. But I guess that’s how the whole durned human comedy keeps perpetuatin’ itself, down through the generations, westward the wagons, across the sands o time until — aw, look at me I’m ramblin’ again. Well, I hope you folks enjoyed yourselves this year, and here’s to a happy and awesome 2011.
What does it look like when the Systems Engineers from around the world gather together in Second Life for our weekly meeting? I took a snapshot today (that’s me in the leftmost-foreground):
People have expressed interest in seeing some more details about our use of MySQL at Linden Lab, so I shall indulge in the next couple of posts. Over the past few months, I have been involved in two major projects involving our central MySQL database cluster. The Second Life central database cluster (known affectionately as mysql.agni) actually consists of a single read/write master database server, with a tree of somewhere around 15 read-only slaves hanging off of it that are split into groups of various purposes (some behind load balancers) and serve the bulk of the queries that make Second Life work.
MySQL 5 Upgrade
Up until this year, our mysql.agni master was running version 4.1. Shortly after I started at Linden Lab, there was a project to upgrade the entire tree to version 5.0. Our resources were limited back then, and the operation was a total failure. It turns out that MySQL 5.0 has a much different disk i/o usage profile for our query rate and load type. When the master was upgraded to version 5.0, it promptly collapsed under heavy disk i/o load and unacceptable response times. Worse still, we did not have a proper version 4.1 host slaving off of the new master, so there was extended downtime and some data loss as we failed back to the older version and had to replay queries from binary logs.
Fast forward to the end of 2009, and we were much smarter, and better equipped to upgrade. By this point, all of the slaves in the tree had been upgraded, and it was becoming difficult and confusing to maintain a mixed-version slaving tree, with some on version 4 and some on version 5. We were operating on upgraded hardware with more capacity and headroom — particularly concerning disk i/o.
My former co-worker, Charity Majors, authored an extremely thorough public blog post concerning this upgrade, and our planning to make it come off without a hitch. We built a robust load-testing and analysis rig, and put the new version and hardware through its paces multiple times, in multiple ways, to make sure that we’d be OK this time. The major difference was our consultation with Percona and our decision to use one of their high-performance tuned builds of MySQL. In addition, we came up with an awesome plan for performing the upgrade, and a fallback mechanism if we needed it.
The trouble with fallback in this case is that normally, it’s not possible to hook up a MySQL version 4.1 slave off of a 5.0 master. This means that once we upgraded to the new master and had writes and updates going to it, we’d have no version 4.1 server to fall back to without losing data. Charity was able to rig up a script that rotated the binary logs on the new 5.0 master every 5 minutes or so, and sent the contents of the log over the network to a fallback host, where we would feed the binary log queries (via mysqlbinlog piped into mysql and various filters to compensate for 4.1/5.0 incompatibilities) into the database and get it up to date and ready to fail over to if we had issues with version 5.0. Of course, performance wasn’t our only concern, and there were millions of lines of code just waiting to bite us in the butt and potentially require falling back.
On January 5th, 2010, all went as well as could have possibly been imagined. We cut over to the version 5.0 master, and there were no issues. It’s so nice to finally be on a relatively recent version of the software, and more importantly, to have our master and slaves all be on the same version. And so our march onwards continues towards ever-improved performance and stability!
Randall Munroe of XKCD hits the nail on the head with this one:
“The weird sense of duty really good sysadmins have can border on the sociopathic, but it’s nice to know that it stands between the forces of darkness and your cat blog’s servers.”
It appears that Tuesday’s episode of Frontline will deal, at least in part, with Second Life. The trailer (embedded below) has some good bits in there about the use of SL for collaboration and both business and social/pleasure stuff. So, check your local PBS listings and set yer tivos and stuff!
For those who are not aware, when I went to MIT I lived in the East Campus dormitory, on the hall known as Second West (a.k.a. Putz). These days, Putz has a pretty neat group of people and an informal tradition of holding “slug talks,” or opportunities for someone to give a brief presentation and have a chance to share some knowledge with the group. Most of these so far have been computery in nature, and specifically computer-sciency. I hear there’s going to be a good one next week on neural plasticity and long term potentiation (LTP — it’s how we learn and form memories in our brains pretty much).
Since my day-to-day life deals with computer science in a more practical and hands-on way, the topics that I have chosen are more practical in nature. Last semester, I gave a talk about MySQL in production, and specifically how we use it at Linden Lab to make Second Life work.
Just last week, however, I gave a slugtalk describing how DNS works out here in the real world. I think it went pretty well. Download my slide deck as a PDF here.
An awesome retro news report from 1988 about the super scary Morris Worm. Centered on MIT, it features none other than jis himself a few times. And an amish part-time virus hunter cum MIT student? Courtesy of The Scottographer .
Some good quotes:
“the students were safe … their computers were not.”
“the suspect, somewhere…. a dark genius.”