The Work That Makes Civilized Life Possible (and finding the people to do it)


“So what exactly would you say you do here?”

I’ve flown out to remote locations and been on-site for the build-out and spin-up of three new production data centers within the last 10 months. I’ve been present for load tests and at public launches of new video games’ online services and product and feature launches to predict and solve system load issues from rushes of new customers hitting new code, networks, and servers. And yes, I’ve spent my share of all-nighters in war rooms and in server rooms troubleshooting incidents and complicated failure events that have taken parts of web sites, or entire online properties offline. I wasn’t personally involved in fixing healthcare.gov late last year, but that team was made up of people I would consider my peers, and some people that have specifically have been co-workers of mine in the past.

Do you use the internet? Ever buy anything online? Use Facebook? Have a Netflix account? Ever do a search on Duck Duck Go or use Google? Do you have a bank account? Do you have a criminal record, or not? Ever been pulled over? Have you made a phone call in the past 10 years? Is your metadata being collected by the NSA? Have you ever been to the hospital, doctor’s office, or pharmacy? Do you play video games? If you’ve answered yes to any of the above questions, then a portion of your life (and livelihood) depend on a particular group of professional engineers that do what I do. No, we are not a secret society of illuminati or lizard people. We do, however, work mostly in the background, away from the spotlight, and ensure the correct operation of many parts of our modern, digital world.

So what do we call ourselves? That’s often the first challenge I face when someone asks me what I do for a living. My job titles, and the titles of my peers, have changed over the years. Some of us were called “operators” back in the early days of room-sized computers and massive tape drives. When I graduated college and got my first job I was referred to as a “systems administrator” or “sysadmin” for short. These days, the skill sets required to keep increasingly varied and complex digital infrastructure functioning properly have become specialized enough that this is almost universally considered a distinct field of engineering rather than just “administration” or “operations”. We often refer to ourselves now as “systems engineers,” “systems architects,” “production engineers,” or to use a term coined at Google but now used more widely, “site reliability engineers.”

What does my job entail specifically? There are scripting languages, automated configuration and server deployment packages, common technology standards, and large amounts of monitoring and metrics feedback from the complex systems that we create and work on. These are the tools we need to scale to handle growing populations of customers and increased traffic every day. This is a somewhat unique skill set and engineering field. Many of us have computer science degrees (I happen to), but many of us don’t. Most of the skills and techniques I use to do my job were not learned in school, but through my years of experience and an informal system of mentorship and apprenticeship in this odd guild. I wouldn’t consider myself a software engineer, but I know how to program in several languages. I didn’t write any of the code or design any of our website, but my team and teams like it are responsible for deploying that code and services, monitoring function, making sure the underlying servers, network and operating systems function properly, and maintaining operations through growth and evolution of the product, spikes in traffic, and any other unusual things.

“Skill shortage”

Back in 2001, I was working for the University of Illiniois at Urbana-Champaign for the campus information services department (then known as CCSO) as a primary engineer of the campus email and file storage systems. Both were rather large by 2001 standards, with over 60,000 accounts and about a terabyte (omg a whole terabyte!) of storage. This was still in the early part of the exponential growth of the internet and digital services. I remember a presentation by Sun Microsystems in which they stated that given the current growth rates and server/admin ratios, by 2015 about ⅓ of the U.S. Population would need to be sysadmins of some sort. They were probably right, but the good news is that since then our job has shifted mostly to finding efficiencies and making the management of systems and services of ever-growing scales and complexity possible without actual manual administration or operation — so the server/admin ratio has gone down dramatically since then. Back then it was around 1 admin for every 25 servers in an academic environment like UIUC. Today, the common ratios in industry range from a few hundred to a few thousand servers per engineer. I don’t think I’m allowed to say publicly what the specific numbers are here at TripAdvisor, but it is within that range. But, we still need new engineers every day to meet needs as the internet scales, and as we need to find even more efficiencies to continue to crank that ratio up.

Where do the production operations engineers come from? Many of us are ex-military, went to trade schools, or came to the career through a desire to tinker unrelated to college training. As I stated earlier, while a degree in computer science helps a lot understanding the foundations of what I do, many of the best engineers I’ve had the pleasure of working with are art, philosophy, or rhetoric majors. In hiring, we look for people who have strong problem solving desires and abilities, people who handle pressure well, who sometimes like to break things or take them apart to see how they work, and people who are flexible and open to changing requirements and environments. I believe that, because for a while computers just “worked” for people, a whole generation of young people in college, or just graduating college, never had the need or interest to look under the hood at how systems and networks work. In contrast, while I was in college, we had to compile our own linux kernels to get video support working, and do endless troubleshooting on our own computers just to make them usable for coding and, in some cases, daily operation on the campus network.

So generally speaking, recent college graduates trained in computer science have tended to gravitate towards the more “glamorous” software engineering and design positions, and continue to. How do we attract more interest in our open positions, and in the career as an option as early as college? I don’t have a good answer for that. I’ve asked my peers, and many of them don’t know either. I was thrilled to go to the 2014 SREcon in Santa Clara earlier this month (https://www.usenix.org/conference/srecon14), and for the most part the discussion panels there and the engineers and managers there from all the big Silicon Valley outfits (Facebook, Google, Twitter, Dropbox, etc.) face the same problem. It’s admittedly even worse for us at TripAdvisor as an east coast company fighting against the inexorable pull of Silicon Valley on the talent pool here.

One thing I’ve come to strongly believe, and which I think is becoming the norm in industry operations groups, is that we need to broaden our hiring windows more. We need to attract young talent and bring in the young engineers, who may not even be strictly sure that they want an operations or devops career, and show them how awesome and cool it really is (ok, at least I think it is). To this end, I gave a talk at MIT a little over a year ago on this subject — check out the slides and notes here. I didn’t know that this is what I wanted to do for sure until about a week before I graduated from MIT in 2000. I had two post-graduation job offers on the table, and I chose a position as an entry-level UNIX systems administrator at Massachusetts General Hospital (radiation oncology department, to be more specific) over the higher paying Java software engineering job at some outfit named Lava Stream (which as far as I can tell does not exist anymore). Turns out I made the right decision. The rest of my career history is in my LinkedIn profile (https://www.linkedin.com/profile/view?id=8091411) if anyone is curious. No, I’m not looking for a new job.

“Now (and forever) Hiring”

So, if anyone reading this is entering college, or just leaving college, or thinking of a career change, give operations some consideration. Maybe teach yourself some Linux skills. Take some online classes if you have time or think you need to. Brush up your python and shell scripting skills. At least become a hobbyist at home and figure out some of those skills you see in our open job positions (nagios, Apache, puppet, Hadoop, redis, whatever). Who knows, you might like it, and find yourself in a career where recruiters call you every other day and you can pretty much name your own salary and company you want to work for.

And specifically for my group at TripAdvisor? We manage the world’s largest travel site’s production infrastructure. It’s a fast-moving speed-wins type of place (see my previous blog post) and we are hiring. Any of this sound interesting to you? Even if you don’t think you fit any of the descriptions below but might be up for some mentoring/training and maybe an internship or more entry-level position, tweet at me or drop me an email and we’ll see what we can do. See you out there on the internets.

Job Opening: Technical Operations Engineer

TripAdvisor is seeking a senior-level production operations engineer to
join our technical operations team. The primary focus of the technical
operations team is the build-out and ongoing management of Tripadvisor’s
production systems and infrastructure.

You will be designing, implementing, maintaining, and troubleshooting
systems that run the world's largest travel site across several
datacenters and continents. TripAdvisor is a very fast growing and
innovative site, and our technical operations engineers require the
flexibility, and knowledge to adapt to and respond to challenging and
novel situations every day.

A successful candidate for this role must have strong system and network
troubleshooting skills, a desire for automation, and a willingness to
tackle problems quickly and at scale all the way from the hardware and
kernel level, up the stack to our database, backend, web services and
code.

Some Responsibilities:
- Monitoring/trending of production systems and network
- General linux systems administration
- Troubleshooting performance issues
- DNS and Authentication administration
- Datacenter, network build-outs to support continued growth
- Network management and administration
- Part of a 24x7 emergency response team

Some Desired Qualifications:
- Deep knowledge of Linux
- Experienced in use of scripting and programming languages 
- Experience with high traffic, internet-facing services
- Experience with alerting and trending packages like Nagios, Cacti
- Experience with environment automation tools (puppet, kickstart, etc.)
- Experience with virtualization technology (KVM preferred)
- Experience with network switches, routers and firewalls

Job Opening:  Information Security Engineer

TripAdvisor is seeking an Information Security Engineer to join our 
operations team. You will be charged with the responsibilities for 
overall information security for all the systems powering our sites, the 
information workflow for the sites and operational procedures, as well 
as the access of information from offices and remotes work locations.

Do you have the talent to not only design, but actually implement and 
potentially automate firewall, IDS/IPS configuration changes and manage 
day-to-day operations? Can you implement and manage vulnerability scans, 
penetration tests and audit security infrastructure?

You will be collaborating with product owners, product engineers, 
operations engineers to understand business priorities and goals, company 
culture, development processes, operational processes to identify risks 
and then work with teams on designing and implementing solutions or 
mitigations. You will be the information security expert in the company 
that track and monitor new/emerging vulnerabilities, exploitation 
techniques and attack vectors, as well as evaluate their impacts on 
services in production and under development. You will provide support 
for audit and remediation activities. You will be working hands-on on our 
production systems and network equipment to enact policy and maintain a 
secure and scalable environment.

Desired Skills and Experience

* BSc or higher degree in Computing Science or equivalent desired
* Relevant work experience (10+ yrs) in securing systemsand infrastructure
* Prior experience in penetration testing, vulnerability management, forensics
* Require prior experience in the area of IDS/IPS, firewall config/management
* Experience with high traffic, Internet-facing services
* Ability to understand and integrate business drivers and priorities into design
* Strong problem solving and analytical skills
* Strong communication skills with both product management and engineering
* Familiar with OWASP Top-10
* Relevant certifications (CISSP, GIAC Gold/Platinum, and CISM) a plus

1 Comment

Heartbleed, Internet Security and What it Means to You


For those not in the know, or catching any of the news stories that are popping up today in mainstream media, we are in the midst of dealing with a very serious vulnerability that has been discovered in the foundation of secure data transmission on the internet. While many of the news stories out there are filled with some ridiculous hyperbole, it would be dangerous to understate the criticality of what was discovered.

SSL (Secure Sockets Layer) is a protocol for letting your computer and other systems communicate across the internet with negotiated encryption (so people can’t snoop on your passwords and other sensitive transmitted information), and authentication (so you have a way of knowing that when you’re filling in information at your bank’s website it actually is going to your bank’s website). Anytime you’re at a website with “https” in the URL, or that little lock icon in your address bar, your communications are protected by this protocol and code running in your browser and on the server you’re communicating with works on encrypting and decrypting the information flying through the tubes. The SSL protocol was initially developed by our old friends at Netscape in the early 1990s, and is what makes e-commerce and a good portion of our modern economy and communications possible.

The Heartbleed Bug lets any attacker send a somewhat-carefully crafted message to a web server running this SSL code and get back arbitrary contents of the memory within that server. This is, sadly, not an uncommon type of bug (as anyone who has ever programmed will recognize the horror and commonality of array bounds-checking problems and buffer overflow problem). On a web server, however, some things that get returned from memory when it is poked with this attack include:

  • The web server’s secret key – This is the key that’s used to actually encrypt all traffic. If you are running a secure website and were vulnerable to this bug, in my opinion, you should assume that your key has been compromised and generate a new key and certificate for encrypting future traffic. Fortunately, due to the “authentication” part of the SSL protocol, in order to take advantage of having a server private key and certificate, you’d have to launch a “man in the middle” attack — which takes a bit more work and often involves actually penetrating the network of your victim and/or hijacking internet DNS service for your victim. Still, this is a very bad thing to leak.
  • Sensitive Information – Usernames, passwords, things filled out in forms and submitted to the website by other customers at the time the attack is launched will be present in the server memory in plaintext and can be retrieved. It’s not a bad idea to change your passwords regularly on websites anyways, but this bug might provoke you to go and do it right now
  • Session Cookies – Many secure websites keep track of which users are logged in and which aren’t by sharing a little bit of data with you known as a “cookie.” It’s pretty much a magic number that your browser can present to the website to say “hey it’s me again.” The web server will then look it up in the database to say “oh yeah, you logged in successfully a few hours ago, you’re still good.” This is how you can go to websites like facebook repeatedly and not have to enter your password over and over again. Other users’ session cookies will be present in the server memory in plaintext and can be retrieved by this attack. This is called “sidejacking” and is (in my opinion) the most frightening aspect of this bug. This blog has a more detailed example of using this vulnerability to do a sidejacking, and confirms that this is possible on at least one “fairly popular website”

This bug was disclosed in what we call a “responsible” manner. The researchers that were supposedly first to discover it did not release it to the public, but went directly to the OpenSSL project and, in turn, large stakeholders were notified several weeks ago. It can be assumed that sites like Google, Facebook, Akamai (which is good because they actually terminate a good portion of the web’s SSL — including TripAdvisor’s), and hosting providers like CloudFlare have already repaired the vulnerability before yesterday. Sadly, it appears that the publication of the vulnerability on April 7th was earlier than hoped. Linux distribution providers (Debian, CentOS, Redhat, Ubuntu) who provide the OpenSSL code packages that people like me actually have to get to install on our web servers, were not providing a fix in some cases until late in the evening on the 7th — well after exploit code was in the wild. Furthermore, while I trust the researchers listed as the discoverers of this bug, I can not (nor should anyone) be 100% certain that someone else hadn’t already discovered this problem and has been attacking websites with it for several months stealing private keys and sensitive information and credentials. So while it’s comforting that responsible disclosure and fast action on the part of the people that run the web sites you visit every day (people like me) have potentially mitigated the problem, the consequences of this vulnerability are (as you can see in the list above) far reaching and somewhat frightening.

“So as a regular person, how worried should I be?”This is a common question a lot of people have been asking in the past day or two. I can’t pretend to understand your own risk and paranoia level, but I will attempt to convey how I feel. This is not a reason to stop trusting the little lock icon in your browser or the “https” in the url. Bugs happen, sometimes information is leaked, and then they get fixed. Any damage done by this has already been done and there’s no reason to yank out your ethernet cables and delete your facebook and twitter accounts. What you should do (and should be doing already) are some common sense web security techniques. If there’s a bright side to this bug, it’s that this may increase everyone’s awareness and get people do to the following:

  • Change your passwords: This is a no-brainer. If anyone gets your account information (through this vulnerability or any other means), it’s useless if you change your passwords. I do this every few months.
  • Don’t use the same passwords on multiple sites: This is a common problem. Here at TripAdvisor the only thing your password protects is a bunch of travel reviews. You may think “oh whatever, big deal.” But research (and anecdotal evidence) shows that many people use the same exact password and username on many sites. The same username and password a user uses on TripAdvisor may very well be their gmail password, or the password for their online banking, or facebook or twitter. Websites get hacked all the time (none that I’m responsible for, of course, LOL [yes, I just typed LOL]) — sometimes without the public even knowing about it. So be smart. Even I don’t use a unique password for every website, but I have a set of four or five that I use for different classes of sites (social media password, email password, financial services password, shell login password, etc.).
  • Pick a good password: People have been saying this forever, but I will say it again. Quick story: when I was at UIUC running the campus Email and UNIX shell/file sharing services, we first ran a password cracker against our users’ accounts. The way that these “brute force” attacks work is that an attacker will attempt login using dictionary words, names and other things. The most common password, by far, was actually password. Among the top 5 were also fuckyou, ncc1701, various people’s names (obviously people choose their girlfriend/boyfriend/mother/father’s names for passwords), and in several dozen cases people actually used their usernames as their passwords. These days many websites will prevent you from using a weak password. So don’t be dumb. Pick a good password. It should not be dictionary-word based. Even replacing numbers with letters is easily decoded by brute-force attackers, so don’t think you’re fooling anyone. Don’t use anyone’s name in your password either. And don’t even use a combination of dictionary words, names, and l33t-sp34k numbering. The brute-force password crackers are at least as smart as you and have a lot more time and computing power.
  • So as a website operator or systems engineer what should I do? You should act immediately if you have not already. If you run your own web server, upgrade your OpenSSL package right this goddamn minute. Also, since the library is loaded in memory at service-start time you will need to restart your web server or any other service relying on the flawed library. To be safe, just reboot after you upgrade the package. There also might be code that was built statically-linked to the flawed library. In that case you’ll have to recompile and re-install it. Run common vulnerability scanners like nessus (or other tools available) against everything you have running. If you have a website that’s hosted elsewhere, contact your hosting provider immediately. Make sure they are patched and no longer vulnerable. Also, replace your SSL key and certificate. Some will say that this step is overly paranoid, and your hosting provider might even give you shit for insisting that they generate a new key and certificate for you. As I stated above, while these researchers responsibly disclosed this bug, the possibility that this was out in the wild before can not be dismissed.

    Timeline:

  • December 2011: Bug is introduced into the hearbeat function of the OpenSSL library
  • March 14th 2012: OpenSSL v1.0.1 released into the wild with the bug
  • March? 2014: Bug is discovered by some combination of Neel Mehta at Google Security and Matti Kamunen, Antti Karljalainen and Riku Hietamäki from Codenomicon and reported to the OpenSSL project.
  • >March-April 2014: NCSC-FI and OpenSSL work to notify some subset of stakeholders ahead of time of the vunerability, apparently with a patch and a workaround
  • April 7th 2014: News breaks of the vulnerability and the NCSC-FI team needs to go public with it so the rest of the world can fix their web servers

3 Comments

The 2014 MIT Mystery Hunt – Alice Shrugged


Winning the hunt in 2013

Winning the hunt in 2013


Calling the winning team in 2014

Calling the winning team in 2014

So we ran the MIT Mystery Hunt this year (our dubious award for winning it last year). The experience is pretty well bookended by the above two pictures: one of Laura answering our phone in 2013 to hear that we had answered the final meta successfully and won, and another one of Laura calling the winning team in 2014 (One fish, two fish, random fish, blue fish) to congratulate them on answering the final meta successfully. I have no idea how or where to begin describing what it was like to do this this year. My team ran the hunt in 2004, but I was out of town in Champaign, IL at the time and played no part in that little misadventure. This time around, I was on the leadership committee, in charge of hunt systems, IT and infrastructure.

Thank You

First I’d like to thank the other members of the systems team. James Clark just about single-handedly wrote a new django app and framework (based loosely on techniques and code from the 2012 codex hunt) which we will be putting up on github soonish and we hope that other teams can make use of it in the future — we have dubbed it “spoilr”. It worked remarkably well, and has several innovative features that I think will serve hunt well for years to come. Joel Miller and Alejandro Sedeño were my on-site server admin helpers and helped keep things running and further adjusted code (although only slightly) during the hunt. Josh Randall was our veteran team member on call in England (which helped because he was available during shifted hours for us). And Matt Goldstein set up our HQ call center and auto-dial app with VoIP phones provided by IS&T.

With the exception of only a few issues (which I’ll try to address below), from the systems side of things the hunt ran extremely well. We were the first hunt in a while to actually start on time, and we were also the first hunt in a while to actually have the solutions and hunt solve statistics up and posted by wrap-up on Monday. This hunt had a record number of participants, and a record number of teams, (both higher than we planned when designing and testing the system) making our job all the more difficult. And of course, I’d like to join everyone else in thanking the entire team of Alice Shrugged that made this hunt possible. It was great working with you all year and pulling off what many feel was a fantastic hunt.

Hunt Archive and Favorites

To look at the actual hunt, including all puzzles and solutions, and some team and hunt statistics, go to the 2014 Hunt Archive. My favorite puzzles (since everyone seems to be asking) were: Callooh Callay World and Crow Facts. Okay, I guess Stalk Us Maybe was pretty neat too.

Apologies

First of all, I’d like to apologize on behalf of the systems team for the issues with Walk Across Some Dungeons. This performed artificially well during our load test, but load test clients are far more well behaved than actual real people on a real network. There were several socket locking and connection starvation issues with the puzzle even after we spent all day (and night) Friday parallelizing it onto as many as 7 virtual servers. Eventually we patched the code to allow for better handling of dropped connections and to be more multi-threaded within each app instance and by late Saturday night it was working much better. The author has since ported it to javascript, and it should work fine well into the future. Lesson to future teams: don’t try to write your own socket-handling code or any server-side code for that matter that has to interact with the hunters. The issues surrounding the puzzle (lag, and the frequent reset back to stage 1 requiring many clicks to get back to your position) affected every team equally for the first day and were not fixed until at least the top 5 teams had already solved the puzzle in its “difficult” state. So luckily this was not a fairness issue, it just made a puzzle a LOT harder than it should have been.

Second, there was an interesting issue with our hunt republishing code. At times during the hunt there were errata (remarkably few, actually) and some point-value and unlocking-system changes (will mention more about this below) that required a full republish of the hunt for all teams. This is not unusual. However, with the number of teams and hunters and the pace our call handlers (particularly Zoz the queue-handling machine) were progressing teams through the hunt on Friday in particular, this created a race condition. If any puzzle unlocks happened during one of these republishes, they would be put back to the state they were when the publish started. Since a republish takes a looooooong time for all of these teams and puzzles, a number of teams noticed “disappearing” puzzles and unlocks on Friday while we were updating our first erratas in puzzles and then later Friday night when we changed the value of Wonderland train tickets to slow the hunt down a bit. We alleviated this slightly in the spoilr code by making the republish iterate team by team rather than take its state of the whole hunt at the beginning and then apply it to everyone. By later on Friday though, teams had enough puzzles unlocked that even just republishing for a team had a risk of coinciding with a puzzle unlock, so we simply froze the handling of the call-in queue while we were doing these. As a note for future teams, this could probably be fixed by making the republish work more transactional in the code.

Release Rate, Fairness, and Fun

On this subject I can not pretend to speak for the whole team (nor can anyone probably), but I will share what I experienced and what I think about it. Many medium and small-sized teams have written to congratulate us on running a hunt that was fun for them and that encouraged teams to keep hunting in some cases over 24 hours after the coin was found. On the flip side some medium and large-sized teams were a bit disappointed in the later stages of the hunt when puzzles unlocked at a slower rate (particularly once all rounds were unlocked) leaving them with less puzzles to work on and creating bottlenecks to finishing the hunt. One of the overriding principles of us writing this hunt was to make it fun for small teams, and fair for large teams. The puzzle release mechanism in the MIT round(s) was fast, furious and fun. Something like 30 teams actually solved the MIT Meta and got to go on the MIT runaround and get the “mid-hunt” reward. From the beginning of our design, the puzzle release mechanism for the wonderland rounds (particularly the outer ones) was constrained to release puzzles in an already-opened round based only on correct answers in that round. The rate of how many answers in a round it took to open up the next set of puzzles in that round, and the order in which puzzles were released in a given round was designed to require focused effort on a smaller number of opened puzzles in order to progress to a point where those metas were solvable. This rate was, incidentally, tweaked to be somewhat lower on Friday night (but only for the two rounds no team had opened yet) in a concerted effort to make sure the coin wasn’t found as early as 6-8pm on Saturday. Coming from a large team myself, I have seen the effect of the explosion of team size on the dynamics of Mystery Hunt. This is an issue that teams will face for years to come, and everyone may choose to solve it a different way. But once again, our overriding goal was to make the hunt fun for small teams, and fair for large teams, and I think we did just that.

Architecture Overview

For the curious, and to those running the hunt next year, our server setup was fairly simple. We had one backend server which ran a database and all of the queue-handling and hunt HQ parts of the spoilr software (in django via mod_wsgi and apache). There were two frontend servers which shared a common filesystem mount with the backend server so all teams saw the consistent view of unlocks. Each team gets its own login and home directory which controls their view of the hunt when the spoilr software updates symlinks or changes the HTML files there. The spoilr software on the frontends handled answer submissions and contact HQ requests among some other things, but they were mostly just static web servers. We didn’t need two for load reasons, we just had both running for redundancy in case one pooped out over the weekend. However, splitting the dynamic queue operations and Hunt HQ dashboards off from the web servers that 1500+ hunters were hitting for the hunt was a necessity. Each of the front ends also acted as a full streaming replica of the database on the backend server, and we had a failover script ready so the hunt could continue even if the backend server and database failed somehow. There was also a streaming database replica and hunt server in another colocation facility in Chicago in case somehow both datacenters that the hunt servers were in failed or lost internet connectivity. I’d like to thank Axcelx Technologies for providing us with hosting and support, and would recommend them to anyone looking for a reasonably priced virtual server provider or collocation provider.

As far as writing the hunt goes, we used the now-standard “puzzletron” software and made a lot of improvements to that and hope to get that pushed back up to gitweb for the next team to start writing with. We had dev and test instances of puzzletron running all year so we could deploy our new features quickly and safely as our team came up with neat new things to track with it. Beyond that, we set up a mediawiki wiki, and a phpbb bulletin board, as well as several mailman mailing lists and a jabber chat server (which nobody really used). As a large team, collaboration tools have always been very important for us in trying to win the hunt, and were even more important in writing it. In retrospect, we probably should have taken more time to develop an actual electronic ticketing system (or find one to use) for the run-time operations of the hunt. Instead we ended up using paper tickets which passed back and forth between characters, queue handlers, and the run-time people. Since this hunt had so many interactions and so many teams which needed to get through them, this got clumsy and some were dropped or not checked off early in the hunt (I’m very sorry if this happened to any teams and delayed unlocks of puzzles/rounds early on).

In Closing

In closing, I had a great time working on the hunt. I can’t say how great it would have been to go on it, since sadly I did not get to. But, hearing the generally positive comments from everyone thus far, I’m glad we didn’t screw it up :) The mailing list aliceshrugged@aliceshrugged.com will continue to work into the future, and I look forward to getting some of our code and documentation posted up for random to perhaps use and further improve upon next year, and for other teams to carry on the tradition for many years to come.

6 Comments

Oreo Insanity


I think someone must have slipped the product development team at Nabisco some meth.

Oreos are an amazing food product. They are, in fact, probably my favorite cookie (I’m a fan of the golden variety). But what on earth would possess the makers of the greatest sandwich cookie in the universe to go on this recent insane quest to make as many different new varieties as possible.

Okay, I kind of get the motivation for candy corn Oreos. It was Halloween, after all, and that was a novelty. But looking through amazon, one is assaulted with all sorts of Oreo insanity. Aren’t they worried about brand dilution (not to mention that some of these flavors sound even more potentially-vile than candy corn):

candycaneoreoCandy Cane Oreos

winteroreoWinter Oreos

gingerbreadoreoGingerbread Oreos

candycornoreosCandy Corn Oreos

magastuforeoMega Stuf Oreos

watermelonoreoWatermelon Oreos

coolmintoreoCool Mint Oreos

bananasplitoreoBanana Split Oreos

berrybursticecreamoreoBerry Burst Ice Cream Oreos

halloweenoreoHalloween Oreos

peanutbutteroreoPeanut Butter Oreos

tripledoubleoreoTriple Double Oreos

neopolitanoreoTriple Double Neopolitan Oreos

Leave a comment

On The Asshats at straightprideuk.com (and the Streisand Effect)


So there’s this group out there in the UK: Straight Pride UK .  I hadn’t heard about them until today, and I doubt pretty much anyone in the world had (other than maybe their own immediate circle of homophobic, conservative nitwits).  And that’s fine.  If you want to read the whole story about what transpired with them, go check out the story at popehat’s most excellent blog.  But here’s the short version:  a history student writes to them, identifying himself as a freelance journalist, and asking them a few pointed questions about their positions (mainly that straight people are getting “silenced and abused” and the mounting censorship in the UK).  They write back in a document labeled as a “press release.” Fine, so far so good, they’re entitled to their own opinions, and they’re letting people know.  But here’s where they end up going over the edge:  the student writes back asking for clarification on a couple of his questions, and mentions that he’s going to post their conversation on his blog.  Straightpride then responds with an angry DMCA takedown letter, complaining that the student did not have the right to publish the email/press release (uh yeah, what does “press release” mean again?).  What really sucks is that wordpress.com preemptively actually took down his post, in violation of just about every bit of common sense one would expect them to have.

The Internet, however, detects censorship as damage and routes around it.  Here is the google cache of his original article.  And his actual response to the bogus takedown and threats is on his blog.  Now, the kind folks at Straight Pride UK are getting a serious taste of the Streisand Effect (the internet phenomenon whereby an attempt to censor information will actually increase publicity of that information exponentially and screw you over, named after a failed attempt by Barbara Streisand to censor aerial photos of her house way back when).  Many other bloggers (including me) are re-posting the censored article / letter from Straight Pride UK.  So, what was before just one dude pointing out a homophobic nutty group, is now the whole internet pointing and laughing.  And they still don’t get it:

http://www.popehat.com/wp-content/uploads/2013/08/STOPMAKINGFUNOFUS.jpg

Anyways, without further ado, here is the original post by Oliver exposing these asshats.  Almost makes me ashamed to be straight:

It’s Great When You’re Straight… Yeah

There has never been a better time to be gay in this country. LGBTI people will soon enjoy full marriage equality,public acceptance of homosexuality is at an all time high, and generally a consensus has developed that it’s really not that big of a deal what consenting adults do in the privacy of their bedrooms. The debate on Gay Marriage in the House of Commons was marred by a few old reactionaries, true, but generally it’s become accepted that full rights for LGBTI people is inevitable and desirable. Thank God.

But some are deeply troubled by this unfaltering march toward common decency, and they call themselves the Straight Pride movement.

Determined to raise awareness of the “heterosexual part of our society”, Straight Pride believe that a militant gay lobby has hijacked the debate on sexuality in this country, and encourage their members, among other things, to “come out” as straight, posting on their Facebook page that:

“Coming out as Straight or heterosexual in todays politically correct world is an extremely challenging experience. It is often distressing and evokes emotions of fear, relief, pride and embarrassment.”

I asked them some questions.

First of all, what prompted you to set up Straight Pride UK?

Straight Pride is a small group of heterosexual individuals who joined together after seeing the rights of people who have opposing views to homosexuality trampled over and, quite frankly, oppressed.

With the current political situation in the United Kingdom with Gay Marriage passing, everyone is being forced to accept homosexuals, and other chosen lifestyles and behaviours, no matter their opposing views. Straight Pride has seen people sued, and businesses affected, all because the homosexual community do not like people having a view or opinion that differs from theirs.

Are your beliefs linked to religion? How many of you derive your views from scripture?

Straight Pride aims are neutral and we do not follow religion, but we do support people who are oppressed for being religious. Only today, Straight Pride see that two homosexual parents are planning to sue the Church because they ‘cannot get what they want’. This is aggressive behaviour and this is the reason why people have strong objections to homosexuals.

You say that one of your goals is “to raise awareness of the heterosexual part of society”. Why do you feel this is necessary?

The Straight Pride mission is to make sure that the default setting for humanity is not forgotten and that heterosexuals are allowed to have a voice and speak out against being oppressed because of the politically correct Government.

Straight Pride feel need to raise awareness of heterosexuality, family values, morals, and traditional lifestyles and relationships.

Your website states that “Homosexuals have more rights than others”. What rights specifically do LGBTI people have that straight people are denied?

Homosexuals do currently have more rights than heterosexuals, their rights can trump those of others, religious or not. Heterosexuals cannot speak out against homosexuals, but homosexuals are free to call people bigots who don’t agree with homosexuality, heterosexuals, religious or not, cannot refuse to serve or accommodate homosexuals, if they do, they face being sued, this has already happened.

Straight Pride believe anyone should be able to refuse service and speak out against something they do not like or support.

There is a hotel in the south of England, called Hamilton Hall which only accepts homosexuals – if this is allowed, then hotels should have the choice and right to who they accommodate.


What has been the response to your campaign?

The response to Straight Pride’s formation has been as expected; hostile, threatening, and aggressive. Homosexuals do not like anyone challenging them or their behaviour.

We have had support from many people saying that if homosexuals can have a Pride March, and then equality should allow Heterosexuals to have one too. After all, the homosexual movement want everyone to have equality.

Why would you say that heterosexuality the “natural orientation”?

Heterosexuality is the default setting for the human race, this is what creates life, if everyone made the decision to be homosexual, life would stop. People are radicalised to become homosexual, it is promoted to be ‘okay’ and right by the many groups that have sprung up.

Marriage is a man and a woman, homosexuals had Civil Partnerships, which was identical to Marriage with all the same rights, they wanted to destroy Marriage and have successfully done so.

If you could pick one historical figure to be the symbol of straight pride (just as figures like Alan Turing, Judith Butler or Peter Tatchell would be for Gay Pride) which would you choose?

Straight Pride would praise Margaret Thatcher for her stance on Section 28, which meant that children were not taught about homosexuality, as this should not on the curriculum.

More recently, Straight Pride admire President Vladimir Putin of Russia for his stance and support of his country’s traditional values.

How do you react to anti-gay attacks and movements in Russia and parts of Africa?

Straight Pride support what Russia and Africa is doing, these country have morals and are listening to their majorities. These countries are not ‘anti-gay’ – that is a term always used by the Homosexual Agenda to play the victim and suppress opinions and views of those against it.

These countries have passed laws, these laws are to be respected and no other country should interfere with another country’s laws or legislation.

We have country wide events which our members attend, and ask people their opinions and views, on such event at Glastonbury this year was very positive with the majority of people we asked, replied they were happily heterosexual.


For the record, Straight Pride did not respond to these questions:

“Pride” movements such as Gay Pride and Black Pride were making the argument that the stigma against them meant that proclaiming their “pride” was an act of liberation from oppression. Can being heterosexually really compare?

A problem that Gay rights activists cite is the issue of bullying, and the effect this can have on young LGBT people. Do you think a similar problem exists with straight children being bullied by gay children?

I will obviously add to this if they do respond.

1 Comment

MIT Mystery Hunt 2013 (a.k.a. The Misery Hunt)


Happy June everyone. Back in January, I had the privilege of being on the winning team of the 2013 MIT IAP Mystery Hunt (pretty sure I already mentioned that a couple of posts ago). For those unaware, we were a huge team (~100+ people), and the name of our team was the full text of Ayn Rand’s Atlas Shrugged. Whenever we’d communicate with hunt HQ, we’d continue reading the text until they made us stop (or let us stop). Among others thought this would be a neat, clever idea — maybe even “cute.”  In reality, however, it turned out to just add to the pain and misery of what turned out to be an already painful and misery-drenched (but also somewhat fun) hunt. Tired-sounding reading of the rambling Randian prose quickly became the leitmotif for the weekend.

Other people have already shared their opinions and experiences about the hunt (google “2013: the year the mystery hunt broke” for an example).  The organizers (Manic Sages) have, in my opinion, already gotten more than enough criticism dumped on them for putting together the longest weekend in hunt history that almost ended in hunt ending by decree or draw — which would have been disastrous for the 2013 hunt, as well as the concept and tradition of the mystery hunt moving forward (in my opinion).

The word “grueling” was the one word I used most when people asked me what the hunt was like this year.  I’ve participated in the hunts with this team consistently for the past 5 years and on and off going back to 2004 (the last year our team won) and earlier.  I spent four years at MIT, with all of the all-nighters, failing grades, and frustration that that entails.  But I’ll be damned if the 2013 mystery hunt wasn’t one of the most intellectually demoralizing experiences of my life. Is that such a bad thing? In retrospect,  I’m not so sure. Challenging experiences and “rolling up one’s sleeves and getting to work” (sometimes by doing insane statistical analysis on endless streams of random numbers) are ways in which we attain personal growth — right?  Maybe if it was just as difficult, but shorter.  Maybe if it had some more fun and games mixed in.  Maybe if we didn’t decide to do that stupid Atlas Shrugged thing.  Maybe then the hunt would have been FUN as well as just grueling — and wouldn’t have left me with hunt PTSD.  Seriously, I’m not alone on my team in having had nightmares up to a week after hunt about still doing the hunt, or still needing to solve a meta.

I’m not going to bore everyone with detailed stories of extremely difficult puzzles with perhaps one-too-many “a-ha!” moments necessary to solve, or the detailed methods our team uses to keep fresh shifts of solvers moving in and out of the room, taking naps, and ultimately winning the battle of attrition that the last 24 hours of the weekend became.  But I will recount my tale of how the hunt ended (from my perspective).

The beginning of the end was 8pm Sunday night (already over 16 hours after the point the 2012 mystery hunt had ended in its weekend). We already knew by that point that this was going to be a hunt for the ages.  My team had that glazed-over deer-in-the-headlights look that comes from being up for 20-30+ hours in some cases doing extreme mental gymnastics. An email came in from HQ reading: “Our honest estimate of hunt’s end is Monday at 9AM given what we’ve seen of solving rates on our puzzles so far.”  At this time, I was on my way out to a room to sleep for a quick 4 hours (or until hunt ended).  It turns out that not only was hunt nowhere near ending, but with an end-time of 9AM, they were predicting it to surpass the 2004 hunt for the all-time duration record.  And who had written the 2004 hunt?  Our team — then known as “French Armada” (because the team wanted to wear funny hats — from what I hear).  When I woke up to my alarm 4 hours later, my disillusionment with the mystery hunt had turned into a sort of prideful anger.  How dare they assume that their hunt will be even more un-defeatable than the one we’d written (somewhat poorly) a decade earlier?  At around 2am, the late-night shift of fresh puzzlers dug in and I, for one, was hoping to prove the Manic Sages’ assumption wrong and to keep our dubious record of “longest hunt ever.”

But it was not to be. At 6AM, or so, I went back in for another brief nap. 9AM came and went, and another shift of freshly-napped hunters came in. The 2004 French Armada’s hunt length record had fallen.  Free answers to puzzles were getting handed out every 20 minutes now to help draw things to a close.  The requirements for finishing hunt were changed so that one full meta-puzzle (out of 5 total) could be skipped entirely. For those unfamiliar with the standard mechanics of a mystery hunt, there are generally puzzles in “rounds”, and then for each round (or group of rounds), the answers of the puzzles plug into a “meta” puzzle.  Once all meta-puzzles are complete, the team is eligible to go on a final “runaround” (involving, literally, running around and solving more puzzles) and ultimately win the hunt.  So, eliminating an entire meta-puzzle requirement was a big deal — and up until this year, unheard of (at least by me).

At some point on Monday morning, one of our freshmen (Lauren Herring) had been sitting in the same spot working on one of the metas (“The Enigma”) for what seemed like 12 hours.  She’d be sitting there when I left for a nap, and she’d be in the same spot, wild-eyed and turning those same infernal rolls of paper when I got back to the room hours later.  And so eventually, that meta got solved.  And that left us needing exactly one more meta to get to the runaround.  One of them (“Rubik”) seemed totally impossible and we had made little progress on it at all from what I could see.  The other one (“Indiana Jones”) was getting churned on slowly at a table by puzzlers including our “old guard” — the bleary-eyed Mark Feldmeier and Zoz Brooks — with the whole team cheering them on.  Actually it was more nervous pacing, drinking coffee, and watching vs. audible “cheering.”  We were getting close by 10-11am, and calling HQ regularly for hints and clues.  We had even called in to verify a partial answer for it — “hey are the first 8 letters of the answer this?” (this is also unheard of in a mystery hunt) — only to get rebuked.

And then something weird happened.  Our team phone rang, and Manic Sages’ HQ was on the other end.  The inimitable Laura Royden was “manning” the phone at the time and dealing with team-wide organization (we call it “puzzle bitch”ing).  It turns out that an offer of settlement/surrender had been made and was being brokered by the Manic Sages in the interest of ending hunt.  The terms were to stop hunting now and whatever team was deemed the “furthest ahead” by the Sages would be declared the winner.  After being at it for over 70 hours at that point, it was a tempting prospect.  The team huddled together, and Laura told HQ that we’d call them back in “a few minutes”.  How far could any other team possibly be if they were willing to make this offer? We heard rumors of other competitive teams giving up and packing in to go home by this point, leaving us as one of the few teams insane and stubborn enough to still be trying to win.  We knew we only needed one more meta but were, for the moment, scratching our heads on what we were doing wrong with “Indiana Jones.” A couple of the more senior puzzlers on our team (Dan Katz and Erin Rhode at least from what I remember) immediately leaned towards rejecting the offer.  Then we got another call from HQ, and they told us that they’d made a terrible mistake and our partial answer check from earlier was actually on the right track. At that point the choice was clear. Not only did we know that we were the farthest ahead, but we also knew we were potentially only minutes away from winning.  To accept the other team’s surrender at that point would have perhaps been merciful, but wouldn’t have been a good thing for the 2013 hunt, or hunt as a tradition and concept (in my opinion at least).  Laura shouted clearly into the phone: “no we will not accept your offer!” And, sure enough, about 10 minutes later, she called back with the complete correct answer to our final required meta-puzzle and accepted congratulations that we had, for all intents and purposes (with the exception of the runaround), won the hunt.  Below is a picture commemorating that very moment.

424244_910363591908_1978574625_n

Once that was done, Manic Sages actually asked if we wanted to do the full runaround, or just be handed the coin and declared the winners at that point.  Staying true to tradition, even though it was almost noon on Monday at that point, we elected to make them put on the entire runaround for us.  At 3:30pm on Monday, a full 75 hours after the hunt began, the coin was found by the small subgroup of our team that was still awake (this did not include me, as I collapsed shortly after the final answer was called in and I knew we had won).

So now what?  Now our team is writing and running the 2014 IAP MIT Mystery Hunt, that’s what.  The experience of last year (and echoes of our 2004 hunt) sort of lend a feeling of “there but for the grace of god, go I” to the whole thing.  Each and every one of us knows (or should know) that it is very much possible, with the best of intentions and the smartest and most experienced people, to write a hunt that turns out to be “bad” or even maybe “a disaster.”  That’s kind of a lot of responsibility, isn’t it?  But alas, we will do our best.  Without further ado, I’ll wrap up here and introduce the board of directors of the 2014 IAP MIT Mystery Hunt.  For continued missives from our team, and guest writers talking about hunt, please visit our blog at http://mysteryhunt.wordpress.com/.  And oh yeah, good luck in 2014 everyone!

DSC_0967 copy

Galen Pickard (Executive Producer)

Anand Sarwate (Finance)

Anand Sarwate (Finance)

Erin Rhode (Director)

Erin Rhode (Director)

DSC_0992 copy

Benjamin O’Connor (IT and Infrastructure)

MysteryHunt_6

Pranjal Vachaspati (Operations and Logistics)

DSC_0970 copy

Laura Royden (Theme)

DSC_0971 copy

Harvey Jones (Quality Control)

3 Comments

New Job Observations: Farewell Harmonix, Hello TripAdvisor.


I know it’s April already, but happy new year everyone!

For those not in the know, I got a somewhat unexpected new job prospect (and offer, which I accepted) at the end of 2012. Since then, I’ve been a senior member of the technical operations team at TripAdvisor.

TripAdvisor is the world’s largest travel site, with over 100 million reviews, and over 100 million unique users per month. For people keeping track, this is the third company I worked for during the year 2012, and all three have been mentioned on The Office (Linden Lab [Second Life], Harmonix [Guitar Hero / Rock Band], and now Tripadvisor [check out the Schrute Farms episode]). However, it’s not just popularity or “hipness” that led me to shift around.

I like to tell people (and recruiters) that I have four rules for picking a place to work:

  • Must not be generally evil or tending towards evil (in my opinion) — This rules out most banks or the pharmaceutical industry, any petrochemical comany, and probably currently most of Google and Facebook.
  • Must be a profitable venture — I’m too old to play the startup risk game.
  • Must be accessible to my apartment in Boston via public transportation commute of <30 minutes — I don’t own a car, don’t want one, and I’m not moving anywhere.
  • Operations and Systems must be critical to the core business and of the highest priority — My job is best executed when it has the highest respect and attention of the company and management (immediate as well as upper).

It was that last one that I forgot about when I ended up at Harmonix. After being at Linden for 4+ years, I could feel myself falling into the crotchety grizzled BOFH sysadmin role. Come to think of it, that probably happens to anyone in my field after a few years in an organization. Spending a year at Harmonix was a great chance to broaden my horizons, relax, and experience new perspectives on things. As I stated in an earlier blog post, I loved working there, and I do miss the place, people, and incredibly fun things happening in their awesome Central Square office. At TripAdvisor, we’re still in the business of providing joy to people. Rather than by selling some of the best video games, this time it’s by helping folks plan and take vacations.

Very similar to my time at Linden Lab, when I told people that I worked at Harmonix (makers of Rock Band and Dance Central franchises) the first response was usually “wow that’s really cool.” However, the second response was more often than not, “are they still relevant? What are they working on now?” While it’s true that the heyday of plastic instruments (and maybe console gaming in general — according to some naysayers) has passed, I’m still rooting for the folks over there, and I happen to know that they are still a vital, awesome independent studio with the best people and some blow-your-mind projects in the pipeline. If I was still there, I’d be hustling along side them doing my best to keep up and push forward the state of game network interaction and back ends. That being said, the effort that game developers (particularly independents) put into network features, operations, and backends is decreasing over time. And it should be. Great games are great because of the focus on art, gameplay, story, and other intangibles. Console manufacturers and third-party contractors can be brought on to do the job now of multi-player matchmaking and scoreboard databases, letting game makers stick to making awesome games and fostering and maintaining player communities — both things that Harmonix has done and will continue to do very well.

What drew me out to TripAdvisor (other than the folks I already know who work there — hi Laura and Drew!) was the scale. Honestly, I missed the excitement and challenges of running a huge infrastructure. At its peak, Second Life consisted of three data centers, 12,000+ servers, and received a new rack of 40 servers or so every couple of weeks. TripAdvisor isn’t quite that big infrastructure-wise (although we have 5 times as many employees), but we serve 2 billion ads a year, and are peaking at 600k web requests per minute (and growing tremendously still year-over-year). The company has a weekly release cycle, an innovative and freewheeling engineering culture, and an unofficial motto of “speed wins.”

At first, being a somewhat methodical systems engineer, the concept of putting velocity in front of “correctness” scared me a little bit. I’ve focused on things like proper cabling, thorough documentation, long planning cycles, enforcing automation prior to production, eliminating waste, etc. Here, though, I quickly learned that it’s important to keep moving and to cut a little slack to the folks that came before me for bad cabling, some missing documentation, or leaving a half dozen underutilized or unused servers around (sometimes literally powered-off in the racks or on the floor) while buying new ones in a hurry. If everyone takes the extra time (myself included) to do things the absolute correct way, we’ll lose our competitive advantage and then I’d be out of a job. So yeah, speed does win.

At this point, I’d be remiss if I didn’t offer you all potential jobs here. So, visit TripAdvisor Careers, find something you want to do, and drop me a line if I know you — I’d love to give a few hiring referrals, and yes we are hiring like crazy as the company expands!

1 Comment

Follow

Get every new post delivered to your Inbox.

Join 887 other followers

%d bloggers like this: