People have expressed interest in seeing some more details about our use of MySQL at Linden Lab, so I shall indulge in the next couple of posts. Over the past few months, I have been involved in two major projects involving our central MySQL database cluster. The Second Life central database cluster (known affectionately as mysql.agni) actually consists of a single read/write master database server, with a tree of somewhere around 15 read-only slaves hanging off of it that are split into groups of various purposes (some behind load balancers) and serve the bulk of the queries that make Second Life work.
MySQL 5 Upgrade
Up until this year, our mysql.agni master was running version 4.1. Shortly after I started at Linden Lab, there was a project to upgrade the entire tree to version 5.0. Our resources were limited back then, and the operation was a total failure. It turns out that MySQL 5.0 has a much different disk i/o usage profile for our query rate and load type. When the master was upgraded to version 5.0, it promptly collapsed under heavy disk i/o load and unacceptable response times. Worse still, we did not have a proper version 4.1 host slaving off of the new master, so there was extended downtime and some data loss as we failed back to the older version and had to replay queries from binary logs.
Fast forward to the end of 2009, and we were much smarter, and better equipped to upgrade. By this point, all of the slaves in the tree had been upgraded, and it was becoming difficult and confusing to maintain a mixed-version slaving tree, with some on version 4 and some on version 5. We were operating on upgraded hardware with more capacity and headroom — particularly concerning disk i/o.
My former co-worker, Charity Majors, authored an extremely thorough public blog post concerning this upgrade, and our planning to make it come off without a hitch. We built a robust load-testing and analysis rig, and put the new version and hardware through its paces multiple times, in multiple ways, to make sure that we’d be OK this time. The major difference was our consultation with Percona and our decision to use one of their high-performance tuned builds of MySQL. In addition, we came up with an awesome plan for performing the upgrade, and a fallback mechanism if we needed it.
The trouble with fallback in this case is that normally, it’s not possible to hook up a MySQL version 4.1 slave off of a 5.0 master. This means that once we upgraded to the new master and had writes and updates going to it, we’d have no version 4.1 server to fall back to without losing data. Charity was able to rig up a script that rotated the binary logs on the new 5.0 master every 5 minutes or so, and sent the contents of the log over the network to a fallback host, where we would feed the binary log queries (via mysqlbinlog piped into mysql and various filters to compensate for 4.1/5.0 incompatibilities) into the database and get it up to date and ready to fail over to if we had issues with version 5.0. Of course, performance wasn’t our only concern, and there were millions of lines of code just waiting to bite us in the butt and potentially require falling back.
On January 5th, 2010, all went as well as could have possibly been imagined. We cut over to the version 5.0 master, and there were no issues. It’s so nice to finally be on a relatively recent version of the software, and more importantly, to have our master and slaves all be on the same version. And so our march onwards continues towards ever-improved performance and stability!
#1 by Sascha on March 17, 2010 - 8:35 am
Hi,
I’m wondering if the “distributed load testing framework” mentioned in the blog post will really pop up as open source sometime soon?
Thanks
Sascha