Upgrading GitHub to Rails 3 with Zero Downtime
GitHub is a fairly large production Ruby on Rails application. From a scale perspective, it serves hundreds of millions of requests per day.
Until now, we’ve been running an outdated, heavily-modified, unsupported fork of Rails, which we called 2.3.github. This choice has bitten us in the form of gem incompatibility, having to manually backport security patches, missing out on core framework performance and feature improvements, and being unable to easily contribute back to the open source rails project.
For those of you keeping score:
- Yes, Rails 3 was released four years ago
- Yes, the current stable version is Rails 4.1, which left us two major versions behind
We had work to do in order to live in the modern world again.
Over the last six months, we’ve had a team of 4 engineers working full time on upgrading to Rails 3.
Here are a few lessons we learned:
Instrumentation empowers change
One of our biggest concerns with this upgrade was performance.
How would this version of Rails perform against its predecessor? Would shipping it cause additional load for other services? Are the appropriate metrics being recorded so we can measure performance improvements/regressions?
To get an initial idea of how the app might perform on Rails 3, we did some local benchmarking against a random pull request:
$ script/benchmark -n 50 --url http://github.dev/github/linguist/pull/748 GitHub - Rails 3.0.x (development) 50 requests to http://github.dev/github/linguist/pull/748 peak memory: 566 MB RSS response size: 9.37 MB total (192 KB/req) response time: 43,124ms total (862ms avg/req, 762ms - 1,143ms) render time: 39,426ms total (788ms avg/req, 696ms - 1,071ms) cpu time: 41,832ms total (836ms avg/req, 739ms - 1,117ms) idle time: 1,293ms total (25ms avg/req, 18ms - 45ms) oob gc: 50 / 2,960ms in-request gc: 32 / 2,390ms total (79ms avg/req, 27ms - 301ms) allocations: 46,334,670 objs total (926,693 objs avg/req, 926,307 objs - 927,696 objs) ar objs: 89,250 objs total (1,785 objs/req) smoke: 0 / 0ms total mysql: 4,150 / 1,429ms total (28ms avg/req, 20ms - 49ms) redis: 958 / 119ms total (2.39ms avg/req, 1.71ms - 3.26ms) cache: 15,327 / 324ms total (6.48ms avg/req, 5.45ms - 9.73ms) marshal: 3,450 / 25ms total (0.5ms avg/req, 0.42ms - 0.72ms) zlib: 0 / 0ms total 879410 live, 1185510 free slots 562 MB RSS
We then re-ran the benchmarking tests under Rails 2.3.github, compared the results, and dug in deeper to figure out what changed.
As we figured out which areas of the application could potentially suffer performance regressions, we started tracking metrics around those areas.
For example, we found it important to monitor the amount of time spent per request in object garbage collection:
Knowing immediately if a change you’ve deployed is affecting your users' experience is a good thing.
Long-running branches lead to a life of merge conflicts
For the first several months of the project, we worked on a long-running rails3 git branch. However, given that many changes are made to the GitHub codebase each day, we were constantly running into merge conflicts:
git checkout rails3 ... (make some changes) git commit -am "Disabled loading plugins" git merge master ...BOOM - MERGE CONFLICT!
In the beginning, we spent lots of time resolving these merge conflicts and keeping our long-running rails3 branch stable. It was a headache, to say the least.
Faced with this ongoing pain, the team discussed how we could get all of the changes from the rails3 branch into master in order to escape this life of merge conflict resolution.
Our approach involved enabling a dual-boot of the application under Rails 2 or Rails 3 by switching on an environment variable.
Boot the app on Rails 3:
$ RAILS3=true ./script/server Bundler v/1.6.3 bootstrapping the Rails 3 gem environment...
Boot the app on Rails 2:
$ ./script/server Bundler v/1.6.3 bootstrapping the Rails 2 gem environment...
Think of it like swapping out the V6 engine in your car with a V8. The car should still start, no matter what type of engine is under the hood.
With this environment variable in place, our Gemfile looked something like this:
source "https://rubygems.org" def rails3? ENV["RAILS3"] end if rails3? gem "rails", "3.0.20.github11" else gem "rails", "2.3.14.github50" gem "actionmailer", "2.3.14.github50" gem "actionpack", "2.3.14.github50" gem "activerecord", "2.3.14.github50" gem "activesupport", "2.3.14.github50" end gem "will_paginate", rails3? ? "3.0.3" : "2.3.9.github"
In the GitHub application itself, we partitioned off version-specific features:
# Disable plugins on Rails 3 GitHub.only_on_rails_3 do config.plugins =  end # Rails 2 needs the session to be configured before initializing GitHub.only_on_rails_2 do require_relative "initializers/session_store" end
Given that the application should behave the exact same between versions of Rails, in terms of serving up an equivalent feature set, a dual boot environment also meant that we could run Rails 2 and Rails 3 side-by-side in production.
This was important because it allowed us to slowly roll out, monitor, and compare performance of the site under each version:
This approach combined with instrumentation made it easy to see, for example, what percentage of our background workers were running on Rails 3 at a point in time:
If we saw performance or behavior regressions in certain areas of the site, we scaled back. If we saw no change, and no exceptional behavior, we scaled up to more servers running Rails 3.
Get changes into master earlier
Getting the application into a state where it could boot under Rails 2 or Rails 3 required backporting all of the existing changes from our rails3 branch into master.
Once we got past that, it was magical.
Our diffs were smaller, and merge conflicts were rare.
We made sure to get any new upgrade-related changes into master (and out on production) in a timely manner.
This kept our diffs smaller, more reviewable, and more shippable. And we soothed our merge conflict headache.
Once instrumentation was in place, frontend servers could run Rails 2 or Rails 3, and diffs were easier to review, we had a lot more time to make progress on actually fixing the bugs and blocking performance changes that we discovered during the progressive rollout.
As a result, we were able to pull-off a zero downtime rollout over the past month:
We’re already discussing how to apply these lessons to our next upgrade, and I hope you will too!
If you enjoyed reading this, you should follow me on Twitter.