Upgrading GitHub to Rails 3 with Zero Downtime

September 15, 2014

GitHub is a fairly large production Ruby on Rails application. From a scale perspective, it serves hundreds of millions of requests per day.

Until now, we’ve been running an outdated, heavily-modified, unsupported fork of Rails, which we called 2.3.github. This choice has bitten us in the form of gem incompatibility, having to manually backport security patches, missing out on core framework performance and feature improvements, and being unable to easily contribute back to the open source rails project.

For those of you keeping score:

We had work to do in order to live in the modern world again.

Over the last six months, we’ve had a team of 4 engineers working full time on upgrading to Rails 3.

Here are a few lessons we learned:

Instrumentation empowers change

One of our biggest concerns with this upgrade was performance.

How would this version of Rails perform against its predecessor? Would shipping it cause additional load for other services? Are the appropriate metrics being recorded so we can measure performance improvements/regressions?

To get an initial idea of how the app might perform on Rails 3, we did some local benchmarking against a random pull request:

$ script/benchmark -n 50 --url http://github.dev/github/linguist/pull/748
GitHub - Rails 3.0.x (development)

50 requests to http://github.dev/github/linguist/pull/748
    peak memory:  566 MB RSS

  response size:  9.37 MB total                  (192 KB/req)
  response time:  43,124ms total                 (862ms avg/req,  762ms - 1,143ms)
    render time:  39,426ms total                 (788ms avg/req,  696ms - 1,071ms)

       cpu time:  41,832ms total                 (836ms avg/req,  739ms - 1,117ms)
      idle time:  1,293ms total                  (25ms avg/req,  18ms - 45ms)

         oob gc:  50 / 2,960ms
  in-request gc:  32 / 2,390ms total             (79ms avg/req,  27ms - 301ms)
    allocations:  46,334,670 objs total          (926,693 objs avg/req,  926,307 objs - 927,696 objs)
        ar objs:  89,250 objs total              (1,785 objs/req)

          smoke:  0 / 0ms total
          mysql:  4,150 / 1,429ms total          (28ms avg/req,  20ms - 49ms)
          redis:  958 / 119ms total              (2.39ms avg/req,  1.71ms - 3.26ms)
          cache:  15,327 / 324ms total           (6.48ms avg/req,  5.45ms - 9.73ms)

        marshal:  3,450 / 25ms total             (0.5ms avg/req,  0.42ms - 0.72ms)
           zlib:  0 / 0ms total

  879410 live, 1185510 free slots
  562 MB RSS

We then re-ran the benchmarking tests under Rails 2.3.github, compared the results, and dug in deeper to figure out what changed.

As we figured out which areas of the application could potentially suffer performance regressions, we started tracking metrics around those areas.

For example, we found it important to monitor the amount of time spent per request in object garbage collection:


Knowing immediately if a change you’ve deployed is affecting your users' experience is a good thing.

Long-running branches lead to a life of merge conflicts

For the first several months of the project, we worked on a long-running rails3 git branch. However, given that many changes are made to the GitHub codebase each day, we were constantly running into merge conflicts:

git checkout rails3
... (make some changes)
git commit -am "Disabled loading plugins"
git merge master


In the beginning, we spent lots of time resolving these merge conflicts and keeping our long-running rails3 branch stable. It was a headache, to say the least.

Faced with this ongoing pain, the team discussed how we could get all of the changes from the rails3 branch into master in order to escape this life of merge conflict resolution.

Dual-boot application

Our approach involved enabling a dual-boot of the application under Rails 2 or Rails 3 by switching on an environment variable.

Boot the app on Rails 3:

$ RAILS3=true ./script/server
Bundler v/1.6.3 bootstrapping the Rails 3 gem environment...

Boot the app on Rails 2:

$ ./script/server
Bundler v/1.6.3 bootstrapping the Rails 2 gem environment...

Think of it like swapping out the V6 engine in your car with a V8. The car should still start, no matter what type of engine is under the hood.

With this environment variable in place, our Gemfile looked something like this:

source "https://rubygems.org"

def rails3?

if rails3?
  gem "rails",               "3.0.20.github11"
  gem "rails",               "2.3.14.github50"
  gem "actionmailer",        "2.3.14.github50"
  gem "actionpack",          "2.3.14.github50"
  gem "activerecord",        "2.3.14.github50"
  gem "activesupport",       "2.3.14.github50"

gem "will_paginate", rails3? ? "3.0.3" : "2.3.9.github"

In the GitHub application itself, we partitioned off version-specific features:

# Disable plugins on Rails 3
GitHub.only_on_rails_3 do
  config.plugins = []

# Rails 2 needs the session to be configured before initializing
GitHub.only_on_rails_2 do
  require_relative "initializers/session_store"

Given that the application should behave the exact same between versions of Rails, in terms of serving up an equivalent feature set, a dual boot environment also meant that we could run Rails 2 and Rails 3 side-by-side in production.

This was important because it allowed us to slowly roll out, monitor, and compare performance of the site under each version:


This approach combined with instrumentation made it easy to see, for example, what percentage of our background workers were running on Rails 3 at a point in time:


If we saw performance or behavior regressions in certain areas of the site, we scaled back. If we saw no change, and no exceptional behavior, we scaled up to more servers running Rails 3.

Get changes into master earlier

Getting the application into a state where it could boot under Rails 2 or Rails 3 required backporting all of the existing changes from our rails3 branch into master.

Once we got past that, it was magical.

Our diffs were smaller, and merge conflicts were rare.

We made sure to get any new upgrade-related changes into master (and out on production) in a timely manner.

This kept our diffs smaller, more reviewable, and more shippable. And we soothed our merge conflict headache.

Zero downtime

Once instrumentation was in place, frontend servers could run Rails 2 or Rails 3, and diffs were easier to review, we had a lot more time to make progress on actually fixing the bugs and blocking performance changes that we discovered during the progressive rollout.

As a result, we were able to pull-off a zero downtime rollout over the past month:


We’re already discussing how to apply these lessons to our next upgrade, and I hope you will too!

If you enjoyed reading this, you should follow me on Twitter.