Double Your Cache, Double Your Fun

Double Your Cache, Double Your Fun

Long, long ago, when I was at GitHub, the file servers were setup active-passive, with the passive replicas sitting relatively unused. In an effort to prevent them from twiddling their thumbs, they were used as the primary memcached cluster. Sharing resources like this had benefits like less cost and higher resource utilization, but as the application grew, isolation became more important, especially when resolving production issues.

In addition to keeping services separate, another solid reason to move memcached off the file servers was that it would allow the transition to the new git infrastructure (more active-active instead of passive-active). In the new infrastructure, replicas were rarely twiddling their thumbs (thanks to balancing traffic across them), which meant that running memcached on them would not be a good idea.

Lastly, if we moved memcached off the file servers to a new home, the move to the new infra could be more aggressive, re-purposing the old Git file servers for the new infra, which would be a win-win (for isolation and moving to the new infra).

A Few Discarded Approaches

The problem with switching to a new cluster of cache servers is that the new servers are completely empty (aka cold 🥶). Simply updating our application to point at the new cluster and calling it a day would most assuredly cause increased latency 📈 and quite possibly make the application unavailable for a period of time, as all the time saved by hitting the cache would need to happen again, all at once. Knowing this, we spent some time coming up with a way to fill (aka warm) the new cluster, prior to switching to it.

One of my favorite things about GitHub at this time was that not just the approach taken was valued, but those rejected as well because software development is about tradeoffs. Rather than just tell you the route we took and leave you wondering if we thought about X, I'll go over a couple approaches that were rejected.

Slow and Stale

A safer approach to swapping out clusters all at once is to slowly swap out old servers for new servers in your application's configuration. This is probably one of the more common approaches we have seen and even used in the past.

One problem with this approach was that it was going to take a lot of time and handholding. We estimated that we would need to deploy the application 20-30 times to completely rotate the old servers out and the new ones in safely. We were also not sure how much time we would need to wait in between deploys for the new server's cache to warm (minutes? hours?) enough that we could move on to the swapping out the next. Certainly other developers could deploy while we were rolling these changes out, but it was going to take a while.

Another concern raised with this approach was the possibility of stale keys. At the time, the ketama hashing algorithm was used to consistently distribute the keys, but with each configuration change, a subset of keys would move.

💡
Keys moving is natural and not a concern when adding or removing a few servers, but due to the relatively dramatic change in the number of servers in the cluster (new cluster had ~3-4x fewer servers than old), there was concern that keys would move to a new server and somehow, several configuration changes later, end up mapped back to that server, where they would now possibly be stale.

Even with adjusting the weight of the new servers, we did not feel comfortable with this approach, without doing a fair amount of work simulating the key changes that would happen with each configuration change to verify we would not end up with stale keys.

New and Unknown

Another common solution to this problem (again this was 7+ years ago) was a proxy along the lines of  mcroutertwemproxy, or dynomite. These projects definitely looked interesting, but we had no production experience with them. As we were already introducing a fair amount of change, we were not comfortable with also throwing a new piece of software in the mix.

Double Your Fun

Rather than switching to a cold cache, intermingling the old and new clusters, or adding a new piece of software to the mix, I suggested we ship all cache operations to both clusters from the application, until the hit rate was high enough on the new cluster that we felt comfortable making the switch to it.

Conveniently, I had just wrapped up an audit of all our memcached usage in the application. This meant I was really familiar with what we were caching and how long we were caching it. I reached into my toolbox, pulled out the composite pattern and used it to make a cache client for the application that could talk to both clusters and easily be discarded once it had served its purpose.

💡
Temporary code that is thrown away after use isn't an anti-pattern. It's entirely ok to write some code that will help with a problem and then 🔥 it after it has served its purpose. Don't be afraid to ☠️ code.
class DualMemcached
  def initialize(primary:, secondary:)
    @primary = primary
    @secondary = secondary
  end

  def get(key, raw = false)
    result = @primary.get(key, raw)
    @secondary.get(key, raw)
    result
  end

  # ... repeat for get_multi, incr, delete, add, set, etc...
end

The primary cache instance was pointed at the passive file server replicas and the secondary at the new, isolated cluster. For each method that was used in the application, I filled in the blanks similar to the get method above.

GitHub also used the fetch pattern (get => compute on miss => set) a fair amount, so I added a trick to the implementation of fetch to help prime the secondary faster on a primary hit and a secondary miss.

Once I had DualMemcached working and well tested, I swapped it out for the main GitHub.cache instance that was used to talk to memcached.

passive_replicas_cache_client = GitHub::Cache::Client.new(...)
isolated_cache_client = GitHub::Cache::Client.new(...)
GitHub.cache = GitHub::Cache::DualMemcached.new(
  primary: passive_replicas_cache_client,
  secondary: isolated_cache_client
)

Safety First

At this point, I could have deployed my pull request and watched the secondary cluster start to warm, but I was far too cautious for that. I was relatively sure that the application could handle the additional latency of doubling all memcached operations, but I wanted to be able to test my theory in production safely, rather than just assume.

All I needed was a way to easily enable/disable a feature (secondary memcached operations) at runtime. Back then, we used flipper for conditional rollouts of new code/features. Adding flipper to the mix left the code looking a little more like this:

class DualMemcached
  def initialize(primary:, secondary:)
    @primary = primary
    @secondary = secondary
  end

  def get(key, raw = false)
    result = @primary.get(key, raw)
    if secondary_operations_enabled?
      @secondary.get(key, raw)
    end
    result
  end

  # get_multi, incr, delete, add, set, etc. ...

  private

  def secondary_operations_enabled?
    Flipper.enabled?(:memcached_secondary_cluster_operations)
  end
end

By default, all flipper features are disabled. This meant I could safely deploy this code and test the new DualMemcached code alone, without any secondary operations. Then, when I felt ready, I could slowly enable secondary operations (% of time rollout in flipper terms) and observe the effect on the application as a whole. If doubling the operations caused too much latency for the application or some other unforeseen issue popped up, I or anyone on call could easily disable the flipper feature to get things back to normal.

With flipper controlling access to the secondary cluster operations, I finally felt safe deploying and merging my pull request. Over the course of a few hours, I slowly cranked up the memcached_secondary_cluster_operations feature to 100%. My assumptions were correct and the application tolerated the additional operations without issue.

Testing Reads

After a few days of warming (sending all operations to both clusters), the hit rate on the new memcached cluster was close enough to the old cluster that I was ready to test reads.

Once again, I used flipper to safely test reads against the new cluster in production:

class DualMemcached
  def initialize(primary:, secondary:)
    @primary = primary
    @secondary = secondary
  end

  def get(key, raw = false)
    result = @primary.get(key, raw)

    if secondary_operations_enabled?
      secondary_result = @secondary.get(key, raw)
    end

    if secondary_operations_enabled? && use_secondary_reads_enabled?
      secondary_result
    else
      result
    end
  end

  # get_multi, incr, delete, add, set, etc. ...

  private

  def secondary_operations_enabled?
    Flipper.enabled?(:memcached_secondary_cluster_operations)
  end

  def use_secondary_reads_enabled?
    Flipper.enabled?(:memcached_use_secondary_reads)
  end
end

Over the course of an hour I enabled the memcached_use_secondary_reads feature to 100%. At this point, all operations were still going to both clusters, but we were returning results from the secondary cluster instead of the primary. I let this change simmer over the weekend and early the next week cut one final pull request that reverted us to the original code (without DualMemcached), but with the configuration pointed at the new, isolated cluster.

Conclusion

All in all, this approach worked great. The total timeframe was a few weeks, mostly due to being overly cautious, but the amount of work involved was less than we estimated for any other approach and, more importantly, it was the safest thanks to flipper.

I know this was from many years ago, but my hope is that by writing up previous ways I've used flags, it'll help spark ideas for you going forward, so you can make big changes safely too.