Long, long ago, when I was at GitHub, the file servers were setup active-passive, with the passive replicas sitting relatively unused. In an effort to prevent them from twiddling their thumbs, they were used as the primary memcached cluster. Sharing resources like this had benefits like less cost and higher resource utilization, but as the application grew, isolation became more important, especially when resolving production issues.
In addition to keeping services separate, another solid reason to move memcached off the file servers was that it would allow the transition to the new git infrastructure (more active-active instead of passive-active). In the new infrastructure, replicas were rarely twiddling their thumbs (thanks to balancing traffic across them), which meant that running memcached on them would not be a good idea.
Lastly, if we moved memcached off the file servers to a new home, the move to the new infra could be more aggressive, re-purposing the old Git file servers for the new infra, which would be a win-win (for isolation and moving to the new infra).
A Few Discarded Approaches
The problem with switching to a new cluster of cache servers is that the new servers are completely empty (aka cold 🥶). Simply updating our application to point at the new cluster and calling it a day would most assuredly cause increased latency 📈 and quite possibly make the application unavailable for a period of time, as all the time saved by hitting the cache would need to happen again, all at once. Knowing this, we spent some time coming up with a way to fill (aka warm) the new cluster, prior to switching to it.
One of my favorite things about GitHub at this time was that not just the approach taken was valued, but those rejected as well because software development is about tradeoffs. Rather than just tell you the route we took and leave you wondering if we thought about X, I'll go over a couple approaches that were rejected.
Slow and Stale
A safer approach to swapping out clusters all at once is to slowly swap out old servers for new servers in your application's configuration. This is probably one of the more common approaches we have seen and even used in the past.
One problem with this approach was that it was going to take a lot of time and handholding. We estimated that we would need to deploy the application 20-30 times to completely rotate the old servers out and the new ones in safely. We were also not sure how much time we would need to wait in between deploys for the new server's cache to warm (minutes? hours?) enough that we could move on to the swapping out the next. Certainly other developers could deploy while we were rolling these changes out, but it was going to take a while.
Another concern raised with this approach was the possibility of stale keys. At the time, the ketama hashing algorithm was used to consistently distribute the keys, but with each configuration change, a subset of keys would move.
Even with adjusting the weight of the new servers, we did not feel comfortable with this approach, without doing a fair amount of work simulating the key changes that would happen with each configuration change to verify we would not end up with stale keys.
New and Unknown
Another common solution to this problem (again this was 7+ years ago) was a proxy along the lines of mcrouter, twemproxy, or dynomite. These projects definitely looked interesting, but we had no production experience with them. As we were already introducing a fair amount of change, we were not comfortable with also throwing a new piece of software in the mix.
Double Your Fun
Rather than switching to a cold cache, intermingling the old and new clusters, or adding a new piece of software to the mix, I suggested we ship all cache operations to both clusters from the application, until the hit rate was high enough on the new cluster that we felt comfortable making the switch to it.
Conveniently, I had just wrapped up an audit of all our memcached usage in the application. This meant I was really familiar with what we were caching and how long we were caching it. I reached into my toolbox, pulled out the composite pattern and used it to make a cache client for the application that could talk to both clusters and easily be discarded once it had served its purpose.
class DualMemcached
def initialize(primary:, secondary:)
@primary = primary
@secondary = secondary
end
def get(key, raw = false)
result = @primary.get(key, raw)
@secondary.get(key, raw)
result
end
# ... repeat for get_multi, incr, delete, add, set, etc...
end
The primary
cache instance was pointed at the passive file server replicas and the secondary
at the new, isolated cluster. For each method that was used in the application, I filled in the blanks similar to the get
method above.
GitHub also used the fetch
pattern (get => compute on miss => set) a fair amount, so I added a trick to the implementation of fetch
to help prime the secondary faster on a primary hit and a secondary miss.
Once I had DualMemcached
working and well tested, I swapped it out for the main GitHub.cache
instance that was used to talk to memcached.
passive_replicas_cache_client = GitHub::Cache::Client.new(...)
isolated_cache_client = GitHub::Cache::Client.new(...)
GitHub.cache = GitHub::Cache::DualMemcached.new(
primary: passive_replicas_cache_client,
secondary: isolated_cache_client
)
Safety First
At this point, I could have deployed my pull request and watched the secondary cluster start to warm, but I was far too cautious for that. I was relatively sure that the application could handle the additional latency of doubling all memcached operations, but I wanted to be able to test my theory in production safely, rather than just assume.
All I needed was a way to easily enable/disable a feature (secondary memcached operations) at runtime. Back then, we used flipper for conditional rollouts of new code/features. Adding flipper to the mix left the code looking a little more like this:
class DualMemcached
def initialize(primary:, secondary:)
@primary = primary
@secondary = secondary
end
def get(key, raw = false)
result = @primary.get(key, raw)
if secondary_operations_enabled?
@secondary.get(key, raw)
end
result
end
# get_multi, incr, delete, add, set, etc. ...
private
def secondary_operations_enabled?
Flipper.enabled?(:memcached_secondary_cluster_operations)
end
end
By default, all flipper features are disabled. This meant I could safely deploy this code and test the new DualMemcached
code alone, without any secondary operations. Then, when I felt ready, I could slowly enable secondary operations (% of time rollout in flipper terms) and observe the effect on the application as a whole. If doubling the operations caused too much latency for the application or some other unforeseen issue popped up, I or anyone on call could easily disable the flipper feature to get things back to normal.
With flipper controlling access to the secondary cluster operations, I finally felt safe deploying and merging my pull request. Over the course of a few hours, I slowly cranked up the memcached_secondary_cluster_operations
feature to 100%. My assumptions were correct and the application tolerated the additional operations without issue.
Testing Reads
After a few days of warming (sending all operations to both clusters), the hit rate on the new memcached cluster was close enough to the old cluster that I was ready to test reads.
Once again, I used flipper to safely test reads against the new cluster in production:
class DualMemcached
def initialize(primary:, secondary:)
@primary = primary
@secondary = secondary
end
def get(key, raw = false)
result = @primary.get(key, raw)
if secondary_operations_enabled?
secondary_result = @secondary.get(key, raw)
end
if secondary_operations_enabled? && use_secondary_reads_enabled?
secondary_result
else
result
end
end
# get_multi, incr, delete, add, set, etc. ...
private
def secondary_operations_enabled?
Flipper.enabled?(:memcached_secondary_cluster_operations)
end
def use_secondary_reads_enabled?
Flipper.enabled?(:memcached_use_secondary_reads)
end
end
Over the course of an hour I enabled the memcached_use_secondary_reads
feature to 100%. At this point, all operations were still going to both clusters, but we were returning results from the secondary cluster instead of the primary. I let this change simmer over the weekend and early the next week cut one final pull request that reverted us to the original code (without DualMemcached
), but with the configuration pointed at the new, isolated cluster.
Conclusion
All in all, this approach worked great. The total timeframe was a few weeks, mostly due to being overly cautious, but the amount of work involved was less than we estimated for any other approach and, more importantly, it was the safest thanks to flipper.
I know this was from many years ago, but my hope is that by writing up previous ways I've used flags, it'll help spark ideas for you going forward, so you can make big changes safely too.