How we designed Flipper Cloud to never take your app down

The first hesitation most people have with a cloud-based feature flag service is assuming that their feature flags checks will now require a network connection to said cloud. Many do, but ours doesn't.

When I considered starting Flipper Cloud, my overriding concern was how can we ensure that we never take someone else's app down. Today I'm going to dive into why and how, but first some background for relevant context and my past experiences with both feature flags and scaling.

Let's go...

Background

I worked at GitHub for ~7 years (2011 - 2018). But I wasn't building shiny new features. Most of my time there was spent in (what I affectionately refer to as) the dark corners – performance and availability. In fact, the header image for this post is a re-enactment of me working in one of GitHub's dark corners (thanks DALL-E 🤣). Pretty spot on.

Enough AI, let's build some trust...

Analytics

My first project at GitHub was starting the analytics team, which I've written about before. The tl;dr is we built a pretty darn resilient system and the first public result of that was the repo traffic graphs you know and love. I won't go into details here, since they are already written up in the link you just read past, but give it a read if you want more.

Notifications

After analytics, I helped move notifications (aka the blue dot and all the emails you get) off the primary database cluster to a shiny new one (using lots of feature flags of course). Myself and a co-worker (shout out to Rocio) followed that work up with a big resiliency project for notifications.

We added circuit breakers and response objects as a way to insulate github.com (as a whole) from notifications being unavailable. Makes sense, right? You should be able to commit, browse and make issues while notifications are struggling.

Multi-Datacenter

Following the notifications work, I joined a team that was making GitHub.com work from a second data center. The speed of light is a real thing, so adding a datacenter on the other side of the country (to help with SSL termination for our friends across the ocean) is no easy task.

At this point, GitHub was a top 100 or whatever website in the world. I was a very small part of that, but I was part of it – enough so that I understand what works and what doesn't at scale.

I say all this just to establish that I'm not some rando on the internet, but that's enough background. Let's get into the nitty gritty.

Adapters

Most feature flagging systems merely adopted local storage.

Flipper was born in it. Molded by it.

I promise this corny reference will make sense shortly.

Do I love The Dark Knight? Yes. Does that fact make the pic above any less applicable here? No. Carry on.

Cache rules everything around me

Long, long ago, I worked on Words with Friends during a period where it grew from 50k to > 1M rpm. I like to refer to this time period as when I wrote memcached code for a living.

Whenever we rolled out a new caching feature, the site went down trying to warm up the new cache. To combat this we started using rollout (which was redis only at the time). We'd slowly roll out new caching using a percentage based approach until we were at 100%. Life was great.

But before too long redis became our biggest point of failure, so we forked rollout and fronted it with memcached (which we had a metric ton of).

One gem to store them all

It always bothered me that we had to fork and maintain a separate project. Most of the logic was in the gem. The storage in rollout was just get, set and delete. I was reading some book and stumbled on the adapter pattern (again this was long ago, I'm old).

One weekend I felt like hacking and decided a version of rollout based on the adapter pattern with (IMO) a more pleasant interface was what the world needed. So I built it.

Before long, there were adapters for ActiveRecord, Sequel, Redis, Mongo, and even Rollout. Creating a new adapter was a matter of defining a handful of methods. Store your feature flag data wherever you want. 💥

Back to Cloud

This is where Cloud comes in. How do we design an adapter that ensures your flags are available when Cloud is not? Well... everything in systems design is about trade offs. So let's think about this for a sec.

The critical path of feature flags is reads – is this feature enabled for this actor. Reads happen many orders of magnitudes more than writes.

In fact (if memory serves me), GitHub did billions of flag reads per day compared to tens of flag writes. So the answer to being highly available for this type of system is to keep reads local to the app. Writes can be slower because they happen less often, but reads need to be quick.

To accomplish this, we created a dual write adapter. Perhaps it would be better named LocalReadDualWrite or something. But I'm lazy and that felt really long.

The dual write adapter sends all reads to a local adapter and all writes to a remote adapter (and then local). Follow those links for the source or check out the abbreviated version below with one example of each.

module Flipper
  module Adapters
    class DualWrite
      include ::Flipper::Adapter

      def initialize(local, remote)
        @local = local
        @remote = remote
      end

      def get(...)
        @local.get(...)
      end

      def enable(...)
        @remote.enable(...).tap do
          @local.enable(...)
        end
      end

      # [snipped for brevity...]
    end
  end
end

If you are using the redis adapter and you switch to Cloud, we configure Redis as your local adapter and Cloud as your remote. All your reads go only to the redis adapter. Same goes for ActiveRecord, Mongo or whatever else you choose.

So at this point, Cloud works the same as open source Flipper. You have not lessened your availability or resiliency. Cool. In fact, the only HTTP requests (or connections of any sort) from your app to Cloud occur either in a background thread or a single web hook request.

This means in the worst case scenario you have a background thread trying an HTTP request once in a while and failing or an occasional failed web hook.

Summary

I'm a big fan of mental models. They make it easy to fit the world into your brain. The best mental model for Flipper Cloud is "Dropbox for feature flags".

If Dropbox is down, you can still access the files on your computer. But it will be more difficult to sync your files across devices (maybe switch to bluetooth or, God forbid, cables).

Similarly, if Flipper Cloud (or your internet) is down, your app will continue to just work. But syncing won't, which will make it more difficult to change your features (aka files in the Dropbox analogy). Sure, it's inconvenient but it's not a blocker.

Lastly, while we are talking mental models. I'd love for you to picture Cloud as an adapter that wraps your local adapter with a big, cuddly hug (aka extra functionality you neither have to build or maintain).

The best part is as the hug ends, Cloud drops audit history, point in time rollback, collaboration with teammates, permissions per environment, useful analytics and a gorgeous (mobile friendly) UI in your pocket as parting gifts.

Hopefully this clears every thing up, but if you have any questions, concerns or high fives, definitely reach out.