Switch ActiveJob backends without disruption

Want to know the secret to making big changes? Don't. Make small changes, one step at a time.

Earlier this year, we helped a customer migrate their Rails application to a new ActiveJob backend. As with most applications, their background jobs performed critical workloads, like emailing users, communicating with external services, and keeping things tidy. Even a momentary blip in background job processing can quickly become a user-facing issue.

So here's how we did the migration—one job at a time—to ensure no user-visible interruptions.

1. Add the new backend

We migrated to GoodJob, which we are big fans of thanks to multithreading, Postgres for storage, and it's tight integration with ActiveJob. In Rails, ActiveJob decouples applications from specific job queue backends, but there are still implementation and operational details that could prevent jobs from running successfully. You can use whatever backend suits your needs.

The tempation when replacing a system is to add the new one and remove the old one at the same time, but why make drastic changes when you can run both systems side by side while you stand up the new one to test, and then safely switch from one to the other?

Our first step is to add the new backend. We followed GoodJob's setup instructions, except for the part where it says to "Configure the ActiveJob adapter":

# Don't do this yet
# config.active_job.queue_adapter = :good_job

Instead of a wholesale change, add a new job, and set it to run on the new queue:

# app/jobs/new_worker_job.rb
class NewWorkerJob < ApplicationJob
  self.queue_adapter = :good_job
  
  def perform
    # nothing to do here
  end
end

Guess what?! This code is ready for production. Yup, seriously. Deploy now and do whatever is needed to run the new worker in production. For example, we added this to the Procfile and deployed to Heroku:

+ # Old worker
  worker: bin/rake jobs:work
+ # New worker
+ worker_next: bundle exec good_job start

Now we can verify that the new worker is working in production and able to process jobs by scheduling our new job from rails console:

NewWorkerJob.perform_later

✅ Small step 1. Users disrupted by this deploy: 0.

2. Make it flippable

Our second small step in this big change is to try some of our real jobs on the new backend. To flip specific jobs and have the ability to flip them back if we don't like how they're performing, we need to add a feature flag.

To review, a feature flag is just a conditional that performs different logic based on whether or not a feature is enabled for the given context. If you haven't already, you'll want to setup Flipper in your app.

So we might revise the queue_adapter method in the new job we added to look something like this:

# app/jobs/new_worker_job.rb
class NewWorkerJob < ApplicationJob
  cattr_reader :new_queue_adapter, default: GoodJob::Adapter.new

  def self.queue_adapter
    Flipper.enabled?(:worker_next) ? new_queue_adapter : super
  end
  
  def perform
    # nothing to do here
  end
end

Now, we can toggle this job from the old queue adapter to the new one with Flipper.enable(:worker_next). Instead of adding this logic to every job, we can add a feature flag that works for any job.

With Flipper, you can enable and disable features based on the "actor", which is any Ruby object that responds to the flipper_id method. So if we make our job class an actor, we can override how queue_adapter works for every job class, and enable the feature flag for specific jobs.

Here's a module that we can include into ActiveJob that will allow us to flip any job to the new adapter:

# config/initializers/active_job_flipper.rb

module ActiveJob::Flipper
  extend ActiveSupport::Concern

  included do
    # Initialize an adapter for the new job backend
    cattr_reader :new_queue_adapter, default: GoodJob::Adapter.new
  end

  class_methods do
    # Allow enabling the new adapter for specific jobs.
    # Run `Flipper.enable(:worker_next, JobClass)`
    def queue_adapter
      ::Flipper.enabled?(:worker_next, self) ? new_queue_adapter : super
    end

    # Allow this class to be treated as an actor by Flipper
    def flipper_id
      "job;#{name}"
    end
  end
end

# Include into ActiveJob so any job (including those defined by Rails)
# can be enabled
ActiveJob::Base.include ActiveJob::Flipper

With that little bit of code, we added a feature flag that maintains the current behavior by default, and allows us to flip to the new queue adapter for specific jobs.

Go ahead, deploy this on Friday afternoon. Nothing changes. While that's not usually the goal of deploying, that's exactly what we want in this case.

✅ Small step 2. Users disrupted by this deploy: 0.

3. Turn and burn!

When you are good and ready, you can enable specific jobs from rails console or Flipper Cloud. Start with the jobs that have the least impact on end users:

Flipper.enable_actor :worker_next, UnimportantJob

After monitoring how that one job is performing, you can enable more of your jobs, or some of Rails' internal jobs:

Flipper.enable :worker_next, ActiveStorage::PurgeJob
Flipper.enable :worker_next, ActiveStorage::AnalyzeJob
Flipper.enable :worker_next, ActionMailer::MailDeliveryJob

Feeling good? Maybe enable for 50% of your jobs:

Flipper.enable_percentage_of_actors :worker_next, 50

Still looking good? If not, you can disable specific jobs or everything:

Flipper.disable :worker_next, YourJobName
Flipper.enable_percentage_of_actors :worker_next
Flipper.disable :worker_next

Keep enabling specific jobs until you feel confident to just flip it on for everything:

Flipper.enable :worker_next

✅ Small step 3. Users disrupted by this deploy: 0? Or worst case scenario only a handful and only for a brief moment.

4. It's not so hard to say goodbye to yesterday

If you are anything like me, this next step will be your favorite, because we get to delete code!

Remove any remaining references to the old backend or the :worker_next feature in the code and comments, including the active_job_flipper.rb initializer we added in step 2. Delete the gem from the Gemfile. Delete database tables. Disable the old worker in production. Shut down any unused service dependencies.

Ahh, I feel so free! Before you deploy, just make sure tests pass and nothing else breaks. This step is where I tend to make a careless mistake because I am less nervous about affecting users.

✅ Small step 4. Users disrupted by this deploy: 0

…One giant leap…

We deployed 3 times: 1. to add the new backend, 2. to add a feature flag to run any job on the new backend, 3. to remove the old backend. In exchange for two extra deployments–which added maybe an hour to this migration–we removed almost all of the risk.

That's how we make big changes one small step at a time.