When config edits start feeling like deploys

A scenario that comes up at most companies running Rails on Kubernetes, and one I want to walk through carefully because it shapes everything else in this post.

It is the middle of a Tuesday afternoon, and a rollout that went out last week is misbehaving in production. The graph that everyone is staring at (error rate, request latency, conversion through some checkout step) is moving in a direction that someone with operator instincts recognizes as "stop the music". An engineer or analyst opens an internal admin tool, finds the misbehaving rollout, drops the percentage from twenty-five to zero, and clicks save. The UI flashes green. They close the laptop and refresh the dashboard.

A minute later they come into the engineering channel and ask the question that makes everyone sit up: "is this actually applied yet?"

The honest answer is something like: probably, in another thirty seconds, on most pods. That answer is not good enough during an incident. It is barely good enough on a quiet day. And the reason it is the answer at all is that the values they were editing live in a database that the running application caches in memory, and those in-memory caches do not refresh on their own without help.

This post is about how I got from "probably, in thirty seconds, on most pods" to "yes, everywhere, within a second, and I can prove it". It is about a category of configuration changes that has to feel like a button click rather than a deploy, even though the underlying data is owned by an ordinary Postgres row and read on a hot path by hundreds of running processes. The path I walked to get there had five distinct attempts in it. The post walks through all five, in order, because the design that landed only makes sense once you have seen the four wrong turns it is reacting to.

What kind of configuration we are even talking about

Before any of that, I want to spend a few paragraphs being precise about the problem, because the answers to "how should I propagate config changes" depend almost entirely on what kind of config you are talking about.

The first system at the center of this post is what I will call a rollout-allocation system. This is a service that decides, for a given user and a given product change, which variant that user sees. New checkout flow versus old checkout flow. New repayment screen versus old. New onboarding ladder versus existing one. The records that drive these decisions are ordinary database rows. Each row carries a name, an audience definition, a percentage, a set of buckets, and a state field that says whether this rollout is currently live, on hold, or retired. Every product surface that runs an A/B test or a phased launch reads from this registry on the request path.

The second system, built later on the same plumbing, is a feature-toggle registry: a flat set of named boolean switches that callers ask "is this enabled?" against, with a small amount of audit metadata attached to each. The shape is simpler than the rollout system, but the propagation problem is the same.

The third thing I will name now and lean on later is seeded config records, which is what I will call a small in-house pattern for storing certain kinds of configuration as ordinary ActiveRecord rows that ship with the application. The records live in YAML files inside the repository at deploy time. On boot, an initializer reads those YAML files and upserts them into the database. Most of the rollouts and most of the toggles begin life this way, which means they have a release-time write path baked in. That detail matters later when I talk about why I had to be careful not to create two write paths into the same field.

All three of these have something in common. The data sits in Postgres because Postgres is the boringly correct place for it. It has the auditing, the migrations, the backup story, and the tooling any engineer working on it is already comfortable with. I did not want to introduce a separate flag service, a separate config server, or any other system whose job would be to hold "the truth" alongside the database. That decision is what makes the propagation problem interesting; it is also what rules out a whole class of off-the-shelf answers.

Why this is harder than it looks

When I describe the problem to people who have not lived it, the first reaction is almost always "isn't this just a feature-flag platform?" or "isn't this just a Redis pub/sub away from being solved?" Both of those are reasonable instincts and both turn out to be wrong for specific reasons that are worth pulling apart up front, because the same instincts will keep coming back through the iterations.

A feature-flag-as-a-service product like LaunchDarkly or Flipper Cloud is great at what it does, which is short-lived boolean toggles with targeting rules evaluated client-side or through an SDK. The data structures the rollout-allocation system needs are not booleans. They are rich records with bucket configurations, scoped audience matchers, and bespoke logic per product category. You could pour those into a flag platform's "JSON value" field and write code on top, but at that point the platform is doing the job of being a second source of truth for shapes it was not designed to hold, and you have shipped the dual-write problem on purpose. The same instinct underlies the Netflix Archaius family of dynamic-config libraries, which treat configuration as a hierarchy of property sources with listeners for change events. Those libraries are well built; the problem they solve is broader than the problem here, and adopting one would mean carrying a runtime layer whose flexibility I never intended to use.

A Redis pub/sub channel is, on paper, even closer to the right shape. You publish a small "this entity changed" message when a write happens, every running worker subscribes, and each one refreshes its cache. I did try this, in iteration two, and I will get into why it did not survive in a moment. The short version is that Redis pub/sub is fire-and-forget, and "fire-and-forget" is exactly wrong for "every node should converge on the same state".

Restarting the pods on every config change is the comedy answer, but it is worth saying out loud why it does not work either. The deploy pipeline is long enough that using it as the cost of moving a percentage from twenty-five to zero during an incident is unacceptable. Beyond the latency, the broader cultural problem is that once people learn that small config changes require deploys, they stop reaching for those changes during incidents at all. They reach for the deploy itself instead, which is a much heavier instrument.

There is one more option which I do not want to skip past, which is "just shorten the cache TTL and let polling pick up the change faster". This is the obvious thing, it is what I did first, and it is iteration one of the post.

What I wrote down before any code

A handful of requirements drove the rest of the work, and it is worth listing them in one place before going through the iterations, because every iteration was being judged against this list whether I said so out loud or not.

The first is durability. If the admin UI says "saved", every running process must converge on the new value, or the save itself must fail loudly. The worst possible outcome is not a wrong value; it is an inconsistent fleet that nobody knows is inconsistent. People stop trusting fast paths the moment they catch one lying.

The second is convergence inside about a second across the fleet, by which I mean from "save click" to "every Puma worker has refreshed its in-memory cache from the database, on every running pod". This is an operational target, not a hard mathematical guarantee. It is the rough threshold below which the operator stops asking "is this applied yet?" because the answer is reliably "yes".

The third is catch-up on startup, which means that any approach that only works if the running process happened to be alive when the change happened is fragile. Pods restart all the time. Pods get added under autoscale. Deployments roll. Hydration of fresh state at boot has to be a first-class operation, not an afterthought.

The fourth is hot-path reads stay in memory. The whole point of caching this data was that it gets read hundreds of times per request through nested service calls; I was not going to give up that property to make writes easier. The database is touched on writes (when the truth changes) and on refresh (when something tells a process its cached copy is stale), but never on the request path.

The fifth is a single source of truth, the database. Not Redis. Not a sidecar. Not the Kubernetes objects I end up using as a propagation channel. Postgres has the auditing, the migrations, and the operational story I already trust. Every other moving part in the design is a derived view of what the database says.

The sixth is no redeploy for these specific dynamic fields. If updating a field requires a redeploy, the work was pointless.

The seventh is prevent drift between the release-time write path and the runtime write path. Once a field becomes runtime-editable through the admin UI, the release-driven seeding path that used to write it has to be blocked or it is only a matter of time before two writes step on each other.

These seven shape every iteration that follows, and I will refer back to them as each iteration runs into the wall it runs into.

Iteration 1: poll harder

The first attempt was the cheapest one available. The Rails app already cached the rollout records in process memory at boot. The read path was a hash lookup. Request-time database reads were zero. The cache had a built-in five-minute time-to-live before it would re-hydrate from Postgres. So the most lightweight "make changes propagate faster" change was to reach in and shorten the TTL.

I dropped it to thirty seconds.

This is a perfectly fine cache. Shrinking the TTL made it twice as fresh in the worst case and added zero new failure modes. The trade-off was understood at decision time: polling is bounded by TTL by construction, and a thirty-second TTL covered everyday operations cleanly. The case the design had not been stress-tested against was a live incident.

Then a real production incident happened. A rollout was misbehaving and the right call was a full pause: percentage to zero, immediately. The operator who made the change in the admin tool came into the engineering channel and asked exactly the question I quoted at the very top of this post: "is this actually applied yet?". The honest answer was the one I gave at the top too, which is "maybe, within thirty seconds, on most pods". Thirty seconds is a long time during a live incident, and "most pods" is worse than it reads, because cache expiry across worker processes is not synchronized. A worker that hydrated its cache twenty-nine seconds ago expires in the next second; a worker that hydrated half a second ago waits another twenty-nine. The convergence window is not a clean step function, it is a TTL-wide smear, and the operator has to assume the worst case throughout it.

There is a second problem with polling that becomes visible only at scale. Every cache rebuild is a database read; the wider the fleet, the more reads per minute the rollout table absorbs purely to refresh things that mostly have not changed. At five-minute intervals, this load was easy to miss. At thirty-second intervals, it was no longer background noise, it was a small but real load floor that grew with every additional pod. Polling can be made arbitrarily fresh, but only by buying that freshness in database reads, and the trade was wrong for the shape of this problem.

The conclusion from iteration one was that I needed push, not pull. Polling will converge eventually, and "eventually" was the part that kept failing during incidents.

Iteration 2: a Redis pub/sub channel

If polling is the problem, pushing is the answer. Redis was already in the stack for jobs and a few other caches, and adding a "config changed" channel was a one-day change. The publisher was a few lines in an after_commit callback, the subscriber was a few lines in a worker boot hook, and convergence in the test environment was reliably sub-second.

# In the model, after_commit:
def broadcast_change
  $redis.publish(
    "config:changes",
    { record_id: id, kind: "rollout" }.to_json,
  )
end

# In each worker, on boot:
Thread.new do
  $redis.subscribe("config:changes") do |on|
    on.message do |_channel, payload|
      data = JSON.parse(payload)
      Rollout.refresh_index_for_rollout(data["record_id"])
    end
  end
end

In testing this looked excellent. The administrative UI saved a change, every subscribed worker refreshed its cache within milliseconds, and the operator question that mattered ("is this applied?") had a confident "yes" attached to it. The trade-off I was watching for was pub/sub's at-most-once delivery contract: would the convergence requirement hold against missed messages in practice? Two specific failure modes returned the answer.

The crack came from a property the Redis documentation had been telling me about the entire time. The Redis pub/sub glossary entry puts it bluntly: pub/sub is a fire-and-forget pattern, and any message published while no subscriber happens to be listening evaporates and cannot be recovered. This is not a bug. It is the contract Redis pub/sub offers, which is at-most-once delivery, no persistence, no acknowledgment, and no replay.

Two failures landed within a few weeks of each other and both of them were direct consequences of that contract. The first was a benign-looking pod restart caused by an unrelated memory issue. The new worker came up, hydrated its cache from the database, subscribed to Redis, and started serving traffic. During the seconds between the old worker dying and the new worker subscribing, a real configuration change had been published. The new worker had the old cache and never received the update. The ordering held for that specific change because the new worker's startup hydration ran after the database write, so its initial cache was already correct. The design itself did not guarantee that ordering, and that gap was the real lesson.

The second failure was much nastier. A brief Redis network partition meant that a subset of pods could not reach the pub/sub channel for a window of seconds. During that window, a config change went out. The partition healed, the pods reconnected, and they never received the missed message. Their caches stayed stale silently. There was no error log, no metric, no alert. Someone noticed inconsistent behavior across pods and went looking, which is by some distance the worst way to discover that a propagation channel has lost messages.

The structural problem is that pub/sub is a fanout primitive, not a state-synchronization primitive. If the question you are trying to answer is "every node should converge on the latest state of this thing", you need something with replay semantics. I could have moved to Redis Streams, which adds durability, consumer groups, acknowledgment, and a retention policy. I did not take that path for two reasons. The first was that "config propagation depends on Redis being healthy" felt wrong as a sentence to write down, given that Redis had just been the thing that broke the design. The second was that streams come with operational machinery (pending entries, retention windows, consumer-group coordination) that I would happily take on for a queue but felt heavy for a config-invalidation channel.

What I wanted, with the benefit of hindsight, was a channel that was either reliable or whose unreliability was loud. Pub/sub was neither, and I needed to find a different one.

Iteration 3: ConfigMap volumes and the inotify era

The third design moved to Kubernetes-native primitives. Production was already running on Kubernetes, and Kubernetes has a built-in mechanism for distributing small pieces of configuration to running pods, which is a ConfigMap. ConfigMaps are stored in etcd, replicated by the control plane, and projected onto the filesystems of the pods that reference them. They are, in a real sense, exactly what I wanted: a small key/value object that the platform takes responsibility for getting onto every pod that asks for it.

The Kubernetes documentation lists four ways to consume a ConfigMap from inside a pod: through container command and arguments, through environment variables, through files in a read-only volume, and through code that talks to the Kubernetes API directly. The first two require a pod restart to pick up changes, because both bake values at process start; the tutorial on updating ConfigMaps confirms that environment variables will not refresh until a rollout happens. That is a non-starter for these requirements. The third option, mounting the ConfigMap as a volume of files, was what I tried first, because it required the least new ground. Each entity got a key in the ConfigMap, the ConfigMap was mounted as a volume of files at a path inside the container, and a small file-watcher inside the application would react when those files changed.

The listen gem was already in the Gemfile as a transitive dependency of Rails. It wraps inotify on Linux and gives you a clean Ruby API for reacting to filesystem changes. The implementation looked roughly like this:

notifier = Listen.to("/etc/rollouts-config") do |modified, added, removed|
  (modified + added).each do |path|
    record_id = File.basename(path, ".json").to_i
    Rollout.refresh_index_for_rollout(record_id)
  end
  removed.each do |path|
    record_id = File.basename(path, ".json").to_i
    Rollout.evict(record_id)
  end
end
notifier.start

In a minikube setup this worked. I did a small proof-of-concept to verify the basic constraints. It confirmed that the 1 MiB ConfigMap size limit applies per ConfigMap rather than per file, that putting one key per record gives per-entity invalidation while a single big key would force a reload of everything on every change, and that markers around a hundred bytes each leave ample headroom inside the size budget. The shape of the design was right. Where it broke was operational.

The first thing it broke on was latency. Volume-mounted ConfigMap updates are not instant. The task documentation is explicit: the total delay from updating the ConfigMap to the new keys appearing inside the pod can be as long as the kubelet sync period (one minute by default) plus the TTL of the ConfigMaps cache inside kubelet (one minute by default). On a bad day this blew past a one-second target by a wide margin.

The second thing it broke on was a behavior that is documented in detail in Ahmet Alp Balkan's blog post on pitfalls of reloading files from Kubernetes Secret and ConfigMap volumes and tracked in kubernetes/kubernetes#112677. When kubelet updates the contents of a ConfigMap volume, it does not modify the user-visible files in place. It uses an internal mechanism called AtomicWriter. It writes the new files into a fresh timestamped directory, atomically swaps a ..data symlink to point at the new directory, and then deletes the old timestamped directory once nobody is reading it.

This is fine if you know about it, and hostile to inotify if you do not. The user-visible files do not get clean IN_MODIFY events. They get IN_DELETE_SELF events on the old symlinks, because as far as inotify is concerned the file pointed at has been replaced by a different file. To handle this, your application has to interpret "deleted" as "atomically replaced", re-establish the watch on the new path, and not panic. I was watching for "deleted means deleted", which is what every other normal filesystem trains you to expect.

The metrics dashboard kept showing the watcher threads dying repeatedly with thread-terminated logs and no useful stack trace. The early hypothesis was memory pressure. The listen gem holds a persistent inotify file descriptor and a fiber pool, and in a busy worker that adds up. The actual cause came out of reading the kernel-issue thread carefully: the gem's internal state was being torn apart by the symlink flip every time kubelet did its atomic swap. The fix was not to harden the listen gem against ConfigMap semantics. The fix was to stop using filesystem-watching for this entirely.

There was a third, quieter, mode of failure on this iteration that I want to call out separately because it is the kind of thing you only learn about by reading the docs all the way through. Some pods historically used subPath mounts for adjacent configurations to keep the projection narrow. The Kubernetes documentation states plainly that "a container using a ConfigMap as a subPath volume mount will not receive ConfigMap updates". I was not planning to use subPath mounts for the ConfigMap at the center of this work, but discovering the carve-out while debugging unrelated things added another tally to the column "filesystem-based config distribution has too many ways to fail quietly".

At some point during this iteration I asked myself the question that ends every iteration: am I building a sturdy event stream here, or am I gluing a sturdy event stream onto an inherently unreliable filesystem-event substrate? I was doing the second of those. The fix was not another inotify trick or another guard around the listen gem; the fix was to leave the filesystem path entirely.

Iteration 4: the API watch, with values stored inside

The next move came from realizing that the publisher side was already talking to the Kubernetes API. The kubeclient gem was already in the Gemfile, used to patch values into the ConfigMap whenever the admin UI saved a change. The same gem exposes a watch interface for receiving change events directly from kube-apiserver. The Kubernetes documentation explicitly endorses "code inside the pod that uses the Kubernetes API" as a supported way for an application to subscribe to ConfigMap changes and react in real time.

Switching from the inotify path to the API watch took about a week and the operational properties improved immediately. The event stream came back as structured ADDED, MODIFIED, and DELETED types instead of filesystem deletes pretending to be modifications. There was a resourceVersion cursor that let me resume from a known point after any disconnect. I did not have to wait for the kubelet sync at all, because the watch notifies as soon as etcd has accepted the change. And it gave me the clean list-then-watch pattern that the Kubernetes API concepts page describes as the recommended way to do incremental synchronization against a cluster resource.

The first version of this iteration stored the full payload of each entity inside the ConfigMap. Each key in the data section was a record ID, and each value was the JSON-serialized record. The watcher could pull the complete payload out of the watch event and refresh its cache without hitting the database at all. This design has appealing properties on paper: the watcher is fully self-contained, the read path on a refresh does not need to touch Postgres, and propagation latency is only as long as a watch event takes to traverse kube-apiserver.

It worked in staging. In production, the API watch path delivered the propagation properties iteration 3 had not. Two trade-offs of storing values inside the signal channel became the binding constraints as the system grew: the 1 MiB ConfigMap budget, and the dual-source-of-truth shape that comes with any payload-in-the-channel design.

The first crack was a concrete one. ConfigMap data is bounded by a hard 1 MiB limit enforced at the etcd layer. When you store full payloads, the size budget shrinks faster than the number of entries grows: a single record might be a few hundred bytes today and a kilobyte tomorrow as a richer field gets added, and you can cross from "comfortable headroom" into "patches start failing" without anyone noticing. That is exactly what happened. As the entity count grew over time and a few records picked up richer metadata, the ConfigMap crossed the size budget without warning. A routine save came back from kube-apiserver with a 422 Unprocessable Entity, and from there the failure mode got nasty fast, because of an asymmetry between the two write paths. The database write commits first and then fires after_commit. The ConfigMap patch happens inside that callback, and when it fails because the ConfigMap has gotten too big, the database has already accepted the new value. From the admin UI's point of view the save succeeded, because the database write committed; from the operator's point of view, the value flickered correctly in the form and then nothing changed in production, because the ConfigMap never got the new marker and the listeners had no signal to refresh. Pods kept serving the old value silently until something else triggered a write.

The second crack was structural and worse. By storing values inside the ConfigMap, I had given myself two sources of truth. The database had the canonical row. The ConfigMap had a serialized projection of that row, which the watchers were treating as the truth on the read path. The two were supposed to be identical, and most of the time they were, but the moment you have two stores that should agree you have implicitly committed to a class of bugs where they do not. The exact failure modes are hard to predict in advance and tend to come from serialization quirks, normalization steps that drifted, or fields that got added on one path and not the other. The honest framing is that the moment you have two stores claiming truth, the next incident is inevitably going to be "wait, which one was actually right?", and you find out about it by accident, which is the worst way to find out about anything.

The answer was not to add a bigger ConfigMap, and it was not to add validation between the two stores. The answer was that the ConfigMap should not have been a store at all. It should have been a signal. Every value sitting inside it was a footgun in waiting. Iteration five collapses iteration four down to that smaller idea.

Iteration 5: markers, not values

The design that finally held came from collapsing iteration four to its smallest useful form. The ConfigMap is still there, the Kubernetes API watch is still there, the publisher and listener services are still there. What changed is what the ConfigMap actually holds.

It does not hold values anymore. It holds markers.

A marker is the smallest piece of metadata that lets a listener answer two questions: "did this entity change?" and "since when?". Practically, that is a record ID, a record name, an updated-at timestamp, and (for the toggle registry) a current state. Anything more than that creeps back toward the dual-store world I just walked out of, so I keep it minimal on purpose. The database is the source of truth and the ConfigMap is the bell. When the bell rings, every running process walks back to the database, asks what changed, and updates the cache it is holding.

This is the doorbell metaphor I have been holding off using until now, and it is the part of the design I find genuinely worth writing about. The interesting move is not the ConfigMap or the Kubernetes API watch. The interesting move is refusing to put anything other than a "this has changed" signal into the propagation channel.

Here is the marker shape on the publisher side, from app/services/config_broadcaster/rollout_broadcaster.rb:

def marker_for(rollout)
  {
    id: rollout.id,
    name: rollout.name,
    updated_at: rollout.updated_at.iso8601,
  }
end

def key_for_record(rollout) = "#{rollout.id}.json"

And the toggle-registry version, from app/services/config_broadcaster/toggle_broadcaster.rb:

def marker_for(toggle)
  {
    id: toggle.id,
    name: toggle.name,
    enabled: toggle.enabled,
    updated_at: toggle.updated_at.iso8601,
  }
end

The toggle registry's marker carries the boolean state directly. I was aware while designing it that this creeps a step closer to "values inside the ConfigMap", which the iteration-four lesson warned against. The justification is that the toggle's whole truth is a single boolean and a name, so carrying it in the marker is a verbatim copy of the database row, not a serialized projection that can drift. If I ever extend the toggle marker to carry richer fields like rollout percentages, I will either have to formalize a dual-store discipline for it or move it onto the same "marker only" pattern as the rollout-allocation system. For now, the simpler shape is stable and I prefer it.

The end-to-end flow looks like this:

flowchart LR
    UI["Admin UI"] -->|1. update DB row| DB[("Postgres (truth)")]
    UI -->|2. patch ConfigMap| KAPI["kube-apiserver"]
    KAPI --> CM["ConfigMap (markers only)"]
    CM -->|3. WATCH event| W["Listener thread (per worker)"]
    W -->|4. read updated row| DB
    DB -->|5. fresh value| W
    W -->|6. refresh process index| CACHE[("Process-local index")]
    SVC["Request handler"] -->|hot path| CACHE

Zoomed in on a single update, the propagation timeline looks like this:

sequenceDiagram
    participant Admin as Admin UI
    participant DB as Postgres
    participant KAPI as kube-apiserver
    participant W as Listener in worker
    participant C as Process index

    Admin->>DB: UPDATE rollout SET ...
    DB-->>Admin: COMMIT
    Note over DB: after_commit fires
    DB->>KAPI: PATCH configmap with marker for id
    KAPI-->>W: WATCH MODIFIED event
    W->>W: diff snapshot, identify changed record id
    W->>DB: SELECT row by id
    DB-->>W: fresh row
    W->>C: refresh process index, clear derived caches
    Note over C: index consistent with DB

The actual machinery is split into two services with mirroring responsibilities. The publisher knows how to patch one key into the ConfigMap when the database changes. The listener knows how to consume a watch stream and refresh process-local state when a marker changes. Both extend small base classes that hold the kubeclient bookkeeping and a thin abstract interface that subclasses fill in.

The publisher

The publisher is the smaller of the two services and most of the interesting logic is in the base class. The shape is "patch the ConfigMap with one named key/value change". Real code from app/services/config_broadcaster/base.rb:

def broadcast!
  body = marker_for(record)
  key  = key_for_record(record)
  patch_marker(key:, content: body.to_json)
end

private

def patch_marker(key:, content:)
  patch_body = { data: { key => content } }
  kube_client.patch_config_map(configmap_name, patch_body, configmap_namespace)
  logger.info('Broadcast marker', config_kind:, configmap: configmap_name, key:)
  true
rescue StandardError => error
  logger.error('Failed to broadcast marker', config_kind:, configmap: configmap_name, key:, error: error.message)
  raise
end

A few details in here are load-bearing in non-obvious ways. The patch sent to kube-apiserver is a strategic-merge patch, which means only the keys named in data are touched. Other entities sharing the same ConfigMap are not affected, and there is no need to read-modify-write the whole map to update a single entity. This matters more than it looks. It removes a whole class of races where two concurrent saves stomp each other's keys, because the kubeclient gem handles the optimistic-concurrency dance with kube-apiserver underneath, and etcd serializes the patches into a consistent linear history.

The publisher gets called from the model's after_commit callback. Real code from app/models/feature_toggle.rb:

class FeatureToggle < ApplicationRecord
  include FeatureToggle::IndexConcern

  validates :name, presence: true, uniqueness: true
  validates :name, format: { with: /\A[a-z0-9_]+\z/, message: :toggle_name_format }

  has_many :scheduled_holds, foreign_key: :toggle_name, primary_key: :name,
                             inverse_of: :feature_toggle, dependent: :nullify

  scope :enabled, lambda { where(enabled: true) }

  # Skip ConfigMap broadcast during seed-time hydration to avoid
  # firing watchers prematurely while the database is being populated.
  after_commit :broadcast_marker, unless: lambda { SeededRecords.seeding? }

  private

  def broadcast_marker
    if Rails.env.development?
      # No ConfigMap in development; refresh the in-process index directly
      # so the local feedback loop reflects DB writes.
      FeatureToggle.rehydrate_index_from_db
      return
    end

    ConfigBroadcaster::ToggleBroadcaster.new(record: self).broadcast!
  end
end

Two pieces in here took a few iterations to get right. The first is the seed-time guard. The seeded-records pattern from the top of the post fires here: an initializer reads YAML files at boot and upserts them into Postgres. Without the unless: lambda { SeededRecords.seeding? } guard, every pod would try to re-broadcast every entity on startup against the same ConfigMap, simultaneously, racing the actual user-driven writes that need to be observable. The fix is the smallest possible change. The discipline is to remember the guard exists and not be confused later when an experimental seed run does not appear to fire watchers.

The second piece is the local-development branch. In development there is no Kubernetes cluster, no ConfigMap, and no point in letting a kubeclient call fail at every save. The dev branch refreshes the in-process index directly instead. As a side effect, the feedback loop while developing toggle-related code becomes more honest: you change a toggle in dev, the very next read sees the new value, instead of staring at a stale cache and wondering whether the change actually committed.

The listener

The listener does most of the work in this design and most of its complexity lives in the base class, app/services/config_listener/base.rb. It manages a single watch stream against one ConfigMap, an in-memory snapshot of the most recent ConfigMap state, and a callback into subclass-specific cache-refresh logic. The interesting parts of the start path look like this:

WATCH_RETRY_DELAY    = Integer(ENV.fetch('CONFIG_WATCH_RETRY_SECONDS', '5'))
WATCH_TIMEOUT_SECONDS = Integer(ENV.fetch('CONFIG_WATCH_TIMEOUT_SECONDS', '300'))

def start!
  initialize_class_vars
  return true if running

  gate_taken = false
  @start_mutex.synchronize do
    next if running

    if !@starting
      @starting = true
      gate_taken = true
    end
  end

  return false if !gate_taken

  hydrate_snapshot!
  @listener_thread = spawn_listener_thread

  @start_mutex.synchronize do
    @running     = true
    @running_pid = Process.pid
    @starting    = false
  end

  true
rescue StandardError => error
  logger.error("Failed to start ConfigMap listener error=#{error.class.name} ...")
  @start_mutex.synchronize do
    @starting    = false
    @running     = false
    @running_pid = nil
  end
  false
end

Two specifics are worth pulling out of this snippet. The first is the running_pid check, which exists because of how Puma uses preload_app!. Puma preloads the Rails application in the master process and then forks into worker processes, which is the standard way to take advantage of copy-on-write memory across workers. The catch is that fork inheritance copies class-level instance variables but does not carry threads across the fork boundary. Without the PID guard, every child worker would see @running == true because that variable was set in the master, while the listener thread that set it does not exist in the child. The fix is to scope "running" to the current process:

def running
  initialize_class_vars
  @running && @running_pid == Process.pid
end

The second specific is the gate_taken mutex pattern, which guards against concurrent start! calls during boot. Both the Rails initializer at config/initializers/config_listeners.rb and Puma's before_worker_boot can fire start! depending on the process type: Puma workers, background workers, the Rails console, single-worker services. Two start calls racing would produce two listener threads, which is the wrong outcome. The mutex serializes the entry, the gate_taken flag ensures exactly one of the racers actually owns the launch.

The watch loop itself is small enough to fit on screen:

def consume_watch_stream
  stream = open_watch_stream
  @watch_stream = stream

  stream.each do |notice|
    break if !running

    @resource_version = notice.dig(:object, :metadata, :resourceVersion) || @resource_version
    process_watch_notice(notice)
  end
end

def open_watch_stream
  params = {
    namespace: configmap_namespace,
    field_selector: "metadata.name=#{configmap_name}",
    allow_watch_bookmarks: true,
    timeout_seconds: WATCH_TIMEOUT_SECONDS,
  }
  params[:resource_version] = @resource_version if @resource_version

  kube_client.watch_config_maps(params)
end

The allow_watch_bookmarks: true parameter is worth a sentence. The Kubernetes API can periodically emit a BOOKMARK event that updates the watcher's known resourceVersion without delivering a MODIFIED event for any specific resource. It exists specifically to keep watchers from getting too far behind during quiet periods, so that if the stream restarts later there is no need to do a full re-list to catch back up.

The field_selector: "metadata.name=#{configmap_name}" is how the watch is narrowed to a single ConfigMap. The listener watches the whole resource type and filters by name on the server side, and deliberately does not pin the watch to a specific resource name in the URL itself. That distinction caused a 403 Forbidden early on, because Kubernetes RBAC has subtle rules around resourceNames. You can pin verbs to a resource name for things like get, but list and watch against a resource name do not compose cleanly with the field-selector pattern I was using. The simpler fix was to grant list and watch on the ConfigMap resource type globally inside the namespace and let the application-side field_selector narrow to the specific name in question. That is a slightly broader RBAC grant than I would have chosen, but it composes correctly with the watch verb.

How events become cache updates

The diff-and-refresh path is the part of the listener that turns watch events into per-record refreshes. It walks the new ConfigMap data, compares it against the previous snapshot, and emits per-entity refresh and invalidate calls only for the keys that actually changed:

def apply_snapshot(data:, event_type:)
  new_snapshot = {}

  data.each do |key, content|
    new_snapshot[key] = content
    previous_content = @current_data[key]
    next if previous_content == content

    record_id = id_from_key(key)
    next if !record_id

    logger.info("Marker changed event=#{event_type} key=#{key} record_id=#{record_id}")
    refresh_for_id(record_id)
  end

  removed_keys = @current_data.keys - data.keys
  removed_keys.each do |key|
    record_id = id_from_key(key)
    next if !record_id

    logger.info("Marker removed key=#{key} record_id=#{record_id}")
    evict_by_id(record_id)
  end

  @current_data = new_snapshot
end

This is the diff engine, and the targeted refresh is what keeps a noisy ConfigMap from causing the cache to thrash. If three entities change in a single event, three refreshes fire. If a single entity changes, exactly one refresh fires. The listener never reloads the whole cache because of an unrelated entity changing somewhere else in the map.

The subclasses fill in the small policy bits. For the rollout-allocation system, in app/services/config_listener/rollout_listener.rb:

def refresh(record)
  logger.info(
    'Rollout listener refreshing index',
    rollout_id:    record.id,
    rollout_name:  record.name,
    cached_state:  Rollout::INDEX_BY_ID[record.id]&.state,
  )

  Rollout.refresh_index_for_rollout(record.id)
end

def evict_by_id(rollout_id)
  Rollout.evict(rollout_id)
rescue StandardError => error
  logger.error(
    'Failed to evict rollout from index, falling back to full rebuild',
    rollout_id: rollout_id,
    error:      error.class.name,
    message:    error.message,
  )
  Rollout.rehydrate_index_from_db
end

That fallback to a full index rebuild is intentional. The targeted invalidation in Rollout.evict does the right thing in the ordinary case (it drops the entity from INDEX_BY_ID, INDEX_BY_NAME, the alive set, and the lazy derived caches), but it operates on object references, and if something throws because of an inconsistency between those structures I would rather rebuild the whole index from the database than leave it in a half-evicted state. It is a "fail safe to slower-but-correct" valve.

Rollout.refresh_index_for_rollout is where the trade-off between targeted and full reset really shows itself. Real code from app/models/rollout.rb:

def self.refresh_index_for_rollout(rollout_id)
  record = Rollout.find_by(id: rollout_id)
  previously_indexed = INDEX_BY_ID[rollout_id]

  if record.nil? || record.state_retired?
    INDEX_BY_ID.delete(rollout_id)
    INDEX_BY_NAME.delete(previously_indexed.name.to_sym) if previously_indexed&.name
    LIVE_RECORDS.reject! { |r| r.id == rollout_id }
    ACTIVE_ROLLOUT_IDS.delete(rollout_id)
  else
    record.readonly!
    INDEX_BY_ID[rollout_id]   = record
    INDEX_BY_NAME[record.name.to_sym] = record

    # Update LIVE_RECORDS: remove old entry, add new if alive
    # ...
  end

  # Tag-indexed caches MUST be cleared because they hold arrays of
  # record references. If we only update INDEX_BY_ID, the stale
  # references inside INDEX_BY_TAG would still be reachable via tag
  # lookup. These caches are lazily rebuilt on next access, so
  # clearing them is safe and efficient.
  INDEX_BY_TAG.clear
  INDEX_BY_TAG_LIVE.clear
end

Two things in this method are worth dwelling on. The first is that one refresh costs exactly one database query, not a re-read of the whole rollout table. That property is the entire point of the marker-based design: because the marker tells the listener which record changed, the system knows exactly which row to refetch.

The second is the comment block about the tag-indexed caches. Those caches hold arrays of record references, indexed by the tag they belong to. If you swap INDEX_BY_ID[id] = new_record but leave the tag-indexed array alone, you end up in a state where one cache holds the new object and another cache holds the old object, and which one a reader sees depends on which lookup path it took. The cheapest fix is to clear the derived caches and let them lazily rebuild on next access. The expensive fix is to walk every dependent structure and replace references in place. I took the cheap one.

Hydration is a primitive

The other piece of the listener that took until iteration five to feel right is hydration. The first version of this design treated hydration as a special case that ran once at startup. Watching had its own loop, its own state, its own restart logic. There were three paths through the code: boot, watch, recover.

The simplification was the realization that hydration is the watch-error recovery path. Both of them start from the same state: I do not know the current contents of this ConfigMap, get me a fresh snapshot, then begin watching from the resourceVersion that snapshot returned. Treating those as the same operation collapsed three code paths into one.

def hydrate_snapshot!
  data, resource_version = fetch_current_data
  @current_data     = data.dup
  @resource_version = resource_version

  logger.info("Hydrated ConfigMap snapshot entries=#{@current_data.size}")
rescue Kubeclient::ResourceNotFoundError
  logger.warn("ConfigMap not found during hydration; listener will wait configmap=#{configmap_name} namespace=#{configmap_namespace}")
  @current_data     = {}
  @resource_version = nil
rescue StandardError => error
  logger.error("Failed to hydrate ConfigMap snapshot error=#{error.class.name} message=#{error.message} ...")

  @current_data     ||= {}
  @resource_version ||= nil
end

After hydration, the listener opens its watch stream from the resourceVersion it just read. The stream will eventually fail. The Kubernetes API concepts page explicitly calls out the 410 Gone case where "historical version of a resource is not available". When that happens, the listener lands in the error-handling path and simply hydrates again.

The 410 Gone recovery deserves its own paragraph because it is the place where the design feels most coherent. When you watch from a resourceVersion that has aged out of the watch cache window inside kube-apiserver, the API returns 410 Gone and the watch terminates. The recommended response from the Kubernetes maintainers is to clear the local cache, do a fresh LIST on the resource, and start a new watch from the resourceVersion returned by that list. The listener handles this by watching for the ERROR notice type in the stream and resetting its resource-version cursor:

when 'ERROR'
  status  = notice[:object] || {}
  code    = status[:code]
  reason  = status[:reason]
  message = status[:message] || 'Unknown watch error'

  logger.warn("Listener stream returned error code=#{code} reason=#{reason} message=#{message}")

  # resourceVersions older than the watch cache compaction window
  # yield HTTP 410/Expired. Reset so the next hydration starts from
  # the current snapshot.
  if code.to_i == 410 || reason.to_s.casecmp('Expired').zero?
    @resource_version = nil
    logger.debug('Listener resourceVersion expired; will rehydrate on restart')
  end

  raise WatchInterruptedError, "Listener stream error #{code} #{reason}: #{message}"

The outer loop catches WatchInterruptedError, sleeps for WATCH_RETRY_DELAY, and starts again from the top. The next hydration call sees @resource_version = nil, fetches a fresh snapshot, and the listener is back in sync. There is no special-cased recovery path; recovery is hydration plus restart of the watch.

One listener per worker process

Ruby on Kubernetes has a multiplicative concurrency model. There are multiple pods, each running multiple Puma workers, each one a separate operating-system process with its own Ruby heap. The right unit for "converged" is not the pod, it is the worker process. Every worker has its own in-memory index, and "every running process has refreshed its index from the database" is what "the fleet has converged" means here.

I chose to run one listener thread per worker process. It is a little redundant, because every worker on every pod opens its own watch stream against the same ConfigMap, but the alternative (one listener per pod with some kind of intra-pod fanout to the workers) is a lot more code and a lot more shared state, and I did not want either. Letting every process be responsible for its own correctness keeps the mental model simple. There is no cross-process IPC inside a pod and no sidecar coordination service that has to stay up.

The launch logic lives in two places. For Puma workers, the launch uses before_worker_boot so the listener starts after the fork:

# from config/puma.rb
before_worker_boot do
  # project-specific fork hooks (DB reconnect, logger reopen, redis reset)...

  # Start ConfigMap listeners in every Puma worker process
  Thread.new do
    sleep 5  # let the process finish booting

    begin
      ConfigListener::RolloutListener.start! if ENV['ROLLOUTS_CONFIGMAP_NAME'].present?
    rescue StandardError => error
      Rails.logger&.error do
        ['Failed to start Rollout ConfigMap listener in Puma worker',
         error.class.name, error.message, (error.backtrace || [])[0..10]].join(' | ')
      end
    end

    begin
      ConfigListener::ToggleListener.start! if ENV['TOGGLES_CONFIGMAP_NAME'].present?
    rescue StandardError => error
      Rails.logger.error do
        ['Failed to start Toggle ConfigMap listener in Puma worker',
         error.class.name, error.message, (error.backtrace || [])[0..10]].join(' | ')
      end
    end
  end
end

For anything that is not Puma (background workers, the Rails console, one-off Rake tasks that need the index, single-worker services), a separate initializer detects the process type and starts the listener only when Puma is not going to:

# from config/initializers/config_listeners.rb
Rails.application.config.after_initialize do
  is_background_worker  = ENV['WORKER_QUEUES'].present?
  puma_handles_listeners = defined?(Puma) && !is_background_worker

  if ENV['ROLLOUTS_CONFIGMAP_NAME'].present? && ENV['ROLLOUTS_CONFIGMAP_NAMESPACE'].present?
    next if ConfigListener::RolloutListener.running

    if puma_handles_listeners
      Rails.logger.info('[ConfigListener] Skipping Rollout listener initializer start; Puma will handle it')
    else
      Thread.new do
        sleep 2
        # ... start! with rescue
      end
    end
  end
  # same shape for the ToggleListener
end

The split between "Puma starts listeners in before_worker_boot" and "everything else starts them in after_initialize" is one of those things that looks redundant until it is wrong. Without the puma_handles_listeners short-circuit, the initializer fires before Puma forks, the listener thread lands in the master process, the fork happens, the thread does not survive into the workers, and now no worker has a listener at all. With both paths running unconditionally, a listener in the master would race a listener in each child. Detecting the process type and choosing exactly one of the two start paths is the only way I have found to keep this honest.

The synthetic-removal trick

This is a small mechanism that came out of the rollout-allocation lifecycle, and it is worth describing because it covers a gap in the watcher contract. When a rollout moves to the retired state, the goal is to remove its key from the ConfigMap so that every running listener invalidates its cached entry. Most of the time that key is already in the map, because every prior save of that record published it. But if a record was created and immediately retired, or if it was created before the ConfigMap-based system existed at all, the key is not in the map. A naive delete on a key that does not exist does nothing, no events fire, and any pod that happens to have a stale cached version (perhaps from a boot-time hydration that happened before the retirement) keeps serving the stale value indefinitely.

The fix in ConfigBroadcaster::Base#remove! is to publish-then-remove when the entry was missing, so that the listeners on every pod see at least one event:

if data_hash.delete(key)
  cm.data = data_hash
  client.update_config_map(cm)
  logger.info('Removed marker from ConfigMap', config_kind:, configmap: configmap_name, key:)
else
  # Entity was never published. Synthesize an add+remove so all listeners
  # receive a MODIFIED event, then a removal event. This forces a clean
  # invalidation across the fleet even when there is no prior state.
  logger.info('Marker not found in ConfigMap (entity was never published) - creating synthetic events', ...)

  # Step 1: Add the key (generates MODIFIED event)
  body = marker_for(record)
  data_hash[key] = body.to_json
  cm.data = data_hash
  client.update_config_map(cm)

  # Step 2: Re-fetch ConfigMap to get latest resourceVersion (avoid conflicts)
  cm = client.get_config_map(configmap_name, configmap_namespace)
  data_hash = (cm.data.respond_to?(:to_h) ? cm.data.to_h : (cm.data || {})).stringify_keys

  # Step 3: Remove the key (generates MODIFIED event)
  data_hash.delete(key)
  cm.data = data_hash
  client.update_config_map(cm)

  logger.info('Synthetic removal events created for ConfigMap', ...)
end

The diff engine on the listener side handles this cleanly. The first event adds the entity to the listener's snapshot and refreshes the cache; the second event removes it and invalidates. Every pod ends up with a clean slate for that record. The cost is two etcd writes for one logical operation, which I considered acceptable because the case is rare and the alternative is "stale forever in some places".

The synthetic-event pattern is the right primitive for this gap. The watcher contract has no concept of "force every listener to consider this key", so the choice is between this approach and a per-entity tombstone inside the ConfigMap. Tombstones come with their own garbage-collection problem (when do you delete a tombstone?), and the synthetic-event pattern matched the watcher semantics more cleanly than the tombstone alternative.

What broke in production

Two production issues from the iterations are worth writing down honestly, because they are the kind of thing that the architecture diagrams cannot show you and they are both directly traceable to the way the system was built.

The listen-gem race conditions

The big production issue from the inotify-based design was not the AtomicWriter behavior, which I caught in testing. It was thread crashes under churn. The metrics dashboard showed listener threads dying repeatedly in the staging environment with no useful stack trace, just a bare "thread terminated". The early hypothesis was memory pressure: the listen gem is not free, it holds a persistent inotify file descriptor and a small fiber pool, and in a busy worker that adds up. The actual cause was the symlink-flip behavior tearing the gem's internal state apart on every kubelet sync.

The fix was not to harden Listen against ConfigMap semantics. The fix was to drop it entirely and switch to the kubeclient-based watch. The conclusion in plain language was that I needed an HTTP-based non-polling watch driven by the kubeclient informer pattern, so that symlink atomicity stops being a problem at all. The shape I adopted here is the Operator Pattern, which is the standard way to react to changes in a Kubernetes resource from inside a controller process. The listen gem is fine for ordinary filesystem watching; it is not fine for ConfigMap volume watching, and that is not the gem's fault.

The 403 from a too-narrow RBAC binding

I hit this one early in the API-watch work, before the rest of the design was load-bearing. The listener could not establish its watch stream and was getting 403 Forbidden instead of 200. The root cause was an RBAC binding that had pinned watch permission to a specific resourceName, in this case the ConfigMap's literal name. Kubernetes RBAC has subtle rules around resourceNames: you can pin verbs to a resource name for things like get, but list and watch against a resource name do not compose cleanly with the field-selector pattern I was using.

The fix was to grant list and watch on the ConfigMap resource type globally inside the namespace, and to let the application-side field_selector do the narrowing to the specific name I cared about. That is a slightly broader RBAC grant than I would have chosen if I had a choice; in exchange, it composes correctly with the watch verb and there is no need to argue with Kubernetes about it.

Numbers and operational reality

A short note before this section: I would rather omit a number than invent one. What follows is what I can stand behind from the design constraints, from the code, and from the operational data this design was instrumented against.

Convergence target. Roughly one second from "save click" in the admin UI to "every Puma worker has refreshed its index from the database". This was the operational target I wrote down going in. The API watch path meets it comfortably in the common case: the median sits at about 250 milliseconds, and the long tail sits under two seconds at p99. The slowest path through the system is pod-restart catch-up, which is hydration-bounded rather than watch-bounded.

Marker size. Each marker is roughly a hundred bytes, which is well under the 1 MiB ConfigMap budget. Against that budget, the practical headroom is in the thousands of records per ConfigMap before partitioning would become necessary. Today there are far fewer than that, and a saturated ConfigMap would also mean a saturated etcd key, so the sensible operating point sits well below the documented limit.

Listener fan-out. Each pod runs a small handful of Puma workers (WEB_CONCURRENCY in puma.rb), each with one listener thread per ConfigMap. The total thread count scales as pods x workers x configmaps. In practice, the order of magnitude is "tens of pods, a handful of workers each, two configmaps", which means a low-hundreds count of long-lived watch streams to kube-apiserver. The Kubernetes API handles this fan-out without breaking a sweat; this is the kind of usage the watch interface was built for.

Database read amplification on a save. A single save fires after_commit once on the writer process, which results in exactly one patch_config_map call. Every listener thread that receives the resulting MODIFIED event fires one targeted find_by(id:) against Postgres. So one save produces approximately pods x workers database reads, spread across the fleet within the convergence window. That is higher than zero (which is what you would get if you stuffed the value into the marker; see iteration four for the consequences) and lower than the polling path, which produced reads every TTL period regardless of whether anything had changed. I considered passing the new value in the marker to skip the database round-trip; iteration four argued against it loudly, and I kept the database as the single source of truth.

Deploys avoided. Before this work, every operational config flip was either a deploy or a hotfix; now the runtime-editable fields do not touch the release pipeline at all. In practice, that lands in the low tens of admin-UI saves per week, none of which require a release.

On-call pages avoided. Same caveat. The "is this applied yet?" question that ended iteration one has stopped showing up in incident retrospectives. That is the metric I would want to put a real number on if I were reviewing this work for someone who was deciding whether to invest similarly. It is the strongest argument for the design separately from any latency figure, and I think it is the right one to lead with.

If you are thinking about building something like this, the most honest piece of advice I can offer is to put the propagation-latency metric in front of the design rather than behind it. Build the infrastructure first, instrument the propagation path with explicit observability before you ship it widely, and let your incident-review meetings tell you whether it actually worked.

What I would extend

A couple of threads of follow-up work are worth naming, because the design as described is not where I want it to live forever.

The first is convergence with a broader cache-invalidation framework that landed in parallel and is now used for non-ConfigMap-backed assets. I have moved the toggle-registry's invalidation off the ConfigMap path onto that broader framework, because feature toggles are now better thought of as a special case of "smart-cached database records" than as a domain that needs its own propagation infrastructure. That migration is mostly complete. The rollout-allocation system still uses the listener described here, and probably should keep using it, because the rollout-index shape is more elaborate than the broader abstraction handles cleanly.

The second is a versioned schema for markers. Today, the marker is "whatever marker_for produced this version of the deployment". If I ever want to evolve the marker (add a schema_version, change the timestamp format, drop a field), every running listener has to handle both old and new shapes during the deploy window. I have not had to do this yet, and when I do, I want a small explicit schema-version field on the marker so as to avoid the worst of both worlds: old listeners crashing on new markers, new listeners misinterpreting old markers. The fix is a one-line addition. The discipline is to add it before I need it.

There is also a small tidiness item. One ENV var currently does double duty as the namespace name for both the rollout listener and the toggle listener, because the toggle work shipped reusing the existing variable rather than introducing a parallel one. That rename is on the cleanup list and is the kind of thing worth doing before the wrong namespace gets configured for one of them.

The bigger question I sit with, looking back at all of this work, is whether the doorbell pattern generalizes beyond the specific shape of problem I had. I believe it does, with one important condition. The pattern generalizes to any case where you have a strong source of truth and a fleet of in-memory caches that need to converge on what that source of truth is currently saying. It does not generalize to cases where you need delivery semantics (at-least-once, ordered, transactional), because Kubernetes watches do not offer those semantics by themselves. If you find yourself reaching for "I really need the value to arrive exactly once and to be processed in order even across pod restarts", you have outgrown a doorbell and you should reach for a real durable queue.

Most config-propagation problems do not have that need. Most rollout-control problems are convergent, which is to say you do not care about ordering, you care that everyone ends up agreeing on the latest value. For convergent problems, the doorbell-plus-truth-store separation is one of the cleanest patterns I have worked with.

If I were starting from scratch today, the three questions I would force myself to ask before writing a single line of code in any iteration are these: what happens if a process restarts mid-stream? what happens if the signal is missed? who owns the truth, and where in the code is that ownership enforced?

Those three questions kill iteration two on the second one, because pub/sub fails on missed signals. They kill iteration three on the first one, because the subPath silent-fail breaks process restart and the AtomicWriter behavior breaks signal interpretation in subtle ways. They kill iteration four on the third one, because two truth stores eventually disagree and the design has no canonical answer for which one was right at any given moment. They do not kill iteration one, because polling is a legitimate answer to a different question. For the rest, those three questions are the ones whose absence shaped each iteration along the way.

Comments