Zero-downtime secret rotation in Python

The hard part of rotation is not generating a new credential — it is swapping it into a running service without a single failed request. The answer is a refreshing cache plus an overlap window where both credentials work. This page implements it, extending Automated Secret Rotation Patterns.

Problem 1: restart-to-rotate

# ANTI-PATTERN: the only way to pick up a new secret is a redeploy
SECRET = fetch_secret()    # module-level; rotation requires restarting every pod

Tying rotation to a restart means downtime and a manual step.

Problem 2: a race on refresh

# ANTI-PATTERN: two threads refresh at once, one overwrites the other
if expired:
    global SECRET
    SECRET = fetch_secret()    # no lock: torn reads under concurrency

Without a lock, concurrent requests can see a half-updated value.

Secure implementation

# secrets/refresh.py
import threading, time
from pydantic import SecretStr

class RefreshingSecret:
    def __init__(self, fetch, ttl: int = 300):
        self._fetch, self._ttl = fetch, ttl     # ttl < overlap window
        self._value: SecretStr | None = None
        self._at = 0.0
        self._lock = threading.Lock()

    def get(self) -> SecretStr:
        now = time.monotonic()
        if self._value is None or now - self._at > self._ttl:
            with self._lock:                     # one refresh under contention
                if self._value is None or time.monotonic() - self._at > self._ttl:
                    self._value, self._at = self._fetch(), time.monotonic()
        return self._value

def on_rotation(pool, secret: RefreshingSecret) -> None:
    pool.recreate(password=secret.get().get_secret_value())   # graceful pool reload

The double-checked lock guarantees exactly one refresh under load; the TTL stays below the store’s overlap window, so the old credential is still valid while the cache catches up; pools are recreated rather than torn down mid-request.

Gotchas & version-specific behaviour

  • The cache TTL must be shorter than the credential overlap window or requests fail at the cutover.
  • Use double-checked locking so a refresh under load does not stampede the secret store.
  • Recreate connection pools on change; do not close active connections abruptly.
  • time.monotonic() for all timing so clock changes cannot extend the TTL.

Production parity checklist

  • Rotation needs no restart — the refreshing cache picks up new values.
  • Cache TTL is below the overlap window.
  • Refresh is thread-safe (double-checked lock).
  • Pools reload gracefully on credential change.
  • A staging drill forces rotation and asserts zero failed requests.

Conclusion

A thread-safe refreshing cache sized under the overlap window turns rotation into a non-event — no restart, no failed request. For the overlap-window mechanics in the store, see Automated Secret Rotation Patterns.