Resilience documentation
Understand the drift window
The contract that bounds how stale a replica is allowed to be relative to the authoritative source — and why it's the load-bearing decision for a deployment's risk posture.
The drift window is the maximum permitted staleness of a replica relative to the upstream state it consumes. Every Authonomy instance maintains a canonical view that is a replica of the customer’s authoritative identity source (and, in database-resident topologies, a replica of the credential store). The drift window is the contract that bounds how out-of-date that replica is allowed to be.
Picking the right drift window is the single most consequential operational decision a deployment makes. Everything else — the failover ladder, the placement, the enrollment policy — sits inside the envelope the window defines.
What the window guarantees
The contract states four guarantees, applied to both the identity stream and the credential stream. Stated for identity:
Provisioning reaches replicas within the window. A user provisioned at the authoritative source becomes authenticable against every replica within the window.
Deprovisioning reaches replicas within the window. A user deprovisioned (or disabled) at the authoritative source loses the ability to authenticate against every replica within the window.
Attribute changes reach replicas within the window. Changes affecting authentication or application access propagate within the window.
The window extends by the duration of severance. If a replica is severed from its authority for a period longer than the window — a WAN outage at a site, for example — the effective staleness during the severance is the window plus the severance duration. The contract is a sync-lag contract, not an outage-duration contract; it does not promise that a ten-hour severance carries a one-hour staleness. It promises that absent severance, staleness is bounded by the window, and staleness resumes converging to the window when severance ends.
What it doesn’t promise
The drift window doesn’t promise propagation faster than its width. If a deployment configures a four-hour window, a deprovisioning event at the authoritative source may take up to four hours to reach every replica. During those four hours, a terminated user’s existing sessions remain valid and a new authentication attempt at a replica that hasn’t yet absorbed the change will succeed.
This is the most important operational-risk statement in the design. It is named explicitly because every other resiliency decision sits on top of it.
The width is a deployment choice
Typical deployments operate with drift windows of one to four hours. The two endpoints of the range trade off in opposite directions:
Shorter windows (toward one hour) move state faster. They reduce the deprovisioning-lag surface and shorten the gap between provisioning a new user and that user being able to authenticate at a replica. They cost higher sync traffic, tighter health requirements on the sync engine, and more upstream load on the authoritative source.
Longer windows (toward four hours, or beyond) lighten sync traffic but widen the lag surface. They are supportable, but two consequences are explicitly named for any deployment considering them:
- Deprovisioning lag grows. A terminated employee retains authentication at replicas for the full window after the authoritative source records the termination. Deployments whose threat model requires faster deprovisioning should configure a shorter window or invoke targeted sync (see below) for high-risk deprovisioning events.
- Authentication gaps for newly-provisioned users lengthen. A newly-created user cannot authenticate at a replica until the replica has absorbed the provisioning event. Deployments that frequently onboard users who must authenticate quickly should configure a shorter window or invoke targeted sync for onboarding events.
The right window for a deployment is the result of a policy choice, not a default.
The escape hatch: targeted sync
A targeted sync is an operator-initiated push of a specific change — an urgent deprovisioning, an urgent onboarding, an emergency credential revocation — independent of the incremental cadence. It propagates the change immediately rather than waiting for the next sync cycle. Targeted syncs are recorded in the audit trail with the operator’s identity and the reason given.
Targeted sync is the lever that lets a deployment configure a longer everyday drift window while still handling individual high-risk events at human-operational speed. A four-hour window with targeted sync available for emergencies is operationally distinct from a four-hour window with no escape hatch.
During severance
The window’s behavior during severance is the most-asked question in any review. Stated plainly: the window grows by the duration of severance for each frozen stream. If the WAN at a site is severed for two hours and the configured window is one hour, the effective staleness during the severance is one to three hours. A deprovisioning issued at the authoritative source while the site is severed reaches the site only when the WAN returns and the sync catches up.
This is the tradeoff site placement expresses: continuity at the cost of bounded lag during severance. A deployment selects the window width with this tradeoff in mind, then engages the deployment placement decision against the same tradeoff.