Resilience documentation

Health monitoring

How the platform knows what it knows — signal sources, state classification, the stabilization window, and what metrics get emitted to your observability stack.

Health monitoring is what makes Failover behavior health-aware rather than configured-aware. The router doesn’t wait for a method to declare its own failure; it observes the state of every configured method continuously and selects on the basis of current health. This page describes how that “current health” is computed, what state transitions look like, and what gets emitted for an operator’s consumption.

Three signal sources

The state of each configured login method is computed from three independent sources:

Synthetic probes. Authonomy exercises each method on a cadence by performing a real protocol exchange against it using a dedicated service identity — a client-credentials grant against an OIDC provider, a service-identity assertion against a SAML provider, a verification against a service-enrolled factor for native authentication. The probe goes through the same code path production traffic uses, so it exercises the token-issuance and response-normalization path end-to-end. A probe that completes within the expected latency and returns a response of the expected shape is a healthy observation; one that fails, times out, or returns malformed output is unhealthy.

Probe outcomes are excluded from authentication-success metrics that would otherwise attribute probe behavior to real traffic.

Live-traffic observations. Metrics from real authentication requests — error rates, latency distributions, response-shape conformance — feed the same classifier. A method that’s intermittently failing real traffic reaches a degraded or unavailable state without waiting for the next synthetic probe cycle. The classifier uses moving-window aggregates rather than raw counts, so a single failed request doesn’t drive a state transition.

Upstream signals where available. Some identity providers publish status feeds, explicit health endpoints, or event streams that name outages as they occur. Where such signals are reliable enough to consume programmatically, they’re integrated as a third input. They’re not the sole input — upstream signals lag the outage that produces them, and they occasionally disagree with observed reality — but where present, they accelerate state transitions.

The classifier produces one of three states

The classifier combines the three sources into one state per configured method:

Healthy. The method is routable. Both synthetic probes and live-traffic observations indicate the method is behaving within its expected envelope.

Degraded. The method is functioning but showing elevated error rates or latency. Degraded methods are still routable; the router consults the application’s routing configuration to decide whether to prefer an alternative.

Unavailable. The method is not routable. Traffic is directed to the next method in the ladder.

A stabilization window sits between the raw classifier evaluation and the state the router consults. The window is configurable; the default favors stability over responsiveness. Its purpose is narrow: a method that fails a single probe or produces a single spike of elevated errors does not flip the router’s view, because a single transient does not represent a sustained change in availability. The window is short enough to transition within a failure event and long enough to survive a transient.

A parallel stabilization window applies in the other direction (unavailable to healthy) so that a method flapping briefly healthy mid-outage doesn’t cause the router to ping-pong back to it.

Sync lag is a first-class health signal

Each instance exposes its own health surface: reachability of its configured methods, the database’s freshness against the upstream paths that feed it, keystore reachability where the topology includes one, and native-credential verification success rate.

Sync lag at an instance is treated as a first-class health condition, not a silent degradation. A lag that exceeds the configured drift window is operator-visible. The condition does not interrupt the instance’s ability to serve authentication — the instance continues to serve against whatever state it has — but it surfaces so that an operator can decide whether to intervene (investigate the sync path, invoke a targeted sync, reduce traffic to the instance by policy).

Metrics

The platform emits the following metrics for consumption by the customer’s observability stack. All carry tenant and deployment labels where multi-tenancy or multi-deployment applies.

Per-method authentication-request rate, success rate, and latency histogram. One set per configured method per application, at the granularity the customer’s stack ingests.

Per-method health-state transitions. A counter incrementing on each transition between healthy, degraded, and unavailable, labeled with the destination state.

Router decision distribution. Which rule was selected per application, aggregated across requests. Answers “how often is the secondary being used” without per-request audit replay.

Fallback events. A counter incrementing each time a routing decision selected a method other than the highest-priority rule. Decomposable by application and by reason.

Sync engine cadence and per-run outcome. For each sync mode (incremental, full, targeted, orphan), the run interval and the success or failure of each run.

Identity-sync drift. Estimated staleness of the deployment database against the customer’s authoritative identity source, as a gauge.

Keystore reachability (externalized-keystore topologies). Reachability and latency to the keystore from each instance, as a gauge per instance.

Operations-framework queue depth and retry distribution. Depth per queue, retry-attempt histogram, dead-letter rate.

Local-factor verification success rate. One metric per instance, capturing the success rate of native-auth factor verifications served by the instance.

On-call runbook

The platform ships a baseline runbook the operations team adapts. Five scenarios are covered with expected behavior, verification steps, and the point at which operator intervention is appropriate:

Primary identity provider reported unhealthy. Router falls through to secondary (or native). Verify the per-application routing view shows the fallback active, fallback event rate elevated as expected, audit entries showing the transition. Intervene if the fallback is also unhealthy or the application’s authentication error rate is not recovering.
Secondary unhealthy while primary is also unhealthy. Router falls through to native. Verify native is enrolled for the affected population. Intervene if the affected population is not enrolled — this is a policy gap, not a platform defect.
Instance offline at a site. Applications at the site lose authentication until the instance is restored. Verify instance process status, local storage integrity, reachability from the site’s applications. Recover by restart or restoration. Deployments that require continuity through single-instance failure run a redundant set of instances at the site.
Sync lag exceeded tolerance. The drift window has grown beyond its configured bound for an instance. Verify the sync engine queue and per-run outcomes, network reachability to the relevant upstream, that authentication is still succeeding against current state. Intervene by invoking a targeted sync for urgent records or a full sync to re-establish the baseline.

Each runbook entry carries the audit trail’s relevant search — the pre-written queries an operator runs while investigating — so the investigation path doesn’t require constructing queries from scratch under pressure.

Recommended fault-injection scenarios

The platform’s resiliency behavior is only as trustworthy as the operator’s confidence that it was exercised in the customer’s environment. The platform recommends running the following scenarios before production use:

Simulated primary outage. Block reachability to the primary; observe the router falling through to the next method, the audit recording the transition, the fallback metric incrementing. Restore reachability; observe return-to-primary after stabilization.

Simulated correlated external outage. Block reachability to every configured external provider simultaneously. Observe the router falling through to native for the enrolled population.

Site WAN severance. Sever WAN between an instance and its external providers. Observe the instance continuing to serve against whichever methods remain reachable, with native as the floor where credentials are available. Restore WAN; observe sync and propagation catchup.

Flapping health signal. Introduce intermittent failures such that the raw classifier state oscillates. Observe the stabilization window preventing the router from ping-ponging.

Sync interruption. Interrupt the sync engine mid-run for one replica. Observe the incremental cursor not advancing past the interruption point and resuming on the next run, with no duplicate events applied.

Each scenario has a verification point — a concrete observable confirming the platform behaved as described. Running these against a deployment is the empirical ground under the contract this documentation describes.