Resilience documentation

Failover behavior

How the router walks the ladder — health detection, method selection, stabilization, and operator controls.

The three failover variants in the spec — between cloud identity providers, between cloud and on-premises Active Directory, and into the native floor — are not three separate features. They are three shapes of the same mechanism, applied at instances whose reachable set of methods has narrowed in different ways. This page describes the mechanism.

The router and the health monitor

Routing configuration is per application, held in Authonomy. Each application has an ordered list of rules; each rule names a login method (a cloud IdP, an on-prem directory, the native floor) and, where relevant, the subset of users or requests the rule applies to. At request time, the router walks the list in priority order, consults the health monitor for each candidate method, and dispatches to the first method reporting healthy.

Three properties of this design are worth stating plainly.

The router is health-aware, not position-aware. It does not wait for a method to declare its own failure or for any upstream signal that another method should take over. It observes the state of every configured method continuously and selects on the basis of current health. This is what makes failover robust against the case where a method is failing in ways that prevent it from announcing the failure.

Decisions are logged. Every routing decision — which rule matched, which method was selected, which methods were skipped and why — is written to the audit trail synchronously with the request. The operations console renders the same data as a time-ordered view per application, after the fact.

The application’s experience is uniform. The application receives the same shape of authentication artifact regardless of which method served it. It does not need to branch on “which provider issued this token,” because from its perspective, Authonomy issued it. This is the property that makes failover transparent to the application and what lets a security reviewer evaluate a single trust path between the application and Authonomy rather than a separate trust path per upstream method.

What happens when a method fails

A method becomes unhealthy in the monitor’s view when the classifier — synthetic probes, live-traffic error-rate observations, latency, and any upstream status signal the provider exposes — durably reports failure. A single transient does not flip routing. A stabilization window holds the current state until the signal is durable; the windows are configurable per deployment, with defaults that favor stability over responsiveness.

Once the method’s unhealthy state is durable, the router’s next request walks past it to the next rule on the ladder. The application, the user, and the upstream provider continue along the new path without coordination. There is no fallback handshake, no “second-try” logic at the application layer, no consent re-prompt. The router decides; the request proceeds.

A parallel stabilization window applies to recovery. When a previously unhealthy method begins reporting healthy probes again, the router continues serving on the fallback method until the recovered method’s healthy state is durable. This prevents a flapping primary from oscillating traffic.

Two operator controls sit over the automation

The router’s automated decisions can be overridden by an operator in two ways:

Override. An operator can force a method into a state independent of the monitor’s evaluation — to drain traffic ahead of planned provider maintenance, to hold traffic off a method after a suspected compromise, or to exclude a method from the router’s consideration during an incident. Overrides are written to the audit trail and remain until explicitly cleared. The monitor continues evaluating underneath; the override wins.

Targeted method activation. An operator can promote a specific method (typically the native floor, but in principle any configured method) to the active method for a specified scope: an application, a user population, a duration. This is the control used when a deliberate migration or a declared external-provider incident warrants serving authentication from a chosen method immediately, without waiting for the monitor to drive the transition.

Both controls are tenant-scoped under multi-tenant configuration; an operator acts only against the tenant’s own applications.

What this looks like in practice

A typical three-rung deployment configures cloud IdP A as primary, cloud IdP B as secondary, and the native floor as the bottom rung. In normal operation, every request resolves at the primary; the router walks one entry of the ladder and dispatches.

When the primary’s region has a software incident, the monitor’s classifier observes elevated error rates in the live traffic and the synthetic probes. The stabilization window holds; once the signal is durable, the next request walks past the primary to the secondary. The application sees one slightly slower request as the protocol exchange happens against the secondary. From there, traffic resolves at the secondary until the primary’s recovery is itself durable.

If a correlated outage takes both cloud rungs out simultaneously — for example, an enterprise-edge network failure that severs the path to both providers — the router walks past both and resolves at the floor for any user enrolled in a locally-verifiable factor. This is the case the floor exists for, described separately in The native floor.

The router’s logic does not change across these scenarios. The three rungs are not three code paths. They are three priorities in one ordered list, evaluated against one health signal, on every request.