Resilience documentation

Audit continuity

How the audit trail is written, where it lives, and why it doesn't lose events across an outage or a failover boundary.

The audit trail’s single most important property for a resiliency review is that it does not lose events across an outage. The design commits to this as a continuity guarantee, not an effort. This page describes how that guarantee is realized — and what it does and doesn’t promise.

What gets captured

Every event the instance produces lands in the audit trail: routing decisions (which rule matched, which method was selected, which methods were skipped and why), method selections, factor verifications, token issuances, operator overrides, sync operations. The trail is the source of truth for what happened at the instance, regardless of which method served any particular authentication.

A token issuance from the primary, a token issuance from the secondary, a token issuance from the native floor — all three are captured in the same shape, in the same trail, with the serving method recorded as a property of the entry. A reviewer asking “every session issued yesterday” asks the audit trail one question and gets one answer, not one per possible upstream method.

Three structural commitments

The continuity guarantee rests on three commitments:

Audit is written synchronously with the action that produces it. A routing decision, a method selection, a token issuance, an override, a sync operation — each writes to the instance’s audit trail before the action is considered complete. If the audit write fails, the action does not complete. This is a deliberate choice of correctness over availability for the audit path: a deployment that loses the ability to write audit briefly loses the ability to authenticate briefly, rather than serving authentications that go unrecorded.

Audit lives at the instance. Each Authonomy instance writes its own audit trail to its own durable storage. The trail survives instance restarts. There is no replay path or aggregation requirement for the audit trail to be durable; durability is a property of the local trail, not of any aggregation pipeline.

Aggregation is a separate concern. A multi-instance deployment that requires aggregated audit (for compliance reporting, SIEM ingestion, cross-instance investigation) runs aggregation as a separate concern outside the request path. Aggregation does not change the durability or correctness of any instance’s local audit trail; it consumes that trail rather than constituting it.

The third commitment is the one a reviewer most often misses on a first read. It means the audit trail does not depend on the aggregation pipeline being healthy. If your SIEM ingestion is broken for six hours, you do not lose six hours of audit events — they’re still on the instance, ready to be ingested when the pipeline returns.

What the trail looks like across a failover

The audit trail’s continuity property is most useful at the failover boundary. The trail does not stop at the boundary; it captures the transition itself.

When the router transitions from the primary to the secondary, the audit trail records: the last successful authentication served by the primary, the health signal that triggered the transition (with the underlying classifier inputs), the stabilization window’s resolution, the routing decisions made against the secondary thereafter, and (when it happens) the return-to-primary transition. From the trail’s perspective, an outage is a continuous sequence of authentication events served against whichever method was healthy at each moment, plus the operator-visible state transitions in between.

A site-deployed instance during WAN severance writes its trail locally. Every authentication served against the native floor during the severance is captured at the instance, in the same shape, with the serving method recorded. When the WAN returns, those events are part of the trail; aggregation pipelines pick them up alongside the rest.

Audit during severance

Severance does not change where audit is written. The instance continues writing to its own durable storage. If aggregation to a central system was happening over the WAN, that aggregation pauses during the severance and resumes on reconnect — the local trail is the source of truth in either case.

This matters operationally because some compliance frameworks have audit-log requirements that an aggregation-dependent design would struggle to meet during a severance. The instance’s local trail is durable on its own storage and survives an instance restart; an audit query against the instance returns the events it served regardless of whether they’ve been aggregated upstream yet.

What audit doesn’t do

The trail records what happened. It does not interpret what happened. Three explicit non-claims:

It does not aggregate across instances at the request-path layer. A query for “every session issued at this deployment yesterday” requires either querying each instance, or running an aggregation pipeline outside the request path. The platform supports both — the right choice depends on the deployment’s compliance posture and observability stack.

It does not retroactively classify events. If an authentication served during an outage was against a subject whose authoritative record had since been deprovisioned, the audit captures the authentication and the eventual deprovisioning as two distinct events with their respective timestamps. A reviewer reasoning about the deprovisioning lag does so by reading both. The platform does not flag the prior authentication as “should-not-have-happened” after the fact.

It is not a substitute for the customer’s broader observability stack. Audit captures what the instance served. Application-layer telemetry, network observability, and the customer’s own SIEM rules sit outside the platform. The audit trail integrates with those rather than replacing them.