How Nexa stays available
Disruption events are not a good time for the platform that handles disruption events to have a bad day. A Tier-1 hub closure compresses tens of thousands of stranded passengers and dozens of operators into a window measured in single-digit minutes — at exactly the same moment the underlying internet, partner APIs, and cloud infrastructure are most likely to be under stress themselves. Nexa is engineered against the assumption that something in the stack will be broken on any given day, and that the platform must keep serving anyway.
This article describes the layered availability model behind that promise. The companion piece How Nexa survives operational panic covers the application-level failure modes (vendor surges, contention, refresh storms). This one covers everything below: the network, the edge, DNS, regions, and clouds.
Two principles shape the whole design:
- No single thing can take the platform down — not one region, not one cloud, not one vendor, not one DNS provider. Where a single point of failure cannot be eliminated cheaply, the failure mode and recovery path are documented so the on-call doesn't improvise during an incident.
- The smaller the blast radius, the faster the recovery. A vendor outage recovers in seconds. A regional event recovers in minutes. A cloud-provider event recovers in hours, and only matters to customers whose contracts require it. The platform absorbs disruption at the narrowest layer that can handle it, so the wider layers stay quiet.
The layers of disruption
| Layer | What can go wrong | How the platform absorbs it | Typical recovery |
|---|---|---|---|
| Browser → edge | TLS issues, packet loss, mobile carrier glitches, DDoS targeting one URL | Globally-distributed edge with anycast, WAF, DDoS scrubbing, and serve-stale-on-error caching | Seconds, transparent to the passenger |
| Edge → DNS | DNS provider outage, slow propagation during failover | Short TTLs on tenant-facing names, secondary DNS at an independent provider, health-check-driven steering | A few minutes |
| DNS → region | A cloud region degrades or goes offline | Multi-region topology with active-active read paths and automated traffic steering | Single-digit minutes for the visible-impact window |
| Region → cloud | The entire cloud provider is unreachable for a tenant | Per-tenant cross-cloud disaster-recovery target available for Tier-1 contracts | Hours, contractually scoped |
| Cloud → vendor | An external partner (PSS, GDS, payment, notifications) is down or rate-limiting | Per-vendor circuit breakers, adaptive traffic shaping, and fallback chains across multiple providers | Seconds to minutes — see Operational Panic |
1. The edge
Every public hostname Nexa serves — operator console, passenger PWA, partner API, docs, marketing — sits behind a globally-distributed edge network. There is no path from the public internet to a Nexa origin that does not transit the edge. That single rule unlocks several otherwise-expensive properties.
| Capability | What it means for customers |
|---|---|
| Anycast | A passenger in São Paulo terminates their connection on the nearest edge location, not in another hemisphere. First-byte latency is meaningfully lower, and the routing fails over around regional internet incidents automatically. |
| TLS termination | Certificates are managed at the edge with zero-touch rotation. Origins only accept connections from the edge, by mutual authentication. The public surface and the origin surface are separate trust boundaries. |
| Layer 3/4 + Layer 7 protection | Volumetric attacks are scrubbed at the edge before they reach origin. A Nexa-tuned web application firewall blocks the credential-stuffing and enumeration patterns specific to disruption traffic. |
| CDN + edge cache | Static UI assets are served from the edge, not from origin. The passenger BFF cooperates with the edge cache so a refresh storm hits the edge, not the read store. |
| Per-tenant rate-limiting | Each tenant has its own quota at the edge. A surge against one airline's hostname cannot consume capacity meant for another tenant. |
| Tenant-aware routing | The edge maps each tenant's hostname to that tenant's isolated origin so cross-tenant routing is not even physically possible. |
Fail-static as a last resort
If every regional origin for a tenant is unreachable — which should never happen, but design for it — the edge can serve a degraded, read-only view of each passenger's last-known offer. The passenger sees their hotel address, voucher, and transport instructions even when the platform is fully down. Accept and decline are temporarily disabled; the rest of the experience continues. This degraded mode is acceptable because the information the passenger needs has already been pushed out to the edge — they are not transacting in that window, only consuming.
2. DNS and client behavior
DNS is the most common ungraceful failure path during a regional event — a long TTL means a long outage even when the standby region is healthy. Nexa's DNS posture is built to fail over fast.
- Short TTLs on tenant-facing hostnames. Health-check-driven failover can steer traffic between regions within a tight window, and recovery has to be observable in the same window. Marketing hostnames keep longer TTLs because they don't carry application traffic.
- Active health checks against a deep handler. A "green" probe means the region is transactionally ready — its operational store, cache, event bus, and vendor egress all responsive — not just that a process accepts TCP. Passive signals (5xx rate at the edge) are a secondary input; the active probe is authoritative.
- Secondary DNS at an independent provider. For Tier-1 carriers, Nexa publishes the tenant zone at a second DNS provider in addition to the primary. If the primary's control plane has an incident, resolvers transparently fall back to the secondary. End-customer mobile clients don't notice.
- Token-bucketed drains. When the edge drains traffic between regions, it ramps gradually rather than dumping the full load on the survivor instantly. The autoscaler in the surviving region has time to react before the queue depth becomes a problem.
What the client does
The mobile PWA and the operator console are not blind to outages. Both ship with a small backoff library:
- Jittered exponential retry on network errors. A passenger refreshing during an incident does not refresh every five seconds; the client space-fills retries to avoid stampedes.
Retry-Afteris honored. A 429 from the edge or a 503 from origin includes aRetry-Afterheader. The client respects it, even when the user taps the refresh button. The button is debounced server-side, not by the user's patience.- Cached snapshot survives outages. The passenger PWA caches the last successful payload locally. On a fetch failure, the UI re-renders from cache and shows a small "last updated" timestamp instead of a broken page.
The combined behavior — edge cache, short DNS TTLs, client backoff — means the modal passenger experience during a regional failover is a brief stale UI, then a fresh one. Not a broken app.
3. Multi-region
The platform's target topology pins each tenant to two regions chosen for the tenant's traffic geography (for example, a North American region paired with a South American region for Latin-American carriers). Both regions are active for reads; writes for any given tenant are anchored to that tenant's primary region at any moment in time.
This is a deliberate asymmetry, not active-passive. Active-passive's recovery floor is the time it takes to wake up the cold side, validate it, and switch traffic — typically tens of minutes for a database the size of a Tier-1 case load. Active-active concentrates the failover into the DNS and edge layer, which is the only layer fast enough to matter during a disruption.
What customers experience during a regional event
When a region degrades, the platform follows a documented sequence:
- The deep health check turns red. Within a tight window the edge drains affected tenants' traffic to the standby region.
- Stateful infrastructure auto-promotes its standby zone to primary. Replication lag is measured in seconds, not minutes; recovery is measured in single-digit minutes for the visible-impact window.
- Background workers in the surviving region pick up where the failed region left off. The async workflow engine is built around a durable record of every step, so in-flight work resumes from the last committed point — nothing is lost, nothing is duplicated.
- The passenger PWA continues to render from its cache during the brief unavailability window. The operator UI displays a regional-failover banner and briefly disables submission for the affected tenant; reads continue.
The first-passenger-impact during a clean failover is a short window of stale UI. After the window, the passenger is reading fresh data from the standby region. No data is lost; no booking is duplicated.
Why regions, not zones
Cloud zones inside the same region share fate during a regional event — networking, control plane, sometimes power. A cross-zone-only deployment is not multi-region; it is a more reliable single region. Nexa's recovery objectives presume cross-region capability for Tier-1 customers because no zone topology defends against the class of events that take a region offline.
4. Multi-cloud
Nexa's default deployment runs on a single major hyperscaler, chosen for the dependency graph that supports the most cost-effective build (managed event bus, managed document store, managed cache, managed identity, managed AI). A single-cloud default is the right starting point because a multi-cloud-from-day-one posture would have stalled the platform in foundational engineering before any customer benefit.
Multi-cloud is per-tenant and contractually negotiated. For Tier-1 carriers whose risk-management policy disallows single-cloud dependency, the platform supports a parallel deployment on a second hyperscaler as the disaster-recovery target. The platform is engineered so that the application surface stays identical across clouds:
| What stays identical | Why it stays identical |
|---|---|
| Container images | Every Nexa service is a self-contained container; the orchestrator differs, the binary doesn't. |
| The canonical data model | Every external partner adapter translates to the same internal shape. Where the binary runs is invisible to the data. |
| All public APIs | Operator, passenger, partner, and webhook contracts do not change. A customer's URL is unchanged; only the resolution behind it differs. |
| The event topology | Topic naming, partitioning, and authorization are control-plane concerns, not redesigns. |
What necessarily differs across clouds:
- The platform identity model. The credential bootstrap differs by cloud; the platform's tenant claims and authorization semantics do not.
- AI provider behavior. The platform's AI agents (policy synthesis, exception triage) are pinned to specific model versions per cloud and exercised in CI to ensure consistent behavior.
- Cost. The alternate cloud is typically more expensive for the equivalent shape; the gap is part of the Tier-1 contractual line item.
Active-active across two clouds simultaneously, with cross-cloud replication, is documented as a per-tenant negotiable. It is not the default because cross-cloud replication for stateful infrastructure is an effort measured in months rather than weeks; the cost is justified only for tenants whose risk policy explicitly requires it.
5. Layered backoff
Backoff is not one mechanism. It is the same idea applied at every layer of the stack, and each layer absorbs what the layers below cannot:
Browser / PWA → exponential + jittered retry, local cache fallback
Edge → upstream retry, serve-stale-on-error
DNS → short TTLs + health-check drain
BFF / API → 503 with Retry-After, hedged reads inside the cluster
Vendor egress → per-vendor adaptive traffic shaping + circuit breakers
Two design rules make these compose cleanly:
- Backoff windows compose. A short client retry plus a short edge retry plus a brief DNS drain plus a regional failover gives a worst-case end-to-end recovery window measured in single-digit minutes. That number is the recovery floor; everything else is implementation detail. Customer-facing SLOs are sized to cover this composed window with margin.
- Each layer fails closed for itself, fails open for the layer above. The edge returns stale content rather than a 5xx. The API returns 503 with
Retry-Afterrather than hanging on a vendor. The vendor egress blocks rather than calling and getting rate-limited. Each layer absorbs as much as it can, then degrades gracefully. Failures never compound.
Recovery objectives
Recovery objectives are a customer commitment, not a marketing number. The published targets cover the failure classes Nexa is engineered against:
| Failure class | Approximate visible-impact window |
|---|---|
| External partner outage | Seconds to minutes — fallback chain takes over |
| A single platform process restarts | Seconds — reroute is automatic |
| Stateful infrastructure failover within a region | Tens of seconds |
| Regional event | Single-digit minutes for read traffic; submission resumes shortly after |
| Edge control-plane incident | Few minutes — secondary DNS keeps name resolution working |
| Cloud-provider event (Tier-1 contracts) | Hours, with the customer notified at the start of the window |
Detailed per-surface SLOs and recovery objectives are listed in Operations & SLA.
Failure-mode walkthroughs
The following are not hypotheticals. Each describes a class of event the platform has been designed against; the customer-visible behavior is what matters for confidence.
Scenario A — A regional event during a hub closure
The primary region for a tenant degrades while a hub closure is already in progress. The edge's deep probe detects the failure and drains traffic to the standby region. Stateful infrastructure auto-promotes its replicated standby. Background workers in the surviving region pick up in-flight work from the durable workflow record. The operator UI shows a brief failover banner; submissions resume within a few minutes. The passenger PWA keeps rendering from its local cache through the unavailable window, then refreshes from the standby. No bookings are lost or duplicated; no passenger reservation is corrupted.
Scenario B — An edge control-plane incident
The edge provider's control plane is degraded — new rules can't be deployed and the dashboard is unreachable, but the data plane continues serving with the last known-good configuration. Health-check-driven failover continues working. Rate-limit rules continue working. The on-call falls back to secondary controls if a new threat appears during the window. Secondary DNS at an independent provider keeps name resolution working throughout. The customer sees no service interruption.
Scenario C — An external partner outage
A booking partner's API starts returning errors. The per-vendor circuit breaker counts the failures and opens within seconds; new booking attempts route to the next provider in the configured fallback chain. The asynchronous workflow engine retries against the chain transparently, with idempotency guarantees that prevent double-booking. Operators see the affected vendor flagged in the console health view. Passengers see no degradation.
These are the failure classes the platform absorbs by construction. Others surface during operations and become runbook entries, post-incident reviews, and where structural, future sections in this document.
Where to next
- How Nexa survives operational panic — the application-level resilience story for vendor surges, contention, and refresh storms.
- Architecture Overview — the high-level system topology.
- Operations & SLA — the SLOs and recovery objectives this availability model underwrites.
- Status page — live system health.