Skip to main content

How Nexa survives operational panic

A Tier-1 hub closure is the engineering problem Nexa is built around. When weather grounds a major airport for half a day, the airline's operations workbench fans tens of thousands of disrupted passengers across the platform in a window measured in single-digit minutes. Twenty airport agents open the same flight manifest simultaneously. Every passenger reaches for their phone. External partners — global distribution systems, payment providers, notification gateways — don't lift their published rate limits just because a customer is having a bad day.

A naive system would break in five places at once: connection pools exhaust, partner APIs ban the caller, the operator UI freezes, the passenger app falls back to refresh-spamming, and inventory counts drift into permanent loss. Each of those is a real risk we've designed against, and each is the subject of a deliberate platform pattern. This article describes them at the level a customer needs to evaluate confidence — the what and why-it's-reliable, not the implementation details.

The five failure modes

#FailureWhat goes wrong without mitigationThe pattern that prevents it
1Operator burst pegs the APISynchronous partner calls block the operator UI for everyone, not just the submitterAsynchronous workflow with a durable submit path
2Partner rate-limit cascadeA burst overruns a partner's published rate; recovery requires manual partner interventionPer-partner adaptive traffic shaping at platform level
3Two operators editing the same passengerLast-writer-wins; lost decisions; duplicated workReal-time presence-driven entity locks
4Two operators booking the last roomInventory accounting drifts under crash; capacity gets permanently lostPer-attempt soft-holds with self-healing expiry
5Passenger refresh stormPassenger reads contend with the operator transactional path; UI slows for bothRead-side isolation behind a denormalized snapshot

1. The asynchronous submit path

If an operator's submit blocked on Amadeus, every operator would wait. Partner APIs have realistic worst-case latencies measured in seconds; with twenty operators on the same wave, the math says the platform pegs out within seconds and the UI freezes for everyone — not just the operator who submitted, but every operator including the ones still trying to load the manifest.

The platform structurally refuses to do this:

  1. The operator's submit is recorded against the passenger's case in a single durable step. The handler returns a fast acknowledgement. The connection is freed.
  2. The actual partner work — inventory search, booking, voucher issuance — runs out-of-process behind the scenes, against the partner directly.
  3. If the platform restarts mid-workflow, the durable record means in-flight work resumes from where it stopped. There is no "we updated state but lost the next step" failure mode.

Two structural properties matter for customer confidence:

  • The operator UI never blocks on a partner. Partner latency is metabolized by the workflow engine, not the connection pool serving operators. A partner going slow shows up as queue depth on the internal dashboard — never as a frozen UI.
  • The submit path is idempotent end-to-end. A duplicate retry of any step finds the case in its current state and proceeds correctly. Customers don't see double-bookings or double-charges from in-flight retries.

2. Per-partner traffic shaping

Every external partner publishes a rate limit. Nexa shapes traffic to that limit at the platform level, not per-process — per-process limiters break under autoscaling, where each process believes it has the full budget and the platform exceeds it the moment it scales out.

Properties of the platform's traffic shaping:

  • Platform-wide. Every replica of every worker shares the same budget. The configured rate is the actual rate, regardless of replica count.
  • Per-partner isolation. A slowdown at one partner does not slow calls to a different partner. Each fallback in the chain has its own budget.
  • Backpressure into the queue. Workers wait for budget rather than calling and getting rate-limited. The queue absorbs the surge; the autoscaler reacts to depth, never bursts the partner.

This pattern is a release-blocker primitive: shipping a Tier-1 customer without it would guarantee a partner banning Nexa's traffic during the first significant disruption. It is not an optimization; it is a precondition.

3. Real-time entity locks

When an operator clicks a sub-case, two failure modes need protection:

  • A second operator should not be able to start editing the same sub-case. The first operator's screen should show a lock indicator in real time.
  • If the first operator's tab closes, the lock must release fast. Stuck locks make passengers un-servable for as long as the lock survives — which during operational panic is unacceptable at any timescale beyond seconds.

Nexa's solution couples a sub-case's lock lifetime to the operator's live session, not to a wall-clock timer. While the operator is connected, the lock is held; the moment the connection drops — closed tab, network blip, evicted process, anything — the lock releases within a fraction of a second. A wall-clock fallback exists as defense-in-depth in case the platform itself fails between detection and release, but the live-session signal is the authoritative path.

A key property: the lock state propagates to every other operator in real time. An operator anywhere in the world sees the lock indicator appear on the sub-case the moment another operator opens it, and disappear the moment that operator releases it. The operator UI never gives the impression of "stuck" passengers.

4. Per-attempt inventory soft-holds

The most subtle of the five failure modes. The naive design is to maintain a shared mutable counter — decrement when an operator starts a booking, increment back if it fails. Under crash, that approach has a fatal property:

If the booking process crashes between the decrement and the partner confirmation, the counter stays decremented forever. The room is permanently lost to the system. Operators see "sold out" on a hotel that has actual capacity. Recovery requires manual reconciliation against the partner.

Nexa avoids this entirely by never maintaining a shared mutable counter. Available inventory is computed from the live set of active hold records plus confirmed reservations. Each booking attempt acquires its own per-attempt hold record with a self-healing expiry. If the attempt succeeds, the hold becomes a confirmed reservation. If the attempt fails or the operator abandons the flow, the hold expires on its own — no compensating action required, no drift.

The customer-visible properties:

  • No permanent capacity loss under crash. A failed attempt's capacity is restored automatically.
  • Two operators legitimately competing for the last room see one win immediately and the other receive an instant "sold out." There is no race window where both could think they had the room.
  • Operator-abandoned flows clean up themselves. No reconciliation jobs needed.

Inventory protection is structural, not stateful — an important distinction when a tenant's auditors ask how the platform prevents double-booking under failure.

5. Read-side isolation for passengers

A disruption is also an event for the passenger side. The airline sends an SMS link; passengers tap it; then they refresh, and refresh, and refresh — every few seconds, every passenger, all at once. If the mobile app queries the same database the operator UI does, passenger refresh traffic competes with operator transactional work and slows both.

Nexa isolates passenger traffic with a deliberate read-side / write-side split:

  • The operator side performs all transactional work — opening cases, allocating rooms, issuing vouchers — against an operational store sized for transactional consistency.
  • The passenger side reads from a separate, denormalized snapshot that is published to whenever a passenger's situation changes.
  • The passenger app only ever reads from the snapshot. It never touches the operational store.

Properties:

  • Tens of thousands of passenger refreshes per disruption cause zero contention with the operator UI. The two surfaces are physically isolated.
  • Eventual consistency between operator-side state and passenger-side reads. The lag is sub-second under normal conditions and a few seconds under burst — well within the customer-facing freshness target. Passengers are consuming information, not transacting in tight loops; eventual consistency is the right trade.
  • Passenger writes still go through the workflow. When a passenger taps Accept or Decline, the request publishes an intent that the operator-side workflow consumes, with the same idempotency and durability guarantees as operator-initiated changes. The snapshot reflects the result on the next refresh.

Putting it together

A Tier-1 hub closure with thousands of disrupted passengers and dozens of operators looks, from inside the platform, like an unusually busy day. The patterns above are not optimizations layered on after first-customer GA; each one is a precondition for that GA. Together they make the worst day in the airline's quarter a non-event in Nexa's metrics.

SurfaceLoad shapeWhat absorbs it
Operator submitThousands per hour, burstyFast acknowledgement; partner work runs in the workflow engine
Partner callsCapped at each partner's published ratePlatform-wide traffic shaping; queue depth absorbs bursts
Two operators on the same passengerReal-time lock indicator across the clusterPresence-driven entity locks
Two operators competing for the last roomOne wins, the other sees "sold out" instantlyPer-attempt soft-holds
Passenger refreshes (tens of thousands in a short window)Read-only against an isolated, denormalized snapshotRead-side isolation

The platform doesn't avoid operational panic. It is engineered for it.

Where to next

Was this helpful?