Skip to main content

Operations & SLA

Nexa is mission-critical for an airline during a hub-shutdown event. This page documents the operational commitments behind the platform: support tiers, incident response, throughput, data freshness, and recovery objectives.

Support model

Three tiers of support, mapped per-customer at contract time.

TierHoursChannelsFirst response (Sev-1)First response (Sev-2)First response (Sev-3)
Tier-1 (24/7 follow-the-sun)24×7PagerDuty + Slack Connect + war-room bridge15 min1 hnext business day
Tier-2 (business hours)09:00–19:00 localSlack Connect + emailn/a4 hnext business day
Tier-3 (engineering escalation)On-callInternal paging30 minn/an/a

Each Tier-1 carrier gets a named incident commander on the Nexa side and a named primary contact on the airline side. Contact rotations are kept current and exercised quarterly.

Incident severity matrix

SeverityDefinitionMTTR targetCustomer paging
Sev-1Operator UI down OR booking pipeline stalled OR PII leak suspected2 hYes — at detection
Sev-2Single-domain degradation (e.g., wallet webhook lag > 5 min) without operator-UI impact4 hYes — within first response window
Sev-3Cosmetic / non-blockingnext business dayNo

Declared-incident protocol

  1. PagerDuty fires. On-call acknowledges within target window.
  2. Incident commander opens war-room. Slack Connect channel + bridge with the airline contact; initial status posted within first-response window.
  3. Status updates. Every 15 min for Sev-1; every 60 min for Sev-2.
  4. Resolution criteria are documented before close. No "looks fine now."
  5. Post-incident review within 5 business days for Sev-1, 10 for Sev-2. The PIR is delivered in writing to the airline contact.

Service Level Objectives (SLOs)

SurfaceSLISLO target
Operator UI — case loadp99 latency< 1.5 s
Operator UI — lock acquirep99 latency< 200 ms
Booking — scatter-gather searchp95 wall time2.0 s (time-boxed)
Booking — reservation confirmedend-to-end p95< 90 s under steady-state load
Wallet — card issuanceend-to-end p95< 120 s
Notifications — disruption-to-first-messagep90≤ 30 s
Audit — outbox to warm tierp95 lag< 5 s
Flight predictor — disruption forecast freshnessp95 age< 5 min
Snapshot DB (passenger BFF) — read freshness vs. last writep95 age< 10 s

Each SLO carries an associated error budget. Burn over a rolling window auto-fires a Sev-2 incident.

Capacity & throughput

The platform is engineered for Tier-1 hub-closure scale. Reference benchmarks:

WorkloadTarget throughput
Operator submissions60 / minute / case (300 sustained per hub)
Vendor egress (per vendor)Per published TPS, cluster-wide; typically 5–10 TPS Amadeus / Hotelbeds / Pomelo
Passenger snapshot reads10,000 RPS sustained per tenant, 50,000 RPS burst
Saga end-to-end throughputDisruption-to-resolved p95 < 5 minutes (single-leg booking + wallet)
Manifest ingest5,000 PNRs / minute per tenant

The platform's resilience patterns (Operational Panic) make these throughput numbers achievable without burning vendors:

  • Async saga + token bucket → never exceed vendor TPS.
  • CQRS + snapshot DB → passenger reads don't compete with operator writes.
  • Per-attempt soft-holds → no inventory drift under crash.
  • WebSocket-presence locks → no zombie locks, no operator UI freezes.

Recovery objectives

SurfaceRPORTO
Operational store (cases, booking, wallet)< 1 min< 30 min
Coordination layer (locks, soft-holds)n/a (transient, self-healing)< 5 min
Read-side snapshot store< 5 min (rebuildable from durable workflow record)< 30 min
Audit log — warm tier< 5 min< 60 min
Audit log — cold tier< 1 day< 4 h
Workflow bus< 1 min< 30 min

Rebuild-from-workflow-record: the read-side snapshot and most downstream projections can be fully rebuilt from the platform's durable workflow record. Retention exceeds any plausible recovery scenario.

Maintenance windows

Routine maintenance is performed during agreed-upon windows per tenant — typically a 2-hour window in the lowest-traffic regional period. Tenants receive 14 days' notice for routine maintenance and 48 hours' notice for emergency maintenance.

Zero-downtime deployments are the norm — most maintenance windows are notification-only, and the platform stays live throughout.

Data freshness during a disruption

Under a Tier-1 hub closure, the platform commits:

  • Disruption-detected to case-opened: < 60 seconds.
  • Manifest fetch: < 30 seconds for ≤ 200 PNRs; < 5 minutes for ≤ 5,000.
  • Booking saga end-to-end: < 5 minutes p95 (single leg).
  • Passenger SMS dispatched: < 30 seconds after OFFER_READY.
  • Snapshot read freshness: < 10 seconds p95 vs. operational state.

These commitments hold during burst load — they are the engineered floor, not the typical-day target.

Escalation paths

IssueFirst contactEscalation
Operator UI bugTier-1 ticket / SlackTier-3 engineering on-call
Vendor outage (Amadeus / Hotelbeds / Pomelo)Tier-1 incident commanderVendor-side ops + Nexa platform on-call
Tenant-onboarding questionCustomer SuccessSolutions engineering
Security concernsecurity@nexa.aiCISO + incident commander
Compliance / audit requestcompliance@nexa.aiCTO + legal

Status & telemetry visibility

  • Status page: status.nexa.ai. Every domain's health is published there with a 1-minute refresh cadence.
  • Incident history: 90 days, public.
  • Customer dashboards (per-tenant): in the operations console, every operator sees per-domain health, vendor circuit-breaker state, and queue depth. Live data, not summarized.

Audit log retention

TierRetentionUse
Hot (queryable from operator console)90 daysOperator forensics, customer support
Warm (queryable via API)12 monthsCompliance investigations
Cold (archival)7 yearsRegulatory retention (LGPD / GDPR / ANAC)

Every audit row carries the W3C correlation URN, before/after snapshots, and the actor URN. Cold-tier rows include cryptographic hash-chain signatures for tamper-evidence.

Where to next

Was this helpful?