Operations & SLA
Nexa is mission-critical for an airline during a hub-shutdown event. This page documents the operational commitments behind the platform: support tiers, incident response, throughput, data freshness, and recovery objectives.
Support model
Three tiers of support, mapped per-customer at contract time.
| Tier | Hours | Channels | First response (Sev-1) | First response (Sev-2) | First response (Sev-3) |
|---|---|---|---|---|---|
| Tier-1 (24/7 follow-the-sun) | 24×7 | PagerDuty + Slack Connect + war-room bridge | 15 min | 1 h | next business day |
| Tier-2 (business hours) | 09:00–19:00 local | Slack Connect + email | n/a | 4 h | next business day |
| Tier-3 (engineering escalation) | On-call | Internal paging | 30 min | n/a | n/a |
Each Tier-1 carrier gets a named incident commander on the Nexa side and a named primary contact on the airline side. Contact rotations are kept current and exercised quarterly.
Incident severity matrix
| Severity | Definition | MTTR target | Customer paging |
|---|---|---|---|
| Sev-1 | Operator UI down OR booking pipeline stalled OR PII leak suspected | 2 h | Yes — at detection |
| Sev-2 | Single-domain degradation (e.g., wallet webhook lag > 5 min) without operator-UI impact | 4 h | Yes — within first response window |
| Sev-3 | Cosmetic / non-blocking | next business day | No |
Declared-incident protocol
- PagerDuty fires. On-call acknowledges within target window.
- Incident commander opens war-room. Slack Connect channel + bridge with the airline contact; initial status posted within first-response window.
- Status updates. Every 15 min for Sev-1; every 60 min for Sev-2.
- Resolution criteria are documented before close. No "looks fine now."
- Post-incident review within 5 business days for Sev-1, 10 for Sev-2. The PIR is delivered in writing to the airline contact.
Service Level Objectives (SLOs)
| Surface | SLI | SLO target |
|---|---|---|
| Operator UI — case load | p99 latency | < 1.5 s |
| Operator UI — lock acquire | p99 latency | < 200 ms |
| Booking — scatter-gather search | p95 wall time | 2.0 s (time-boxed) |
| Booking — reservation confirmed | end-to-end p95 | < 90 s under steady-state load |
| Wallet — card issuance | end-to-end p95 | < 120 s |
| Notifications — disruption-to-first-message | p90 | ≤ 30 s |
| Audit — outbox to warm tier | p95 lag | < 5 s |
| Flight predictor — disruption forecast freshness | p95 age | < 5 min |
| Snapshot DB (passenger BFF) — read freshness vs. last write | p95 age | < 10 s |
Each SLO carries an associated error budget. Burn over a rolling window auto-fires a Sev-2 incident.
Capacity & throughput
The platform is engineered for Tier-1 hub-closure scale. Reference benchmarks:
| Workload | Target throughput |
|---|---|
| Operator submissions | 60 / minute / case (300 sustained per hub) |
| Vendor egress (per vendor) | Per published TPS, cluster-wide; typically 5–10 TPS Amadeus / Hotelbeds / Pomelo |
| Passenger snapshot reads | 10,000 RPS sustained per tenant, 50,000 RPS burst |
| Saga end-to-end throughput | Disruption-to-resolved p95 < 5 minutes (single-leg booking + wallet) |
| Manifest ingest | 5,000 PNRs / minute per tenant |
The platform's resilience patterns (Operational Panic) make these throughput numbers achievable without burning vendors:
- Async saga + token bucket → never exceed vendor TPS.
- CQRS + snapshot DB → passenger reads don't compete with operator writes.
- Per-attempt soft-holds → no inventory drift under crash.
- WebSocket-presence locks → no zombie locks, no operator UI freezes.
Recovery objectives
| Surface | RPO | RTO |
|---|---|---|
| Operational store (cases, booking, wallet) | < 1 min | < 30 min |
| Coordination layer (locks, soft-holds) | n/a (transient, self-healing) | < 5 min |
| Read-side snapshot store | < 5 min (rebuildable from durable workflow record) | < 30 min |
| Audit log — warm tier | < 5 min | < 60 min |
| Audit log — cold tier | < 1 day | < 4 h |
| Workflow bus | < 1 min | < 30 min |
Rebuild-from-workflow-record: the read-side snapshot and most downstream projections can be fully rebuilt from the platform's durable workflow record. Retention exceeds any plausible recovery scenario.
Maintenance windows
Routine maintenance is performed during agreed-upon windows per tenant — typically a 2-hour window in the lowest-traffic regional period. Tenants receive 14 days' notice for routine maintenance and 48 hours' notice for emergency maintenance.
Zero-downtime deployments are the norm — most maintenance windows are notification-only, and the platform stays live throughout.
Data freshness during a disruption
Under a Tier-1 hub closure, the platform commits:
- Disruption-detected to case-opened: < 60 seconds.
- Manifest fetch: < 30 seconds for ≤ 200 PNRs; < 5 minutes for ≤ 5,000.
- Booking saga end-to-end: < 5 minutes p95 (single leg).
- Passenger SMS dispatched: < 30 seconds after
OFFER_READY. - Snapshot read freshness: < 10 seconds p95 vs. operational state.
These commitments hold during burst load — they are the engineered floor, not the typical-day target.
Escalation paths
| Issue | First contact | Escalation |
|---|---|---|
| Operator UI bug | Tier-1 ticket / Slack | Tier-3 engineering on-call |
| Vendor outage (Amadeus / Hotelbeds / Pomelo) | Tier-1 incident commander | Vendor-side ops + Nexa platform on-call |
| Tenant-onboarding question | Customer Success | Solutions engineering |
| Security concern | security@nexa.ai | CISO + incident commander |
| Compliance / audit request | compliance@nexa.ai | CTO + legal |
Status & telemetry visibility
- Status page:
status.nexa.ai. Every domain's health is published there with a 1-minute refresh cadence. - Incident history: 90 days, public.
- Customer dashboards (per-tenant): in the operations console, every operator sees per-domain health, vendor circuit-breaker state, and queue depth. Live data, not summarized.
Audit log retention
| Tier | Retention | Use |
|---|---|---|
| Hot (queryable from operator console) | 90 days | Operator forensics, customer support |
| Warm (queryable via API) | 12 months | Compliance investigations |
| Cold (archival) | 7 years | Regulatory retention (LGPD / GDPR / ANAC) |
Every audit row carries the W3C correlation URN, before/after snapshots, and the actor URN. Cold-tier rows include cryptographic hash-chain signatures for tamper-evidence.
Where to next
- Compliance — GDPR, LGPD, ISO 27001, SOC 2 mapping.
- Operational Panic — how the platform survives hub-closure load.
- Status page — live system health.