Operations & SLA

Nexa is mission-critical for an airline during a hub-shutdown event. This page documents the operational commitments behind the platform: support tiers, incident response, throughput, data freshness, and recovery objectives.

Support model

Three tiers of support, mapped per-customer at contract time.

Tier	Hours	Channels	First response (Sev-1)	First response (Sev-2)	First response (Sev-3)
Tier-1 (24/7 follow-the-sun)	24×7	PagerDuty + Slack Connect + war-room bridge	15 min	1 h	next business day
Tier-2 (business hours)	09:00–19:00 local	Slack Connect + email	n/a	4 h	next business day
Tier-3 (engineering escalation)	On-call	Internal paging	30 min	n/a	n/a

Each Tier-1 carrier gets a named incident commander on the Nexa side and a named primary contact on the airline side. Contact rotations are kept current and exercised quarterly.

Incident severity matrix

Severity	Definition	MTTR target	Customer paging
Sev-1	Operator UI down OR booking pipeline stalled OR PII leak suspected	2 h	Yes — at detection
Sev-2	Single-domain degradation (e.g., wallet webhook lag > 5 min) without operator-UI impact	4 h	Yes — within first response window
Sev-3	Cosmetic / non-blocking	next business day	No

Declared-incident protocol

PagerDuty fires. On-call acknowledges within target window.
Incident commander opens war-room. Slack Connect channel + bridge with the airline contact; initial status posted within first-response window.
Status updates. Every 15 min for Sev-1; every 60 min for Sev-2.
Resolution criteria are documented before close. No "looks fine now."
Post-incident review within 5 business days for Sev-1, 10 for Sev-2. The PIR is delivered in writing to the airline contact.

Service Level Objectives (SLOs)

Surface	SLI	SLO target
Operator UI — case load	p99 latency	< 1.5 s
Operator UI — lock acquire	p99 latency	< 200 ms
Booking — scatter-gather search	p95 wall time	2.0 s (time-boxed)
Booking — reservation confirmed	end-to-end p95	< 90 s under steady-state load
Wallet — card issuance	end-to-end p95	< 120 s
Notifications — disruption-to-first-message	p90	≤ 30 s
Audit — outbox to warm tier	p95 lag	< 5 s
Flight predictor — disruption forecast freshness	p95 age	< 5 min
Snapshot DB (passenger BFF) — read freshness vs. last write	p95 age	< 10 s

Each SLO carries an associated error budget. Burn over a rolling window auto-fires a Sev-2 incident.

Capacity & throughput

The platform is engineered for Tier-1 hub-closure scale. Reference benchmarks:

Workload	Target throughput
Operator submissions	60 / minute / case (300 sustained per hub)
Vendor egress (per vendor)	Per published TPS, cluster-wide; typically 5–10 TPS Amadeus / Hotelbeds / Pomelo
Passenger snapshot reads	10,000 RPS sustained per tenant, 50,000 RPS burst
Saga end-to-end throughput	Disruption-to-resolved p95 < 5 minutes (single-leg booking + wallet)
Manifest ingest	5,000 PNRs / minute per tenant

The platform's resilience patterns (Operational Panic) make these throughput numbers achievable without burning vendors:

Async saga + token bucket → never exceed vendor TPS.
CQRS + snapshot DB → passenger reads don't compete with operator writes.
Per-attempt soft-holds → no inventory drift under crash.
WebSocket-presence locks → no zombie locks, no operator UI freezes.

Recovery objectives

Surface	RPO	RTO
Operational store (cases, booking, wallet)	< 1 min	< 30 min
Coordination layer (locks, soft-holds)	n/a (transient, self-healing)	< 5 min
Read-side snapshot store	< 5 min (rebuildable from durable workflow record)	< 30 min
Audit log — warm tier	< 5 min	< 60 min
Audit log — cold tier	< 1 day	< 4 h
Workflow bus	< 1 min	< 30 min

Rebuild-from-workflow-record: the read-side snapshot and most downstream projections can be fully rebuilt from the platform's durable workflow record. Retention exceeds any plausible recovery scenario.

Maintenance windows

Routine maintenance is performed during agreed-upon windows per tenant — typically a 2-hour window in the lowest-traffic regional period. Tenants receive 14 days' notice for routine maintenance and 48 hours' notice for emergency maintenance.

Zero-downtime deployments are the norm — most maintenance windows are notification-only, and the platform stays live throughout.

Data freshness during a disruption

Under a Tier-1 hub closure, the platform commits:

Disruption-detected to case-opened: < 60 seconds.
Manifest fetch: < 30 seconds for ≤ 200 PNRs; < 5 minutes for ≤ 5,000.
Booking saga end-to-end: < 5 minutes p95 (single leg).
Passenger SMS dispatched: < 30 seconds after OFFER_READY.
Snapshot read freshness: < 10 seconds p95 vs. operational state.

These commitments hold during burst load — they are the engineered floor, not the typical-day target.

Escalation paths

Issue	First contact	Escalation
Operator UI bug	Tier-1 ticket / Slack	Tier-3 engineering on-call
Vendor outage (Amadeus / Hotelbeds / Pomelo)	Tier-1 incident commander	Vendor-side ops + Nexa platform on-call
Tenant-onboarding question	Customer Success	Solutions engineering
Security concern	`security@nexa.ai`	CISO + incident commander
Compliance / audit request	`compliance@nexa.ai`	CTO + legal

Status & telemetry visibility

Status page: status.nexa.ai. Every domain's health is published there with a 1-minute refresh cadence.
Incident history: 90 days, public.
Customer dashboards (per-tenant): in the operations console, every operator sees per-domain health, vendor circuit-breaker state, and queue depth. Live data, not summarized.

Audit log retention

Tier	Retention	Use
Hot (queryable from operator console)	90 days	Operator forensics, customer support
Warm (queryable via API)	12 months	Compliance investigations
Cold (archival)	7 years	Regulatory retention (LGPD / GDPR / ANAC)

Every audit row carries the W3C correlation URN, before/after snapshots, and the actor URN. Cold-tier rows include cryptographic hash-chain signatures for tamper-evidence.

Where to next

Compliance — GDPR, LGPD, ISO 27001, SOC 2 mapping.
Operational Panic — how the platform survives hub-closure load.
Status page — live system health.

Support model​

Incident severity matrix​

Declared-incident protocol​

Service Level Objectives (SLOs)​

Capacity & throughput​

Recovery objectives​

Maintenance windows​

Data freshness during a disruption​

Escalation paths​

Status & telemetry visibility​

Audit log retention​

Where to next​