Background Jobs & Queues

Nexa runs every long-running step of a case (allocation, booking, notifications, payments, agent runs) on background workers. Customers do not operate queues or workers — Nexa runs them. This page describes how the platform behaves so integrators can reason about timing, retries, and failure modes from the API.

Pipeline at a glance

A single disruption case flows through five queues:

Stage	Typical duration	Retries	Backoff
Allocation	1–10 s	3	Exp 2 s
Booking	1–30 s per group	5	Exp 5 s
Payment / token capture	< 1 s	5	Exp 3 s
Notification (email / SMS)	< 1 s	5	Exp 3 s
Exception agent	5–30 s	1	none

When you POST /cases, the response returns immediately with status: OPEN. The API enqueues an allocation job and returns. Subsequent stages happen asynchronously.

Subscribe to GET /events/stream for real-time progress, or poll the case for state transitions (OPEN → ALLOCATING → BOOKING → CONFIRMED).

Idempotency keys

Every background job carries an idempotency key so re-enqueues, retries, and replays are safe:

Stage	Key
Allocation	`(caseUrn, demandRequestUrn, waveSeq)`
Booking	`(caseUrn, waveUrn, groupId)`
Payment	`(caseUrn, reservationUrn)`
Notification	`(groupId, type, channel, reservationUrn)`
Agent	`(itemUrn, runSeq)`

If your DCS replays the same disruption event during a network blip, Nexa returns the existing case rather than creating a duplicate. See Idempotency for details.

Retries and dead letters

Retries use exponential backoff with the budgets in the table above. When retries exhaust, Nexa never silently drops the work:

Writes an audit entry with the last error.
Opens a manual-review item with the appropriate category.
Transitions the case to MANUAL_REVIEW (where applicable).

Operators see the failure in the console and resolve via Manual Review.

Concurrency limits you'll observe

The relevant limits as a consumer:

Allocation: bounded by upstream provider rate limits (Amadeus + Hotelbeds). Effective ceiling ≈ 10 concurrent searches per tenant.
Booking: bounded by provider rate limits and PSP cap. Effective ceiling ≈ 20 concurrent bookings per tenant.
Notifications: bounded by your SendGrid plan. The dashboard shows the burn-rate vs. quota; lift the plan if you saturate.
Agent: bounded by LLM provider concurrency. Default 3 concurrent triages per tenant.

Beyond these ceilings, jobs queue rather than fail.

Real-time observability

Every stage emits events to /events/stream:

prediction.created — the predictor finished a forecast.
signal.ingested — { source, count }, one per provider per cycle.
job.activity — { queue, state, detail? } at start/finish/error of each job.

Per-tenant Grafana dashboards (queue depth, job duration, failure rate by reason) are available on request.

Replay and reprocess

If your team needs to replay a failed booking after fixing an upstream issue (for example a contract that was uploaded late), the operations console exposes a one-click "Retry" on the manual-review item. Server-to-server replay is also available:

curl -X POST "$NEXA_BASE_URL/manual-review/$ITEM_URN/retry" \
  -H "Authorization: Bearer $TOKEN"

Idempotency keys make the replay safe — if a booking already succeeded, the replay returns the existing reservation.

Pipeline at a glance​

Idempotency keys​

Retries and dead letters​

Concurrency limits you'll observe​

Real-time observability​

Replay and reprocess​