Skip to main content

Background Jobs & Queues

Nexa runs every long-running step of a case (allocation, booking, notifications, payments, agent runs) on background workers. Customers do not operate queues or workers — Nexa runs them. This page describes how the platform behaves so integrators can reason about timing, retries, and failure modes from the API.

Pipeline at a glance

A single disruption case flows through five queues:

StageTypical durationRetriesBackoff
Allocation1–10 s3Exp 2 s
Booking1–30 s per group5Exp 5 s
Payment / token capture< 1 s5Exp 3 s
Notification (email / SMS)< 1 s5Exp 3 s
Exception agent5–30 s1none

When you POST /cases, the response returns immediately with status: OPEN. The API enqueues an allocation job and returns. Subsequent stages happen asynchronously.

Subscribe to GET /events/stream for real-time progress, or poll the case for state transitions (OPEN → ALLOCATING → BOOKING → CONFIRMED).

Idempotency keys

Every background job carries an idempotency key so re-enqueues, retries, and replays are safe:

StageKey
Allocation(caseUrn, demandRequestUrn, waveSeq)
Booking(caseUrn, waveUrn, groupId)
Payment(caseUrn, reservationUrn)
Notification(groupId, type, channel, reservationUrn)
Agent(itemUrn, runSeq)

If your DCS replays the same disruption event during a network blip, Nexa returns the existing case rather than creating a duplicate. See Idempotency for details.

Retries and dead letters

Retries use exponential backoff with the budgets in the table above. When retries exhaust, Nexa never silently drops the work:

  1. Writes an audit entry with the last error.
  2. Opens a manual-review item with the appropriate category.
  3. Transitions the case to MANUAL_REVIEW (where applicable).

Operators see the failure in the console and resolve via Manual Review.

Concurrency limits you'll observe

The relevant limits as a consumer:

  • Allocation: bounded by upstream provider rate limits (Amadeus + Hotelbeds). Effective ceiling ≈ 10 concurrent searches per tenant.
  • Booking: bounded by provider rate limits and PSP cap. Effective ceiling ≈ 20 concurrent bookings per tenant.
  • Notifications: bounded by your SendGrid plan. The dashboard shows the burn-rate vs. quota; lift the plan if you saturate.
  • Agent: bounded by LLM provider concurrency. Default 3 concurrent triages per tenant.

Beyond these ceilings, jobs queue rather than fail.

Real-time observability

Every stage emits events to /events/stream:

  • prediction.created — the predictor finished a forecast.
  • signal.ingested{ source, count }, one per provider per cycle.
  • job.activity{ queue, state, detail? } at start/finish/error of each job.

Per-tenant Grafana dashboards (queue depth, job duration, failure rate by reason) are available on request.

Replay and reprocess

If your team needs to replay a failed booking after fixing an upstream issue (for example a contract that was uploaded late), the operations console exposes a one-click "Retry" on the manual-review item. Server-to-server replay is also available:

curl -X POST "$NEXA_BASE_URL/manual-review/$ITEM_URN/retry" \
-H "Authorization: Bearer $TOKEN"

Idempotency keys make the replay safe — if a booking already succeeded, the replay returns the existing reservation.

Was this helpful?