Inside the Flight Predictor

The Nexa Flight Predictor is a Nexa AI Model that lets the platform start sourcing hotel inventory before an airline officially declares a disruption. From a consumer's point of view it's three numbers per flight (cancelProbability, predictedDelayMinutes, confidenceScore) plus a list of human-readable factors. Underneath, it's a multi-agent signal layer, a 14-feature snapshot, a dual-head trained model, and a deterministic baseline that — honestly — is what's moving the needle in production while the trainer catches up to the runtime.

This article covers the predictor at the depth a technical reader would want, without wandering into implementation details. The customer-facing API surface lives in the Flight Disruption Predictor guide.

What it predicts

For every scheduled flight in the next 24 hours of every watched airline, the predictor emits:

cancelProbability ∈ [0, 1]
predictedDelayMinutes ∈ [0, 720]
confidenceScore ∈ [0, 1]

These numbers are produced via two complementary paths:

Hot path — when an upstream flight-data event lands (a new flight, a status change), a per-flight forecast runs in seconds.
Cold path — every five minutes, a global sweep re-forecasts every non-terminal flight as a safety net for missed events.

A boot-time reconciliation step closes the gap when the service has been unreachable (see Downtime recovery).

The unit of work: airline, not airport

The first design decision that shapes everything else: the predictor is organized around watched airlines, not airports.

When a tenant marks an airline as watched, three things happen:

A push subscription is opened against the upstream flight-data source for each ICAO in the carrier's group. LATAM, for example, expands to LAN, TAM, LCO, LPE, LAE.
A pull of the next 24 hours of scheduled departures runs every five minutes per ICAO.
A nightly historical backfill is scheduled to feed accuracy reports and retraining data.

A subsidiary roll-up keeps the operator UX consistent ("LATAM" is a single brand) while the platform internally operates per ICAO. Every alert subscription is keyed by ICAO, every data-quota burn is captured per ICAO, and every push event is routed by ICAO.

A predecessor "enable forecast per airport" mode still exists for legacy callers but is deprecated.

Ingestion cycle (every five minutes)

The recurring cycle, per watched airline:

Expand the ICAO group for the airline (handle subsidiaries).
For each ICAO, ensure an alert subscription exists upstream (idempotent).
List scheduled departures for the next 24 hours, capturing rate-limit headers so the dashboard can show data-quota burn vs. ceiling per airline.
Bulk-upsert flights with a synthetic key flightId = ${airlineIata}${flightNumber}@${departureScheduledAt} — this prevents collisions when two carriers share a flight-number code.
Fan out per-flight forecasts only for flights that are newly discovered (don't re-forecast a flight whose forecast is minutes old).
Advance the watermark flightData.lastSeenAt. This single value enables the downtime recovery mechanism described below.

A few real caveats worth surfacing:

The upstream flight-data source rejects ISO timestamps with sub-second precision, so the cycle strips milliseconds before serializing query strings.
If credentials are missing, the cycle errors loudly — there is no silent fallback, since ingestion without an upstream is no ingestion.
Status normalization is fail-open: any unknown status from upstream collapses to scheduled. Conscious tradeoff — Nexa would rather produce a sub-optimal forecast than drop a flight because the upstream introduced a new vocabulary token.

Per-flight forecast (hot path)

When a webhook arrives or the cycle discovers a new flight, a per-flight forecast runs:

Look up the flight by ID.
Pull historical outcomes for the route and a slice for the airline.
Run the decision agent (the orchestrator that combines all the pieces below).
Persist the result, mirror it to long-term storage, and emit a real-time event.

A useful detail: forecasts are bucketed per minute per flight — bursts of change events (typical when a flight is reprogrammed several times) collapse to one prediction per flight per minute. The cost saving on bursty schedules is non-trivial.

Flights in terminal states (landed, cancelled, diverted) are skipped.

The multi-agent signal layer

Before the model runs, an ingestion layer collects external signals from independent sources. Each agent is small, isolated, and runs in parallel — a downed source contributes zero signals but never aborts the prediction.

Agent	Source category	Signal type	Notes
Flight Ops	Upstream flight-data source	`flightOps`	Honors a manual `flightOpsSeverity` override on the input — useful for ad-hoc what-ifs.
Weather	Open meteorological data source	`weather`	Maps weather code → severity.
Labor & Civil Unrest	Open geopolitical-events source	`labor`	Strikes / civil-unrest events near the airport.
News & Traffic	News aggregator	`traffic`, `news`	Keyword search (`closure`, `crash`, `strike`, …).
Natural Hazard	Open hazard-feeds + curated overlay	`hazard`	Seismic / volcanic events.
Airline Reliability	Curated reliability feed (or derived from history)	`airlineReliability`	If no curated source, derived from historical outcomes.
Aircraft Rotation	Internal rotation chain	`rotationChain`	Carries forward the delay of the previous leg of the same tail (empirical carry-over ~40%).
Cancellations Board	Live-cancellations crawl	`cancellationsBoard`	Match by airline → fallback by airport.
History Pattern	Historical outcomes	`historyPattern`	Seasonality, day-of-week.

Every signal carries: signalType, airportIata, destinationIata?, severity ∈ [0, 1], confidence ∈ [0, 1], source, observedAt, optional metadata.

Attribution: ranking factors for humans

A separate attribution agent ranks signals for the operator UI. It computes:

weight = severity × confidence × sourceCredibility(source)

Groups by signal type, sums the weights, and returns the top five with a human-readable rationale ("Derived from <signalType> via <source category>").

Crucially, attribution does not feed the model. It feeds the UI and the topFactors field of the response. This separation is deliberate — the model should not be biased by the explanation it produces.

The feature snapshot

The model's input is a 14-feature numeric vector. Each prediction is auditable: every snapshot is persisted, so any prediction can be reconstructed to its exact input.

Feature	Definition
`historicalCancelRate`	Cancel rate for the route across the entire history.
`historicalAvgDelayMinutes`	Average delay for the route across history.
`aggregatedSignalSeverity`	`Σ severity × confidence` over all signals — proxy of total external pressure on the origin.
`destinationSignalPressure`	Same aggregate, filtered to signals matching the destination IATA.
`recentCancelRate`	Cancel rate over the last 14 days (falls back to historical when there's no recent data).
`recentAvgDelayMinutes`	Average delay over the last 14 days.
`seasonalityWeight`	Simple seasonal multiplier: `1.08` on Friday/Sunday, `1.0` otherwise.
`routeRecentIssueScore`	Cancellations + delays ≥30 min in the last 30 days, normalized against a target of 20 observations, clipped to `[0, 1]`.
`airlineCancelRate`	90-day cancel rate for the airline.
`airlineAvgDelayMinutes`	90-day average delay for the airline.
`airlineRecentCancelRate`	14-day cancel rate for the airline.
`airlineRecentAvgDelayMinutes`	14-day average delay for the airline.
`prevLegRiskScore`	Severity of the rotation-chain signal — how badly the previous leg of the same tail was running.
`prevLegDelayMinutes`	Observed delay of the previous leg (carry-over to the current).

Plus categorical metadata: season, dayOfWeek, hourOfDay, and a stable key (preferentially the flightId, falling back to a composite key).

Inference: trained model first, deterministic baseline fallback

prediction = trainedModelForecast(snapshot, topFactors)
          ?? baselineForecast(snapshot, topFactors)

If the Nexa AI Model is unreachable for any reason, the deterministic baseline runs locally inside the orchestrator. The platform never blocks on model availability.

The trained model

The Nexa AI Model is a dual-head trained model:

Cancel head — binary classifier returning cancelProbability.
Delay head — regressor returning predictedDelayMinutes (clamped to [0, 720]).

Why this shape:

Gradient-boosted trees — the dataset is small, the features are tabular and mixed (rates, severities, normalized counters), and the relationship between, say, signal severity and cancel probability is non-linear and smooth. Trees are also far easier to explain to operators than a neural network.
Shallow trees — depth-3 ensembles to avoid overfitting on a few thousand training examples and to keep inference cheap.
Different ensemble sizes for the two heads — the regression head is wider because the delay signal is much noisier (variance across [0, 720] minutes) than the binary cancel signal.

Training cadence

The Nexa AI Model is retrained on a managed cadence:

Weekly incremental retrain on new outcomes.
Quarterly full retrain with a hyperparameter sweep.
On demand after material schedule changes (new route footprint, terminal reopen).

Training joins features and labels by flightId — joining by (route, scheduled_at) would create code-share collisions; joining by flightId doesn't.

There is a minimum-sample threshold: below it the trainer falls back to a synthetic dataset generated with a fixed seed so the build → upload → register → deploy pipeline always succeeds. The bundle records training_source: "synthetic" and a lower base_confidence (0.62 vs 0.78) for auditability.

Every trained model is registered with a version. Rolling back is a single registry pointer change — no service interruption.

A real technical caveat: runtime/trainer feature drift

The kind of thing a technical article exists to surface honestly.

The runtime emits 14 features with content-oriented names (the table above). The current trainer declares a different feature contract — eight pressure-oriented features inherited from an earlier iteration:

weatherPressure, laborPressure, trafficPressure, hazardPressure,
flightOpsPressure, destinationPressure, recentCancelRate, seasonalityWeight

Only recentCancelRate and seasonalityWeight overlap literally. The runtime sends the rich 14-key snapshot to the trained model; the model reads only the eight keys it knows about; the missing keys resolve to 0.0.

The implications are real:

In low-sample regimes (synthetic training), the trained model is operating on dimensions the runtime is not filling in. It's effectively near-constant in production.
The deterministic baseline (which operates on the full 14-feature snapshot) is, today, probably a better predictor than the trained model. This is acknowledged technical debt; the trainer is being updated to consume the modern 14-feature schema.
For now, the formula in the next section is what most often determines a Nexa forecast in practice. We document it instead of pretending otherwise.

The deterministic baseline

The baseline is an explicit linear combination over the 14-feature snapshot. It runs whenever the trained model is unreachable, the prediction endpoint is unset, or the call fails.

Cancel probability:

cancelProbability = clamp01(
    historicalCancelRate        × 0.22
  + recentCancelRate            × 0.18
  + airlineRecentCancelRate     × 0.20
  + aggregatedSignalSeverity    × 0.22
  + destinationSignalPressure   × 0.10
  + routeRecentIssueScore       × 0.05
  + (seasonalityWeight − 1)     × 0.03
  + prevLegRiskScore            × 0.18
)

The weights were chosen so that no single term can saturate the result on its own (every coefficient is below 0.25). The rotation term (prevLegRiskScore × 0.18) sits at the same weight as "recent airline cancel rate" — encoding the operational intuition that an aircraft arriving very late has a high probability of cancelling its next leg rather than absorbing more delay.

Predicted delay (minutes):

predictedDelayMinutes = max(0, round(
    historicalAvgDelayMinutes     × 0.30
  + recentAvgDelayMinutes         × 0.30
  + airlineRecentAvgDelayMinutes  × 0.25
  + aggregatedSignalSeverity      × 90
  + destinationSignalPressure     × 35
  + prevLegDelayMinutes           × 0.40
  + prevLegRiskScore              × 25
))

The multipliers 90 and 35 translate signal severities ([0, 1]) into minutes. Maximum external pressure adds up to roughly 125 minutes of structural delay — plausible for a severe weather event. The prevLegDelayMinutes × 0.40 coefficient encodes the empirical carry-over ratio (~40%) for the rotation chain.

Confidence is not probability

After producing cancelProbability and predictedDelayMinutes, the system computes a separate confidenceScore:

confidenceScore = clamp01(
    0.42
  + 0.07 × topFactors.length
  + 0.15 × recentCancelRate
  + 0.10 × airlineRecentCancelRate
)

confidenceScore does not come from the model. It's a heuristic that grows with the number of independent factors supporting the forecast and with the historical track record of the route and airline. It exists so an operator can quickly see whether a forecast is backed by three corroborating factors or just one.

A consumer downstream that needs a calibrated probability of cancellation should use cancelProbability (which passes through the classifier) and treat confidenceScore as a coverage proxy — not a calibrated estimate of certainty.

Webhook handling

The upstream flight-data source pushes events to a webhook protected by a shared token. For each event, Nexa:

Persists the raw payload to an append-only audit row — webhooks are never dropped, even if processing fails.
Normalizes the kind to filed | departure | arrival | cancelled | diverted | change | unknown.
Resolves a flightId.
Branches by kind: filed/change → upsert + enqueue forecast; departure → status active + forecast; arrival → upsert outcome + reconciliation forecast; cancelled → outcome with cancelled=true, no forecast (terminal); diverted → status diverted + create a synthetic external signal so neighboring flights pick up the disruption laterally.
Advances the watermark.

Diverted flights are not re-forecasted, but they do create a synthetic origin signal with severity=0.7, confidence=0.9. This is how lateral contagion is modelled — without explicitly modelling cross-flight cancellations.

Downtime recovery

Webhooks are fire-and-forget. If Nexa is unreachable, those events are lost and the flight state goes stale. Recovery is based on a single watermark and a bounded replay.

Watermark: flightData.lastSeenAt — bumped on every successful webhook process and on every successful ingestion cycle close.
Trigger: at boot, after all dependencies are healthy and the recurring jobs are seeded; or manually via an admin endpoint.
Window resolution: explicit override > watermark-based > 60-minute default; capped at 7 days; skipped entirely if the watermark is fresher than 5 minutes.
Flow: list historical flights for each watched ICAO in the window, bulk-upsert with the observed status (including cancelled / landed / diverted), upsert outcomes, re-enqueue forecasts for non-terminal flights, advance the watermark.

The per-minute bucketing on forecast jobs means re-enqueuing a flight that just received a forecast is idempotent — there's no risk of duplicate forecasts during reconciliation.

Outcomes, accuracy, retraining

Outcomes (the ground truth) are captured from three places:

A nightly historical backfill.
A daily outcomes capture.
The boot-time reconcile (above).

The accuracy report endpoints expose:

Cancel (binary, threshold 0.5): TP/TN/FP/FN, accuracy, precision, recall, F1, Brier score.
Delay (regression): MAE, RMSE, bias (mean(predicted − actual), positive = over-predicts), median absolute error, within15m / within30m / within60m, IATA on-time accuracy (≥15 min late as the on-time threshold).
Daily breakdown and top airports.

Matching is strict on no-look-ahead: for each outcome the most recent prediction with generatedAt < scheduled_out is used. Outcomes without a preceding prediction count toward coverageRate but are excluded from the accuracy metrics.

Quirks worth knowing

A few decisions a careful reader will want to be told about:

The runtime/trainer feature skew (above). The most relevant technical debt today.
No explicit upstream rate-limit throttling. The 5-minute cadence × bounded watched-airline count is the de-facto throttle. The operator dashboard surfaces burn so the ICAO list can be tuned.
Status normalization is fail-open. Unknown statuses become scheduled. We'd rather forecast a flight in a wrong-but-recoverable state than drop it because the upstream introduced a new token.
Diverted flights propagate via a synthetic signal, not via explicit cross-flight modelling. Lateral contagion is data-driven.
The attribution agent is not in the inference loop. Separation of concerns: the model should never be biased by the explanation it generates.
Confidence is coverage, not calibration. Use cancelProbability for thresholds; use confidenceScore to sort or filter by how much material the model had.

Glossary

Hot path / cold path — Hot = per-flight forecast (seconds). Cold = full sweep every 5 min (minutes).
Watermark — Timestamp of the last successfully processed event. Enables gap detection after downtime.
Signal — An external piece of information with severity ∈ [0, 1] and confidence ∈ [0, 1], produced by an agent and consumed by the feature builder and the attribution agent.
Snapshot — The numeric feature vector that enters the model. Audited: every prediction stores the exact snapshot that produced it.
Nexa AI Model — Nexa's trained model served on Google. Customers consume it through the predictor API; they do not provision, train, or operate any third-party AI service.

What it predicts​

The unit of work: airline, not airport​

Ingestion cycle (every five minutes)​

Per-flight forecast (hot path)​

The multi-agent signal layer​

Attribution: ranking factors for humans​

The feature snapshot​

Inference: trained model first, deterministic baseline fallback​

The trained model​

Training cadence​

A real technical caveat: runtime/trainer feature drift​

The deterministic baseline​

Confidence is not probability​

Webhook handling​

Downtime recovery​

Outcomes, accuracy, retraining​

Quirks worth knowing​

Glossary​

See also​