Evaluation Metrics

Five tests run weekly against the live prediction + signal evaluation tables. Each test has a numeric success gate from phase-05-evaluation.md.

Test 1 · Envelope calibration

Does the envelope widen on event days?

spread_ratio = mean(p90 - p10) on event days
              / mean(p90 - p10) on non-event days ±7

Target: spread_ratio ≥ 1.5

Per event type (FOMC, CPI, NFP) — expect each to pass independently.

Test 2 · Directional accuracy post-surprise

Does p50 shift in the expected direction after a surprise?

expected_direction = map(event, asset_class)
  # e.g. CPI beat → bearish risk assets → expected_direction = -1 for equities
actual_direction   = sign(price_change)
predicted_shift    = sign(p50_event - p50_base)

directional_accuracy = P(predicted_shift == expected_direction)

Target: ≥ 60 % on surprise events with |z| > 0.5.

Test 3 · Non-event-day regression

Does adding event conditioning hurt predictions on quiet days?

mae_diff = |mean(|event_pred - actual|) - mean(|base_pred - actual|)|
dir_diff = |dir_accuracy(event) - dir_accuracy(base)|

Target: mae_diff < 2 %, dir_diff < 2 %.

This is the critical backward-compatibility gate — if event conditioning regresses quiet-day performance, ship-stop.

Test 4 · Walk-forward backtest

Train: 2021-01-01 → 2024-12-31
Test:  2025-01-01 → 2026-04-23

Strategy:
  Long  when p50 > last_close AND p90-p10 < threshold
  Short when p50 < last_close AND p90-p10 < threshold
  Flat  when p90-p10 > threshold (uncertainty too high)

Compared across all four predictors: kronos-base, kronos-event, chronos-2, ensemble.

Target: kronos-event Sharpe ≥ kronos-base Sharpe — any improvement counts.

Test 5 · Ensemble meta-learner (LightGBM)

Per-asset-class stacking (crypto / forex+commodity / equity) over signal_evaluations. See Ensemble Design for features and target.

Target: ensemble_sharpe ≥ best_single_sharpe + 0.1.

If the ensemble doesn’t beat the best single model by ≥ 0.1 Sharpe, fall back to the single predictor and cut the complexity.

Why these specific metrics

Metric	What it really tells us
Spread ratio	Does the model know when it’s uncertain?
Directional accuracy	Can it react correctly to surprise shocks?
Non-event MAE diff	Did we break the baseline while chasing event gains?
Walk-forward Sharpe	Would this have made money out-of-sample?
Ensemble lift	Is the complexity actually paying for itself?

Together they catch the two failure modes of a financial forecaster: (a) overconfident calibration; (b) direction right but magnitude wrong.

Weekly evaluation cron

0 8 * * 0   → run eval/event_eval.py
0 9 * * 0   → run eval/ensemble_meta.py (after event_eval)

Writes JSON + markdown to plans/reports/eval-{YYMMDD}-event-conditioned.md. First passing week sets a baseline; subsequent weeks compare against it.