Skip to content

Evaluation Metrics

Five tests run weekly against the live prediction + signal evaluation tables. Each test has a numeric success gate from phase-05-evaluation.md.

Does the envelope widen on event days?

spread_ratio = mean(p90 - p10) on event days
/ mean(p90 - p10) on non-event days ±7
Target: spread_ratio ≥ 1.5

Per event type (FOMC, CPI, NFP) — expect each to pass independently.

Test 2 · Directional accuracy post-surprise

Section titled “Test 2 · Directional accuracy post-surprise”

Does p50 shift in the expected direction after a surprise?

expected_direction = map(event, asset_class)
# e.g. CPI beat → bearish risk assets → expected_direction = -1 for equities
actual_direction = sign(price_change)
predicted_shift = sign(p50_event - p50_base)
directional_accuracy = P(predicted_shift == expected_direction)

Target: ≥ 60 % on surprise events with |z| > 0.5.

Does adding event conditioning hurt predictions on quiet days?

mae_diff = |mean(|event_pred - actual|) - mean(|base_pred - actual|)|
dir_diff = |dir_accuracy(event) - dir_accuracy(base)|

Target: mae_diff < 2 %, dir_diff < 2 %.

This is the critical backward-compatibility gate — if event conditioning regresses quiet-day performance, ship-stop.

Train: 2021-01-01 → 2024-12-31
Test: 2025-01-01 → 2026-04-23
Strategy:
Long when p50 > last_close AND p90-p10 < threshold
Short when p50 < last_close AND p90-p10 < threshold
Flat when p90-p10 > threshold (uncertainty too high)

Compared across all four predictors: kronos-base, kronos-event, chronos-2, ensemble.

Target: kronos-event Sharpe ≥ kronos-base Sharpe — any improvement counts.

Test 5 · Ensemble meta-learner (LightGBM)

Section titled “Test 5 · Ensemble meta-learner (LightGBM)”

Per-asset-class stacking (crypto / forex+commodity / equity) over signal_evaluations. See Ensemble Design for features and target.

Target: ensemble_sharpe ≥ best_single_sharpe + 0.1.

If the ensemble doesn’t beat the best single model by ≥ 0.1 Sharpe, fall back to the single predictor and cut the complexity.

MetricWhat it really tells us
Spread ratioDoes the model know when it’s uncertain?
Directional accuracyCan it react correctly to surprise shocks?
Non-event MAE diffDid we break the baseline while chasing event gains?
Walk-forward SharpeWould this have made money out-of-sample?
Ensemble liftIs the complexity actually paying for itself?

Together they catch the two failure modes of a financial forecaster: (a) overconfident calibration; (b) direction right but magnitude wrong.

0 8 * * 0 → run eval/event_eval.py
0 9 * * 0 → run eval/ensemble_meta.py (after event_eval)

Writes JSON + markdown to plans/reports/eval-{YYMMDD}-event-conditioned.md. First passing week sets a baseline; subsequent weeks compare against it.