Evaluation Metrics
Five tests run weekly against the live prediction + signal evaluation tables. Each test has a numeric success gate from phase-05-evaluation.md.
Test 1 · Envelope calibration
Section titled “Test 1 · Envelope calibration”Does the envelope widen on event days?
spread_ratio = mean(p90 - p10) on event days / mean(p90 - p10) on non-event days ±7
Target: spread_ratio ≥ 1.5Per event type (FOMC, CPI, NFP) — expect each to pass independently.
Test 2 · Directional accuracy post-surprise
Section titled “Test 2 · Directional accuracy post-surprise”Does p50 shift in the expected direction after a surprise?
expected_direction = map(event, asset_class) # e.g. CPI beat → bearish risk assets → expected_direction = -1 for equitiesactual_direction = sign(price_change)predicted_shift = sign(p50_event - p50_base)
directional_accuracy = P(predicted_shift == expected_direction)Target: ≥ 60 % on surprise events with |z| > 0.5.
Test 3 · Non-event-day regression
Section titled “Test 3 · Non-event-day regression”Does adding event conditioning hurt predictions on quiet days?
mae_diff = |mean(|event_pred - actual|) - mean(|base_pred - actual|)|dir_diff = |dir_accuracy(event) - dir_accuracy(base)|Target: mae_diff < 2 %, dir_diff < 2 %.
This is the critical backward-compatibility gate — if event conditioning regresses quiet-day performance, ship-stop.
Test 4 · Walk-forward backtest
Section titled “Test 4 · Walk-forward backtest”Train: 2021-01-01 → 2024-12-31Test: 2025-01-01 → 2026-04-23
Strategy: Long when p50 > last_close AND p90-p10 < threshold Short when p50 < last_close AND p90-p10 < threshold Flat when p90-p10 > threshold (uncertainty too high)Compared across all four predictors: kronos-base, kronos-event, chronos-2, ensemble.
Target: kronos-event Sharpe ≥ kronos-base Sharpe — any improvement counts.
Test 5 · Ensemble meta-learner (LightGBM)
Section titled “Test 5 · Ensemble meta-learner (LightGBM)”Per-asset-class stacking (crypto / forex+commodity / equity) over signal_evaluations. See Ensemble Design for features and target.
Target: ensemble_sharpe ≥ best_single_sharpe + 0.1.
If the ensemble doesn’t beat the best single model by ≥ 0.1 Sharpe, fall back to the single predictor and cut the complexity.
Why these specific metrics
Section titled “Why these specific metrics”| Metric | What it really tells us |
|---|---|
| Spread ratio | Does the model know when it’s uncertain? |
| Directional accuracy | Can it react correctly to surprise shocks? |
| Non-event MAE diff | Did we break the baseline while chasing event gains? |
| Walk-forward Sharpe | Would this have made money out-of-sample? |
| Ensemble lift | Is the complexity actually paying for itself? |
Together they catch the two failure modes of a financial forecaster: (a) overconfident calibration; (b) direction right but magnitude wrong.
Weekly evaluation cron
Section titled “Weekly evaluation cron”0 8 * * 0 → run eval/event_eval.py0 9 * * 0 → run eval/ensemble_meta.py (after event_eval)Writes JSON + markdown to plans/reports/eval-{YYMMDD}-event-conditioned.md. First passing week sets a baseline; subsequent weeks compare against it.