Skip to content

Phase 5 · Ensemble Evaluation

Priority: Medium Status: Pending Depends on: Phase 4 (production predictions)

  • Signal live testing already exists: signal_evaluations, signal_performance tables
  • Walk-forward replay: kronos_replay.py with as_of timestamps
  • Base Kronos predictions cached in ml_predictions

Evaluate three predictors and their ensemble: kronos-base, kronos-event (new), chronos-2. Metrics cover envelope calibration, directional accuracy post-event, non-event day regression, and ensemble meta-learner weighting via LightGBM on signal_evaluations.

  • Compare p10/p50/p90 envelopes on event days vs non-event days
  • Measure directional accuracy after surprise events
  • Walk-forward backtest: train 2021-2024, test 2025-2026
  • Regression check: non-event day predictions vs base model
  • Automated evaluation script runnable as cron job
  • Results saved as JSON + markdown report
For each high-impact event type (FOMC, CPI, NFP):
event_spread = mean(p90 - p10) on event days
baseline_spread = mean(p90 - p10) on non-event days ±7
spread_ratio = event_spread / baseline_spread
Target: spread_ratio >= 1.5 (50% wider on event days)

Test 2: Directional Accuracy Post-Surprise

Section titled “Test 2: Directional Accuracy Post-Surprise”
For each event with known surprise direction:
expected_direction = map_event_surprise_to_asset_direction(event, asset_class)
actual_direction = sign(actual_price_change)
predicted_direction = sign(p50 - last_close)
did_envelope_widen = (p90-p10) > baseline_spread
did_p50_shift = sign(p50_event - p50_base) == expected_direction
Metrics:
envelope_widen_rate = % of events where spread increased
directional_accuracy = % of events where p50 shifted correctly
surprise_detection = envelope_widen_rate * directional_accuracy
Target: directional_accuracy >= 60%
For all non-event days in test period:
base_pred = base Kronos p50
event_pred = event-conditioned p50 (events=None)
mae_diff = |mean(abs(event_pred - actual)) - mean(abs(base_pred - actual))|
dir_accuracy_diff = |dir_accuracy(event) - dir_accuracy(base)|
Target: mae_diff < 2%, dir_accuracy_diff < 2%
Train: 2021-01-01 to 2024-12-31
Test: 2025-01-01 to 2026-04-23
Strategy: Long when p50 > last_close AND p90-p10 < threshold
Short when p50 < last_close AND p90-p10 < threshold
Flat when p90-p10 > threshold (uncertainty too high)
Compare: kronos-base / kronos-event / chronos-2 / ensemble Sharpe
Target: kronos-event Sharpe >= base Sharpe (any improvement counts)

Test 5: Ensemble Meta-Learner (LightGBM Stacking)

Section titled “Test 5: Ensemble Meta-Learner (LightGBM Stacking)”
Features per (symbol, timestamp):
- kronos_base_p50, kronos_base_spread (p90-p10)
- kronos_event_p50, kronos_event_spread
- chronos2_p50, chronos2_spread
- regime_flag (trending / ranging / volatile)
- event_flag (any high-impact event active)
- asset_class (crypto / forex / equity / commodity)
Target: realized 1h / 1d return direction (binary or sign(return))
Train: signal_evaluations table, 2025 data
Test: 2026 Q1 walk-forward
Per-asset-class model (3 classes: crypto / forex+commodity / equity) — NOT per-symbol
(≥500 labeled evals per class required)
Metric:
ensemble_sharpe = strategy(weighted p50) Sharpe
best_single_sharpe = max over {kronos-base, kronos-event, chronos-2}
Target: ensemble_sharpe >= best_single_sharpe + 0.1
  1. Create kronos-service/eval/event_eval.py
  2. Implement envelope calibration metric
  3. Implement directional accuracy metric with event surprise mapping
  4. Implement non-event regression comparison
  5. Implement walk-forward backtest (reuse signal_evaluations table)
  6. Create kronos-service/eval/ensemble_meta.py — LightGBM stacking on signal_evaluations
  7. Per-asset-class meta-learner training (3 classes)
  8. Generate evaluation report (JSON + markdown) comparing all 3 predictors + ensemble
  9. Add to cron: run evaluation weekly alongside score_signals
  • Create: kronos-service/eval/event_eval.py
  • Create: kronos-service/eval/ensemble_meta.py
  • Read: scripts/score_signals.py (evaluation pattern)
  • Read: supabase/migrations/20260422_signal_live_testing.sql (signal schema)
  • Read: supabase/migrations/20260423_ml_predictions_model_name.sql (multi-model schema from Phase 0)
  • Evaluation script runs end-to-end on test period data
  • Report shows all 4 metrics with clear pass/fail
  • Envelope calibration: spread_ratio >= 1.5 on event days
  • Directional accuracy: >= 60% post-surprise
  • Non-event regression: MAE within 2% of base
  • Walk-forward: event-conditioned Sharpe >= base Sharpe
  • Ensemble: ensemble Sharpe >= best single model Sharpe + 0.1
  • Per-asset-class meta-learner: ≥500 evals per class before training