Phase 5 · Ensemble Evaluation
Priority: Medium Status: Pending Depends on: Phase 4 (production predictions)
Context
Section titled “Context”- Signal live testing already exists:
signal_evaluations,signal_performancetables - Walk-forward replay:
kronos_replay.pywithas_oftimestamps - Base Kronos predictions cached in
ml_predictions
Overview
Section titled “Overview”Evaluate three predictors and their ensemble: kronos-base, kronos-event (new), chronos-2. Metrics cover envelope calibration, directional accuracy post-event, non-event day regression, and ensemble meta-learner weighting via LightGBM on signal_evaluations.
Requirements
Section titled “Requirements”Functional
Section titled “Functional”- Compare p10/p50/p90 envelopes on event days vs non-event days
- Measure directional accuracy after surprise events
- Walk-forward backtest: train 2021-2024, test 2025-2026
- Regression check: non-event day predictions vs base model
Non-Functional
Section titled “Non-Functional”- Automated evaluation script runnable as cron job
- Results saved as JSON + markdown report
Evaluation Metrics
Section titled “Evaluation Metrics”Test 1: Envelope Calibration
Section titled “Test 1: Envelope Calibration”For each high-impact event type (FOMC, CPI, NFP): event_spread = mean(p90 - p10) on event days baseline_spread = mean(p90 - p10) on non-event days ±7 spread_ratio = event_spread / baseline_spread
Target: spread_ratio >= 1.5 (50% wider on event days)Test 2: Directional Accuracy Post-Surprise
Section titled “Test 2: Directional Accuracy Post-Surprise”For each event with known surprise direction: expected_direction = map_event_surprise_to_asset_direction(event, asset_class) actual_direction = sign(actual_price_change) predicted_direction = sign(p50 - last_close)
did_envelope_widen = (p90-p10) > baseline_spread did_p50_shift = sign(p50_event - p50_base) == expected_direction
Metrics: envelope_widen_rate = % of events where spread increased directional_accuracy = % of events where p50 shifted correctly surprise_detection = envelope_widen_rate * directional_accuracy
Target: directional_accuracy >= 60%Test 3: Non-Event Day Regression
Section titled “Test 3: Non-Event Day Regression”For all non-event days in test period: base_pred = base Kronos p50 event_pred = event-conditioned p50 (events=None)
mae_diff = |mean(abs(event_pred - actual)) - mean(abs(base_pred - actual))| dir_accuracy_diff = |dir_accuracy(event) - dir_accuracy(base)|
Target: mae_diff < 2%, dir_accuracy_diff < 2%Test 4: Walk-Forward Backtest
Section titled “Test 4: Walk-Forward Backtest”Train: 2021-01-01 to 2024-12-31Test: 2025-01-01 to 2026-04-23
Strategy: Long when p50 > last_close AND p90-p10 < threshold Short when p50 < last_close AND p90-p10 < threshold Flat when p90-p10 > threshold (uncertainty too high)
Compare: kronos-base / kronos-event / chronos-2 / ensemble SharpeTarget: kronos-event Sharpe >= base Sharpe (any improvement counts)Test 5: Ensemble Meta-Learner (LightGBM Stacking)
Section titled “Test 5: Ensemble Meta-Learner (LightGBM Stacking)”Features per (symbol, timestamp): - kronos_base_p50, kronos_base_spread (p90-p10) - kronos_event_p50, kronos_event_spread - chronos2_p50, chronos2_spread - regime_flag (trending / ranging / volatile) - event_flag (any high-impact event active) - asset_class (crypto / forex / equity / commodity)
Target: realized 1h / 1d return direction (binary or sign(return))
Train: signal_evaluations table, 2025 dataTest: 2026 Q1 walk-forward
Per-asset-class model (3 classes: crypto / forex+commodity / equity) — NOT per-symbol(≥500 labeled evals per class required)
Metric: ensemble_sharpe = strategy(weighted p50) Sharpe best_single_sharpe = max over {kronos-base, kronos-event, chronos-2}Target: ensemble_sharpe >= best_single_sharpe + 0.1Implementation Steps
Section titled “Implementation Steps”- Create
kronos-service/eval/event_eval.py - Implement envelope calibration metric
- Implement directional accuracy metric with event surprise mapping
- Implement non-event regression comparison
- Implement walk-forward backtest (reuse signal_evaluations table)
- Create
kronos-service/eval/ensemble_meta.py— LightGBM stacking on signal_evaluations - Per-asset-class meta-learner training (3 classes)
- Generate evaluation report (JSON + markdown) comparing all 3 predictors + ensemble
- Add to cron: run evaluation weekly alongside score_signals
Key Files
Section titled “Key Files”- Create:
kronos-service/eval/event_eval.py - Create:
kronos-service/eval/ensemble_meta.py - Read:
scripts/score_signals.py(evaluation pattern) - Read:
supabase/migrations/20260422_signal_live_testing.sql(signal schema) - Read:
supabase/migrations/20260423_ml_predictions_model_name.sql(multi-model schema from Phase 0)
Success Criteria
Section titled “Success Criteria”- Evaluation script runs end-to-end on test period data
- Report shows all 4 metrics with clear pass/fail
- Envelope calibration: spread_ratio >= 1.5 on event days
- Directional accuracy: >= 60% post-surprise
- Non-event regression: MAE within 2% of base
- Walk-forward: event-conditioned Sharpe >= base Sharpe
- Ensemble: ensemble Sharpe >= best single model Sharpe + 0.1
- Per-asset-class meta-learner: ≥500 evals per class before training