Ensemble Design
Single-model prediction is fragile to regime shifts. The ensemble combines two architecturally-distinct token-based models plus a lightweight stacking layer trained on our own signal evaluations.
Ensemble layout
Section titled “Ensemble layout”flowchart TB OHLCV[OHLCV + context] --> K1[Kronos-base<br/>frozen LoRA] OHLCV --> K2[Kronos-event<br/>20-ch conditioning] OHLCV --> C2[Chronos-2<br/>Amazon, Apache 2.0]
K1 --> MP[ml_predictions<br/>model_name='kronos-base'] K2 --> MP2[ml_predictions<br/>model_name='kronos-event'] C2 --> MP3[ml_predictions<br/>model_name='chronos-2']
MP --> META[LightGBM Meta-Learner<br/>per asset class] MP2 --> META MP3 --> META SIG[signal_evaluations<br/>labels] --> META META --> OUT[Ensemble p50 + spread]Why three models, not one
Section titled “Why three models, not one”| Model | Strengths | Weaknesses |
|---|---|---|
| Kronos-base | Proven on pre-training corpus; stable baseline | No event awareness; stale distribution |
| Kronos-event | Event-day envelope widening; surprise-directional p50 shift | New; risk of overfitting event examples |
| Chronos-2 | Zero-shot robust to new regimes; Amazon pre-training corpus | Not fine-tuned to our instruments |
Disagreement = signal. When all three agree on direction, confidence is high. When they diverge, the meta-learner picks up the pattern of which model wins under which conditions.
Meta-learner features
Section titled “Meta-learner features”LightGBM trained on signal_evaluations:
| Feature group | Count | Example |
|---|---|---|
| Per-model p50 | 3 | kronos_base_p50, kronos_event_p50, chronos2_p50 |
| Per-model spread | 3 | p90 - p10 per model |
| Regime flag | 1 | trending / ranging / volatile |
| Event flag | 1 | any high-impact event active |
| Asset class | 1 | crypto / forex+commodity / equity |
Target: realized direction (binary) at 1 h / 1 d horizon.
Training scope: per-asset-class (3 models), not per-symbol — avoids overfitting when some symbols have <500 labelled evaluations.
Selection rules for live serving
Section titled “Selection rules for live serving”At inference time:
if meta_confidence(prediction) < 0.55: # No clear winner — serve the envelope with the widest spread # (most uncertain) to communicate low conviction return widest_spread_model.quantileselse: # Weighted blend return weighted_avg(all_three, meta_weights)Why not just average them?
Section titled “Why not just average them?”Simple averaging hurts when one model is clearly wrong. Consider a FOMC day: Kronos-base misses the volatility, Kronos-event widens envelope correctly, Chronos-2 picks up the cross-asset reaction. Equal weight dilutes the signal from the right model. Stacking fixes this by learning conditional weights.
Regime-shift canary
Section titled “Regime-shift canary”Beyond the stacked output, Chronos-2 plays a canary role. Its zero-shot nature means it responds quickly to regime shifts (new asset class behaviour, structural breaks) while Kronos still operates under pre-training assumptions. Divergence between Kronos and Chronos-2 flags a potential model-update need — feeds into Phase 6 rolling fine-tune triggers.
Success criterion
Section titled “Success criterion”From phase-05-evaluation.md:
ensemble_sharpe >= best_single_sharpe + 0.1
If the ensemble doesn’t beat the best single model by ≥0.1 Sharpe, the extra complexity isn’t justified and we fall back to a single predictor.