Skip to content

Ensemble Design

Single-model prediction is fragile to regime shifts. The ensemble combines two architecturally-distinct token-based models plus a lightweight stacking layer trained on our own signal evaluations.

flowchart TB
OHLCV[OHLCV + context] --> K1[Kronos-base<br/>frozen LoRA]
OHLCV --> K2[Kronos-event<br/>20-ch conditioning]
OHLCV --> C2[Chronos-2<br/>Amazon, Apache 2.0]
K1 --> MP[ml_predictions<br/>model_name='kronos-base']
K2 --> MP2[ml_predictions<br/>model_name='kronos-event']
C2 --> MP3[ml_predictions<br/>model_name='chronos-2']
MP --> META[LightGBM Meta-Learner<br/>per asset class]
MP2 --> META
MP3 --> META
SIG[signal_evaluations<br/>labels] --> META
META --> OUT[Ensemble p50 + spread]
ModelStrengthsWeaknesses
Kronos-baseProven on pre-training corpus; stable baselineNo event awareness; stale distribution
Kronos-eventEvent-day envelope widening; surprise-directional p50 shiftNew; risk of overfitting event examples
Chronos-2Zero-shot robust to new regimes; Amazon pre-training corpusNot fine-tuned to our instruments

Disagreement = signal. When all three agree on direction, confidence is high. When they diverge, the meta-learner picks up the pattern of which model wins under which conditions.

LightGBM trained on signal_evaluations:

Feature groupCountExample
Per-model p503kronos_base_p50, kronos_event_p50, chronos2_p50
Per-model spread3p90 - p10 per model
Regime flag1trending / ranging / volatile
Event flag1any high-impact event active
Asset class1crypto / forex+commodity / equity

Target: realized direction (binary) at 1 h / 1 d horizon.

Training scope: per-asset-class (3 models), not per-symbol — avoids overfitting when some symbols have <500 labelled evaluations.

At inference time:

if meta_confidence(prediction) < 0.55:
# No clear winner — serve the envelope with the widest spread
# (most uncertain) to communicate low conviction
return widest_spread_model.quantiles
else:
# Weighted blend
return weighted_avg(all_three, meta_weights)

Simple averaging hurts when one model is clearly wrong. Consider a FOMC day: Kronos-base misses the volatility, Kronos-event widens envelope correctly, Chronos-2 picks up the cross-asset reaction. Equal weight dilutes the signal from the right model. Stacking fixes this by learning conditional weights.

Beyond the stacked output, Chronos-2 plays a canary role. Its zero-shot nature means it responds quickly to regime shifts (new asset class behaviour, structural breaks) while Kronos still operates under pre-training assumptions. Divergence between Kronos and Chronos-2 flags a potential model-update need — feeds into Phase 6 rolling fine-tune triggers.

From phase-05-evaluation.md:

ensemble_sharpe >= best_single_sharpe + 0.1

If the ensemble doesn’t beat the best single model by ≥0.1 Sharpe, the extra complexity isn’t justified and we fall back to a single predictor.