Ensemble Design

Single-model prediction is fragile to regime shifts. The ensemble combines two architecturally-distinct token-based models plus a lightweight stacking layer trained on our own signal evaluations.

Ensemble layout

flowchart TB
    OHLCV[OHLCV + context] --> K1[Kronos-base<br/>frozen LoRA]
    OHLCV --> K2[Kronos-event<br/>20-ch conditioning]
    OHLCV --> C2[Chronos-2<br/>Amazon, Apache 2.0]

    K1 --> MP[ml_predictions<br/>model_name='kronos-base']
    K2 --> MP2[ml_predictions<br/>model_name='kronos-event']
    C2 --> MP3[ml_predictions<br/>model_name='chronos-2']

    MP --> META[LightGBM Meta-Learner<br/>per asset class]
    MP2 --> META
    MP3 --> META
    SIG[signal_evaluations<br/>labels] --> META
    META --> OUT[Ensemble p50 + spread]

Why three models, not one

Model	Strengths	Weaknesses
Kronos-base	Proven on pre-training corpus; stable baseline	No event awareness; stale distribution
Kronos-event	Event-day envelope widening; surprise-directional p50 shift	New; risk of overfitting event examples
Chronos-2	Zero-shot robust to new regimes; Amazon pre-training corpus	Not fine-tuned to our instruments

Disagreement = signal. When all three agree on direction, confidence is high. When they diverge, the meta-learner picks up the pattern of which model wins under which conditions.

Meta-learner features

LightGBM trained on signal_evaluations:

Feature group	Count	Example
Per-model p50	3	`kronos_base_p50`, `kronos_event_p50`, `chronos2_p50`
Per-model spread	3	`p90 - p10` per model
Regime flag	1	trending / ranging / volatile
Event flag	1	any high-impact event active
Asset class	1	crypto / forex+commodity / equity

Target: realized direction (binary) at 1 h / 1 d horizon.

Training scope: per-asset-class (3 models), not per-symbol — avoids overfitting when some symbols have <500 labelled evaluations.

Selection rules for live serving

At inference time:

if meta_confidence(prediction) < 0.55:
    # No clear winner — serve the envelope with the widest spread
    # (most uncertain) to communicate low conviction
    return widest_spread_model.quantiles
else:
    # Weighted blend
    return weighted_avg(all_three, meta_weights)

Why not just average them?

Simple averaging hurts when one model is clearly wrong. Consider a FOMC day: Kronos-base misses the volatility, Kronos-event widens envelope correctly, Chronos-2 picks up the cross-asset reaction. Equal weight dilutes the signal from the right model. Stacking fixes this by learning conditional weights.

Regime-shift canary

Beyond the stacked output, Chronos-2 plays a canary role. Its zero-shot nature means it responds quickly to regime shifts (new asset class behaviour, structural breaks) while Kronos still operates under pre-training assumptions. Divergence between Kronos and Chronos-2 flags a potential model-update need — feeds into Phase 6 rolling fine-tune triggers.

Success criterion

From phase-05-evaluation.md:

ensemble_sharpe >= best_single_sharpe + 0.1

If the ensemble doesn’t beat the best single model by ≥0.1 Sharpe, the extra complexity isn’t justified and we fall back to a single predictor.