Kronos Model Architecture
Kronos is a token-based autoregressive time-series transformer. Unlike patch-based models (TimesFM, PatchTST) that embed continuous values directly, Kronos quantizes OHLCV candles into discrete tokens — more analogous to a language model for financial series. This enables sampling multiple futures, natural quantile estimation, and clean additive conditioning on exogenous signals.
End-to-end flow
Section titled “End-to-end flow”flowchart LR A[OHLCV candles<br/>6 features] -->|z-score per window| B[BSQ Tokenizer<br/>frozen] B --> C[Token pairs<br/>s1, s2] C --> D[Hierarchical<br/>Embedding] T[Temporal features<br/>hour, weekday, month] --> TE[TemporalEmbedding] E[Event channels<br/>20 dims] --> EE[EventEmbedding<br/>NEW] D --> ADD((+)) TE --> ADD EE --> ADD ADD --> F[Transformer<br/>decoder · LoRA-adapted] F --> G[DualHead] G --> H1[s1 logits] G --> H2[s2 logits] H1 --> I[Sample · decode · p10/p50/p90] H2 --> IComponents
Section titled “Components”1. BSQ Tokenizer (frozen)
Section titled “1. BSQ Tokenizer (frozen)”Binary Spherical Quantization compresses each 6-feature candle into a pair of discrete tokens (s1, s2). Trained once by the Kronos authors on ~2B candles; we keep it frozen — it’s the model’s “alphabet.”
- Input:
(B, T, 6)z-score normalised OHLCV - Output:
(s1_ids, s2_ids)each(B, T)long-tensor in a fixed vocabulary - Why not patch embedding? Discrete tokens make sampling clean (temperature, top-k) and let us build quantile envelopes from empirical sample distributions rather than learnt Gaussian heads.
2. Hierarchical Embedding
Section titled “2. Hierarchical Embedding”Maps (s1, s2) pairs into d_model = 256 dimensions. Shared across timesteps.
3. Temporal Embedding (existing)
Section titled “3. Temporal Embedding (existing)”Sinusoidal encodings of {minute, hour, weekday, day_of_month, month} — added to the token embedding.
4. Event Embedding (Phase 1–2, new)
Section titled “4. Event Embedding (Phase 1–2, new)”A small Linear(20, 256) projection added to the combined embedding the same way temporal features are. 20 channels cover macro event flags, continuous surprise z-scores, sinusoidal days-until-event, and cross-asset leader returns. See Phase 1 for the full schema.
class EventEmbedding(nn.Module): def __init__(self, num_event_channels=20, d_model=256): super().__init__() self.proj = nn.Linear(num_event_channels, d_model)
def forward(self, events): if events is None: return 0 # additive identity — backward compatible return self.proj(events.float())5. Transformer Decoder
Section titled “5. Transformer Decoder”Standard causal decoder. During fine-tuning, LoRA adapters (rank 8) are attached to q_proj and v_proj in each block. Only adapters + event embedding train; the rest is frozen. ~1M trainable parameters out of 102.3M total.
6. DualHead (frozen)
Section titled “6. DualHead (frozen)”Two output heads predict s1 and s2 tokens independently given the shared transformer trunk.
Why additive conditioning?
Section titled “Why additive conditioning?”Kronos already uses additive conditioning for temporal features. Event conditioning reuses the exact same pattern — three lines of code in forward():
x = self.embedding([s1_ids, s2_ids])if stamp is not None: x = x + self.time_emb(stamp)x = x + self.event_emb(events) # NEW — returns 0 when events=Nonex = self.token_drop(x)This gives us zero regression on non-event days (events=None ⇒ identical output to base Kronos) while letting the model learn event-specific behaviour via LoRA.
Inference: p10 / p50 / p90 envelopes
Section titled “Inference: p10 / p50 / p90 envelopes”For each prediction horizon step:
- Sample
n=30continuations from the model with temperatureT=1.0. - Decode each token pair back to
(open, high, low, close, volume)via the frozen BSQ tokenizer’s inverse. - Compute empirical quantiles over the 30 samples at each horizon step.
- Surface as
{p10, p50, p90}envelope.
Result: a fan chart where width encodes uncertainty and centre encodes direction.
Ensemble with Chronos-2
Section titled “Ensemble with Chronos-2”Phase 0 adds Amazon’s Chronos-2 (200M params, Apache 2.0, multivariate ICL via group attention) as a second predictor. Both write to ml_predictions with distinct model_name. A LightGBM meta-learner trained on signal_evaluations weights the two per asset class. See Ensemble Design for details.
Hardware footprint
Section titled “Hardware footprint”| Resource | Value |
|---|---|
| Parameters | 102.3 M (Kronos-base) |
| VRAM (inference, bf16) | ~240 MB |
| VRAM (LoRA training, batch 32) | ~4.5 GB |
| GPU | RTX 4060, 8 GB |
| Inference latency (cached) | ~50 ms |
| Inference latency (cold) | ~800 ms |
Known limitations
Section titled “Known limitations”- Context window: 512 tokens — ~21 days at 1 h, ~2 y at 1 d.
- RoPE cache issue — inference semaphore pinned to 1 to avoid numerical drift at high concurrency.
- Distribution drift — pre-trained on static historical data; addressed via Phase 6 rolling fine-tune.
- No multi-modal fusion — text/news/macro not consumed natively; ensemble + event encoder are the current stopgaps.