Kronos Model Architecture

Kronos is a token-based autoregressive time-series transformer. Unlike patch-based models (TimesFM, PatchTST) that embed continuous values directly, Kronos quantizes OHLCV candles into discrete tokens — more analogous to a language model for financial series. This enables sampling multiple futures, natural quantile estimation, and clean additive conditioning on exogenous signals.

End-to-end flow

flowchart LR
    A[OHLCV candles<br/>6 features] -->|z-score per window| B[BSQ Tokenizer<br/>frozen]
    B --> C[Token pairs<br/>s1, s2]
    C --> D[Hierarchical<br/>Embedding]
    T[Temporal features<br/>hour, weekday, month] --> TE[TemporalEmbedding]
    E[Event channels<br/>20 dims] --> EE[EventEmbedding<br/>NEW]
    D --> ADD((+))
    TE --> ADD
    EE --> ADD
    ADD --> F[Transformer<br/>decoder · LoRA-adapted]
    F --> G[DualHead]
    G --> H1[s1 logits]
    G --> H2[s2 logits]
    H1 --> I[Sample · decode · p10/p50/p90]
    H2 --> I

Components

1. BSQ Tokenizer (frozen)

Binary Spherical Quantization compresses each 6-feature candle into a pair of discrete tokens (s1, s2). Trained once by the Kronos authors on ~2B candles; we keep it frozen — it’s the model’s “alphabet.”

Input: (B, T, 6) z-score normalised OHLCV
Output: (s1_ids, s2_ids) each (B, T) long-tensor in a fixed vocabulary
Why not patch embedding? Discrete tokens make sampling clean (temperature, top-k) and let us build quantile envelopes from empirical sample distributions rather than learnt Gaussian heads.

2. Hierarchical Embedding

Maps (s1, s2) pairs into d_model = 256 dimensions. Shared across timesteps.

3. Temporal Embedding (existing)

Sinusoidal encodings of {minute, hour, weekday, day_of_month, month} — added to the token embedding.

4. Event Embedding (Phase 1–2, new)

A small Linear(20, 256) projection added to the combined embedding the same way temporal features are. 20 channels cover macro event flags, continuous surprise z-scores, sinusoidal days-until-event, and cross-asset leader returns. See Phase 1 for the full schema.

class EventEmbedding(nn.Module):
    def __init__(self, num_event_channels=20, d_model=256):
        super().__init__()
        self.proj = nn.Linear(num_event_channels, d_model)

    def forward(self, events):
        if events is None:
            return 0  # additive identity — backward compatible
        return self.proj(events.float())

5. Transformer Decoder

Standard causal decoder. During fine-tuning, LoRA adapters (rank 8) are attached to q_proj and v_proj in each block. Only adapters + event embedding train; the rest is frozen. ~1M trainable parameters out of 102.3M total.

6. DualHead (frozen)

Two output heads predict s1 and s2 tokens independently given the shared transformer trunk.

Why additive conditioning?

Kronos already uses additive conditioning for temporal features. Event conditioning reuses the exact same pattern — three lines of code in forward():

x = self.embedding([s1_ids, s2_ids])
if stamp is not None:
    x = x + self.time_emb(stamp)
x = x + self.event_emb(events)      # NEW — returns 0 when events=None
x = self.token_drop(x)

This gives us zero regression on non-event days (events=None ⇒ identical output to base Kronos) while letting the model learn event-specific behaviour via LoRA.

Inference: p10 / p50 / p90 envelopes

For each prediction horizon step:

Sample n=30 continuations from the model with temperature T=1.0.
Decode each token pair back to (open, high, low, close, volume) via the frozen BSQ tokenizer’s inverse.
Compute empirical quantiles over the 30 samples at each horizon step.
Surface as {p10, p50, p90} envelope.

Result: a fan chart where width encodes uncertainty and centre encodes direction.

Ensemble with Chronos-2

Phase 0 adds Amazon’s Chronos-2 (200M params, Apache 2.0, multivariate ICL via group attention) as a second predictor. Both write to ml_predictions with distinct model_name. A LightGBM meta-learner trained on signal_evaluations weights the two per asset class. See Ensemble Design for details.

Hardware footprint

Resource	Value
Parameters	102.3 M (Kronos-base)
VRAM (inference, bf16)	~240 MB
VRAM (LoRA training, batch 32)	~4.5 GB
GPU	RTX 4060, 8 GB
Inference latency (cached)	~50 ms
Inference latency (cold)	~800 ms

Known limitations

Context window: 512 tokens — ~21 days at 1 h, ~2 y at 1 d.
RoPE cache issue — inference semaphore pinned to 1 to avoid numerical drift at high concurrency.
Distribution drift — pre-trained on static historical data; addressed via Phase 6 rolling fine-tune.
No multi-modal fusion — text/news/macro not consumed natively; ensemble + event encoder are the current stopgaps.