Skip to content

Kronos Model Architecture

Kronos is a token-based autoregressive time-series transformer. Unlike patch-based models (TimesFM, PatchTST) that embed continuous values directly, Kronos quantizes OHLCV candles into discrete tokens — more analogous to a language model for financial series. This enables sampling multiple futures, natural quantile estimation, and clean additive conditioning on exogenous signals.

flowchart LR
A[OHLCV candles<br/>6 features] -->|z-score per window| B[BSQ Tokenizer<br/>frozen]
B --> C[Token pairs<br/>s1, s2]
C --> D[Hierarchical<br/>Embedding]
T[Temporal features<br/>hour, weekday, month] --> TE[TemporalEmbedding]
E[Event channels<br/>20 dims] --> EE[EventEmbedding<br/>NEW]
D --> ADD((+))
TE --> ADD
EE --> ADD
ADD --> F[Transformer<br/>decoder · LoRA-adapted]
F --> G[DualHead]
G --> H1[s1 logits]
G --> H2[s2 logits]
H1 --> I[Sample · decode · p10/p50/p90]
H2 --> I

Binary Spherical Quantization compresses each 6-feature candle into a pair of discrete tokens (s1, s2). Trained once by the Kronos authors on ~2B candles; we keep it frozen — it’s the model’s “alphabet.”

  • Input: (B, T, 6) z-score normalised OHLCV
  • Output: (s1_ids, s2_ids) each (B, T) long-tensor in a fixed vocabulary
  • Why not patch embedding? Discrete tokens make sampling clean (temperature, top-k) and let us build quantile envelopes from empirical sample distributions rather than learnt Gaussian heads.

Maps (s1, s2) pairs into d_model = 256 dimensions. Shared across timesteps.

Sinusoidal encodings of {minute, hour, weekday, day_of_month, month} — added to the token embedding.

A small Linear(20, 256) projection added to the combined embedding the same way temporal features are. 20 channels cover macro event flags, continuous surprise z-scores, sinusoidal days-until-event, and cross-asset leader returns. See Phase 1 for the full schema.

class EventEmbedding(nn.Module):
def __init__(self, num_event_channels=20, d_model=256):
super().__init__()
self.proj = nn.Linear(num_event_channels, d_model)
def forward(self, events):
if events is None:
return 0 # additive identity — backward compatible
return self.proj(events.float())

Standard causal decoder. During fine-tuning, LoRA adapters (rank 8) are attached to q_proj and v_proj in each block. Only adapters + event embedding train; the rest is frozen. ~1M trainable parameters out of 102.3M total.

Two output heads predict s1 and s2 tokens independently given the shared transformer trunk.

Kronos already uses additive conditioning for temporal features. Event conditioning reuses the exact same pattern — three lines of code in forward():

x = self.embedding([s1_ids, s2_ids])
if stamp is not None:
x = x + self.time_emb(stamp)
x = x + self.event_emb(events) # NEW — returns 0 when events=None
x = self.token_drop(x)

This gives us zero regression on non-event days (events=None ⇒ identical output to base Kronos) while letting the model learn event-specific behaviour via LoRA.

For each prediction horizon step:

  1. Sample n=30 continuations from the model with temperature T=1.0.
  2. Decode each token pair back to (open, high, low, close, volume) via the frozen BSQ tokenizer’s inverse.
  3. Compute empirical quantiles over the 30 samples at each horizon step.
  4. Surface as {p10, p50, p90} envelope.

Result: a fan chart where width encodes uncertainty and centre encodes direction.

Phase 0 adds Amazon’s Chronos-2 (200M params, Apache 2.0, multivariate ICL via group attention) as a second predictor. Both write to ml_predictions with distinct model_name. A LightGBM meta-learner trained on signal_evaluations weights the two per asset class. See Ensemble Design for details.

ResourceValue
Parameters102.3 M (Kronos-base)
VRAM (inference, bf16)~240 MB
VRAM (LoRA training, batch 32)~4.5 GB
GPURTX 4060, 8 GB
Inference latency (cached)~50 ms
Inference latency (cold)~800 ms
  • Context window: 512 tokens — ~21 days at 1 h, ~2 y at 1 d.
  • RoPE cache issue — inference semaphore pinned to 1 to avoid numerical drift at high concurrency.
  • Distribution drift — pre-trained on static historical data; addressed via Phase 6 rolling fine-tune.
  • No multi-modal fusion — text/news/macro not consumed natively; ensemble + event encoder are the current stopgaps.