Skip to content

Phase 6 · Rolling Fine-Tune

Priority: Medium Status: Pending Depends on: Phase 3 (trained event-conditioned model)

  • Kronos-base is pre-trained on static historical data
  • Distribution drift is real: 2021–2022 crypto bull ≠ 2024–2025 regime ≠ 2026 regime
  • Official Kronos fine-tune scripts released Q4 2025 (standard full-param recipe)
  • Existing kronos-service/finetune/ has base infra (shared.py, export_training_data.py)

Automate monthly rolling fine-tune on recent 2Y window per asset class. Combats distribution drift without touching architecture. Produces new LoRA checkpoints that slot into existing event-conditioned model.

  • Rolling 2Y training window per asset class (crypto / forex+commodity / equity)
  • Fine-tune the LoRA adapter from Phase 3 (not the frozen base)
  • Save per-asset-class checkpoints: lora_crypto_2026-04.pt, lora_equity_2026-04.pt, etc.
  • Predictor loads correct checkpoint based on target symbol’s asset class
  • Fallback: if no class-specific checkpoint, use the pooled one from Phase 3
  • Runs via Railway cron on the RTX 4060 service
  • Monthly cadence initially; drop to quarterly after 3 stable months
  • Completes in <8 hours per asset class (serialized to avoid GPU contention)
  • Preserves old checkpoints (rolling 6-month retention for rollback)

Drop from monthly to quarterly when:

ROLLING_3_MONTH_SHARPE_DELTA < 0.05
# i.e. last 3 monthly retrains showed < 0.05 Sharpe improvement over prior version
# → drift is slow, monthly is overkill, switch to quarterly
kronos-service/checkpoints/
├── base/ # Phase 3 output (pooled, all asset classes)
│ ├── lora_adapter.pt
│ └── event_emb.pt
└── rolling/
├── crypto/
│ ├── 2026-04-01/
│ │ ├── lora_adapter.pt
│ │ ├── event_emb.pt
│ │ └── metadata.json # {train_start, train_end, val_sharpe, val_mae}
│ └── 2026-03-01/
├── forex_commodity/
└── equity/

Per-class differences from Phase 3:

  • Data: last 2Y only (rolling window)
  • Initial weights: current in-production LoRA (warm start, not random)
  • Epochs: 5 (fewer — just adaptation, not from-scratch learning)
  • LR: 5e-5 (half of Phase 3 — finer adaptation)
  • All other params: same as Phase 3
def get_lora_checkpoint(symbol: str) -> Path:
asset_class = classify_asset(symbol) # "crypto" | "forex_commodity" | "equity"
class_dir = f"checkpoints/rolling/{asset_class}"
latest = find_latest_subdir(class_dir)
if latest is None:
return Path("checkpoints/base/lora_adapter.pt") # fallback
# Optional: guard against regression
if latest.metadata.val_mae > base.metadata.val_mae * 1.1:
return Path("checkpoints/base/lora_adapter.pt") # rolling failed, use base
return latest / "lora_adapter.pt"
  1. Create scripts/kronos-rolling-finetune.py (wraps Phase 3 training with rolling data)
  2. Add asset-class classifier helper (crypto / forex_commodity / equity)
  3. Implement warm-start from previous checkpoint (not random init)
  4. Implement checkpoint retention (keep last 6, delete older)
  5. Modify kronos-service/predictor.py to select LoRA checkpoint per symbol
  6. Add regression guard — validate new checkpoint before marking active
  7. Add Railway cron service: kronos-rolling-finetune (monthly, 1st of month 06:00 UTC)
  8. Implement cadence auto-adjustment (compare 3 monthly deltas → switch to quarterly if stable)
  9. Add evaluation hook — after training, run Phase 5 eval on new checkpoint
  10. Add alerting — notify if val_mae regresses >10% vs prior month
  • Create: scripts/kronos-rolling-finetune.py
  • Modify: kronos-service/predictor.py — checkpoint selection logic
  • Modify: kronos-service/finetune/train_event_conditioned.py — accept --warm-start flag
  • Create: kronos-service/checkpoints/rolling/ directory structure
RiskLikelihoodMitigation
Monthly retrain overfits recent regimeMediumKeep base checkpoint as fallback; regression guard
Warm-start drifts model too far from baseMediumLR halved + only 5 epochs; strict val_mae threshold
GPU contention with live inferenceLowSchedule during low-traffic window (06:00 UTC Sunday); acquire inference semaphore
Checkpoint storage bloatLow6-month retention = ~300MB per class × 3 classes = <1GB
  • Monthly cron completes without manual intervention for 3 consecutive months
  • Per-class checkpoint val_mae ≤ pooled base val_mae
  • Inference correctly selects latest per-class checkpoint
  • Regression guard fires if new checkpoint is worse (verified in test)
  • Cadence auto-switches to quarterly after 3 stable months
  • Old checkpoints retained for rollback (last 6 months)