Phase 6 · Rolling Fine-Tune

Priority: Medium Status: Pending Depends on: Phase 3 (trained event-conditioned model)

Context

Kronos-base is pre-trained on static historical data
Distribution drift is real: 2021–2022 crypto bull ≠ 2024–2025 regime ≠ 2026 regime
Official Kronos fine-tune scripts released Q4 2025 (standard full-param recipe)
Existing kronos-service/finetune/ has base infra (shared.py, export_training_data.py)

Overview

Automate monthly rolling fine-tune on recent 2Y window per asset class. Combats distribution drift without touching architecture. Produces new LoRA checkpoints that slot into existing event-conditioned model.

Requirements

Functional

Rolling 2Y training window per asset class (crypto / forex+commodity / equity)
Fine-tune the LoRA adapter from Phase 3 (not the frozen base)
Save per-asset-class checkpoints: lora_crypto_2026-04.pt, lora_equity_2026-04.pt, etc.
Predictor loads correct checkpoint based on target symbol’s asset class
Fallback: if no class-specific checkpoint, use the pooled one from Phase 3

Non-Functional

Runs via Railway cron on the RTX 4060 service
Monthly cadence initially; drop to quarterly after 3 stable months
Completes in <8 hours per asset class (serialized to avoid GPU contention)
Preserves old checkpoints (rolling 6-month retention for rollback)

Architecture

Cadence Decision Rule

Drop from monthly to quarterly when:

ROLLING_3_MONTH_SHARPE_DELTA < 0.05
# i.e. last 3 monthly retrains showed < 0.05 Sharpe improvement over prior version
# → drift is slow, monthly is overkill, switch to quarterly

Checkpoint Layout

kronos-service/checkpoints/
├── base/                          # Phase 3 output (pooled, all asset classes)
│   ├── lora_adapter.pt
│   └── event_emb.pt
└── rolling/
    ├── crypto/
    │   ├── 2026-04-01/
    │   │   ├── lora_adapter.pt
    │   │   ├── event_emb.pt
    │   │   └── metadata.json    # {train_start, train_end, val_sharpe, val_mae}
    │   └── 2026-03-01/
    ├── forex_commodity/
    └── equity/

Training Config

Per-class differences from Phase 3:

Data: last 2Y only (rolling window)
Initial weights: current in-production LoRA (warm start, not random)
Epochs: 5 (fewer — just adaptation, not from-scratch learning)
LR: 5e-5 (half of Phase 3 — finer adaptation)
All other params: same as Phase 3

Checkpoint Selection at Inference

def get_lora_checkpoint(symbol: str) -> Path:
    asset_class = classify_asset(symbol)  # "crypto" | "forex_commodity" | "equity"
    class_dir = f"checkpoints/rolling/{asset_class}"
    latest = find_latest_subdir(class_dir)
    if latest is None:
        return Path("checkpoints/base/lora_adapter.pt")  # fallback
    # Optional: guard against regression
    if latest.metadata.val_mae > base.metadata.val_mae * 1.1:
        return Path("checkpoints/base/lora_adapter.pt")  # rolling failed, use base
    return latest / "lora_adapter.pt"

Implementation Steps

Key Files

Create: scripts/kronos-rolling-finetune.py
Modify: kronos-service/predictor.py — checkpoint selection logic
Modify: kronos-service/finetune/train_event_conditioned.py — accept --warm-start flag
Create: kronos-service/checkpoints/rolling/ directory structure

Risks

Risk	Likelihood	Mitigation
Monthly retrain overfits recent regime	Medium	Keep base checkpoint as fallback; regression guard
Warm-start drifts model too far from base	Medium	LR halved + only 5 epochs; strict val_mae threshold
GPU contention with live inference	Low	Schedule during low-traffic window (06:00 UTC Sunday); acquire inference semaphore
Checkpoint storage bloat	Low	6-month retention = ~300MB per class × 3 classes = <1GB

Success Criteria

Monthly cron completes without manual intervention for 3 consecutive months
Per-class checkpoint val_mae ≤ pooled base val_mae
Inference correctly selects latest per-class checkpoint
Regression guard fires if new checkpoint is worse (verified in test)
Cadence auto-switches to quarterly after 3 stable months
Old checkpoints retained for rollback (last 6 months)