Phase 6 · Rolling Fine-Tune
Priority: Medium Status: Pending Depends on: Phase 3 (trained event-conditioned model)
Context
Section titled “Context”- Kronos-base is pre-trained on static historical data
- Distribution drift is real: 2021–2022 crypto bull ≠ 2024–2025 regime ≠ 2026 regime
- Official Kronos fine-tune scripts released Q4 2025 (standard full-param recipe)
- Existing
kronos-service/finetune/has base infra (shared.py, export_training_data.py)
Overview
Section titled “Overview”Automate monthly rolling fine-tune on recent 2Y window per asset class. Combats distribution drift without touching architecture. Produces new LoRA checkpoints that slot into existing event-conditioned model.
Requirements
Section titled “Requirements”Functional
Section titled “Functional”- Rolling 2Y training window per asset class (crypto / forex+commodity / equity)
- Fine-tune the LoRA adapter from Phase 3 (not the frozen base)
- Save per-asset-class checkpoints:
lora_crypto_2026-04.pt,lora_equity_2026-04.pt, etc. - Predictor loads correct checkpoint based on target symbol’s asset class
- Fallback: if no class-specific checkpoint, use the pooled one from Phase 3
Non-Functional
Section titled “Non-Functional”- Runs via Railway cron on the RTX 4060 service
- Monthly cadence initially; drop to quarterly after 3 stable months
- Completes in <8 hours per asset class (serialized to avoid GPU contention)
- Preserves old checkpoints (rolling 6-month retention for rollback)
Architecture
Section titled “Architecture”Cadence Decision Rule
Section titled “Cadence Decision Rule”Drop from monthly to quarterly when:
ROLLING_3_MONTH_SHARPE_DELTA < 0.05# i.e. last 3 monthly retrains showed < 0.05 Sharpe improvement over prior version# → drift is slow, monthly is overkill, switch to quarterlyCheckpoint Layout
Section titled “Checkpoint Layout”kronos-service/checkpoints/├── base/ # Phase 3 output (pooled, all asset classes)│ ├── lora_adapter.pt│ └── event_emb.pt└── rolling/ ├── crypto/ │ ├── 2026-04-01/ │ │ ├── lora_adapter.pt │ │ ├── event_emb.pt │ │ └── metadata.json # {train_start, train_end, val_sharpe, val_mae} │ └── 2026-03-01/ ├── forex_commodity/ └── equity/Training Config
Section titled “Training Config”Per-class differences from Phase 3:
- Data: last 2Y only (rolling window)
- Initial weights: current in-production LoRA (warm start, not random)
- Epochs: 5 (fewer — just adaptation, not from-scratch learning)
- LR: 5e-5 (half of Phase 3 — finer adaptation)
- All other params: same as Phase 3
Checkpoint Selection at Inference
Section titled “Checkpoint Selection at Inference”def get_lora_checkpoint(symbol: str) -> Path: asset_class = classify_asset(symbol) # "crypto" | "forex_commodity" | "equity" class_dir = f"checkpoints/rolling/{asset_class}" latest = find_latest_subdir(class_dir) if latest is None: return Path("checkpoints/base/lora_adapter.pt") # fallback # Optional: guard against regression if latest.metadata.val_mae > base.metadata.val_mae * 1.1: return Path("checkpoints/base/lora_adapter.pt") # rolling failed, use base return latest / "lora_adapter.pt"Implementation Steps
Section titled “Implementation Steps”- Create
scripts/kronos-rolling-finetune.py(wraps Phase 3 training with rolling data) - Add asset-class classifier helper (crypto / forex_commodity / equity)
- Implement warm-start from previous checkpoint (not random init)
- Implement checkpoint retention (keep last 6, delete older)
- Modify
kronos-service/predictor.pyto select LoRA checkpoint per symbol - Add regression guard — validate new checkpoint before marking active
- Add Railway cron service:
kronos-rolling-finetune(monthly, 1st of month 06:00 UTC) - Implement cadence auto-adjustment (compare 3 monthly deltas → switch to quarterly if stable)
- Add evaluation hook — after training, run Phase 5 eval on new checkpoint
- Add alerting — notify if val_mae regresses >10% vs prior month
Key Files
Section titled “Key Files”- Create:
scripts/kronos-rolling-finetune.py - Modify:
kronos-service/predictor.py— checkpoint selection logic - Modify:
kronos-service/finetune/train_event_conditioned.py— accept--warm-startflag - Create:
kronos-service/checkpoints/rolling/directory structure
| Risk | Likelihood | Mitigation |
|---|---|---|
| Monthly retrain overfits recent regime | Medium | Keep base checkpoint as fallback; regression guard |
| Warm-start drifts model too far from base | Medium | LR halved + only 5 epochs; strict val_mae threshold |
| GPU contention with live inference | Low | Schedule during low-traffic window (06:00 UTC Sunday); acquire inference semaphore |
| Checkpoint storage bloat | Low | 6-month retention = ~300MB per class × 3 classes = <1GB |
Success Criteria
Section titled “Success Criteria”- Monthly cron completes without manual intervention for 3 consecutive months
- Per-class checkpoint val_mae ≤ pooled base val_mae
- Inference correctly selects latest per-class checkpoint
- Regression guard fires if new checkpoint is worse (verified in test)
- Cadence auto-switches to quarterly after 3 stable months
- Old checkpoints retained for rollback (last 6 months)