Data Gaps
Source: researcher-260423-1502-kronos-data-gaps.md (full analysis).
Per-channel status
Section titled “Per-channel status”| Ch | Channel | Status | Source | Notes |
|---|---|---|---|---|
| 0 | is_fomc | ✅ HAVE | economic_calendar | Binary, daily |
| 1 | is_cpi | ✅ HAVE | economic_calendar | Monthly |
| 2 | is_nfp | ✅ HAVE | economic_calendar | Monthly |
| 3 | is_gdp | ✅ HAVE | economic_calendar | Quarterly |
| 4 | is_pce | ✅ HAVE | economic_calendar | Monthly |
| 5 | cpi_surprise_z | 🟡 PARTIAL | economic_calendar.actual − forecast | Need rolling-20 std cache |
| 6 | fomc_hawkish_score | ❌ MISSING | — | No Fed statement text stored |
| 7 | nfp_surprise_z | 🟡 PARTIAL | Same as CPI | Same rolling-std work |
| 8 | is_earnings | ✅ HAVE | economic_calendar filtered | title ILIKE '% Earnings' |
| 9 | is_rate_decision | ✅ HAVE | central_bank_rates | Via bank_code + change_date |
| 10 | days_to_fomc_sin | ✅ HAVE | Derivable | Pure compute from is_fomc |
| 11 | days_to_fomc_cos | ✅ HAVE | Derivable | — |
| 12 | days_to_cpi_sin | ✅ HAVE | Derivable | — |
| 13 | days_to_cpi_cos | ✅ HAVE | Derivable | — |
| 14 | btc_log_return_1h | ✅ HAVE | ohlcv_1h, BTC-USD | 40 symbols tracked |
| 15 | spy_log_return_1h | ✅ HAVE | ohlcv_1h, SPY | NYSE hours — mask when closed |
| 16 | dxy_log_return_1h | 🟡 PARTIAL | ohlcv_1h, DX-Y.NYB | Only <35 d history — need ≥6 mo backfill |
| 17 | vix_level_z | 🟡 PARTIAL | FRED VIXCLS daily | Daily only, no hourly |
| 18–19 | reserved | — | — | — |
Readiness summary
Section titled “Readiness summary”| Status | Channels | % |
|---|---|---|
| ✅ HAVE — ready now | 14 | 70 % |
| 🟡 PARTIAL — proxy available | 4 | 20 % |
| ❌ MISSING — needs new source | 2 | 10 % |
Estimated quality impact
Section titled “Estimated quality impact”| Path | Channels usable | Expected quality vs. oracle |
|---|---|---|
| P0 only (6 h work, proxies) | 18 / 20 | ~75–80 % |
| P0 + P1 (backfill, ~16 h) | 19 / 20 | ~85–90 % |
| All backfills (~32 h) | 20 / 20 | ~95 %+ |
Recommendation: proceed to Phase 3 training with P0 proxies; upgrade Tier-1 sources after seeing training results. Details on the Backfill Plan.
The six gaps, briefly
Section titled “The six gaps, briefly”🟡 Surprise z-scores (channels 5, 7)
Section titled “🟡 Surprise z-scores (channels 5, 7)”economic_calendar.actual and .forecast exist. Missing piece is the rolling-20 standard deviation per event type. Materialise it as a view or cache table; ~4 h.
❌ FOMC hawkish score (channel 6)
Section titled “❌ FOMC hawkish score (channel 6)”No FOMC statement text in the DB. Two paths:
- Tier 1 (full, 16 h): Scrape Fed statements → PyPDF2 → FinBERT sentiment →
fomc_statementstable. - Tier 2 (proxy, 2 h):
sign(rate_change) × |delta| / 0.25. Drops accuracy but unblocks Phase 3 in a morning.
Ship Tier 2 to start; upgrade to Tier 1 later.
🟡 DXY hourly OHLCV (channel 16)
Section titled “🟡 DXY hourly OHLCV (channel 16)”DX-Y.NYB is in tickers and ohlcv_1h exists, but only ~35 days of hourly history. Backfill 2021→today from Yahoo DX=F or Alpha Vantage. ~8 h.
🟡 VIX — hourly missing (channel 17)
Section titled “🟡 VIX — hourly missing (channel 17)”FRED VIXCLS provides daily VIX, fine for daily-horizon predictions. For hourly-horizon models, need intraday VIX from CBOE (subscription) or derive from SPY realized vol (~72 % correlation). P3 — not blocking v1.
🟡 Earnings coverage expansion
Section titled “🟡 Earnings coverage expansion”economic_calendar has earnings entries for some tickers but coverage is uneven. Finnhub earnings API can fill gaps. ~2 h.
Unresolved questions (from researcher)
Section titled “Unresolved questions (from researcher)”- Error tolerance for VIX Tier-2 proxy (SPY realised vol, ~72 % correlation)?
- Fed-website HTML structure stable enough for a long-lived scraper?
- DXY bar_time timezone in
ohlcv_1h— aligned to UTC or ET? - Compute earnings surprise z-score from
earnings_estimates, or treat as binary-only?
These won’t block Phase 3 kickoff but will tighten the final model.