Backfill Plan
Per-gap remediation from the researcher report. Work is tiered so Phase 3 training can kick off in 6 h of work, with higher-quality data layered in later.
P0 · Unblock Phase 3 training (≈ 6 h total)
Section titled “P0 · Unblock Phase 3 training (≈ 6 h total)”These three tasks are the minimum to start LoRA training with 18 / 20 usable channels.
1. Surprise z-score cache (4 h)
Section titled “1. Surprise z-score cache (4 h)”Gap: channels 5 (cpi_surprise_z) and 7 (nfp_surprise_z) need rolling-20 standard deviation per event type.
Action:
CREATE TABLE event_surprise_stats ( event_type TEXT PRIMARY KEY, rolling_std NUMERIC NOT NULL, sample_size INTEGER NOT NULL, updated_at TIMESTAMPTZ DEFAULT now());
-- PopulateINSERT INTO event_surprise_stats (event_type, rolling_std, sample_size)SELECT event_type, stddev_samp(actual - forecast) AS rolling_std, count(*) AS sample_sizeFROM economic_calendarWHERE actual IS NOT NULL AND forecast IS NOT NULLGROUP BY event_type;Encoder looks up rolling_std per event type at inference. Cron refreshes daily.
2. FOMC hawkish-score Tier-2 proxy (2 h)
Section titled “2. FOMC hawkish-score Tier-2 proxy (2 h)”Gap: channel 6 (fomc_hawkish_score) has no source.
Proxy formula:
def hawkish_score_tier2(rate_change_bps: int) -> float: """ +1 = 25 bps hike 0 = no change -1 = 25 bps cut Clipped to [-2, +2] for rare 50 bps moves. """ return max(-2.0, min(2.0, rate_change_bps / 25.0))Good enough to let LoRA learn a first-order hawkish / dovish response. Tier-1 NLP backfill is a post-launch upgrade.
3. VIX daily z-score (already available, 0 h)
Section titled “3. VIX daily z-score (already available, 0 h)”Channel 17 marked PARTIAL but the daily VIX is already in economic_indicators via FRED VIXCLS. Use daily-resolution VIX for now; mask or forward-fill into hourly bars with explicit mask channel. Zero new work.
P1 · Layered improvements (≈ 10 h, background during training)
Section titled “P1 · Layered improvements (≈ 10 h, background during training)”4. DXY hourly backfill (8 h)
Section titled “4. DXY hourly backfill (8 h)”Gap: channel 16 has only ~35 days of hourly history.
Source: Yahoo Finance DX=F (or Alpha Vantage if Yahoo coverage insufficient).
Action:
import yfinance as yfdf = yf.Ticker("DX=F").history(interval="1h", start="2021-01-01")# Transform + insert into ohlcv_1h with symbol='DX-Y.NYB'Guard: verify bar_time timezone alignment with existing hourly bars before insert (UTC vs. ET).
5. Earnings coverage expansion (2 h)
Section titled “5. Earnings coverage expansion (2 h)”Gap: channel 8 has earnings for some tickers but coverage is uneven for our 23 instruments.
Source: Finnhub earnings calendar (existing gateway integration).
Action: Loop instrument list → Finnhub earnings calendar → UPSERT into economic_calendar with event_type = 'earnings'.
P3 · Tier-1 upgrades (post-launch, ≈ 28 h)
Section titled “P3 · Tier-1 upgrades (post-launch, ≈ 28 h)”These cement quality to ~95 % but aren’t on the critical path.
6. FOMC Tier-1 hawkish NLP (16 h)
Section titled “6. FOMC Tier-1 hawkish NLP (16 h)”Action:
- Scrape
federalreserve.gov/monetarypolicy/fomccalendars.htmfor meeting dates + statement links. - Download statement PDFs, extract text via
pdfplumber. - Score with FinBERT (
ProsusAI/finbert) →[-1, +1]hawkish ↔ dovish. - Store in new
fomc_statementstable; back-populate history from 2019.
CREATE TABLE fomc_statements ( meeting_date DATE PRIMARY KEY, statement_text TEXT NOT NULL, hawkish_score REAL, hawkish_score_model TEXT DEFAULT 'ProsusAI/finbert', rate_change_bps INTEGER, created_at TIMESTAMPTZ DEFAULT now());Risk: Fed website structure may shift; PDF parsing is fragile. FinBERT needs GPU for batch inference.
7. VIX hourly backfill (12 h)
Section titled “7. VIX hourly backfill (12 h)”Gap: channel 17 intraday resolution.
Source options:
- CBOE historical intraday — authoritative but requires a data subscription.
- Derived from SPY realised vol — free, ~72 % correlation, acceptable when envelope-widening is the primary use.
Recommendation: Skip for v1. Revisit if VIX channel shows high feature importance in the Phase 5 meta-learner.
Ownership + timeline
Section titled “Ownership + timeline”| Task | Tier | Owner | Target |
|---|---|---|---|
| Surprise z-score cache | P0 | Data infra | W1 D1 |
| FOMC Tier-2 proxy | P0 | ML | W1 D1 |
| DXY hourly backfill | P1 | Data infra | W1 D3 |
| Earnings expansion | P1 | Data infra | W2 |
| FOMC Tier-1 NLP | P3 | ML | Post-launch |
| VIX hourly | P3 | Data infra | Deferred |
All P0 work lands before Phase 3 LoRA training starts — total 6 hours of blocking effort. P1 work runs in parallel with Phase 2 model modification.
Go / no-go gate
Section titled “Go / no-go gate”Phase 3 training is GO when:
event_surprise_statstable populated with ≥20 samples per primary event type (FOMC, CPI, NFP)- FOMC Tier-2 proxy integrated into
EventEncoder._fomc_hawkish_score() - DXY hourly coverage ≥180 days (6 months) — or accept masked channel during coverage gaps
If (3) slips, train with DXY masked and retrofit when backfill completes. This keeps the critical path moving.