Backfill Plan

Per-gap remediation from the researcher report. Work is tiered so Phase 3 training can kick off in 6 h of work, with higher-quality data layered in later.

P0 · Unblock Phase 3 training (≈ 6 h total)

These three tasks are the minimum to start LoRA training with 18 / 20 usable channels.

1. Surprise z-score cache (4 h)

Gap: channels 5 (cpi_surprise_z) and 7 (nfp_surprise_z) need rolling-20 standard deviation per event type.

Action:

CREATE TABLE event_surprise_stats (
  event_type TEXT PRIMARY KEY,
  rolling_std NUMERIC NOT NULL,
  sample_size INTEGER NOT NULL,
  updated_at TIMESTAMPTZ DEFAULT now()
);

-- Populate
INSERT INTO event_surprise_stats (event_type, rolling_std, sample_size)
SELECT event_type,
       stddev_samp(actual - forecast) AS rolling_std,
       count(*) AS sample_size
FROM economic_calendar
WHERE actual IS NOT NULL AND forecast IS NOT NULL
GROUP BY event_type;

Encoder looks up rolling_std per event type at inference. Cron refreshes daily.

2. FOMC hawkish-score Tier-2 proxy (2 h)

Gap: channel 6 (fomc_hawkish_score) has no source.

Proxy formula:

def hawkish_score_tier2(rate_change_bps: int) -> float:
    """
    +1  = 25 bps hike
     0  = no change
    -1  = 25 bps cut
    Clipped to [-2, +2] for rare 50 bps moves.
    """
    return max(-2.0, min(2.0, rate_change_bps / 25.0))

Good enough to let LoRA learn a first-order hawkish / dovish response. Tier-1 NLP backfill is a post-launch upgrade.

3. VIX daily z-score (already available, 0 h)

Channel 17 marked PARTIAL but the daily VIX is already in economic_indicators via FRED VIXCLS. Use daily-resolution VIX for now; mask or forward-fill into hourly bars with explicit mask channel. Zero new work.

P1 · Layered improvements (≈ 10 h, background during training)

4. DXY hourly backfill (8 h)

Gap: channel 16 has only ~35 days of hourly history.

Source: Yahoo Finance DX=F (or Alpha Vantage if Yahoo coverage insufficient).

Action:

import yfinance as yf
df = yf.Ticker("DX=F").history(interval="1h", start="2021-01-01")
# Transform + insert into ohlcv_1h with symbol='DX-Y.NYB'

Guard: verify bar_time timezone alignment with existing hourly bars before insert (UTC vs. ET).

5. Earnings coverage expansion (2 h)

Gap: channel 8 has earnings for some tickers but coverage is uneven for our 23 instruments.

Source: Finnhub earnings calendar (existing gateway integration).

Action: Loop instrument list → Finnhub earnings calendar → UPSERT into economic_calendar with event_type = 'earnings'.

P3 · Tier-1 upgrades (post-launch, ≈ 28 h)

These cement quality to ~95 % but aren’t on the critical path.

6. FOMC Tier-1 hawkish NLP (16 h)

Action:

Scrape federalreserve.gov/monetarypolicy/fomccalendars.htm for meeting dates + statement links.
Download statement PDFs, extract text via pdfplumber.
Score with FinBERT (ProsusAI/finbert) → [-1, +1] hawkish ↔ dovish.
Store in new fomc_statements table; back-populate history from 2019.

CREATE TABLE fomc_statements (
  meeting_date DATE PRIMARY KEY,
  statement_text TEXT NOT NULL,
  hawkish_score REAL,
  hawkish_score_model TEXT DEFAULT 'ProsusAI/finbert',
  rate_change_bps INTEGER,
  created_at TIMESTAMPTZ DEFAULT now()
);

Risk: Fed website structure may shift; PDF parsing is fragile. FinBERT needs GPU for batch inference.

7. VIX hourly backfill (12 h)

Gap: channel 17 intraday resolution.

Source options:

CBOE historical intraday — authoritative but requires a data subscription.
Derived from SPY realised vol — free, ~72 % correlation, acceptable when envelope-widening is the primary use.

Recommendation: Skip for v1. Revisit if VIX channel shows high feature importance in the Phase 5 meta-learner.

Ownership + timeline

Task	Tier	Owner	Target
Surprise z-score cache	P0	Data infra	W1 D1
FOMC Tier-2 proxy	P0	ML	W1 D1
DXY hourly backfill	P1	Data infra	W1 D3
Earnings expansion	P1	Data infra	W2
FOMC Tier-1 NLP	P3	ML	Post-launch
VIX hourly	P3	Data infra	Deferred

All P0 work lands before Phase 3 LoRA training starts — total 6 hours of blocking effort. P1 work runs in parallel with Phase 2 model modification.

Go / no-go gate

Phase 3 training is GO when:

event_surprise_stats table populated with ≥20 samples per primary event type (FOMC, CPI, NFP)
FOMC Tier-2 proxy integrated into EventEncoder._fomc_hawkish_score()
DXY hourly coverage ≥180 days (6 months) — or accept masked channel during coverage gaps

If (3) slips, train with DXY masked and retrofit when backfill completes. This keeps the critical path moving.