Skip to content

Backfill Plan

Per-gap remediation from the researcher report. Work is tiered so Phase 3 training can kick off in 6 h of work, with higher-quality data layered in later.

P0 · Unblock Phase 3 training (≈ 6 h total)

Section titled “P0 · Unblock Phase 3 training (≈ 6 h total)”

These three tasks are the minimum to start LoRA training with 18 / 20 usable channels.

Gap: channels 5 (cpi_surprise_z) and 7 (nfp_surprise_z) need rolling-20 standard deviation per event type.

Action:

CREATE TABLE event_surprise_stats (
event_type TEXT PRIMARY KEY,
rolling_std NUMERIC NOT NULL,
sample_size INTEGER NOT NULL,
updated_at TIMESTAMPTZ DEFAULT now()
);
-- Populate
INSERT INTO event_surprise_stats (event_type, rolling_std, sample_size)
SELECT event_type,
stddev_samp(actual - forecast) AS rolling_std,
count(*) AS sample_size
FROM economic_calendar
WHERE actual IS NOT NULL AND forecast IS NOT NULL
GROUP BY event_type;

Encoder looks up rolling_std per event type at inference. Cron refreshes daily.

Gap: channel 6 (fomc_hawkish_score) has no source.

Proxy formula:

def hawkish_score_tier2(rate_change_bps: int) -> float:
"""
+1 = 25 bps hike
0 = no change
-1 = 25 bps cut
Clipped to [-2, +2] for rare 50 bps moves.
"""
return max(-2.0, min(2.0, rate_change_bps / 25.0))

Good enough to let LoRA learn a first-order hawkish / dovish response. Tier-1 NLP backfill is a post-launch upgrade.

3. VIX daily z-score (already available, 0 h)

Section titled “3. VIX daily z-score (already available, 0 h)”

Channel 17 marked PARTIAL but the daily VIX is already in economic_indicators via FRED VIXCLS. Use daily-resolution VIX for now; mask or forward-fill into hourly bars with explicit mask channel. Zero new work.

P1 · Layered improvements (≈ 10 h, background during training)

Section titled “P1 · Layered improvements (≈ 10 h, background during training)”

Gap: channel 16 has only ~35 days of hourly history.

Source: Yahoo Finance DX=F (or Alpha Vantage if Yahoo coverage insufficient).

Action:

scripts/backfill_dxy_hourly.py
import yfinance as yf
df = yf.Ticker("DX=F").history(interval="1h", start="2021-01-01")
# Transform + insert into ohlcv_1h with symbol='DX-Y.NYB'

Guard: verify bar_time timezone alignment with existing hourly bars before insert (UTC vs. ET).

Gap: channel 8 has earnings for some tickers but coverage is uneven for our 23 instruments.

Source: Finnhub earnings calendar (existing gateway integration).

Action: Loop instrument list → Finnhub earnings calendar → UPSERT into economic_calendar with event_type = 'earnings'.

P3 · Tier-1 upgrades (post-launch, ≈ 28 h)

Section titled “P3 · Tier-1 upgrades (post-launch, ≈ 28 h)”

These cement quality to ~95 % but aren’t on the critical path.

Action:

  1. Scrape federalreserve.gov/monetarypolicy/fomccalendars.htm for meeting dates + statement links.
  2. Download statement PDFs, extract text via pdfplumber.
  3. Score with FinBERT (ProsusAI/finbert) → [-1, +1] hawkish ↔ dovish.
  4. Store in new fomc_statements table; back-populate history from 2019.
CREATE TABLE fomc_statements (
meeting_date DATE PRIMARY KEY,
statement_text TEXT NOT NULL,
hawkish_score REAL,
hawkish_score_model TEXT DEFAULT 'ProsusAI/finbert',
rate_change_bps INTEGER,
created_at TIMESTAMPTZ DEFAULT now()
);

Risk: Fed website structure may shift; PDF parsing is fragile. FinBERT needs GPU for batch inference.

Gap: channel 17 intraday resolution.

Source options:

  • CBOE historical intraday — authoritative but requires a data subscription.
  • Derived from SPY realised vol — free, ~72 % correlation, acceptable when envelope-widening is the primary use.

Recommendation: Skip for v1. Revisit if VIX channel shows high feature importance in the Phase 5 meta-learner.

TaskTierOwnerTarget
Surprise z-score cacheP0Data infraW1 D1
FOMC Tier-2 proxyP0MLW1 D1
DXY hourly backfillP1Data infraW1 D3
Earnings expansionP1Data infraW2
FOMC Tier-1 NLPP3MLPost-launch
VIX hourlyP3Data infraDeferred

All P0 work lands before Phase 3 LoRA training starts — total 6 hours of blocking effort. P1 work runs in parallel with Phase 2 model modification.

Phase 3 training is GO when:

  1. event_surprise_stats table populated with ≥20 samples per primary event type (FOMC, CPI, NFP)
  2. FOMC Tier-2 proxy integrated into EventEncoder._fomc_hawkish_score()
  3. DXY hourly coverage ≥180 days (6 months) — or accept masked channel during coverage gaps

If (3) slips, train with DXY masked and retrofit when backfill completes. This keeps the critical path moving.