Commit Graph

4 Commits

Author SHA1 Message Date
valknar 187ee2e312 fix: parse Wikipedia 12h time format and sort upcoming matches with NULLS LAST
Wikipedia stores match times as "6:00 p.m." (1-digit hour) which didn't
match the \d{2}:\d{2} regex, producing NULL for those matches. Introduced
parseTime12h() to handle 1-2 digit hours + AM/PM and convert to 24h.
Also sort upcomingMatches by NULLS LAST so unscheduled games appear after
timed ones rather than first. Dropped "openfootball" data attribution.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-15 17:50:30 +02:00
valknar 42063cdfda fix: extend group heading regex from [a-h] to [a-z] for 2026 Groups I-L
2026 FIFA World Cup has 12 groups (A-L). The previous regex only matched A-H,
causing Groups I, J, K, L to fall through undetected and collapse into Group H.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-15 17:39:38 +02:00
valknar b832b62f5e fix: normalize Bosnia & Herzegovina and USA team name variants
Add TEAM_ALIASES to lib/wiki-scraper.ts applied at extraction time so both
scraper and sync consistently produce canonical names. Removes the duplicate
alias map from seed.ts in favour of the shared normalizeTeam() export.

Aliases added:
  Bosnia & Herzegovina  → Bosnia and Herzegovina
  USA                   → United States

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-15 17:33:05 +02:00
valknar f885e4312c refactor: extract lib/wiki-scraper.ts, make scraper composable, sync from Wikipedia
Move all scraping logic (fetchWikiHtml, scrapeYear, scrapeSquads and all
helpers) into lib/wiki-scraper.ts as exported functions shared by both scripts.

scrape-wikipedia.ts becomes a composable CLI:
  pnpm scrape [year]             — matches + squads (default)
  pnpm scrape [year] --matches   — matches/meta/stadiums only
  pnpm scrape [year] --squads    — squads only

sync.ts drops the openfootball GitHub dependency entirely and scrapes
Wikipedia directly. Incremental: completed groups (all matches have FT
scores) are detected via DB query and their sub-pages are skipped each run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-15 17:23:17 +02:00