- Detect Wikipedia plain-text rate-limit response ("You are making too many
requests") and wait 30s before retrying, rather than silently failing
- Increase inter-attempt delay from 3s to 15s per attempt
- Increase group subpage delay from 1.2s to 3s, year delay from 0.6s to 2s
- Re-scrape 1982, 1998, 2002, 2006 which had failed groups; all groups now
complete — e.g. 2002 now has 64 matches including Group E (Germany/Klose)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wikipedia abbreviates goal scorer display text (e.g. "Müller") but the
<a title="Thomas Müller"> attribute always has the full name. Switch
parseGoals() to prefer title attr and strip disambiguation suffixes like
"(soccer, born 1993)". This ensures Gerd Müller and Thomas Müller get
separate player pages.
Also preserve the UTC offset from Wikipedia's ftime (e.g. "12:00 UTC-4")
so that isLive() can accurately compute UTC kickoff time instead of
treating local time as UTC. upcomingMatches sorts by SPLIT_PART on the
HH:MM part to ignore the timezone suffix.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wikipedia stores match times as "6:00 p.m." (1-digit hour) which didn't
match the \d{2}:\d{2} regex, producing NULL for those matches. Introduced
parseTime12h() to handle 1-2 digit hours + AM/PM and convert to 24h.
Also sort upcomingMatches by NULLS LAST so unscheduled games appear after
timed ones rather than first. Dropped "openfootball" data attribution.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026 FIFA World Cup has 12 groups (A-L). The previous regex only matched A-H,
causing Groups I, J, K, L to fall through undetected and collapse into Group H.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add TEAM_ALIASES to lib/wiki-scraper.ts applied at extraction time so both
scraper and sync consistently produce canonical names. Removes the duplicate
alias map from seed.ts in favour of the shared normalizeTeam() export.
Aliases added:
Bosnia & Herzegovina → Bosnia and Herzegovina
USA → United States
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move all scraping logic (fetchWikiHtml, scrapeYear, scrapeSquads and all
helpers) into lib/wiki-scraper.ts as exported functions shared by both scripts.
scrape-wikipedia.ts becomes a composable CLI:
pnpm scrape [year] — matches + squads (default)
pnpm scrape [year] --matches — matches/meta/stadiums only
pnpm scrape [year] --squads — squads only
sync.ts drops the openfootball GitHub dependency entirely and scrapes
Wikipedia directly. Incremental: completed groups (all matches have FT
scores) are detected via DB query and their sub-pages are skipped each run.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>