- Replace openfootball references with Wikipedia scraper workflow - Document all three scripts: scrape (dev), seed (init), sync (scheduled) - Explain rate-limit handling, incremental group detection, UTC kickoff ordering - Add NEXT_PUBLIC_SITE_URL to env vars table - Update project structure with data/, client.tsx pattern, wiki-scraper.ts - Add architecture notes for server/client split, dynamic sitemap, standings seeding Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
World Cup
A full-stack World Cup statistics web app covering every tournament from 1930 to 2026. Built with Next.js 16, TailwindCSS 4, GraphQL, and PostgreSQL. Historical data is scraped from English Wikipedia and committed to the repo; live 2026 results are synced from Wikipedia on a schedule so scores appear within minutes of the final whistle.
Features
- Live 2026 matches — detected automatically when today's date matches a scheduled fixture; Apollo polls every 60 seconds for score updates
- All-time statistics — goals, hat-tricks, biggest wins, highest-scoring games, penalty stats, goals-by-minute heatmap, confederation performance, title counts
- Group standings — computed from match results for every tournament, with 0-row entries seeded so all groups appear even before any matches are played
- Deep-linked pages — every tournament, team, and player has a permanent URL (
/tournaments/1966,/teams/brazil,/players/Pelé) with server-side metadata for SEO - Full-text search — across teams, tournaments, and players
- Squad data — 26-man rosters for 2026 with position, shirt number, and date of birth
- Qualification playoffs — 2026 inter-confederation playoff results stored separately
- Country flags — via
flag-iconsCSS classes, ~200 nations covered - Dark pitch aesthetic — Bebas Neue headings, Space Grotesk body, green-on-black design
Pages
| Route | Content |
|---|---|
/ |
Home: live matches, stat pills, latest result, upcoming fixtures, Golden Boot race |
/groups |
All 12 group tables for 2026 (P/W/D/L/GD/Pts) with results and upcoming fixtures |
/stats |
Historical stats: goals chart, top scorers, hat-tricks, biggest wins, goals by minute, ET/shootout stats, confederation stats |
/history |
All 24 tournament cards newest-first, each with host, winner, top scorer |
/search?q=… |
Full-text search across teams, players, tournaments |
/tournaments/[year] |
Tournament detail: group stage with standings + matches, knockout rounds, scorer sidebar |
/teams/[slug] |
Team profile: all-time record, top scorers, WC appearances |
/players/[name] |
Player profile: goals by tournament, penalties vs open play breakdown |
Tech stack
| Layer | Technology |
|---|---|
| Framework | Next.js 16.2 (App Router, standalone output) |
| Styling | TailwindCSS 4 (CSS-first @theme config) |
| GraphQL server | GraphQL Yoga in /api/graphql Next.js route |
| GraphQL client | Apollo Client 4 with 60 s poll for live matches |
| ORM | Drizzle ORM with postgres driver |
| Database | PostgreSQL 16 |
| Flags | flag-icons npm package |
| Fonts | Bebas Neue + Space Grotesk (Google Fonts) |
| Container | Docker multi-stage build, Traefik-compatible |
Data pipeline
Data flows through three scripts that are run at different times and for different purposes.
1. Scrape — one-time developer task
pnpm scrape # all years (1930–2022), matches + squads
pnpm scrape 2002 # single year
pnpm scrape 2002 --matches # matches, meta, stadiums, groups only
pnpm scrape 2002 --squads # squads only
Fetches structured match data from English Wikipedia using the MediaWiki parse API and writes JSON files to data/{year}/. These files are committed to git so the production build never needs to hit Wikipedia for historical data.
Each year produces up to five files:
| File | Content |
|---|---|
worldcup.json |
Matches with scores (FT/HT/ET/P) and goal-scorer events |
worldcup.meta.json |
Tournament metadata: host, winner, runner-up, team count |
worldcup.stadiums.json |
Stadium names and cities |
worldcup.groups.json |
Group compositions (teams per group) |
worldcup.squads.json |
Player rosters (where available on Wikipedia) |
The scraper has built-in rate-limit handling: it detects Wikipedia's plain-text "You are making too many requests" response, waits 30 seconds, and retries with exponential back-off (up to 6 attempts, 15 s × attempt delay between retries). Group sub-pages are fetched with a 3-second delay between requests.
2. Seed — initial database population
DATABASE_URL="postgres://wc:wc@localhost:5432/worldcup" pnpm seed
DATABASE_URL="..." pnpm seed --force # drop and re-seed from scratch
Reads the committed data/{year}/ JSON files and loads them into the database. Also creates all tables (if they do not exist). Intended for first-time setup and for re-seeding after schema changes. Covers 1930–2022 only — 2026 data is handled by sync.
Seed is idempotent and skips silently if data is already present (unless --force is passed).
3. Sync — scheduled live updates (2026 only)
DATABASE_URL="..." pnpm sync # normal run
DATABASE_URL="..." pnpm sync --force # clear and re-fetch all 2026 data
Fetches the current state of the 2026 Wikipedia pages and upserts everything into the database. Historical years (1930–2022) are not touched — they come from the committed JSON files via seed.
What sync does on each run:
- Fetches
2026_FIFA_World_Cupvia the MediaWiki API - Determines which groups are fully complete (all matches have FT scores) and skips their sub-pages to save requests
- Upserts matches, scores, and goal events
- Fetches
2026_FIFA_World_Cup_squadsand upserts squad rosters - Recomputes group standings from match results
- Seeds 0-row standing entries for groups with no played matches yet (so all groups appear in the UI)
- Updates tournament aggregates (total goals, matches played, avg goals/game)
Sync is designed to run on a 10-minute cron in production. Each run is safe to repeat — all writes use ON CONFLICT DO UPDATE.
Database schema
tournaments year PK, host, winner, runner_up, third_place, fourth_place,
teams_count, matches_count, total_goals, avg_goals_per_game
teams id, name UNIQUE, iso2, fifa_code, continent, confederation
stadiums id, tournament_year FK, name, city, country_code,
capacity, timezone, coordinates
matches id, tournament_year FK, round, group_name, date, time_local,
stadium_id FK, team1_id FK, team2_id FK,
score_ft_home, score_ft_away,
score_ht_home, score_ht_away,
score_et_home, score_et_away,
score_p_home, score_p_away,
is_quali_playoff
goals id, match_id FK, team_id FK, player_name,
minute, minute_offset, is_penalty, is_own_goal
group_standings tournament_year FK, group_name, team_id FK,
pos, played, won, drawn, lost,
goals_for, goals_against, goal_diff, pts
squads id, tournament_year FK, team_id FK, player_name,
shirt_number, position, date_of_birth
Local development
Prerequisites: Node.js 22+, pnpm 10+, Docker
# 1. Clone and install
git clone <repo-url> worldcup
cd worldcup
pnpm install
# 2. Start the database
docker compose -f docker-compose.dev.yml up -d
# 3. Seed historical data (1930–2022) from committed JSON files
DATABASE_URL="postgres://wc:wc@localhost:5432/worldcup" pnpm seed
# 4. Sync 2026 data from Wikipedia
DATABASE_URL="postgres://wc:wc@localhost:5432/worldcup" pnpm sync
# 5. Start the dev server
DATABASE_URL="postgres://wc:wc@localhost:5432/worldcup" pnpm dev
Open http://localhost:3000.
To stop the database: docker compose -f docker-compose.dev.yml down
If you need to re-scrape historical data (e.g. after a Wikipedia article correction):
pnpm scrape 2002 # re-scrape a single year
git add data/2002/ && git commit -m "chore: refresh 2002 scraped data"
Environment variables
| Variable | Required | Description |
|---|---|---|
DATABASE_URL |
Yes | PostgreSQL connection string |
NEXT_PUBLIC_SITE_URL |
Production | Public base URL, e.g. https://worldcup.example.com — used for sitemap and OG metadata |
DB_PASSWORD |
Production | Password for the wc DB user (used by docker-compose.yml) |
TRAEFIK_ENABLED |
Production | Set to true to activate Traefik router labels |
TRAEFIK_HOST |
Production | Public hostname, e.g. worldcup.example.com |
NETWORK_NAME |
Production | Name of the external Docker network Traefik is attached to |
UMAMI_ID |
Optional | Umami analytics site ID |
UMAMI_SRC |
Optional | Umami analytics script URL |
Copy .env.example to .env and fill in the values before deploying.
Deployment (Coolify + Traefik)
The app is designed for self-hosted deployment via Coolify behind a Traefik reverse proxy.
1. Configure environment
In Coolify's environment variable editor set:
DB_PASSWORD=<strong-random-password>
DATABASE_URL=postgres://wc:<DB_PASSWORD>@db:5432/worldcup
NEXT_PUBLIC_SITE_URL=https://worldcup.yourdomain.com
TRAEFIK_ENABLED=true
TRAEFIK_HOST=worldcup.yourdomain.com
NETWORK_NAME=<your-traefik-network-name>
2. Deploy
Coolify builds the Docker image via docker compose up and attaches the container to the Traefik network automatically. TLS certificates are issued by the resolver cert resolver configured in Traefik.
3. Initial data load
After the first deployment, seed historical data and then sync 2026:
# In Coolify's terminal for the app container:
pnpm seed # loads 1930–2022 from committed JSON files
pnpm sync # fetches 2026 from Wikipedia
4. Scheduled sync (live updates)
In Coolify → your service → Scheduled Tasks, add:
| Field | Value |
|---|---|
| Command | pnpm sync |
| Schedule | */10 * * * * |
| Container | app |
This re-syncs 2026 from Wikipedia every 10 minutes. New match results appear within 10 minutes of the final whistle.
Project structure
worldcup/
├── app/
│ ├── layout.tsx # Root layout: nav, fonts, Apollo provider, global metadata
│ ├── robots.ts # robots.txt (Next.js convention)
│ ├── sitemap.ts # sitemap.xml — dynamic, rendered at request time
│ ├── page.tsx # Home — server wrapper (exports metadata)
│ ├── client.tsx # Home — Apollo/interactive client component
│ ├── groups/
│ │ ├── page.tsx # Groups — server wrapper
│ │ └── client.tsx # Groups — client component
│ ├── stats/page.tsx + client.tsx
│ ├── history/page.tsx + client.tsx
│ ├── search/page.tsx + client.tsx
│ ├── tournaments/[year]/
│ │ ├── page.tsx # generateMetadata fetches tournament from DB
│ │ └── client.tsx # Tournament detail, group standings, bracket
│ ├── teams/[slug]/page.tsx + client.tsx
│ ├── players/[name]/page.tsx + client.tsx
│ └── api/graphql/route.ts # GraphQL Yoga endpoint
├── components/
│ ├── apollo-provider.tsx # Apollo Client provider wrapper
│ ├── nav.tsx # Top navigation bar
│ ├── team-flag.tsx # flag-icons wrapper component
│ ├── match-card.tsx # Match result / fixture card
│ └── live-badge.tsx # Pulsing LIVE indicator
├── lib/
│ ├── db/
│ │ ├── schema.ts # Drizzle table definitions
│ │ └── index.ts # DB connection singleton
│ ├── graphql/
│ │ ├── schema.ts # GraphQL SDL
│ │ ├── resolvers/index.ts # All resolvers
│ │ ├── hooks.ts # Apollo v4 useQuery wrapper
│ │ └── client.ts # Apollo Client factory
│ ├── wiki-scraper.ts # Wikipedia HTML parser (cheerio), rate-limit retry
│ └── iso-codes.ts # Team name → ISO2 country code map
├── scripts/
│ ├── scrape-wikipedia.ts # Developer-only: scrape Wikipedia → data/{year}/
│ ├── seed.ts # Initial DB load from data/{year}/ JSON files
│ └── sync.ts # Scheduled: sync 2026 live data from Wikipedia
├── data/
│ ├── 1930/ … 2022/ # Committed Wikipedia scrape output (per-year JSON)
│ └── {year}/
│ ├── worldcup.json # Matches + goals
│ ├── worldcup.meta.json # Tournament metadata
│ ├── worldcup.stadiums.json # Stadiums
│ ├── worldcup.groups.json # Group compositions
│ └── worldcup.squads.json # Squad rosters (where available)
├── docker-compose.yml # Production (Traefik + external network)
├── docker-compose.dev.yml # Local dev (DB only, port 5432 exposed)
├── Dockerfile # Multi-stage pnpm build
├── .env.example # Environment variable template
├── next.config.ts # standalone output, serverExternalPackages
├── drizzle.config.ts # Drizzle Kit config
└── tsconfig.json
Architecture notes
Live match detection — A match is considered live when its date equals today and the current time falls within 5 minutes before kick-off to 125 minutes after. Kick-off times are stored as "HH:MM UTC±N" strings; the resolver computes the UTC timestamp at query time using PostgreSQL interval arithmetic. Apollo's pollInterval: 60_000 re-queries liveMatches and recentMatches every minute.
UTC kickoff ordering — Both upcomingMatches (ascending) and recentMatches (descending) sort by computed UTC kickoff time using a CASE expression that parses the time_local string and subtracts the UTC offset as an interval. This ensures correct ordering across time zones — a match starting later in a westward timezone is not incorrectly ranked ahead of an earlier match with a higher database ID.
Server/client split — All pages use a server wrapper page.tsx that exports metadata (or generateMetadata) and a client.tsx that contains the Apollo query and interactive rendering. This lets Next.js generate accurate <title>, OpenGraph, and Twitter card tags for each route without requiring server-side data fetching in client components.
NEXT_PUBLIC_SITE_URL — The public hostname is read from this environment variable in sitemap.ts, robots.ts, and layout.tsx (metadataBase). All per-page openGraph.url values use relative paths (/groups, /tournaments/2026, etc.) which Next.js resolves against metadataBase automatically. The sitemap is marked export const dynamic = 'force-dynamic' so it runs at request time when the database is reachable, not at build time.
Apollo Client v4 — This project uses Apollo Client 4 which moved hooks to @apollo/client/react and core utilities to @apollo/client/core. A thin wrapper in lib/graphql/hooks.ts re-exports useQuery typed as Record<string, any> to avoid the v4 TData = {} default breaking all field accesses.
Standalone Docker output — next.config.ts sets output: 'standalone' which produces a self-contained server.js. The scripts/, lib/, and data/ directories are copied separately into the runner stage so pnpm seed and pnpm sync work inside the container without needing a full Node/TypeScript toolchain reinstall.
Group standings — Standings are computed live from match results via a SQL GROUP BY query in the groupStandings resolver. After each sync, 0-row standing entries are inserted for all teams in all 2026 groups, ensuring every group appears in the UI even before its first match is played.
Wikipedia scraper rate limits — The MediaWiki API occasionally returns a plain-text "You are making too many requests to the API" response instead of JSON. The scraper detects this by reading the response as text first, then parses JSON only if the body does not start with that phrase. On rate-limit (or HTTP 429), it waits 30 seconds before retrying. Retries use exponential back-off: 15 s × attempt number, up to 6 attempts per page.
GraphQL API
The GraphQL playground is available at /api/graphql in development.
Key queries:
# Live matches right now
{ liveMatches { id date time team1 { name } team2 { name } scoreFt isLive } }
# All-time top scorers
{ topScorers(limit: 10) { playerName goals penalties team { name iso2 } } }
# 2026 group standings
{ groupStandings(year: 2026) { groupName pos team { name iso2 } played won drawn lost goalsFor goalsAgainst pts } }
# Tournament detail
{ tournament(year: 2022) { year host winner totalGoals avgGoalsPerGame } }
# Team stats
{ team(slug: "brazil") { name stats { appearances wins losses titles goalsFor } } }
# Full-text search
{ search(query: "Ronaldo") { teams { name } players { playerName goals } } }
# Hat-tricks in World Cup history
{ hatTricks { playerName goals year round team { name } opponent { name } } }
# Global stats
{ tournamentStats { totalTournaments totalMatches totalGoals avgGoalsPerGame } }