valknar 1ebe4613ce docs: rewrite README with accurate data pipeline documentation
- Replace openfootball references with Wikipedia scraper workflow
- Document all three scripts: scrape (dev), seed (init), sync (scheduled)
- Explain rate-limit handling, incremental group detection, UTC kickoff ordering
- Add NEXT_PUBLIC_SITE_URL to env vars table
- Update project structure with data/, client.tsx pattern, wiki-scraper.ts
- Add architecture notes for server/client split, dynamic sitemap, standings seeding

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-16 07:50:12 +02:00

World Cup

A full-stack World Cup statistics web app covering every tournament from 1930 to 2026. Built with Next.js 16, TailwindCSS 4, GraphQL, and PostgreSQL. Historical data is scraped from English Wikipedia and committed to the repo; live 2026 results are synced from Wikipedia on a schedule so scores appear within minutes of the final whistle.

Features

  • Live 2026 matches — detected automatically when today's date matches a scheduled fixture; Apollo polls every 60 seconds for score updates
  • All-time statistics — goals, hat-tricks, biggest wins, highest-scoring games, penalty stats, goals-by-minute heatmap, confederation performance, title counts
  • Group standings — computed from match results for every tournament, with 0-row entries seeded so all groups appear even before any matches are played
  • Deep-linked pages — every tournament, team, and player has a permanent URL (/tournaments/1966, /teams/brazil, /players/Pelé) with server-side metadata for SEO
  • Full-text search — across teams, tournaments, and players
  • Squad data — 26-man rosters for 2026 with position, shirt number, and date of birth
  • Qualification playoffs — 2026 inter-confederation playoff results stored separately
  • Country flags — via flag-icons CSS classes, ~200 nations covered
  • Dark pitch aesthetic — Bebas Neue headings, Space Grotesk body, green-on-black design

Pages

Route Content
/ Home: live matches, stat pills, latest result, upcoming fixtures, Golden Boot race
/groups All 12 group tables for 2026 (P/W/D/L/GD/Pts) with results and upcoming fixtures
/stats Historical stats: goals chart, top scorers, hat-tricks, biggest wins, goals by minute, ET/shootout stats, confederation stats
/history All 24 tournament cards newest-first, each with host, winner, top scorer
/search?q=… Full-text search across teams, players, tournaments
/tournaments/[year] Tournament detail: group stage with standings + matches, knockout rounds, scorer sidebar
/teams/[slug] Team profile: all-time record, top scorers, WC appearances
/players/[name] Player profile: goals by tournament, penalties vs open play breakdown

Tech stack

Layer Technology
Framework Next.js 16.2 (App Router, standalone output)
Styling TailwindCSS 4 (CSS-first @theme config)
GraphQL server GraphQL Yoga in /api/graphql Next.js route
GraphQL client Apollo Client 4 with 60 s poll for live matches
ORM Drizzle ORM with postgres driver
Database PostgreSQL 16
Flags flag-icons npm package
Fonts Bebas Neue + Space Grotesk (Google Fonts)
Container Docker multi-stage build, Traefik-compatible

Data pipeline

Data flows through three scripts that are run at different times and for different purposes.

1. Scrape — one-time developer task

pnpm scrape                   # all years (19302022), matches + squads
pnpm scrape 2002              # single year
pnpm scrape 2002 --matches    # matches, meta, stadiums, groups only
pnpm scrape 2002 --squads     # squads only

Fetches structured match data from English Wikipedia using the MediaWiki parse API and writes JSON files to data/{year}/. These files are committed to git so the production build never needs to hit Wikipedia for historical data.

Each year produces up to five files:

File Content
worldcup.json Matches with scores (FT/HT/ET/P) and goal-scorer events
worldcup.meta.json Tournament metadata: host, winner, runner-up, team count
worldcup.stadiums.json Stadium names and cities
worldcup.groups.json Group compositions (teams per group)
worldcup.squads.json Player rosters (where available on Wikipedia)

The scraper has built-in rate-limit handling: it detects Wikipedia's plain-text "You are making too many requests" response, waits 30 seconds, and retries with exponential back-off (up to 6 attempts, 15 s × attempt delay between retries). Group sub-pages are fetched with a 3-second delay between requests.

2. Seed — initial database population

DATABASE_URL="postgres://wc:wc@localhost:5432/worldcup" pnpm seed
DATABASE_URL="..." pnpm seed --force   # drop and re-seed from scratch

Reads the committed data/{year}/ JSON files and loads them into the database. Also creates all tables (if they do not exist). Intended for first-time setup and for re-seeding after schema changes. Covers 19302022 only — 2026 data is handled by sync.

Seed is idempotent and skips silently if data is already present (unless --force is passed).

3. Sync — scheduled live updates (2026 only)

DATABASE_URL="..." pnpm sync           # normal run
DATABASE_URL="..." pnpm sync --force   # clear and re-fetch all 2026 data

Fetches the current state of the 2026 Wikipedia pages and upserts everything into the database. Historical years (19302022) are not touched — they come from the committed JSON files via seed.

What sync does on each run:

  1. Fetches 2026_FIFA_World_Cup via the MediaWiki API
  2. Determines which groups are fully complete (all matches have FT scores) and skips their sub-pages to save requests
  3. Upserts matches, scores, and goal events
  4. Fetches 2026_FIFA_World_Cup_squads and upserts squad rosters
  5. Recomputes group standings from match results
  6. Seeds 0-row standing entries for groups with no played matches yet (so all groups appear in the UI)
  7. Updates tournament aggregates (total goals, matches played, avg goals/game)

Sync is designed to run on a 10-minute cron in production. Each run is safe to repeat — all writes use ON CONFLICT DO UPDATE.

Database schema

tournaments       year PK, host, winner, runner_up, third_place, fourth_place,
                  teams_count, matches_count, total_goals, avg_goals_per_game

teams             id, name UNIQUE, iso2, fifa_code, continent, confederation

stadiums          id, tournament_year FK, name, city, country_code,
                  capacity, timezone, coordinates

matches           id, tournament_year FK, round, group_name, date, time_local,
                  stadium_id FK, team1_id FK, team2_id FK,
                  score_ft_home, score_ft_away,
                  score_ht_home, score_ht_away,
                  score_et_home, score_et_away,
                  score_p_home,  score_p_away,
                  is_quali_playoff

goals             id, match_id FK, team_id FK, player_name,
                  minute, minute_offset, is_penalty, is_own_goal

group_standings   tournament_year FK, group_name, team_id FK,
                  pos, played, won, drawn, lost,
                  goals_for, goals_against, goal_diff, pts

squads            id, tournament_year FK, team_id FK, player_name,
                  shirt_number, position, date_of_birth

Local development

Prerequisites: Node.js 22+, pnpm 10+, Docker

# 1. Clone and install
git clone <repo-url> worldcup
cd worldcup
pnpm install

# 2. Start the database
docker compose -f docker-compose.dev.yml up -d

# 3. Seed historical data (19302022) from committed JSON files
DATABASE_URL="postgres://wc:wc@localhost:5432/worldcup" pnpm seed

# 4. Sync 2026 data from Wikipedia
DATABASE_URL="postgres://wc:wc@localhost:5432/worldcup" pnpm sync

# 5. Start the dev server
DATABASE_URL="postgres://wc:wc@localhost:5432/worldcup" pnpm dev

Open http://localhost:3000.

To stop the database: docker compose -f docker-compose.dev.yml down

If you need to re-scrape historical data (e.g. after a Wikipedia article correction):

pnpm scrape 2002              # re-scrape a single year
git add data/2002/ && git commit -m "chore: refresh 2002 scraped data"

Environment variables

Variable Required Description
DATABASE_URL Yes PostgreSQL connection string
NEXT_PUBLIC_SITE_URL Production Public base URL, e.g. https://worldcup.example.com — used for sitemap and OG metadata
DB_PASSWORD Production Password for the wc DB user (used by docker-compose.yml)
TRAEFIK_ENABLED Production Set to true to activate Traefik router labels
TRAEFIK_HOST Production Public hostname, e.g. worldcup.example.com
NETWORK_NAME Production Name of the external Docker network Traefik is attached to
UMAMI_ID Optional Umami analytics site ID
UMAMI_SRC Optional Umami analytics script URL

Copy .env.example to .env and fill in the values before deploying.

Deployment (Coolify + Traefik)

The app is designed for self-hosted deployment via Coolify behind a Traefik reverse proxy.

1. Configure environment

In Coolify's environment variable editor set:

DB_PASSWORD=<strong-random-password>
DATABASE_URL=postgres://wc:<DB_PASSWORD>@db:5432/worldcup
NEXT_PUBLIC_SITE_URL=https://worldcup.yourdomain.com
TRAEFIK_ENABLED=true
TRAEFIK_HOST=worldcup.yourdomain.com
NETWORK_NAME=<your-traefik-network-name>

2. Deploy

Coolify builds the Docker image via docker compose up and attaches the container to the Traefik network automatically. TLS certificates are issued by the resolver cert resolver configured in Traefik.

3. Initial data load

After the first deployment, seed historical data and then sync 2026:

# In Coolify's terminal for the app container:
pnpm seed    # loads 19302022 from committed JSON files
pnpm sync    # fetches 2026 from Wikipedia

4. Scheduled sync (live updates)

In Coolify → your service → Scheduled Tasks, add:

Field Value
Command pnpm sync
Schedule */10 * * * *
Container app

This re-syncs 2026 from Wikipedia every 10 minutes. New match results appear within 10 minutes of the final whistle.

Project structure

worldcup/
├── app/
│   ├── layout.tsx                      # Root layout: nav, fonts, Apollo provider, global metadata
│   ├── robots.ts                       # robots.txt (Next.js convention)
│   ├── sitemap.ts                      # sitemap.xml — dynamic, rendered at request time
│   ├── page.tsx                        # Home — server wrapper (exports metadata)
│   ├── client.tsx                      # Home — Apollo/interactive client component
│   ├── groups/
│   │   ├── page.tsx                    # Groups — server wrapper
│   │   └── client.tsx                  # Groups — client component
│   ├── stats/page.tsx + client.tsx
│   ├── history/page.tsx + client.tsx
│   ├── search/page.tsx + client.tsx
│   ├── tournaments/[year]/
│   │   ├── page.tsx                    # generateMetadata fetches tournament from DB
│   │   └── client.tsx                  # Tournament detail, group standings, bracket
│   ├── teams/[slug]/page.tsx + client.tsx
│   ├── players/[name]/page.tsx + client.tsx
│   └── api/graphql/route.ts            # GraphQL Yoga endpoint
├── components/
│   ├── apollo-provider.tsx             # Apollo Client provider wrapper
│   ├── nav.tsx                         # Top navigation bar
│   ├── team-flag.tsx                   # flag-icons wrapper component
│   ├── match-card.tsx                  # Match result / fixture card
│   └── live-badge.tsx                  # Pulsing LIVE indicator
├── lib/
│   ├── db/
│   │   ├── schema.ts                   # Drizzle table definitions
│   │   └── index.ts                    # DB connection singleton
│   ├── graphql/
│   │   ├── schema.ts                   # GraphQL SDL
│   │   ├── resolvers/index.ts          # All resolvers
│   │   ├── hooks.ts                    # Apollo v4 useQuery wrapper
│   │   └── client.ts                   # Apollo Client factory
│   ├── wiki-scraper.ts                 # Wikipedia HTML parser (cheerio), rate-limit retry
│   └── iso-codes.ts                    # Team name → ISO2 country code map
├── scripts/
│   ├── scrape-wikipedia.ts             # Developer-only: scrape Wikipedia → data/{year}/
│   ├── seed.ts                         # Initial DB load from data/{year}/ JSON files
│   └── sync.ts                         # Scheduled: sync 2026 live data from Wikipedia
├── data/
│   ├── 1930/ … 2022/                   # Committed Wikipedia scrape output (per-year JSON)
│   └── {year}/
│       ├── worldcup.json               # Matches + goals
│       ├── worldcup.meta.json          # Tournament metadata
│       ├── worldcup.stadiums.json      # Stadiums
│       ├── worldcup.groups.json        # Group compositions
│       └── worldcup.squads.json        # Squad rosters (where available)
├── docker-compose.yml                  # Production (Traefik + external network)
├── docker-compose.dev.yml              # Local dev (DB only, port 5432 exposed)
├── Dockerfile                          # Multi-stage pnpm build
├── .env.example                        # Environment variable template
├── next.config.ts                      # standalone output, serverExternalPackages
├── drizzle.config.ts                   # Drizzle Kit config
└── tsconfig.json

Architecture notes

Live match detection — A match is considered live when its date equals today and the current time falls within 5 minutes before kick-off to 125 minutes after. Kick-off times are stored as "HH:MM UTC±N" strings; the resolver computes the UTC timestamp at query time using PostgreSQL interval arithmetic. Apollo's pollInterval: 60_000 re-queries liveMatches and recentMatches every minute.

UTC kickoff ordering — Both upcomingMatches (ascending) and recentMatches (descending) sort by computed UTC kickoff time using a CASE expression that parses the time_local string and subtracts the UTC offset as an interval. This ensures correct ordering across time zones — a match starting later in a westward timezone is not incorrectly ranked ahead of an earlier match with a higher database ID.

Server/client split — All pages use a server wrapper page.tsx that exports metadata (or generateMetadata) and a client.tsx that contains the Apollo query and interactive rendering. This lets Next.js generate accurate <title>, OpenGraph, and Twitter card tags for each route without requiring server-side data fetching in client components.

NEXT_PUBLIC_SITE_URL — The public hostname is read from this environment variable in sitemap.ts, robots.ts, and layout.tsx (metadataBase). All per-page openGraph.url values use relative paths (/groups, /tournaments/2026, etc.) which Next.js resolves against metadataBase automatically. The sitemap is marked export const dynamic = 'force-dynamic' so it runs at request time when the database is reachable, not at build time.

Apollo Client v4 — This project uses Apollo Client 4 which moved hooks to @apollo/client/react and core utilities to @apollo/client/core. A thin wrapper in lib/graphql/hooks.ts re-exports useQuery typed as Record<string, any> to avoid the v4 TData = {} default breaking all field accesses.

Standalone Docker outputnext.config.ts sets output: 'standalone' which produces a self-contained server.js. The scripts/, lib/, and data/ directories are copied separately into the runner stage so pnpm seed and pnpm sync work inside the container without needing a full Node/TypeScript toolchain reinstall.

Group standings — Standings are computed live from match results via a SQL GROUP BY query in the groupStandings resolver. After each sync, 0-row standing entries are inserted for all teams in all 2026 groups, ensuring every group appears in the UI even before its first match is played.

Wikipedia scraper rate limits — The MediaWiki API occasionally returns a plain-text "You are making too many requests to the API" response instead of JSON. The scraper detects this by reading the response as text first, then parses JSON only if the body does not start with that phrase. On rate-limit (or HTTP 429), it waits 30 seconds before retrying. Retries use exponential back-off: 15 s × attempt number, up to 6 attempts per page.

GraphQL API

The GraphQL playground is available at /api/graphql in development.

Key queries:

# Live matches right now
{ liveMatches { id date time team1 { name } team2 { name } scoreFt isLive } }

# All-time top scorers
{ topScorers(limit: 10) { playerName goals penalties team { name iso2 } } }

# 2026 group standings
{ groupStandings(year: 2026) { groupName pos team { name iso2 } played won drawn lost goalsFor goalsAgainst pts } }

# Tournament detail
{ tournament(year: 2022) { year host winner totalGoals avgGoalsPerGame } }

# Team stats
{ team(slug: "brazil") { name stats { appearances wins losses titles goalsFor } } }

# Full-text search
{ search(query: "Ronaldo") { teams { name } players { playerName goals } } }

# Hat-tricks in World Cup history
{ hatTricks { playerName goals year round team { name } opponent { name } } }

# Global stats
{ tournamentStats { totalTournaments totalMatches totalGoals avgGoalsPerGame } }
S
Description
A full-stack World Cup statistics web app covering every tournament from 1930 to 2026.
https://worldcup.pivoine.art
Readme 2.2 MiB
Languages
TypeScript 97.9%
CSS 1.3%
Dockerfile 0.5%
JavaScript 0.3%