Files

Sebastian Krüger 8de88d96ac docs(ai): add comprehensive GPU setup documentation and configs

- Add setup guides (SETUP_GUIDE, TAILSCALE_SETUP, DOCKER_GPU_SETUP, etc.)
- Add deployment configurations (litellm-config-gpu.yaml, gpu-server-compose.yaml)
- Add GPU_DEPLOYMENT_LOG.md with current infrastructure details
- Add GPU_EXPANSION_PLAN.md with complete provider comparison
- Add deploy-gpu-stack.sh automation script

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-21 12:57:06 +01:00

8.3 KiB

Raw Blame History

Docker & NVIDIA Container Toolkit Setup

Day 5: Docker Configuration on GPU Server

This guide sets up Docker with GPU support on your RunPod server.

Step 1: Install Docker

Quick Install (Recommended)

# SSH into GPU server
ssh gpu-pivoine

# Download and run Docker install script
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh

# Verify installation
docker --version
docker compose version

Expected output:

Docker version 24.0.7, build afdd53b
Docker Compose version v2.23.0

Manual Install (Alternative)

# Add Docker's official GPG key
apt-get update
apt-get install -y ca-certificates curl gnupg
install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /etc/apt/keyrings/docker.gpg
chmod a+r /etc/apt/keyrings/docker.gpg

# Add repository
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  tee /etc/apt/sources.list.d/docker.list > /dev/null

# Install Docker
apt-get update
apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

# Start Docker
systemctl enable docker
systemctl start docker

Step 2: Install NVIDIA Container Toolkit

This enables Docker containers to use the GPU.

# Add NVIDIA repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install toolkit
apt-get update
apt-get install -y nvidia-container-toolkit

# Configure Docker to use NVIDIA runtime
nvidia-ctk runtime configure --runtime=docker

# Restart Docker
systemctl restart docker

Step 3: Test GPU Access in Docker

Test 1: Basic CUDA Container

docker run --rm --runtime=nvidia --gpus all \
  nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

Expected output: Same as nvidia-smi output showing your RTX 4090.

Test 2: PyTorch Container

docker run --rm --runtime=nvidia --gpus all \
  pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime \
  python -c "import torch; print('CUDA:', torch.cuda.is_available(), 'Device:', torch.cuda.get_device_name(0))"

Expected output:

CUDA: True Device: NVIDIA GeForce RTX 4090

Test 3: Multi-GPU Query (if you have multiple GPUs)

docker run --rm --runtime=nvidia --gpus all \
  nvidia/cuda:12.1.0-base-ubuntu22.04 \
  bash -c "echo 'GPU Count:' && nvidia-smi --list-gpus"

Step 4: Configure Docker Compose with GPU Support

Docker Compose needs to know about NVIDIA runtime.

Create daemon.json

cat > /etc/docker/daemon.json << 'EOF'
{
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  },
  "default-runtime": "nvidia",
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}
EOF

# Restart Docker
systemctl restart docker

Step 5: Create GPU Project Structure

cd /workspace

# Create directory structure
mkdir -p gpu-stack/{vllm,comfyui,training,jupyter,monitoring}
cd gpu-stack

# Create .env file
cat > .env << 'EOF'
# GPU Stack Environment Variables

# Timezone
TIMEZONE=Europe/Berlin

# VPN Network
VPS_IP=10.8.0.1
GPU_IP=10.8.0.2

# Model Storage
MODELS_PATH=/workspace/models

# Hugging Face (optional, for private models)
HF_TOKEN=

# PostgreSQL (on VPS)
DB_HOST=10.8.0.1
DB_PORT=5432
DB_USER=valknar
DB_PASSWORD=ragnarok98
DB_NAME=openwebui

# Weights & Biases (optional, for training logging)
WANDB_API_KEY=
EOF

chmod 600 .env

Step 6: Test Full Stack (Quick Smoke Test)

Let's deploy a minimal vLLM container to verify everything works:

cd /workspace/gpu-stack

# Create test compose file
cat > test-compose.yaml << 'EOF'
services:
  test-vllm:
    image: vllm/vllm-openai:latest
    container_name: test_vllm
    runtime: nvidia
    environment:
      NVIDIA_VISIBLE_DEVICES: all
    command:
      - --model
      - facebook/opt-125m  # Tiny model for testing
      - --host
      - 0.0.0.0
      - --port
      - 8000
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
EOF

# Start test
docker compose -f test-compose.yaml up -d

# Wait 30 seconds for model download
sleep 30

# Check logs
docker compose -f test-compose.yaml logs

# Test inference
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m",
    "prompt": "Hello, my name is",
    "max_tokens": 10
  }'

Expected output (JSON response with generated text).

Clean up test:

docker compose -f test-compose.yaml down

Step 7: Install Additional Tools

# Python tools
apt install -y python3-pip python3-venv

# Monitoring tools
apt install -y htop nvtop iotop

# Network tools
apt install -y iperf3 tcpdump

# Development tools
apt install -y build-essential

# Git LFS (for large model files)
apt install -y git-lfs
git lfs install

Step 8: Configure Automatic Updates (Optional)

# Install unattended-upgrades
apt install -y unattended-upgrades

# Configure
dpkg-reconfigure -plow unattended-upgrades

# Enable automatic security updates
cat > /etc/apt/apt.conf.d/50unattended-upgrades << 'EOF'
Unattended-Upgrade::Allowed-Origins {
    "${distro_id}:${distro_codename}-security";
};
Unattended-Upgrade::Automatic-Reboot "false";
Unattended-Upgrade::Remove-Unused-Dependencies "true";
EOF

Troubleshooting

Docker can't access GPU

Problem: docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]

Solution:

# Verify NVIDIA runtime is configured
docker info | grep -i runtime

# Should show nvidia in runtimes list
# If not, reinstall nvidia-container-toolkit

# Check daemon.json
cat /etc/docker/daemon.json

# Restart Docker
systemctl restart docker

Permission denied on docker commands

Solution:

# Add your user to docker group (if not root)
usermod -aG docker $USER

# Or always use sudo
sudo docker ...

Out of disk space

Check usage:

df -h
du -sh /var/lib/docker
docker system df

Clean up:

# Remove unused images
docker image prune -a

# Remove unused volumes
docker volume prune

# Full cleanup
docker system prune -a --volumes

Verification Checklist

Before deploying the full stack:

Docker installed and running
docker --version shows 24.x or newer
docker compose version works
NVIDIA Container Toolkit installed
docker run --gpus all nvidia/cuda:12.1.0-base nvidia-smi works
PyTorch container can see GPU
Test vLLM deployment successful
/workspace directory structure created
.env file configured with VPN IPs
Additional tools installed (nvtop, htop, etc.)

Performance Monitoring Commands

GPU Monitoring:

# Real-time GPU stats
watch -n 1 nvidia-smi

# Or with nvtop (prettier)
nvtop

# GPU memory usage
nvidia-smi --query-gpu=memory.used,memory.total --format=csv

Docker Stats:

# Container resource usage
docker stats

# Specific container
docker stats vllm --no-stream

System Resources:

# Overall system
htop

# I/O stats
iotop

# Network
iftop

Next: Deploy Production Stack

Now you're ready to deploy the full GPU stack with vLLM, ComfyUI, and training tools.

Proceed to: Deploying the production docker-compose.yaml

Save your progress:

cat >> /workspace/SERVER_INFO.md << 'EOF'

## Docker Configuration
- Docker Version: [docker --version]
- NVIDIA Runtime: Enabled
- GPU Access in Containers: ✓
- Test vLLM Deployment: Successful
- Directory: /workspace/gpu-stack

## Tools Installed
- nvtop: GPU monitoring
- htop: System monitoring
- Docker Compose: v2.x
- Git LFS: Large file support
EOF

8.3 KiB Raw Blame History