# Docker & NVIDIA Container Toolkit Setup ## Day 5: Docker Configuration on GPU Server This guide sets up Docker with GPU support on your RunPod server. --- ## Step 1: Install Docker ### Quick Install (Recommended) ```bash # SSH into GPU server ssh gpu-pivoine # Download and run Docker install script curl -fsSL https://get.docker.com -o get-docker.sh sh get-docker.sh # Verify installation docker --version docker compose version ``` Expected output: ``` Docker version 24.0.7, build afdd53b Docker Compose version v2.23.0 ``` ### Manual Install (Alternative) ```bash # Add Docker's official GPG key apt-get update apt-get install -y ca-certificates curl gnupg install -m 0755 -d /etc/apt/keyrings curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /etc/apt/keyrings/docker.gpg chmod a+r /etc/apt/keyrings/docker.gpg # Add repository echo \ "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \ $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \ tee /etc/apt/sources.list.d/docker.list > /dev/null # Install Docker apt-get update apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin # Start Docker systemctl enable docker systemctl start docker ``` --- ## Step 2: Install NVIDIA Container Toolkit This enables Docker containers to use the GPU. ```bash # Add NVIDIA repository distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \ gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ tee /etc/apt/sources.list.d/nvidia-container-toolkit.list # Install toolkit apt-get update apt-get install -y nvidia-container-toolkit # Configure Docker to use NVIDIA runtime nvidia-ctk runtime configure --runtime=docker # Restart Docker systemctl restart docker ``` --- ## Step 3: Test GPU Access in Docker ### Test 1: Basic CUDA Container ```bash docker run --rm --runtime=nvidia --gpus all \ nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi ``` Expected output: Same as `nvidia-smi` output showing your RTX 4090. ### Test 2: PyTorch Container ```bash docker run --rm --runtime=nvidia --gpus all \ pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime \ python -c "import torch; print('CUDA:', torch.cuda.is_available(), 'Device:', torch.cuda.get_device_name(0))" ``` Expected output: ``` CUDA: True Device: NVIDIA GeForce RTX 4090 ``` ### Test 3: Multi-GPU Query (if you have multiple GPUs) ```bash docker run --rm --runtime=nvidia --gpus all \ nvidia/cuda:12.1.0-base-ubuntu22.04 \ bash -c "echo 'GPU Count:' && nvidia-smi --list-gpus" ``` --- ## Step 4: Configure Docker Compose with GPU Support Docker Compose needs to know about NVIDIA runtime. ### Create daemon.json ```bash cat > /etc/docker/daemon.json << 'EOF' { "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } }, "default-runtime": "nvidia", "log-driver": "json-file", "log-opts": { "max-size": "10m", "max-file": "3" } } EOF # Restart Docker systemctl restart docker ``` --- ## Step 5: Create GPU Project Structure ```bash cd /workspace # Create directory structure mkdir -p gpu-stack/{vllm,comfyui,training,jupyter,monitoring} cd gpu-stack # Create .env file cat > .env << 'EOF' # GPU Stack Environment Variables # Timezone TIMEZONE=Europe/Berlin # VPN Network VPS_IP=10.8.0.1 GPU_IP=10.8.0.2 # Model Storage MODELS_PATH=/workspace/models # Hugging Face (optional, for private models) HF_TOKEN= # PostgreSQL (on VPS) DB_HOST=10.8.0.1 DB_PORT=5432 DB_USER=valknar DB_PASSWORD=ragnarok98 DB_NAME=openwebui # Weights & Biases (optional, for training logging) WANDB_API_KEY= EOF chmod 600 .env ``` --- ## Step 6: Test Full Stack (Quick Smoke Test) Let's deploy a minimal vLLM container to verify everything works: ```bash cd /workspace/gpu-stack # Create test compose file cat > test-compose.yaml << 'EOF' services: test-vllm: image: vllm/vllm-openai:latest container_name: test_vllm runtime: nvidia environment: NVIDIA_VISIBLE_DEVICES: all command: - --model - facebook/opt-125m # Tiny model for testing - --host - 0.0.0.0 - --port - 8000 ports: - "8000:8000" deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] EOF # Start test docker compose -f test-compose.yaml up -d # Wait 30 seconds for model download sleep 30 # Check logs docker compose -f test-compose.yaml logs # Test inference curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "facebook/opt-125m", "prompt": "Hello, my name is", "max_tokens": 10 }' ``` Expected output (JSON response with generated text). **Clean up test:** ```bash docker compose -f test-compose.yaml down ``` --- ## Step 7: Install Additional Tools ```bash # Python tools apt install -y python3-pip python3-venv # Monitoring tools apt install -y htop nvtop iotop # Network tools apt install -y iperf3 tcpdump # Development tools apt install -y build-essential # Git LFS (for large model files) apt install -y git-lfs git lfs install ``` --- ## Step 8: Configure Automatic Updates (Optional) ```bash # Install unattended-upgrades apt install -y unattended-upgrades # Configure dpkg-reconfigure -plow unattended-upgrades # Enable automatic security updates cat > /etc/apt/apt.conf.d/50unattended-upgrades << 'EOF' Unattended-Upgrade::Allowed-Origins { "${distro_id}:${distro_codename}-security"; }; Unattended-Upgrade::Automatic-Reboot "false"; Unattended-Upgrade::Remove-Unused-Dependencies "true"; EOF ``` --- ## Troubleshooting ### Docker can't access GPU **Problem:** `docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]` **Solution:** ```bash # Verify NVIDIA runtime is configured docker info | grep -i runtime # Should show nvidia in runtimes list # If not, reinstall nvidia-container-toolkit # Check daemon.json cat /etc/docker/daemon.json # Restart Docker systemctl restart docker ``` ### Permission denied on docker commands **Solution:** ```bash # Add your user to docker group (if not root) usermod -aG docker $USER # Or always use sudo sudo docker ... ``` ### Out of disk space **Check usage:** ```bash df -h du -sh /var/lib/docker docker system df ``` **Clean up:** ```bash # Remove unused images docker image prune -a # Remove unused volumes docker volume prune # Full cleanup docker system prune -a --volumes ``` --- ## Verification Checklist Before deploying the full stack: - [ ] Docker installed and running - [ ] `docker --version` shows 24.x or newer - [ ] `docker compose version` works - [ ] NVIDIA Container Toolkit installed - [ ] `docker run --gpus all nvidia/cuda:12.1.0-base nvidia-smi` works - [ ] PyTorch container can see GPU - [ ] Test vLLM deployment successful - [ ] /workspace directory structure created - [ ] .env file configured with VPN IPs - [ ] Additional tools installed (nvtop, htop, etc.) --- ## Performance Monitoring Commands **GPU Monitoring:** ```bash # Real-time GPU stats watch -n 1 nvidia-smi # Or with nvtop (prettier) nvtop # GPU memory usage nvidia-smi --query-gpu=memory.used,memory.total --format=csv ``` **Docker Stats:** ```bash # Container resource usage docker stats # Specific container docker stats vllm --no-stream ``` **System Resources:** ```bash # Overall system htop # I/O stats iotop # Network iftop ``` --- ## Next: Deploy Production Stack Now you're ready to deploy the full GPU stack with vLLM, ComfyUI, and training tools. **Proceed to:** Deploying the production docker-compose.yaml **Save your progress:** ```bash cat >> /workspace/SERVER_INFO.md << 'EOF' ## Docker Configuration - Docker Version: [docker --version] - NVIDIA Runtime: Enabled - GPU Access in Containers: ✓ - Test vLLM Deployment: Successful - Directory: /workspace/gpu-stack ## Tools Installed - nvtop: GPU monitoring - htop: System monitoring - Docker Compose: v2.x - Git LFS: Large file support EOF ```