runpod/DOCKER_GPU_SETUP.md

# Docker & NVIDIA Container Toolkit Setup

## Day 5: Docker Configuration on GPU Server

This guide sets up Docker with GPU support on your RunPod server.

---

## Step 1: Install Docker

### Quick Install (Recommended)

```bash
# SSH into GPU server
ssh gpu-pivoine

# Download and run Docker install script
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh

# Verify installation
docker --version
docker compose version
```

Expected output:
```
Docker version 24.0.7, build afdd53b
Docker Compose version v2.23.0
```

### Manual Install (Alternative)

```bash
# Add Docker's official GPG key
apt-get update
apt-get install -y ca-certificates curl gnupg
install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /etc/apt/keyrings/docker.gpg
chmod a+r /etc/apt/keyrings/docker.gpg

# Add repository
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  tee /etc/apt/sources.list.d/docker.list > /dev/null

# Install Docker
apt-get update
apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

# Start Docker
systemctl enable docker
systemctl start docker
```

---

## Step 2: Install NVIDIA Container Toolkit

This enables Docker containers to use the GPU.

```bash
# Add NVIDIA repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install toolkit
apt-get update
apt-get install -y nvidia-container-toolkit

# Configure Docker to use NVIDIA runtime
nvidia-ctk runtime configure --runtime=docker

# Restart Docker
systemctl restart docker
```

---

## Step 3: Test GPU Access in Docker

### Test 1: Basic CUDA Container

```bash
docker run --rm --runtime=nvidia --gpus all \
  nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
```

Expected output: Same as `nvidia-smi` output showing your RTX 4090.

### Test 2: PyTorch Container

```bash
docker run --rm --runtime=nvidia --gpus all \
  pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime \
  python -c "import torch; print('CUDA:', torch.cuda.is_available(), 'Device:', torch.cuda.get_device_name(0))"
```

Expected output:
```
CUDA: True Device: NVIDIA GeForce RTX 4090
```

### Test 3: Multi-GPU Query (if you have multiple GPUs)

```bash
docker run --rm --runtime=nvidia --gpus all \
  nvidia/cuda:12.1.0-base-ubuntu22.04 \
  bash -c "echo 'GPU Count:' && nvidia-smi --list-gpus"
```

---

## Step 4: Configure Docker Compose with GPU Support

Docker Compose needs to know about NVIDIA runtime.

### Create daemon.json

```bash
cat > /etc/docker/daemon.json << 'EOF'
{
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  },
  "default-runtime": "nvidia",
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}
EOF

# Restart Docker
systemctl restart docker
```

---

## Step 5: Create GPU Project Structure

```bash
cd /workspace

# Create directory structure
mkdir -p gpu-stack/{vllm,comfyui,training,jupyter,monitoring}
cd gpu-stack

# Create .env file
cat > .env << 'EOF'
# GPU Stack Environment Variables

# Timezone
TIMEZONE=Europe/Berlin

# VPN Network
VPS_IP=10.8.0.1
GPU_IP=10.8.0.2

# Model Storage
MODELS_PATH=/workspace/models

# Hugging Face (optional, for private models)
HF_TOKEN=

# PostgreSQL (on VPS)
DB_HOST=10.8.0.1
DB_PORT=5432
DB_USER=valknar
DB_PASSWORD=ragnarok98
DB_NAME=openwebui

# Weights & Biases (optional, for training logging)
WANDB_API_KEY=
EOF

chmod 600 .env
```

---

## Step 6: Test Full Stack (Quick Smoke Test)

Let's deploy a minimal vLLM container to verify everything works:

```bash
cd /workspace/gpu-stack

# Create test compose file
cat > test-compose.yaml << 'EOF'
services:
  test-vllm:
    image: vllm/vllm-openai:latest
    container_name: test_vllm
    runtime: nvidia
    environment:
      NVIDIA_VISIBLE_DEVICES: all
    command:
      - --model
      - facebook/opt-125m  # Tiny model for testing
      - --host
      - 0.0.0.0
      - --port
      - 8000
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
EOF

# Start test
docker compose -f test-compose.yaml up -d

# Wait 30 seconds for model download
sleep 30

# Check logs
docker compose -f test-compose.yaml logs

# Test inference
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m",
    "prompt": "Hello, my name is",
    "max_tokens": 10
  }'
```

Expected output (JSON response with generated text).

**Clean up test:**
```bash
docker compose -f test-compose.yaml down
```

---

## Step 7: Install Additional Tools

```bash
# Python tools
apt install -y python3-pip python3-venv

# Monitoring tools
apt install -y htop nvtop iotop

# Network tools
apt install -y iperf3 tcpdump

# Development tools
apt install -y build-essential

# Git LFS (for large model files)
apt install -y git-lfs
git lfs install
```

---

## Step 8: Configure Automatic Updates (Optional)

```bash
# Install unattended-upgrades
apt install -y unattended-upgrades

# Configure
dpkg-reconfigure -plow unattended-upgrades

# Enable automatic security updates
cat > /etc/apt/apt.conf.d/50unattended-upgrades << 'EOF'
Unattended-Upgrade::Allowed-Origins {
    "${distro_id}:${distro_codename}-security";
};
Unattended-Upgrade::Automatic-Reboot "false";
Unattended-Upgrade::Remove-Unused-Dependencies "true";
EOF
```

---

## Troubleshooting

### Docker can't access GPU

**Problem:** `docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]`

**Solution:**
```bash
# Verify NVIDIA runtime is configured
docker info | grep -i runtime

# Should show nvidia in runtimes list
# If not, reinstall nvidia-container-toolkit

# Check daemon.json
cat /etc/docker/daemon.json

# Restart Docker
systemctl restart docker
```

### Permission denied on docker commands

**Solution:**
```bash
# Add your user to docker group (if not root)
usermod -aG docker $USER

# Or always use sudo
sudo docker ...
```

### Out of disk space

**Check usage:**
```bash
df -h
du -sh /var/lib/docker
docker system df
```

**Clean up:**
```bash
# Remove unused images
docker image prune -a

# Remove unused volumes
docker volume prune

# Full cleanup
docker system prune -a --volumes
```

---

## Verification Checklist

Before deploying the full stack:

- [ ] Docker installed and running
- [ ] `docker --version` shows 24.x or newer
- [ ] `docker compose version` works
- [ ] NVIDIA Container Toolkit installed
- [ ] `docker run --gpus all nvidia/cuda:12.1.0-base nvidia-smi` works
- [ ] PyTorch container can see GPU
- [ ] Test vLLM deployment successful
- [ ] /workspace directory structure created
- [ ] .env file configured with VPN IPs
- [ ] Additional tools installed (nvtop, htop, etc.)

---

## Performance Monitoring Commands

**GPU Monitoring:**
```bash
# Real-time GPU stats
watch -n 1 nvidia-smi

# Or with nvtop (prettier)
nvtop

# GPU memory usage
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
```

**Docker Stats:**
```bash
# Container resource usage
docker stats

# Specific container
docker stats vllm --no-stream
```

**System Resources:**
```bash
# Overall system
htop

# I/O stats
iotop

# Network
iftop
```

---

## Next: Deploy Production Stack

Now you're ready to deploy the full GPU stack with vLLM, ComfyUI, and training tools.

**Proceed to:** Deploying the production docker-compose.yaml

**Save your progress:**

```bash
cat >> /workspace/SERVER_INFO.md << 'EOF'

## Docker Configuration
- Docker Version: [docker --version]
- NVIDIA Runtime: Enabled
- GPU Access in Containers: ✓
- Test vLLM Deployment: Successful
- Directory: /workspace/gpu-stack

## Tools Installed
- nvtop: GPU monitoring
- htop: System monitoring
- Docker Compose: v2.x
- Git LFS: Large file support
EOF
```