- Multi-modal AI infrastructure for RunPod RTX 4090 - Automatic model orchestration (text, image, music) - Text: vLLM + Qwen 2.5 7B Instruct - Image: Flux.1 Schnell via OpenEDAI - Music: MusicGen Medium via AudioCraft - Cost-optimized sequential loading on single GPU - Template preparation scripts for rapid deployment - Comprehensive documentation (README, DEPLOYMENT, TEMPLATE)
431 lines
8.3 KiB
Markdown
431 lines
8.3 KiB
Markdown
# Docker & NVIDIA Container Toolkit Setup
|
|
|
|
## Day 5: Docker Configuration on GPU Server
|
|
|
|
This guide sets up Docker with GPU support on your RunPod server.
|
|
|
|
---
|
|
|
|
## Step 1: Install Docker
|
|
|
|
### Quick Install (Recommended)
|
|
|
|
```bash
|
|
# SSH into GPU server
|
|
ssh gpu-pivoine
|
|
|
|
# Download and run Docker install script
|
|
curl -fsSL https://get.docker.com -o get-docker.sh
|
|
sh get-docker.sh
|
|
|
|
# Verify installation
|
|
docker --version
|
|
docker compose version
|
|
```
|
|
|
|
Expected output:
|
|
```
|
|
Docker version 24.0.7, build afdd53b
|
|
Docker Compose version v2.23.0
|
|
```
|
|
|
|
### Manual Install (Alternative)
|
|
|
|
```bash
|
|
# Add Docker's official GPG key
|
|
apt-get update
|
|
apt-get install -y ca-certificates curl gnupg
|
|
install -m 0755 -d /etc/apt/keyrings
|
|
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /etc/apt/keyrings/docker.gpg
|
|
chmod a+r /etc/apt/keyrings/docker.gpg
|
|
|
|
# Add repository
|
|
echo \
|
|
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
|
|
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
|
|
tee /etc/apt/sources.list.d/docker.list > /dev/null
|
|
|
|
# Install Docker
|
|
apt-get update
|
|
apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
|
|
|
|
# Start Docker
|
|
systemctl enable docker
|
|
systemctl start docker
|
|
```
|
|
|
|
---
|
|
|
|
## Step 2: Install NVIDIA Container Toolkit
|
|
|
|
This enables Docker containers to use the GPU.
|
|
|
|
```bash
|
|
# Add NVIDIA repository
|
|
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
|
|
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
|
|
gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
|
|
|
|
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
|
|
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
|
|
tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
|
|
|
|
# Install toolkit
|
|
apt-get update
|
|
apt-get install -y nvidia-container-toolkit
|
|
|
|
# Configure Docker to use NVIDIA runtime
|
|
nvidia-ctk runtime configure --runtime=docker
|
|
|
|
# Restart Docker
|
|
systemctl restart docker
|
|
```
|
|
|
|
---
|
|
|
|
## Step 3: Test GPU Access in Docker
|
|
|
|
### Test 1: Basic CUDA Container
|
|
|
|
```bash
|
|
docker run --rm --runtime=nvidia --gpus all \
|
|
nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
|
|
```
|
|
|
|
Expected output: Same as `nvidia-smi` output showing your RTX 4090.
|
|
|
|
### Test 2: PyTorch Container
|
|
|
|
```bash
|
|
docker run --rm --runtime=nvidia --gpus all \
|
|
pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime \
|
|
python -c "import torch; print('CUDA:', torch.cuda.is_available(), 'Device:', torch.cuda.get_device_name(0))"
|
|
```
|
|
|
|
Expected output:
|
|
```
|
|
CUDA: True Device: NVIDIA GeForce RTX 4090
|
|
```
|
|
|
|
### Test 3: Multi-GPU Query (if you have multiple GPUs)
|
|
|
|
```bash
|
|
docker run --rm --runtime=nvidia --gpus all \
|
|
nvidia/cuda:12.1.0-base-ubuntu22.04 \
|
|
bash -c "echo 'GPU Count:' && nvidia-smi --list-gpus"
|
|
```
|
|
|
|
---
|
|
|
|
## Step 4: Configure Docker Compose with GPU Support
|
|
|
|
Docker Compose needs to know about NVIDIA runtime.
|
|
|
|
### Create daemon.json
|
|
|
|
```bash
|
|
cat > /etc/docker/daemon.json << 'EOF'
|
|
{
|
|
"runtimes": {
|
|
"nvidia": {
|
|
"path": "nvidia-container-runtime",
|
|
"runtimeArgs": []
|
|
}
|
|
},
|
|
"default-runtime": "nvidia",
|
|
"log-driver": "json-file",
|
|
"log-opts": {
|
|
"max-size": "10m",
|
|
"max-file": "3"
|
|
}
|
|
}
|
|
EOF
|
|
|
|
# Restart Docker
|
|
systemctl restart docker
|
|
```
|
|
|
|
---
|
|
|
|
## Step 5: Create GPU Project Structure
|
|
|
|
```bash
|
|
cd /workspace
|
|
|
|
# Create directory structure
|
|
mkdir -p gpu-stack/{vllm,comfyui,training,jupyter,monitoring}
|
|
cd gpu-stack
|
|
|
|
# Create .env file
|
|
cat > .env << 'EOF'
|
|
# GPU Stack Environment Variables
|
|
|
|
# Timezone
|
|
TIMEZONE=Europe/Berlin
|
|
|
|
# VPN Network
|
|
VPS_IP=10.8.0.1
|
|
GPU_IP=10.8.0.2
|
|
|
|
# Model Storage
|
|
MODELS_PATH=/workspace/models
|
|
|
|
# Hugging Face (optional, for private models)
|
|
HF_TOKEN=
|
|
|
|
# PostgreSQL (on VPS)
|
|
DB_HOST=10.8.0.1
|
|
DB_PORT=5432
|
|
DB_USER=valknar
|
|
DB_PASSWORD=ragnarok98
|
|
DB_NAME=openwebui
|
|
|
|
# Weights & Biases (optional, for training logging)
|
|
WANDB_API_KEY=
|
|
EOF
|
|
|
|
chmod 600 .env
|
|
```
|
|
|
|
---
|
|
|
|
## Step 6: Test Full Stack (Quick Smoke Test)
|
|
|
|
Let's deploy a minimal vLLM container to verify everything works:
|
|
|
|
```bash
|
|
cd /workspace/gpu-stack
|
|
|
|
# Create test compose file
|
|
cat > test-compose.yaml << 'EOF'
|
|
services:
|
|
test-vllm:
|
|
image: vllm/vllm-openai:latest
|
|
container_name: test_vllm
|
|
runtime: nvidia
|
|
environment:
|
|
NVIDIA_VISIBLE_DEVICES: all
|
|
command:
|
|
- --model
|
|
- facebook/opt-125m # Tiny model for testing
|
|
- --host
|
|
- 0.0.0.0
|
|
- --port
|
|
- 8000
|
|
ports:
|
|
- "8000:8000"
|
|
deploy:
|
|
resources:
|
|
reservations:
|
|
devices:
|
|
- driver: nvidia
|
|
count: 1
|
|
capabilities: [gpu]
|
|
EOF
|
|
|
|
# Start test
|
|
docker compose -f test-compose.yaml up -d
|
|
|
|
# Wait 30 seconds for model download
|
|
sleep 30
|
|
|
|
# Check logs
|
|
docker compose -f test-compose.yaml logs
|
|
|
|
# Test inference
|
|
curl http://localhost:8000/v1/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "facebook/opt-125m",
|
|
"prompt": "Hello, my name is",
|
|
"max_tokens": 10
|
|
}'
|
|
```
|
|
|
|
Expected output (JSON response with generated text).
|
|
|
|
**Clean up test:**
|
|
```bash
|
|
docker compose -f test-compose.yaml down
|
|
```
|
|
|
|
---
|
|
|
|
## Step 7: Install Additional Tools
|
|
|
|
```bash
|
|
# Python tools
|
|
apt install -y python3-pip python3-venv
|
|
|
|
# Monitoring tools
|
|
apt install -y htop nvtop iotop
|
|
|
|
# Network tools
|
|
apt install -y iperf3 tcpdump
|
|
|
|
# Development tools
|
|
apt install -y build-essential
|
|
|
|
# Git LFS (for large model files)
|
|
apt install -y git-lfs
|
|
git lfs install
|
|
```
|
|
|
|
---
|
|
|
|
## Step 8: Configure Automatic Updates (Optional)
|
|
|
|
```bash
|
|
# Install unattended-upgrades
|
|
apt install -y unattended-upgrades
|
|
|
|
# Configure
|
|
dpkg-reconfigure -plow unattended-upgrades
|
|
|
|
# Enable automatic security updates
|
|
cat > /etc/apt/apt.conf.d/50unattended-upgrades << 'EOF'
|
|
Unattended-Upgrade::Allowed-Origins {
|
|
"${distro_id}:${distro_codename}-security";
|
|
};
|
|
Unattended-Upgrade::Automatic-Reboot "false";
|
|
Unattended-Upgrade::Remove-Unused-Dependencies "true";
|
|
EOF
|
|
```
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Docker can't access GPU
|
|
|
|
**Problem:** `docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]`
|
|
|
|
**Solution:**
|
|
```bash
|
|
# Verify NVIDIA runtime is configured
|
|
docker info | grep -i runtime
|
|
|
|
# Should show nvidia in runtimes list
|
|
# If not, reinstall nvidia-container-toolkit
|
|
|
|
# Check daemon.json
|
|
cat /etc/docker/daemon.json
|
|
|
|
# Restart Docker
|
|
systemctl restart docker
|
|
```
|
|
|
|
### Permission denied on docker commands
|
|
|
|
**Solution:**
|
|
```bash
|
|
# Add your user to docker group (if not root)
|
|
usermod -aG docker $USER
|
|
|
|
# Or always use sudo
|
|
sudo docker ...
|
|
```
|
|
|
|
### Out of disk space
|
|
|
|
**Check usage:**
|
|
```bash
|
|
df -h
|
|
du -sh /var/lib/docker
|
|
docker system df
|
|
```
|
|
|
|
**Clean up:**
|
|
```bash
|
|
# Remove unused images
|
|
docker image prune -a
|
|
|
|
# Remove unused volumes
|
|
docker volume prune
|
|
|
|
# Full cleanup
|
|
docker system prune -a --volumes
|
|
```
|
|
|
|
---
|
|
|
|
## Verification Checklist
|
|
|
|
Before deploying the full stack:
|
|
|
|
- [ ] Docker installed and running
|
|
- [ ] `docker --version` shows 24.x or newer
|
|
- [ ] `docker compose version` works
|
|
- [ ] NVIDIA Container Toolkit installed
|
|
- [ ] `docker run --gpus all nvidia/cuda:12.1.0-base nvidia-smi` works
|
|
- [ ] PyTorch container can see GPU
|
|
- [ ] Test vLLM deployment successful
|
|
- [ ] /workspace directory structure created
|
|
- [ ] .env file configured with VPN IPs
|
|
- [ ] Additional tools installed (nvtop, htop, etc.)
|
|
|
|
---
|
|
|
|
## Performance Monitoring Commands
|
|
|
|
**GPU Monitoring:**
|
|
```bash
|
|
# Real-time GPU stats
|
|
watch -n 1 nvidia-smi
|
|
|
|
# Or with nvtop (prettier)
|
|
nvtop
|
|
|
|
# GPU memory usage
|
|
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
|
|
```
|
|
|
|
**Docker Stats:**
|
|
```bash
|
|
# Container resource usage
|
|
docker stats
|
|
|
|
# Specific container
|
|
docker stats vllm --no-stream
|
|
```
|
|
|
|
**System Resources:**
|
|
```bash
|
|
# Overall system
|
|
htop
|
|
|
|
# I/O stats
|
|
iotop
|
|
|
|
# Network
|
|
iftop
|
|
```
|
|
|
|
---
|
|
|
|
## Next: Deploy Production Stack
|
|
|
|
Now you're ready to deploy the full GPU stack with vLLM, ComfyUI, and training tools.
|
|
|
|
**Proceed to:** Deploying the production docker-compose.yaml
|
|
|
|
**Save your progress:**
|
|
|
|
```bash
|
|
cat >> /workspace/SERVER_INFO.md << 'EOF'
|
|
|
|
## Docker Configuration
|
|
- Docker Version: [docker --version]
|
|
- NVIDIA Runtime: Enabled
|
|
- GPU Access in Containers: ✓
|
|
- Test vLLM Deployment: Successful
|
|
- Directory: /workspace/gpu-stack
|
|
|
|
## Tools Installed
|
|
- nvtop: GPU monitoring
|
|
- htop: System monitoring
|
|
- Docker Compose: v2.x
|
|
- Git LFS: Large file support
|
|
EOF
|
|
```
|