feat: add RunPod Docker template with automated build workflow

- Add Dockerfile with minimal setup (supervisor, tailscale) - Add start.sh bootstrap script for container initialization - Add Gitea workflow for automated Docker image builds - Add comprehensive RUNPOD_TEMPLATE.md documentation - Add bootstrap-venvs.sh for Python venv health checks This enables deployment of the AI orchestrator on RunPod using: - Minimal Docker image (~2-3GB) for fast deployment - Network volume for models and data persistence (~80-200GB) - Automated builds on push to main or version tags - Full Tailscale VPN integration - Supervisor process management
2025-11-23 21:53:56 +01:00
parent 0e3150e26c
commit 571431955d
5 changed files with 834 additions and 0 deletions
--- a/RUNPOD_TEMPLATE.md
+++ b/RUNPOD_TEMPLATE.md
@@ -0,0 +1,501 @@
+# RunPod Template Setup Guide
+
+This guide explains how to deploy the AI Orchestrator (ComfyUI + vLLM) on RunPod using a custom Docker template and network volume.
+
+## Architecture Overview
+
+The deployment uses a **two-tier strategy**:
+
+1. **Docker Image** (software layer) - Contains system packages, Supervisor, Tailscale
+2. **Network Volume** (data layer) - Contains models, ComfyUI installation, venvs, configuration
+
+This approach allows fast pod deployment (~2-3 minutes) while keeping all large files (models, ~80-200GB) on a persistent network volume.
+
+## Prerequisites
+
+- RunPod account with credits
+- Docker Hub account (for hosting the template image)
+- HuggingFace account with API token (for model downloads)
+- Tailscale account with auth key (optional, for VPN access)
+
+## Step 1: Build and Push Docker Image
+
+### Option A: Automated Build (Recommended)
+
+The repository includes a Gitea workflow that automatically builds and pushes the Docker image when you push to the `main` branch or create a version tag.
+
+1. **Configure Gitea Secrets:**
+   - Go to your Gitea repository → Settings → Secrets
+   - Add `DOCKER_USERNAME` = your Docker Hub username
+   - Add `DOCKER_PASSWORD` = your Docker Hub password or access token
+
+2. **Trigger Build:**
+   ```bash
+   # Push to main branch
+   git push origin main
+
+   # Or create a version tag
+   git tag v1.0.0
+   git push origin v1.0.0
+   ```
+
+3. **Monitor Build:**
+   - Go to Actions tab in Gitea
+   - Wait for build to complete (~5-10 minutes)
+   - Note the Docker image name: `your-username/runpod-ai-orchestrator:latest`
+
+### Option B: Manual Build
+
+If you prefer to build manually:
+
+```bash
+# From the repository root
+cd /path/to/runpod
+
+# Build the image
+docker build -t your-username/runpod-ai-orchestrator:latest .
+
+# Push to Docker Hub
+docker login
+docker push your-username/runpod-ai-orchestrator:latest
+```
+
+## Step 2: Create Network Volume
+
+Network volumes persist your models and data across pod restarts and rebuilds.
+
+1. **Go to RunPod Dashboard → Storage → Network Volumes**
+
+2. **Click "New Network Volume"**
+
+3. **Configure:**
+   - **Name**: `ai-orchestrator-models`
+   - **Size**: `200GB` (adjust based on your needs)
+     - Essential models only: ~80GB
+     - All models: ~137-200GB
+   - **Datacenter**: Choose closest to you (volume tied to datacenter)
+
+4. **Click "Create Volume"**
+
+5. **Note the Volume ID** (e.g., `vol-abc123def456`) for pod deployment
+
+### Storage Requirements
+
+| Configuration | Size | Models Included |
+|--------------|------|-----------------|
+| Essential | ~80GB | FLUX Schnell, 1-2 SDXL checkpoints, MusicGen Medium |
+| Complete | ~137GB | All image/video/audio models from playbook |
+| Full + vLLM | ~200GB | Complete + Qwen 2.5 7B + Llama 3.1 8B |
+
+## Step 3: Create RunPod Template
+
+1. **Go to RunPod Dashboard → Templates**
+
+2. **Click "New Template"**
+
+3. **Configure Template Settings:**
+
+   **Container Configuration:**
+   - **Template Name**: `AI Orchestrator (ComfyUI + vLLM)`
+   - **Template Type**: Docker
+   - **Container Image**: `your-username/runpod-ai-orchestrator:latest`
+   - **Container Disk**: `50GB` (for system and temp files)
+   - **Docker Command**: Leave empty (uses default `/start.sh`)
+
+   **Volume Configuration:**
+   - **Volume Mount Path**: `/workspace`
+   - **Attach to Network Volume**: Select your volume ID from Step 2
+
+   **Port Configuration:**
+   - **Expose HTTP Ports**: `8188, 9000, 9001`
+     - `8188` - ComfyUI web interface
+     - `9000` - Model orchestrator API
+     - `9001` - Supervisor web UI
+   - **Expose TCP Ports**: `22` (SSH access)
+
+   **Environment Variables:**
+   ```
+   HF_TOKEN=your_huggingface_token_here
+   TAILSCALE_AUTHKEY=tskey-auth-your_tailscale_authkey_here
+   SUPERVISOR_BACKEND_HOST=localhost
+   SUPERVISOR_BACKEND_PORT=9001
+   ```
+
+   **Advanced Settings:**
+   - **Start Jupyter**: No
+   - **Start SSH**: Yes (handled by base image)
+
+4. **Click "Save Template"**
+
+## Step 4: First Deployment (Initial Setup)
+
+The first time you deploy, you need to set up the network volume with models and configuration.
+
+### 4.1 Deploy Pod
+
+1. **Go to RunPod Dashboard → Pods**
+2. **Click "Deploy"** or "GPU Pods"
+3. **Select your custom template**: `AI Orchestrator (ComfyUI + vLLM)`
+4. **Configure GPU:**
+   - **GPU Type**: RTX 4090 (24GB VRAM) or higher
+   - **Network Volume**: Select your volume from Step 2
+   - **On-Demand vs Spot**: Choose based on budget
+5. **Click "Deploy"**
+
+### 4.2 SSH into Pod
+
+```bash
+# Get pod SSH command from RunPod dashboard
+ssh root@<pod-ip> -p <port> -i ~/.ssh/id_ed25519
+
+# Or use RunPod web terminal
+```
+
+### 4.3 Initial Setup on Network Volume
+
+```bash
+# 1. Clone the repository to /workspace/ai
+cd /workspace
+git clone https://github.com/your-username/runpod.git ai
+cd ai
+
+# 2. Create .env file with your credentials
+cp .env.example .env
+nano .env
+
+# Edit and add:
+# HF_TOKEN=your_huggingface_token
+# TAILSCALE_AUTHKEY=tskey-auth-your_key
+# GPU_TAILSCALE_IP=<will be set automatically>
+
+# 3. Download essential models (this takes 30-60 minutes)
+ansible-playbook playbook.yml --tags comfyui-essential
+
+# OR download all models (1-2 hours)
+ansible-playbook playbook.yml --tags comfyui-models-all
+
+# 4. Link models to ComfyUI
+bash scripts/link-comfyui-models.sh
+
+# OR if arty is available
+arty run models/link-comfyui
+
+# 5. Install ComfyUI custom nodes dependencies
+cd /workspace/ComfyUI/custom_nodes/ComfyUI-Manager
+pip install -r requirements.txt
+cd /workspace/ai
+
+# 6. Restart the container to apply all changes
+exit
+# Go to RunPod dashboard → Stop pod → Start pod
+```
+
+### 4.4 Verify Services
+
+After restart, SSH back in and check:
+
+```bash
+# Check supervisor status
+supervisorctl -c /workspace/supervisord.conf status
+
+# Expected output:
+# comfyui                          RUNNING   pid 123, uptime 0:01:00
+# (orchestrator is disabled by default - enable for vLLM)
+
+# Test ComfyUI
+curl -I http://localhost:8188
+
+# Test Supervisor web UI
+curl -I http://localhost:9001
+```
+
+## Step 5: Subsequent Deployments
+
+After initial setup, deploying new pods is quick (2-3 minutes):
+
+1. **Deploy pod** with same template + network volume
+2. **Wait for startup** (~1-2 minutes for services to start)
+3. **Access services:**
+   - ComfyUI: `http://<pod-ip>:8188`
+   - Supervisor: `http://<pod-ip>:9001`
+
+**All models, configuration, and data persist on the network volume!**
+
+## Step 6: Access Services
+
+### Via Direct IP (HTTP)
+
+Get pod IP and ports from RunPod dashboard:
+
+```
+ComfyUI:           http://<pod-ip>:8188
+Supervisor UI:     http://<pod-ip>:9001
+Orchestrator API:  http://<pod-ip>:9000
+SSH:               ssh root@<pod-ip> -p <port>
+```
+
+### Via Tailscale VPN (Recommended)
+
+If you configured `TAILSCALE_AUTHKEY`, the pod automatically joins your Tailscale network:
+
+1. **Get Tailscale IP:**
+   ```bash
+   ssh root@<pod-ip> -p <port>
+   tailscale ip -4
+   # Example output: 100.114.60.40
+   ```
+
+2. **Access via Tailscale:**
+   ```
+   ComfyUI:      http://<tailscale-ip>:8188
+   Supervisor:   http://<tailscale-ip>:9001
+   Orchestrator: http://<tailscale-ip>:9000
+   SSH:          ssh root@<tailscale-ip>
+   ```
+
+3. **Update LiteLLM config** on your VPS with the Tailscale IP
+
+## Service Management
+
+### Start/Stop Services
+
+```bash
+# Start all services
+supervisorctl -c /workspace/supervisord.conf start all
+
+# Stop all services
+supervisorctl -c /workspace/supervisord.conf stop all
+
+# Restart specific service
+supervisorctl -c /workspace/supervisord.conf restart comfyui
+
+# View status
+supervisorctl -c /workspace/supervisord.conf status
+```
+
+### Enable vLLM Models (Text Generation)
+
+By default, only ComfyUI runs (to save VRAM). To enable vLLM:
+
+1. **Stop ComfyUI** (frees up VRAM):
+   ```bash
+   supervisorctl -c /workspace/supervisord.conf stop comfyui
+   ```
+
+2. **Start orchestrator** (manages vLLM models):
+   ```bash
+   supervisorctl -c /workspace/supervisord.conf start orchestrator
+   ```
+
+3. **Test text generation:**
+   ```bash
+   curl -X POST http://localhost:9000/v1/chat/completions \
+     -H 'Content-Type: application/json' \
+     -d '{"model":"qwen-2.5-7b","messages":[{"role":"user","content":"Hello"}]}'
+   ```
+
+### Switch Back to ComfyUI
+
+```bash
+# Stop orchestrator (stops all vLLM models)
+supervisorctl -c /workspace/supervisord.conf stop orchestrator
+
+# Start ComfyUI
+supervisorctl -c /workspace/supervisord.conf start comfyui
+```
+
+## Updating the Template
+
+When you make changes to code or configuration:
+
+### Update Docker Image
+
+```bash
+# 1. Make changes to Dockerfile or start.sh
+# 2. Push to repository
+git add .
+git commit -m "Update template configuration"
+git push origin main
+
+# 3. Gitea workflow auto-builds new image
+
+# 4. Terminate old pod and deploy new one with updated image
+```
+
+### Update Network Volume Data
+
+```bash
+# SSH into running pod
+ssh root@<pod-ip> -p <port>
+
+# Update repository
+cd /workspace/ai
+git pull
+
+# Re-run Ansible if needed
+ansible-playbook playbook.yml --tags <specific-tag>
+
+# Restart services
+supervisorctl -c /workspace/supervisord.conf restart all
+```
+
+## Troubleshooting
+
+### Pod fails to start
+
+**Check logs:**
+```bash
+# Via SSH
+cat /workspace/logs/supervisord.log
+cat /workspace/logs/comfyui.err.log
+
+# Via RunPod web terminal
+tail -f /workspace/logs/*.log
+```
+
+**Common issues:**
+- Missing `.env` file → Create `/workspace/ai/.env` with required vars
+- Supervisor config not found → Ensure `/workspace/ai/supervisord.conf` exists
+- Port conflicts → Check if services are already running
+
+### Tailscale not connecting
+
+**Check Tailscale status:**
+```bash
+tailscale status
+tailscale ip -4
+```
+
+**Common issues:**
+- Missing or invalid `TAILSCALE_AUTHKEY` in `.env`
+- Auth key expired → Generate new key in Tailscale admin
+- Firewall blocking → RunPod should allow Tailscale by default
+
+### Services not starting
+
+**Check Supervisor:**
+```bash
+supervisorctl -c /workspace/supervisord.conf status
+supervisorctl -c /workspace/supervisord.conf tail -f comfyui
+```
+
+**Common issues:**
+- venv broken → Re-run `scripts/bootstrap-venvs.sh`
+- Models not downloaded → Run Ansible playbook again
+- Python version mismatch → Rebuild venvs
+
+### Out of VRAM
+
+**Check GPU memory:**
+```bash
+nvidia-smi
+```
+
+**RTX 4090 (24GB) capacity:**
+- ComfyUI (FLUX Schnell): ~23GB (can't run with vLLM)
+- vLLM (Qwen 2.5 7B): ~14GB
+- vLLM (Llama 3.1 8B): ~17GB
+
+**Solution:** Only run one service at a time (see Service Management section)
+
+### Network volume full
+
+**Check disk usage:**
+```bash
+df -h /workspace
+du -sh /workspace/*
+```
+
+**Clean up:**
+```bash
+# Remove old HuggingFace cache
+rm -rf /workspace/huggingface_cache
+
+# Re-download essential models only
+cd /workspace/ai
+ansible-playbook playbook.yml --tags comfyui-essential
+```
+
+## Cost Optimization
+
+### Spot vs On-Demand
+
+- **Spot instances**: ~70% cheaper, can be interrupted
+- **On-Demand**: More expensive, guaranteed availability
+
+**Recommendation:** Use spot for development, on-demand for production
+
+### Network Volume Pricing
+
+- First 1TB: $0.07/GB/month
+- Beyond 1TB: $0.05/GB/month
+
+**200GB volume cost:** ~$14/month
+
+### Pod Auto-Stop
+
+Configure auto-stop in RunPod pod settings to save costs when idle:
+- Stop after 15 minutes idle
+- Stop after 1 hour idle
+- Manual stop only
+
+## Advanced Configuration
+
+### Custom Environment Variables
+
+Add to template or pod environment variables:
+
+```bash
+# Model cache locations
+HF_HOME=/workspace/huggingface_cache
+TRANSFORMERS_CACHE=/workspace/huggingface_cache
+
+# ComfyUI settings
+COMFYUI_PORT=8188
+COMFYUI_LISTEN=0.0.0.0
+
+# Orchestrator settings
+ORCHESTRATOR_PORT=9000
+
+# GPU settings
+CUDA_VISIBLE_DEVICES=0
+```
+
+### Multiple Network Volumes
+
+You can attach multiple network volumes for organization:
+
+1. **Models volume** - `/workspace/models` (read-only, shared)
+2. **Data volume** - `/workspace/data` (read-write, per-project)
+
+### Custom Startup Script
+
+Override `/start.sh` behavior by creating `/workspace/custom-start.sh`:
+
+```bash
+#!/bin/bash
+# Custom startup commands
+
+# Source default startup
+source /start.sh
+
+# Add your custom commands here
+echo "Running custom initialization..."
+```
+
+## References
+
+- [RunPod Documentation](https://docs.runpod.io/)
+- [RunPod Templates Overview](https://docs.runpod.io/pods/templates/overview)
+- [Network Volumes Guide](https://docs.runpod.io/storage/network-volumes)
+- [ComfyUI Documentation](https://github.com/comfyanonymous/ComfyUI)
+- [Supervisor Documentation](http://supervisord.org/)
+- [Tailscale Documentation](https://tailscale.com/kb/)
+
+## Support
+
+For issues or questions:
+- Check troubleshooting section above
+- Review `/workspace/logs/` files
+- Check RunPod community forums
+- Open issue in project repository