feat: add RunPod Docker template with automated build workflow
- Add Dockerfile with minimal setup (supervisor, tailscale) - Add start.sh bootstrap script for container initialization - Add Gitea workflow for automated Docker image builds - Add comprehensive RUNPOD_TEMPLATE.md documentation - Add bootstrap-venvs.sh for Python venv health checks This enables deployment of the AI orchestrator on RunPod using: - Minimal Docker image (~2-3GB) for fast deployment - Network volume for models and data persistence (~80-200GB) - Automated builds on push to main or version tags - Full Tailscale VPN integration - Supervisor process management
This commit is contained in:
501
RUNPOD_TEMPLATE.md
Normal file
501
RUNPOD_TEMPLATE.md
Normal file
@@ -0,0 +1,501 @@
|
||||
# RunPod Template Setup Guide
|
||||
|
||||
This guide explains how to deploy the AI Orchestrator (ComfyUI + vLLM) on RunPod using a custom Docker template and network volume.
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
The deployment uses a **two-tier strategy**:
|
||||
|
||||
1. **Docker Image** (software layer) - Contains system packages, Supervisor, Tailscale
|
||||
2. **Network Volume** (data layer) - Contains models, ComfyUI installation, venvs, configuration
|
||||
|
||||
This approach allows fast pod deployment (~2-3 minutes) while keeping all large files (models, ~80-200GB) on a persistent network volume.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- RunPod account with credits
|
||||
- Docker Hub account (for hosting the template image)
|
||||
- HuggingFace account with API token (for model downloads)
|
||||
- Tailscale account with auth key (optional, for VPN access)
|
||||
|
||||
## Step 1: Build and Push Docker Image
|
||||
|
||||
### Option A: Automated Build (Recommended)
|
||||
|
||||
The repository includes a Gitea workflow that automatically builds and pushes the Docker image when you push to the `main` branch or create a version tag.
|
||||
|
||||
1. **Configure Gitea Secrets:**
|
||||
- Go to your Gitea repository → Settings → Secrets
|
||||
- Add `DOCKER_USERNAME` = your Docker Hub username
|
||||
- Add `DOCKER_PASSWORD` = your Docker Hub password or access token
|
||||
|
||||
2. **Trigger Build:**
|
||||
```bash
|
||||
# Push to main branch
|
||||
git push origin main
|
||||
|
||||
# Or create a version tag
|
||||
git tag v1.0.0
|
||||
git push origin v1.0.0
|
||||
```
|
||||
|
||||
3. **Monitor Build:**
|
||||
- Go to Actions tab in Gitea
|
||||
- Wait for build to complete (~5-10 minutes)
|
||||
- Note the Docker image name: `your-username/runpod-ai-orchestrator:latest`
|
||||
|
||||
### Option B: Manual Build
|
||||
|
||||
If you prefer to build manually:
|
||||
|
||||
```bash
|
||||
# From the repository root
|
||||
cd /path/to/runpod
|
||||
|
||||
# Build the image
|
||||
docker build -t your-username/runpod-ai-orchestrator:latest .
|
||||
|
||||
# Push to Docker Hub
|
||||
docker login
|
||||
docker push your-username/runpod-ai-orchestrator:latest
|
||||
```
|
||||
|
||||
## Step 2: Create Network Volume
|
||||
|
||||
Network volumes persist your models and data across pod restarts and rebuilds.
|
||||
|
||||
1. **Go to RunPod Dashboard → Storage → Network Volumes**
|
||||
|
||||
2. **Click "New Network Volume"**
|
||||
|
||||
3. **Configure:**
|
||||
- **Name**: `ai-orchestrator-models`
|
||||
- **Size**: `200GB` (adjust based on your needs)
|
||||
- Essential models only: ~80GB
|
||||
- All models: ~137-200GB
|
||||
- **Datacenter**: Choose closest to you (volume tied to datacenter)
|
||||
|
||||
4. **Click "Create Volume"**
|
||||
|
||||
5. **Note the Volume ID** (e.g., `vol-abc123def456`) for pod deployment
|
||||
|
||||
### Storage Requirements
|
||||
|
||||
| Configuration | Size | Models Included |
|
||||
|--------------|------|-----------------|
|
||||
| Essential | ~80GB | FLUX Schnell, 1-2 SDXL checkpoints, MusicGen Medium |
|
||||
| Complete | ~137GB | All image/video/audio models from playbook |
|
||||
| Full + vLLM | ~200GB | Complete + Qwen 2.5 7B + Llama 3.1 8B |
|
||||
|
||||
## Step 3: Create RunPod Template
|
||||
|
||||
1. **Go to RunPod Dashboard → Templates**
|
||||
|
||||
2. **Click "New Template"**
|
||||
|
||||
3. **Configure Template Settings:**
|
||||
|
||||
**Container Configuration:**
|
||||
- **Template Name**: `AI Orchestrator (ComfyUI + vLLM)`
|
||||
- **Template Type**: Docker
|
||||
- **Container Image**: `your-username/runpod-ai-orchestrator:latest`
|
||||
- **Container Disk**: `50GB` (for system and temp files)
|
||||
- **Docker Command**: Leave empty (uses default `/start.sh`)
|
||||
|
||||
**Volume Configuration:**
|
||||
- **Volume Mount Path**: `/workspace`
|
||||
- **Attach to Network Volume**: Select your volume ID from Step 2
|
||||
|
||||
**Port Configuration:**
|
||||
- **Expose HTTP Ports**: `8188, 9000, 9001`
|
||||
- `8188` - ComfyUI web interface
|
||||
- `9000` - Model orchestrator API
|
||||
- `9001` - Supervisor web UI
|
||||
- **Expose TCP Ports**: `22` (SSH access)
|
||||
|
||||
**Environment Variables:**
|
||||
```
|
||||
HF_TOKEN=your_huggingface_token_here
|
||||
TAILSCALE_AUTHKEY=tskey-auth-your_tailscale_authkey_here
|
||||
SUPERVISOR_BACKEND_HOST=localhost
|
||||
SUPERVISOR_BACKEND_PORT=9001
|
||||
```
|
||||
|
||||
**Advanced Settings:**
|
||||
- **Start Jupyter**: No
|
||||
- **Start SSH**: Yes (handled by base image)
|
||||
|
||||
4. **Click "Save Template"**
|
||||
|
||||
## Step 4: First Deployment (Initial Setup)
|
||||
|
||||
The first time you deploy, you need to set up the network volume with models and configuration.
|
||||
|
||||
### 4.1 Deploy Pod
|
||||
|
||||
1. **Go to RunPod Dashboard → Pods**
|
||||
2. **Click "Deploy"** or "GPU Pods"
|
||||
3. **Select your custom template**: `AI Orchestrator (ComfyUI + vLLM)`
|
||||
4. **Configure GPU:**
|
||||
- **GPU Type**: RTX 4090 (24GB VRAM) or higher
|
||||
- **Network Volume**: Select your volume from Step 2
|
||||
- **On-Demand vs Spot**: Choose based on budget
|
||||
5. **Click "Deploy"**
|
||||
|
||||
### 4.2 SSH into Pod
|
||||
|
||||
```bash
|
||||
# Get pod SSH command from RunPod dashboard
|
||||
ssh root@<pod-ip> -p <port> -i ~/.ssh/id_ed25519
|
||||
|
||||
# Or use RunPod web terminal
|
||||
```
|
||||
|
||||
### 4.3 Initial Setup on Network Volume
|
||||
|
||||
```bash
|
||||
# 1. Clone the repository to /workspace/ai
|
||||
cd /workspace
|
||||
git clone https://github.com/your-username/runpod.git ai
|
||||
cd ai
|
||||
|
||||
# 2. Create .env file with your credentials
|
||||
cp .env.example .env
|
||||
nano .env
|
||||
|
||||
# Edit and add:
|
||||
# HF_TOKEN=your_huggingface_token
|
||||
# TAILSCALE_AUTHKEY=tskey-auth-your_key
|
||||
# GPU_TAILSCALE_IP=<will be set automatically>
|
||||
|
||||
# 3. Download essential models (this takes 30-60 minutes)
|
||||
ansible-playbook playbook.yml --tags comfyui-essential
|
||||
|
||||
# OR download all models (1-2 hours)
|
||||
ansible-playbook playbook.yml --tags comfyui-models-all
|
||||
|
||||
# 4. Link models to ComfyUI
|
||||
bash scripts/link-comfyui-models.sh
|
||||
|
||||
# OR if arty is available
|
||||
arty run models/link-comfyui
|
||||
|
||||
# 5. Install ComfyUI custom nodes dependencies
|
||||
cd /workspace/ComfyUI/custom_nodes/ComfyUI-Manager
|
||||
pip install -r requirements.txt
|
||||
cd /workspace/ai
|
||||
|
||||
# 6. Restart the container to apply all changes
|
||||
exit
|
||||
# Go to RunPod dashboard → Stop pod → Start pod
|
||||
```
|
||||
|
||||
### 4.4 Verify Services
|
||||
|
||||
After restart, SSH back in and check:
|
||||
|
||||
```bash
|
||||
# Check supervisor status
|
||||
supervisorctl -c /workspace/supervisord.conf status
|
||||
|
||||
# Expected output:
|
||||
# comfyui RUNNING pid 123, uptime 0:01:00
|
||||
# (orchestrator is disabled by default - enable for vLLM)
|
||||
|
||||
# Test ComfyUI
|
||||
curl -I http://localhost:8188
|
||||
|
||||
# Test Supervisor web UI
|
||||
curl -I http://localhost:9001
|
||||
```
|
||||
|
||||
## Step 5: Subsequent Deployments
|
||||
|
||||
After initial setup, deploying new pods is quick (2-3 minutes):
|
||||
|
||||
1. **Deploy pod** with same template + network volume
|
||||
2. **Wait for startup** (~1-2 minutes for services to start)
|
||||
3. **Access services:**
|
||||
- ComfyUI: `http://<pod-ip>:8188`
|
||||
- Supervisor: `http://<pod-ip>:9001`
|
||||
|
||||
**All models, configuration, and data persist on the network volume!**
|
||||
|
||||
## Step 6: Access Services
|
||||
|
||||
### Via Direct IP (HTTP)
|
||||
|
||||
Get pod IP and ports from RunPod dashboard:
|
||||
|
||||
```
|
||||
ComfyUI: http://<pod-ip>:8188
|
||||
Supervisor UI: http://<pod-ip>:9001
|
||||
Orchestrator API: http://<pod-ip>:9000
|
||||
SSH: ssh root@<pod-ip> -p <port>
|
||||
```
|
||||
|
||||
### Via Tailscale VPN (Recommended)
|
||||
|
||||
If you configured `TAILSCALE_AUTHKEY`, the pod automatically joins your Tailscale network:
|
||||
|
||||
1. **Get Tailscale IP:**
|
||||
```bash
|
||||
ssh root@<pod-ip> -p <port>
|
||||
tailscale ip -4
|
||||
# Example output: 100.114.60.40
|
||||
```
|
||||
|
||||
2. **Access via Tailscale:**
|
||||
```
|
||||
ComfyUI: http://<tailscale-ip>:8188
|
||||
Supervisor: http://<tailscale-ip>:9001
|
||||
Orchestrator: http://<tailscale-ip>:9000
|
||||
SSH: ssh root@<tailscale-ip>
|
||||
```
|
||||
|
||||
3. **Update LiteLLM config** on your VPS with the Tailscale IP
|
||||
|
||||
## Service Management
|
||||
|
||||
### Start/Stop Services
|
||||
|
||||
```bash
|
||||
# Start all services
|
||||
supervisorctl -c /workspace/supervisord.conf start all
|
||||
|
||||
# Stop all services
|
||||
supervisorctl -c /workspace/supervisord.conf stop all
|
||||
|
||||
# Restart specific service
|
||||
supervisorctl -c /workspace/supervisord.conf restart comfyui
|
||||
|
||||
# View status
|
||||
supervisorctl -c /workspace/supervisord.conf status
|
||||
```
|
||||
|
||||
### Enable vLLM Models (Text Generation)
|
||||
|
||||
By default, only ComfyUI runs (to save VRAM). To enable vLLM:
|
||||
|
||||
1. **Stop ComfyUI** (frees up VRAM):
|
||||
```bash
|
||||
supervisorctl -c /workspace/supervisord.conf stop comfyui
|
||||
```
|
||||
|
||||
2. **Start orchestrator** (manages vLLM models):
|
||||
```bash
|
||||
supervisorctl -c /workspace/supervisord.conf start orchestrator
|
||||
```
|
||||
|
||||
3. **Test text generation:**
|
||||
```bash
|
||||
curl -X POST http://localhost:9000/v1/chat/completions \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"model":"qwen-2.5-7b","messages":[{"role":"user","content":"Hello"}]}'
|
||||
```
|
||||
|
||||
### Switch Back to ComfyUI
|
||||
|
||||
```bash
|
||||
# Stop orchestrator (stops all vLLM models)
|
||||
supervisorctl -c /workspace/supervisord.conf stop orchestrator
|
||||
|
||||
# Start ComfyUI
|
||||
supervisorctl -c /workspace/supervisord.conf start comfyui
|
||||
```
|
||||
|
||||
## Updating the Template
|
||||
|
||||
When you make changes to code or configuration:
|
||||
|
||||
### Update Docker Image
|
||||
|
||||
```bash
|
||||
# 1. Make changes to Dockerfile or start.sh
|
||||
# 2. Push to repository
|
||||
git add .
|
||||
git commit -m "Update template configuration"
|
||||
git push origin main
|
||||
|
||||
# 3. Gitea workflow auto-builds new image
|
||||
|
||||
# 4. Terminate old pod and deploy new one with updated image
|
||||
```
|
||||
|
||||
### Update Network Volume Data
|
||||
|
||||
```bash
|
||||
# SSH into running pod
|
||||
ssh root@<pod-ip> -p <port>
|
||||
|
||||
# Update repository
|
||||
cd /workspace/ai
|
||||
git pull
|
||||
|
||||
# Re-run Ansible if needed
|
||||
ansible-playbook playbook.yml --tags <specific-tag>
|
||||
|
||||
# Restart services
|
||||
supervisorctl -c /workspace/supervisord.conf restart all
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Pod fails to start
|
||||
|
||||
**Check logs:**
|
||||
```bash
|
||||
# Via SSH
|
||||
cat /workspace/logs/supervisord.log
|
||||
cat /workspace/logs/comfyui.err.log
|
||||
|
||||
# Via RunPod web terminal
|
||||
tail -f /workspace/logs/*.log
|
||||
```
|
||||
|
||||
**Common issues:**
|
||||
- Missing `.env` file → Create `/workspace/ai/.env` with required vars
|
||||
- Supervisor config not found → Ensure `/workspace/ai/supervisord.conf` exists
|
||||
- Port conflicts → Check if services are already running
|
||||
|
||||
### Tailscale not connecting
|
||||
|
||||
**Check Tailscale status:**
|
||||
```bash
|
||||
tailscale status
|
||||
tailscale ip -4
|
||||
```
|
||||
|
||||
**Common issues:**
|
||||
- Missing or invalid `TAILSCALE_AUTHKEY` in `.env`
|
||||
- Auth key expired → Generate new key in Tailscale admin
|
||||
- Firewall blocking → RunPod should allow Tailscale by default
|
||||
|
||||
### Services not starting
|
||||
|
||||
**Check Supervisor:**
|
||||
```bash
|
||||
supervisorctl -c /workspace/supervisord.conf status
|
||||
supervisorctl -c /workspace/supervisord.conf tail -f comfyui
|
||||
```
|
||||
|
||||
**Common issues:**
|
||||
- venv broken → Re-run `scripts/bootstrap-venvs.sh`
|
||||
- Models not downloaded → Run Ansible playbook again
|
||||
- Python version mismatch → Rebuild venvs
|
||||
|
||||
### Out of VRAM
|
||||
|
||||
**Check GPU memory:**
|
||||
```bash
|
||||
nvidia-smi
|
||||
```
|
||||
|
||||
**RTX 4090 (24GB) capacity:**
|
||||
- ComfyUI (FLUX Schnell): ~23GB (can't run with vLLM)
|
||||
- vLLM (Qwen 2.5 7B): ~14GB
|
||||
- vLLM (Llama 3.1 8B): ~17GB
|
||||
|
||||
**Solution:** Only run one service at a time (see Service Management section)
|
||||
|
||||
### Network volume full
|
||||
|
||||
**Check disk usage:**
|
||||
```bash
|
||||
df -h /workspace
|
||||
du -sh /workspace/*
|
||||
```
|
||||
|
||||
**Clean up:**
|
||||
```bash
|
||||
# Remove old HuggingFace cache
|
||||
rm -rf /workspace/huggingface_cache
|
||||
|
||||
# Re-download essential models only
|
||||
cd /workspace/ai
|
||||
ansible-playbook playbook.yml --tags comfyui-essential
|
||||
```
|
||||
|
||||
## Cost Optimization
|
||||
|
||||
### Spot vs On-Demand
|
||||
|
||||
- **Spot instances**: ~70% cheaper, can be interrupted
|
||||
- **On-Demand**: More expensive, guaranteed availability
|
||||
|
||||
**Recommendation:** Use spot for development, on-demand for production
|
||||
|
||||
### Network Volume Pricing
|
||||
|
||||
- First 1TB: $0.07/GB/month
|
||||
- Beyond 1TB: $0.05/GB/month
|
||||
|
||||
**200GB volume cost:** ~$14/month
|
||||
|
||||
### Pod Auto-Stop
|
||||
|
||||
Configure auto-stop in RunPod pod settings to save costs when idle:
|
||||
- Stop after 15 minutes idle
|
||||
- Stop after 1 hour idle
|
||||
- Manual stop only
|
||||
|
||||
## Advanced Configuration
|
||||
|
||||
### Custom Environment Variables
|
||||
|
||||
Add to template or pod environment variables:
|
||||
|
||||
```bash
|
||||
# Model cache locations
|
||||
HF_HOME=/workspace/huggingface_cache
|
||||
TRANSFORMERS_CACHE=/workspace/huggingface_cache
|
||||
|
||||
# ComfyUI settings
|
||||
COMFYUI_PORT=8188
|
||||
COMFYUI_LISTEN=0.0.0.0
|
||||
|
||||
# Orchestrator settings
|
||||
ORCHESTRATOR_PORT=9000
|
||||
|
||||
# GPU settings
|
||||
CUDA_VISIBLE_DEVICES=0
|
||||
```
|
||||
|
||||
### Multiple Network Volumes
|
||||
|
||||
You can attach multiple network volumes for organization:
|
||||
|
||||
1. **Models volume** - `/workspace/models` (read-only, shared)
|
||||
2. **Data volume** - `/workspace/data` (read-write, per-project)
|
||||
|
||||
### Custom Startup Script
|
||||
|
||||
Override `/start.sh` behavior by creating `/workspace/custom-start.sh`:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Custom startup commands
|
||||
|
||||
# Source default startup
|
||||
source /start.sh
|
||||
|
||||
# Add your custom commands here
|
||||
echo "Running custom initialization..."
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- [RunPod Documentation](https://docs.runpod.io/)
|
||||
- [RunPod Templates Overview](https://docs.runpod.io/pods/templates/overview)
|
||||
- [Network Volumes Guide](https://docs.runpod.io/storage/network-volumes)
|
||||
- [ComfyUI Documentation](https://github.com/comfyanonymous/ComfyUI)
|
||||
- [Supervisor Documentation](http://supervisord.org/)
|
||||
- [Tailscale Documentation](https://tailscale.com/kb/)
|
||||
|
||||
## Support
|
||||
|
||||
For issues or questions:
|
||||
- Check troubleshooting section above
|
||||
- Review `/workspace/logs/` files
|
||||
- Check RunPod community forums
|
||||
- Open issue in project repository
|
||||
Reference in New Issue
Block a user