Files

Build and Push RunPod Docker Image / build-and-push (push) Successful in 2m8s

Details

fix: update Docker registry from Docker Hub to dev.pivoine.art

- Use Gitea container registry instead of Docker Hub
- Update workflow to use gitea.actor and REGISTRY_TOKEN
- Update documentation to reflect correct registry URL
- Match supervisor-ui workflow configuration

2025-11-23 21:57:14 +01:00

12 KiB

Raw Permalink Blame History

RunPod Template Setup Guide

This guide explains how to deploy the AI Orchestrator (ComfyUI + vLLM) on RunPod using a custom Docker template and network volume.

Architecture Overview

The deployment uses a two-tier strategy:

Docker Image (software layer) - Contains system packages, Supervisor, Tailscale
Network Volume (data layer) - Contains models, ComfyUI installation, venvs, configuration

This approach allows fast pod deployment (~2-3 minutes) while keeping all large files (models, ~80-200GB) on a persistent network volume.

Prerequisites

RunPod account with credits
Docker Hub account (for hosting the template image)
HuggingFace account with API token (for model downloads)
Tailscale account with auth key (optional, for VPN access)

Step 1: Build and Push Docker Image

Option A: Automated Build (Recommended)

The repository includes a Gitea workflow that automatically builds and pushes the Docker image to your Gitea container registry when you push to the main branch or create a version tag.

Configure Gitea Secret:
- Go to your Gitea repository → Settings → Secrets
- Add REGISTRY_TOKEN = your Gitea access token with registry permissions
- (The workflow automatically uses your Gitea username via gitea.actor)

Trigger Build:

# Push to main branch
git push origin main

# Or create a version tag
git tag v1.0.0
git push origin v1.0.0

Monitor Build:
- Go to Actions tab in Gitea
- Wait for build to complete (~5-10 minutes)
- Note the Docker image name: dev.pivoine.art/valknar/runpod-ai-orchestrator:latest

Option B: Manual Build

If you prefer to build manually:

# From the repository root
cd /path/to/runpod

# Build the image
docker build -t dev.pivoine.art/valknar/runpod-ai-orchestrator:latest .

# Login to your Gitea registry
docker login dev.pivoine.art

# Push to Gitea registry
docker push dev.pivoine.art/valknar/runpod-ai-orchestrator:latest

Step 2: Create Network Volume

Network volumes persist your models and data across pod restarts and rebuilds.

Go to RunPod Dashboard → Storage → Network Volumes
Click "New Network Volume"
Configure:
- Name: ai-orchestrator-models
- Size: 200GB (adjust based on your needs)
  - Essential models only: ~80GB
  - All models: ~137-200GB
- Datacenter: Choose closest to you (volume tied to datacenter)
Click "Create Volume"
Note the Volume ID (e.g., vol-abc123def456) for pod deployment

Storage Requirements

Configuration	Size	Models Included
Essential	~80GB	FLUX Schnell, 1-2 SDXL checkpoints, MusicGen Medium
Complete	~137GB	All image/video/audio models from playbook
Full + vLLM	~200GB	Complete + Qwen 2.5 7B + Llama 3.1 8B

Step 3: Create RunPod Template

Go to RunPod Dashboard → Templates
Click "New Template"
Configure Template Settings:

Container Configuration:
- Template Name: AI Orchestrator (ComfyUI + vLLM)
- Template Type: Docker
- Container Image: dev.pivoine.art/valknar/runpod-ai-orchestrator:latest
- Container Disk: 50GB (for system and temp files)
- Docker Command: Leave empty (uses default /start.sh)
Volume Configuration:
- Volume Mount Path: /workspace
- Attach to Network Volume: Select your volume ID from Step 2
Port Configuration:
- Expose HTTP Ports: 8188, 9000, 9001
  - 8188 - ComfyUI web interface
  - 9000 - Model orchestrator API
  - 9001 - Supervisor web UI
- Expose TCP Ports: 22 (SSH access)
Environment Variables:
```
HF_TOKEN=your_huggingface_token_here
TAILSCALE_AUTHKEY=tskey-auth-your_tailscale_authkey_here
SUPERVISOR_BACKEND_HOST=localhost
SUPERVISOR_BACKEND_PORT=9001
```
Advanced Settings:
- Start Jupyter: No
- Start SSH: Yes (handled by base image)
Click "Save Template"

Step 4: First Deployment (Initial Setup)

The first time you deploy, you need to set up the network volume with models and configuration.

4.1 Deploy Pod

Go to RunPod Dashboard → Pods
Click "Deploy" or "GPU Pods"
Select your custom template: AI Orchestrator (ComfyUI + vLLM)
Configure GPU:
- GPU Type: RTX 4090 (24GB VRAM) or higher
- Network Volume: Select your volume from Step 2
- On-Demand vs Spot: Choose based on budget
Click "Deploy"

4.2 SSH into Pod

# Get pod SSH command from RunPod dashboard
ssh root@<pod-ip> -p <port> -i ~/.ssh/id_ed25519

# Or use RunPod web terminal

4.3 Initial Setup on Network Volume

# 1. Clone the repository to /workspace/ai
cd /workspace
git clone https://github.com/your-username/runpod.git ai
cd ai

# 2. Create .env file with your credentials
cp .env.example .env
nano .env

# Edit and add:
# HF_TOKEN=your_huggingface_token
# TAILSCALE_AUTHKEY=tskey-auth-your_key
# GPU_TAILSCALE_IP=<will be set automatically>

# 3. Download essential models (this takes 30-60 minutes)
ansible-playbook playbook.yml --tags comfyui-essential

# OR download all models (1-2 hours)
ansible-playbook playbook.yml --tags comfyui-models-all

# 4. Link models to ComfyUI
bash scripts/link-comfyui-models.sh

# OR if arty is available
arty run models/link-comfyui

# 5. Install ComfyUI custom nodes dependencies
cd /workspace/ComfyUI/custom_nodes/ComfyUI-Manager
pip install -r requirements.txt
cd /workspace/ai

# 6. Restart the container to apply all changes
exit
# Go to RunPod dashboard → Stop pod → Start pod

4.4 Verify Services

After restart, SSH back in and check:

# Check supervisor status
supervisorctl -c /workspace/supervisord.conf status

# Expected output:
# comfyui                          RUNNING   pid 123, uptime 0:01:00
# (orchestrator is disabled by default - enable for vLLM)

# Test ComfyUI
curl -I http://localhost:8188

# Test Supervisor web UI
curl -I http://localhost:9001

Step 5: Subsequent Deployments

After initial setup, deploying new pods is quick (2-3 minutes):

Deploy pod with same template + network volume
Wait for startup (~1-2 minutes for services to start)
Access services:
- ComfyUI: http://<pod-ip>:8188
- Supervisor: http://<pod-ip>:9001

All models, configuration, and data persist on the network volume!

Step 6: Access Services

Via Direct IP (HTTP)

Get pod IP and ports from RunPod dashboard:

ComfyUI:           http://<pod-ip>:8188
Supervisor UI:     http://<pod-ip>:9001
Orchestrator API:  http://<pod-ip>:9000
SSH:               ssh root@<pod-ip> -p <port>

Via Tailscale VPN (Recommended)

If you configured TAILSCALE_AUTHKEY, the pod automatically joins your Tailscale network:

Get Tailscale IP:

ssh root@<pod-ip> -p <port>
tailscale ip -4
# Example output: 100.114.60.40

Access via Tailscale:

ComfyUI:      http://<tailscale-ip>:8188
Supervisor:   http://<tailscale-ip>:9001
Orchestrator: http://<tailscale-ip>:9000
SSH:          ssh root@<tailscale-ip>

Update LiteLLM config on your VPS with the Tailscale IP

Service Management

Start/Stop Services

# Start all services
supervisorctl -c /workspace/supervisord.conf start all

# Stop all services
supervisorctl -c /workspace/supervisord.conf stop all

# Restart specific service
supervisorctl -c /workspace/supervisord.conf restart comfyui

# View status
supervisorctl -c /workspace/supervisord.conf status

Enable vLLM Models (Text Generation)

By default, only ComfyUI runs (to save VRAM). To enable vLLM:

Stop ComfyUI (frees up VRAM):

supervisorctl -c /workspace/supervisord.conf stop comfyui

Start orchestrator (manages vLLM models):

supervisorctl -c /workspace/supervisord.conf start orchestrator

Test text generation:

curl -X POST http://localhost:9000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"qwen-2.5-7b","messages":[{"role":"user","content":"Hello"}]}'

Switch Back to ComfyUI

# Stop orchestrator (stops all vLLM models)
supervisorctl -c /workspace/supervisord.conf stop orchestrator

# Start ComfyUI
supervisorctl -c /workspace/supervisord.conf start comfyui

Updating the Template

When you make changes to code or configuration:

Update Docker Image

# 1. Make changes to Dockerfile or start.sh
# 2. Push to repository
git add .
git commit -m "Update template configuration"
git push origin main

# 3. Gitea workflow auto-builds new image

# 4. Terminate old pod and deploy new one with updated image

Update Network Volume Data

# SSH into running pod
ssh root@<pod-ip> -p <port>

# Update repository
cd /workspace/ai
git pull

# Re-run Ansible if needed
ansible-playbook playbook.yml --tags <specific-tag>

# Restart services
supervisorctl -c /workspace/supervisord.conf restart all

Troubleshooting

Pod fails to start

Check logs:

# Via SSH
cat /workspace/logs/supervisord.log
cat /workspace/logs/comfyui.err.log

# Via RunPod web terminal
tail -f /workspace/logs/*.log

Common issues:

Missing .env file → Create /workspace/ai/.env with required vars
Supervisor config not found → Ensure /workspace/ai/supervisord.conf exists
Port conflicts → Check if services are already running

Tailscale not connecting

Check Tailscale status:

tailscale status
tailscale ip -4

Common issues:

Missing or invalid TAILSCALE_AUTHKEY in .env
Auth key expired → Generate new key in Tailscale admin
Firewall blocking → RunPod should allow Tailscale by default

Services not starting

Check Supervisor:

supervisorctl -c /workspace/supervisord.conf status
supervisorctl -c /workspace/supervisord.conf tail -f comfyui

Common issues:

venv broken → Re-run scripts/bootstrap-venvs.sh
Models not downloaded → Run Ansible playbook again
Python version mismatch → Rebuild venvs

Out of VRAM

Check GPU memory:

nvidia-smi

RTX 4090 (24GB) capacity:

ComfyUI (FLUX Schnell): ~23GB (can't run with vLLM)
vLLM (Qwen 2.5 7B): ~14GB
vLLM (Llama 3.1 8B): ~17GB

Solution: Only run one service at a time (see Service Management section)

Network volume full

Check disk usage:

df -h /workspace
du -sh /workspace/*

Clean up:

# Remove old HuggingFace cache
rm -rf /workspace/huggingface_cache

# Re-download essential models only
cd /workspace/ai
ansible-playbook playbook.yml --tags comfyui-essential

Cost Optimization

Spot vs On-Demand

Spot instances: ~70% cheaper, can be interrupted
On-Demand: More expensive, guaranteed availability

Recommendation: Use spot for development, on-demand for production

Network Volume Pricing

First 1TB: $0.07/GB/month
Beyond 1TB: $0.05/GB/month

200GB volume cost: ~$14/month

Pod Auto-Stop

Configure auto-stop in RunPod pod settings to save costs when idle:

Stop after 15 minutes idle
Stop after 1 hour idle
Manual stop only

Advanced Configuration

Custom Environment Variables

Add to template or pod environment variables:

# Model cache locations
HF_HOME=/workspace/huggingface_cache
TRANSFORMERS_CACHE=/workspace/huggingface_cache

# ComfyUI settings
COMFYUI_PORT=8188
COMFYUI_LISTEN=0.0.0.0

# Orchestrator settings
ORCHESTRATOR_PORT=9000

# GPU settings
CUDA_VISIBLE_DEVICES=0

Multiple Network Volumes

You can attach multiple network volumes for organization:

Models volume - /workspace/models (read-only, shared)
Data volume - /workspace/data (read-write, per-project)

Custom Startup Script

Override /start.sh behavior by creating /workspace/custom-start.sh:

#!/bin/bash
# Custom startup commands

# Source default startup
source /start.sh

# Add your custom commands here
echo "Running custom initialization..."

References

Support

For issues or questions:

Check troubleshooting section above
Review /workspace/logs/ files
Check RunPod community forums
Open issue in project repository

12 KiB Raw Permalink Blame History

RunPod Template Setup Guide

Architecture Overview

Prerequisites

Step 1: Build and Push Docker Image

Option A: Automated Build (Recommended)

Option B: Manual Build

Step 2: Create Network Volume

Storage Requirements

Step 3: Create RunPod Template

Step 4: First Deployment (Initial Setup)

4.1 Deploy Pod

4.2 SSH into Pod

4.3 Initial Setup on Network Volume

4.4 Verify Services

Step 5: Subsequent Deployments

Step 6: Access Services

Via Direct IP (HTTP)

Via Tailscale VPN (Recommended)

Service Management

Start/Stop Services

Enable vLLM Models (Text Generation)

Switch Back to ComfyUI

Updating the Template

Update Docker Image

Update Network Volume Data

Troubleshooting

Pod fails to start

Tailscale not connecting

Services not starting

Out of VRAM

Network volume full

Cost Optimization

Spot vs On-Demand

Network Volume Pricing

Pod Auto-Stop

Advanced Configuration

Custom Environment Variables

Multiple Network Volumes

Custom Startup Script

References

Support

12 KiB

Raw Permalink Blame History