- Remove custom PivoineDiffRhythmRun wrapper node - Add git patch file for ComfyUI_DiffRhythm __init__.py - Patch adds LlamaConfig fix at import time - Add arty script 'fix/diffrhythm-patch' to apply patch - Revert all workflows to use original DiffRhythmRun - Remove startup_patch.py and revert start.sh This approach is cleaner and more maintainable than wrapping the node. The patch directly fixes the tensor dimension mismatch (32 vs 64) in DiffRhythm's rotary position embeddings by ensuring num_attention_heads and num_key_value_heads are properly set based on hidden_size. References: - https://github.com/billwuhao/ComfyUI_DiffRhythm/issues/44 - https://github.com/billwuhao/ComfyUI_DiffRhythm/issues/48
AI Model Orchestrator for RunPod
A lightweight, process-based AI model orchestrator designed for single-GPU environments. Automatically switches between text, image, and audio generation models based on API requests, maximizing GPU utilization on resource-constrained hardware.
Overview
This orchestrator solves a common problem: running multiple large AI models on a single GPU by:
- Loading models sequentially (one at a time) to fit within GPU memory constraints
- Automatically switching models based on incoming API requests
- Providing OpenAI-compatible endpoints for seamless integration
- Supporting streaming responses for real-time text generation
Perfect for:
- RunPod RTX 4090 instances (24GB VRAM)
- Cost-effective GPU servers
- Self-hosted AI infrastructure
- Integration with LiteLLM, Open WebUI, and other AI proxies
Architecture
Client Request → LiteLLM Proxy → Orchestrator (port 9000) → Model Service (port 8000+)
(Optional) ↓
Model Registry
(models.yaml)
↓
Sequential Loading
(stops current, starts new)
Key Components
- Orchestrator (
orchestrator_subprocess.py): FastAPI proxy managing model lifecycle via Python subprocesses - Model Registry (
models.yaml): YAML configuration defining available models and their properties - Model Services: Individual Python servers (vLLM, OpenedAI-Images, AudioCraft) running specific models
- Management Scripts: Bash utilities for starting, stopping, and downloading models
Why Process-Based?
Unlike the Docker-based version (orchestrator.py), the subprocess implementation:
- Works seamlessly in RunPod's containerized environment (no Docker-in-Docker complexity)
- Starts models faster (direct Python process spawning)
- Simplifies debugging (single process tree)
- Reduces overhead (no container management layer)
Available Models
Text Generation (vLLM)
- qwen-2.5-7b: Qwen 2.5 7B Instruct (14GB VRAM, ~2min startup)
- Fast text generation with streaming support
- Endpoint:
/v1/chat/completions
Image Generation (OpenedAI-Images)
- flux-schnell: Flux.1 Schnell (14GB VRAM, ~1min startup)
- High-quality images in 4-5 seconds
- Endpoint:
/v1/images/generations
Music Generation (AudioCraft)
- musicgen-medium: MusicGen Medium (11GB VRAM, ~45s startup)
- Text-to-music generation (60-90s for 30s audio)
- Endpoint:
/v1/audio/generations
Installation
Prerequisites
- Python 3.10+
- CUDA-capable GPU (tested on RTX 4090)
- Ubuntu/Debian-based system
- HuggingFace account with API token
Quick Start (RunPod)
-
Clone the repository:
cd /workspace git clone https://github.com/yourusername/runpod.git ai cd ai -
Configure environment:
cp .env.example .env # Edit .env and set: # - HF_TOKEN=your_huggingface_token # - PORT=9000 (orchestrator port) -
Run setup (installs dependencies via Ansible):
ansible-playbook playbook.yml -
Start all services:
bash scripts/start-all.sh -
Verify orchestrator is running:
curl http://localhost:9000/health
Manual Installation
If you prefer step-by-step setup:
# Install system dependencies and download models via Ansible
# Use specific tags for selective installation:
# - base: Base system packages
# - python: Python runtime via pyenv
# - vllm: vLLM dependencies and Qwen model
# - comfyui: ComfyUI installation
# - comfyui-essential: Essential ComfyUI models only
ansible-playbook playbook.yml --tags base,python,vllm
# Start orchestrator
python3 model-orchestrator/orchestrator_subprocess.py
Usage
Starting Services
# Start orchestrator + all models
bash scripts/start-all.sh
# Start orchestrator only (models start on-demand)
python3 model-orchestrator/orchestrator_subprocess.py
Services start in the background and automatically load models when requested.
Stopping Services
# Stop all services
bash scripts/stop-all.sh
# Stop orchestrator only (leaves models running)
pkill -f orchestrator_subprocess.py
Testing the API
Text generation (streaming):
curl -s -N -X POST http://localhost:9000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen-2.5-7b",
"messages": [{"role": "user", "content": "Count to 5"}],
"max_tokens": 50,
"stream": true
}'
Image generation:
curl -X POST http://localhost:9000/v1/images/generations \
-H 'Content-Type: application/json' \
-d '{
"model": "flux-schnell",
"prompt": "A serene mountain landscape at sunset",
"size": "1024x1024"
}'
Music generation:
curl -X POST http://localhost:9000/v1/audio/generations \
-H 'Content-Type: application/json' \
-d '{
"model": "musicgen-medium",
"prompt": "Upbeat electronic dance music",
"duration": 30
}'
List available models:
curl http://localhost:9000/v1/models
Health check:
curl http://localhost:9000/health
Integration with LiteLLM
The orchestrator is designed to work behind LiteLLM for unified API management:
LiteLLM config (litellm-config.yaml):
model_list:
- model_name: qwen-2.5-7b
litellm_params:
model: hosted_vllm/openai/qwen-2.5-7b
api_base: http://100.121.199.88:9000/v1 # Tailscale VPN
api_key: dummy
stream: true
timeout: 600
Key configuration notes:
- Use
hosted_vllm/openai/prefix for vLLM models via orchestrator - Set
stream: trueto enable token-by-token streaming - Use Tailscale IP for secure VPN access from remote services
Configuration
Model Registry (model-orchestrator/models.yaml)
Models are defined in YAML with these properties:
qwen-2.5-7b:
type: text # text, image, or audio
framework: vllm # vllm, openedai-images, audiocraft
service_script: models/vllm/server.py
port: 8000
vram_gb: 14 # GPU memory requirement
startup_time_seconds: 120 # Max time to wait for health check
endpoint: /v1/chat/completions # API endpoint path
description: "Human-readable description"
Adding New Models
-
Add model definition to
models.yaml:llama-3.1-8b: type: text framework: vllm service_script: models/vllm/server_llama.py port: 8004 vram_gb: 17 startup_time_seconds: 120 endpoint: /v1/chat/completions description: "Llama 3.1 8B Instruct" -
Create model service script (
models/vllm/server_llama.py):import os from vllm import LLM from vllm.entrypoints.openai.api_server import run_server model = "meta-llama/Llama-3.1-8B-Instruct" port = int(os.getenv("PORT", 8004)) run_server(model=model, port=port) -
Models are downloaded automatically by the playbook when using the appropriate tags (vllm, comfyui, etc.)
-
Restart orchestrator:
bash scripts/stop-all.sh && bash scripts/start-all.sh
Deployment Options
Option 1: RunPod (Current Setup)
Designed for RunPod's containerized environment:
- Services run as background processes in
/workspace/ai/ - Managed by
start-all.shandstop-all.sh - No systemd (RunPod containers don't support it)
Option 2: Dedicated VPS with systemd
For production VPS deployment:
-
Install systemd service:
sudo cp systemd/ai-orchestrator.service /etc/systemd/system/ sudo systemctl daemon-reload sudo systemctl enable ai-orchestrator sudo systemctl start ai-orchestrator -
Check logs:
sudo journalctl -u ai-orchestrator -f
Option 3: Tailscale VPN Integration
Connect RunPod GPU to your VPS infrastructure:
-
On RunPod instance:
# Install Tailscale curl -fsSL https://tailscale.com/install.sh | sh # Connect to VPN tailscale up --advertise-tags=tag:gpu -
On VPS (configure LiteLLM):
# Use Tailscale IP in litellm-config.yaml api_base: http://100.121.199.88:9000/v1 -
Benefits:
- Secure encrypted tunnel (no public exposure)
- Low latency direct connection
- Access GPU from any device on Tailscale network
API Endpoints
Orchestrator Management
GET /health- Health check with current model statusGET /v1/models- OpenAI-compatible models listPOST /switch- Manually switch to a specific model
Model Endpoints (Proxied)
The orchestrator automatically routes requests based on endpoint and model name:
POST /v1/chat/completions- Text generation (text models)POST /v1/images/generations- Image generation (image models)POST /v1/audio/generations- Music generation (audio models)
Streaming Support:
- Set
"stream": truein request body - Returns Server-Sent Events (SSE) stream
- Works through LiteLLM proxy (with correct configuration)
Troubleshooting
Streaming Returns Empty Deltas
Symptom: LiteLLM shows streaming chunks but content is empty:
{"choices":[{"delta":{"content":""},"index":0}]}
Solution: Use correct model path format in LiteLLM:
# ✅ Correct
model: hosted_vllm/openai/qwen-2.5-7b
# ❌ Wrong
model: openai/qwen-2.5-7b
Model Fails to Start
Check logs:
# RunPod
ps aux | grep orchestrator
ps aux | grep vllm
# VPS with systemd
sudo journalctl -u ai-orchestrator -n 50
Common issues:
- Out of GPU memory: Check VRAM usage with
nvidia-smi - Missing model weights: Run
ansible-playbook playbook.yml --tags vllmor--tags comfyui-essential - Port conflicts: Check if port is already in use with
lsof -i :9000
Orchestrator Not Responding
Test model directly (bypass orchestrator):
# Start vLLM service directly
cd /workspace/ai
python3 models/vllm/server.py
# Test on port 8000
curl -X POST http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Hi"}]}'
HuggingFace Download Errors
Symptom: "Repository not found" or "Authentication required"
Solution:
# Check HF_TOKEN is set
echo $HF_TOKEN
# Set token in .env file
echo "HF_TOKEN=your_token_here" >> .env
# Re-run Ansible playbook to download models
ansible-playbook playbook.yml --tags vllm
Project Structure
runpod/
├── model-orchestrator/
│ ├── orchestrator_subprocess.py # Main orchestrator (process-based)
│ ├── orchestrator.py # Alternative Docker-based version
│ └── models.yaml # Model registry configuration
├── models/
│ ├── vllm/
│ │ ├── server.py # vLLM text generation service
│ │ └── requirements.txt
│ └── comfyui/ # ComfyUI for image/video/audio generation
│ ├── start.sh # ComfyUI startup script
│ └── models/ # ComfyUI models directory
├── scripts/
│ ├── start-all.sh # Start orchestrator + models
│ └── stop-all.sh # Stop all services
├── systemd/
│ └── ai-orchestrator.service # systemd service file (for VPS)
├── playbook.yml # Ansible playbook for system setup
├── inventory.yml # Ansible inventory
├── .env.example # Environment variables template
├── COMFYUI_MODELS.md # ComfyUI models usage guide
└── README.md # This file
Performance Notes
- Model switching time: 30-120 seconds (depends on model size)
- Text generation: ~20-40 tokens/second (qwen-2.5-7b on RTX 4090)
- Image generation: 4-5 seconds per image (Flux Schnell)
- Music generation: 60-90 seconds for 30s audio (MusicGen Medium)
Future Enhancements
- Parallel model loading (for multi-GPU setups)
- Model preloading based on usage patterns
- Prometheus metrics exporter
- Web dashboard for model management
- Support for quantized models (lower VRAM requirements)
- Add Whisper (speech-to-text)
- Add XTTS-v2 (text-to-speech with voice cloning)
License
MIT License - See LICENSE file for details.
Contributing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Test your changes on RunPod
- Submit a pull request
Support
- Issues: https://github.com/yourusername/runpod/issues
- Discussions: https://github.com/yourusername/runpod/discussions
- Documentation: This README + inline code comments
Acknowledgments
Built with:
- vLLM - Fast LLM inference
- OpenedAI-Images - OpenAI-compatible image generation
- AudioCraft - Music generation
- FastAPI - Modern Python web framework
- LiteLLM - Universal LLM proxy