- Prepended ComfyUI model type folder (checkpoints/, clip/, vae/, etc.) to all dest paths - Removed separate 'type' field from all model entries - Consolidated SD3.5 duplicate entries (5 → 1) - Simplified model configuration by embedding directory structure directly in destination paths This change eliminates the need to parse the 'type' field separately in artifact_huggingface_download.sh, making the configuration more explicit and easier to understand. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
AI Model Orchestrator for RunPod
A lightweight, process-based AI model orchestrator designed for single-GPU environments. Automatically switches between text, image, and audio generation models based on API requests, maximizing GPU utilization on resource-constrained hardware.
Overview
This orchestrator solves a common problem: running multiple large AI models on a single GPU by:
- Loading models sequentially (one at a time) to fit within GPU memory constraints
- Automatically switching models based on incoming API requests
- Providing OpenAI-compatible endpoints for seamless integration
- Supporting streaming responses for real-time text generation
Perfect for:
- RunPod RTX 4090 instances (24GB VRAM)
- Cost-effective GPU servers
- Self-hosted AI infrastructure
- Integration with LiteLLM, Open WebUI, and other AI proxies
Architecture
Client Request → LiteLLM Proxy → Orchestrator (port 9000) → Model Service (port 8000+)
(Optional) ↓
Model Registry
(models.yaml)
↓
Sequential Loading
(stops current, starts new)
Key Components
- Orchestrator (
orchestrator_subprocess.py): FastAPI proxy managing model lifecycle via Python subprocesses - Model Registry (
models.yaml): YAML configuration defining available models and their properties - Model Services: Individual Python servers (vLLM, OpenedAI-Images, AudioCraft) running specific models
- Management Scripts: Bash utilities for starting, stopping, and downloading models
Why Process-Based?
Unlike the Docker-based version (orchestrator.py), the subprocess implementation:
- Works seamlessly in RunPod's containerized environment (no Docker-in-Docker complexity)
- Starts models faster (direct Python process spawning)
- Simplifies debugging (single process tree)
- Reduces overhead (no container management layer)
Available Models
Text Generation (vLLM)
- qwen-2.5-7b: Qwen 2.5 7B Instruct (14GB VRAM, ~2min startup)
- Fast text generation with streaming support
- Endpoint:
/v1/chat/completions
Image Generation (OpenedAI-Images)
- flux-schnell: Flux.1 Schnell (14GB VRAM, ~1min startup)
- High-quality images in 4-5 seconds
- Endpoint:
/v1/images/generations
Music Generation (AudioCraft)
- musicgen-medium: MusicGen Medium (11GB VRAM, ~45s startup)
- Text-to-music generation (60-90s for 30s audio)
- Endpoint:
/v1/audio/generations
Installation
Prerequisites
- Python 3.10+
- CUDA-capable GPU (tested on RTX 4090)
- Ubuntu/Debian-based system
- HuggingFace account with API token
Quick Start (RunPod)
-
Clone the repository:
cd /workspace git clone https://github.com/yourusername/runpod.git ai cd ai -
Configure environment:
cp .env.example .env # Edit .env and set: # - HF_TOKEN=your_huggingface_token # - PORT=9000 (orchestrator port) -
Run setup (installs dependencies via Ansible):
ansible-playbook playbook.yml -
Start all services:
bash scripts/start-all.sh -
Verify orchestrator is running:
curl http://localhost:9000/health
Manual Installation
If you prefer step-by-step setup:
# Install system dependencies and download models via Ansible
# Use specific tags for selective installation:
# - base: Base system packages
# - python: Python runtime via pyenv
# - vllm: vLLM dependencies and Qwen model
# - comfyui: ComfyUI installation
# - comfyui-essential: Essential ComfyUI models only
ansible-playbook playbook.yml --tags base,python,vllm
# Start orchestrator
python3 model-orchestrator/orchestrator_subprocess.py
Usage
Starting Services
# Start orchestrator + all models
bash scripts/start-all.sh
# Start orchestrator only (models start on-demand)
python3 model-orchestrator/orchestrator_subprocess.py
Services start in the background and automatically load models when requested.
Stopping Services
# Stop all services
bash scripts/stop-all.sh
# Stop orchestrator only (leaves models running)
pkill -f orchestrator_subprocess.py
Testing the API
Text generation (streaming):
curl -s -N -X POST http://localhost:9000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen-2.5-7b",
"messages": [{"role": "user", "content": "Count to 5"}],
"max_tokens": 50,
"stream": true
}'
Image generation:
curl -X POST http://localhost:9000/v1/images/generations \
-H 'Content-Type: application/json' \
-d '{
"model": "flux-schnell",
"prompt": "A serene mountain landscape at sunset",
"size": "1024x1024"
}'
Music generation:
curl -X POST http://localhost:9000/v1/audio/generations \
-H 'Content-Type: application/json' \
-d '{
"model": "musicgen-medium",
"prompt": "Upbeat electronic dance music",
"duration": 30
}'
List available models:
curl http://localhost:9000/v1/models
Health check:
curl http://localhost:9000/health
Integration with LiteLLM
The orchestrator is designed to work behind LiteLLM for unified API management:
LiteLLM config (litellm-config.yaml):
model_list:
- model_name: qwen-2.5-7b
litellm_params:
model: hosted_vllm/openai/qwen-2.5-7b
api_base: http://100.121.199.88:9000/v1 # Tailscale VPN
api_key: dummy
stream: true
timeout: 600
Key configuration notes:
- Use
hosted_vllm/openai/prefix for vLLM models via orchestrator - Set
stream: trueto enable token-by-token streaming - Use Tailscale IP for secure VPN access from remote services
Configuration
Model Registry (model-orchestrator/models.yaml)
Models are defined in YAML with these properties:
qwen-2.5-7b:
type: text # text, image, or audio
framework: vllm # vllm, openedai-images, audiocraft
service_script: models/vllm/server.py
port: 8000
vram_gb: 14 # GPU memory requirement
startup_time_seconds: 120 # Max time to wait for health check
endpoint: /v1/chat/completions # API endpoint path
description: "Human-readable description"
Adding New Models
-
Add model definition to
models.yaml:llama-3.1-8b: type: text framework: vllm service_script: models/vllm/server_llama.py port: 8004 vram_gb: 17 startup_time_seconds: 120 endpoint: /v1/chat/completions description: "Llama 3.1 8B Instruct" -
Create model service script (
models/vllm/server_llama.py):import os from vllm import LLM from vllm.entrypoints.openai.api_server import run_server model = "meta-llama/Llama-3.1-8B-Instruct" port = int(os.getenv("PORT", 8004)) run_server(model=model, port=port) -
Models are downloaded automatically by the playbook when using the appropriate tags (vllm, comfyui, etc.)
-
Restart orchestrator:
bash scripts/stop-all.sh && bash scripts/start-all.sh
Deployment Options
Option 1: RunPod (Current Setup)
Designed for RunPod's containerized environment:
- Services run as background processes in
/workspace/ai/ - Managed by
start-all.shandstop-all.sh - No systemd (RunPod containers don't support it)
Option 2: Dedicated VPS with systemd
For production VPS deployment:
-
Install systemd service:
sudo cp systemd/ai-orchestrator.service /etc/systemd/system/ sudo systemctl daemon-reload sudo systemctl enable ai-orchestrator sudo systemctl start ai-orchestrator -
Check logs:
sudo journalctl -u ai-orchestrator -f
Option 3: Tailscale VPN Integration
Connect RunPod GPU to your VPS infrastructure:
-
On RunPod instance:
# Install Tailscale curl -fsSL https://tailscale.com/install.sh | sh # Connect to VPN tailscale up --advertise-tags=tag:gpu -
On VPS (configure LiteLLM):
# Use Tailscale IP in litellm-config.yaml api_base: http://100.121.199.88:9000/v1 -
Benefits:
- Secure encrypted tunnel (no public exposure)
- Low latency direct connection
- Access GPU from any device on Tailscale network
API Endpoints
Orchestrator Management
GET /health- Health check with current model statusGET /v1/models- OpenAI-compatible models listPOST /switch- Manually switch to a specific model
Model Endpoints (Proxied)
The orchestrator automatically routes requests based on endpoint and model name:
POST /v1/chat/completions- Text generation (text models)POST /v1/images/generations- Image generation (image models)POST /v1/audio/generations- Music generation (audio models)
Streaming Support:
- Set
"stream": truein request body - Returns Server-Sent Events (SSE) stream
- Works through LiteLLM proxy (with correct configuration)
Troubleshooting
Streaming Returns Empty Deltas
Symptom: LiteLLM shows streaming chunks but content is empty:
{"choices":[{"delta":{"content":""},"index":0}]}
Solution: Use correct model path format in LiteLLM:
# ✅ Correct
model: hosted_vllm/openai/qwen-2.5-7b
# ❌ Wrong
model: openai/qwen-2.5-7b
Model Fails to Start
Check logs:
# RunPod
ps aux | grep orchestrator
ps aux | grep vllm
# VPS with systemd
sudo journalctl -u ai-orchestrator -n 50
Common issues:
- Out of GPU memory: Check VRAM usage with
nvidia-smi - Missing model weights: Run
ansible-playbook playbook.yml --tags vllmor--tags comfyui-essential - Port conflicts: Check if port is already in use with
lsof -i :9000
Orchestrator Not Responding
Test model directly (bypass orchestrator):
# Start vLLM service directly
cd /workspace/ai
python3 models/vllm/server.py
# Test on port 8000
curl -X POST http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Hi"}]}'
HuggingFace Download Errors
Symptom: "Repository not found" or "Authentication required"
Solution:
# Check HF_TOKEN is set
echo $HF_TOKEN
# Set token in .env file
echo "HF_TOKEN=your_token_here" >> .env
# Re-run Ansible playbook to download models
ansible-playbook playbook.yml --tags vllm
Project Structure
runpod/
├── model-orchestrator/
│ ├── orchestrator_subprocess.py # Main orchestrator (process-based)
│ ├── orchestrator.py # Alternative Docker-based version
│ └── models.yaml # Model registry configuration
├── models/
│ ├── vllm/
│ │ ├── server.py # vLLM text generation service
│ │ └── requirements.txt
│ └── comfyui/ # ComfyUI for image/video/audio generation
│ ├── start.sh # ComfyUI startup script
│ └── models/ # ComfyUI models directory
├── scripts/
│ ├── start-all.sh # Start orchestrator + models
│ └── stop-all.sh # Stop all services
├── systemd/
│ └── ai-orchestrator.service # systemd service file (for VPS)
├── playbook.yml # Ansible playbook for system setup
├── inventory.yml # Ansible inventory
├── .env.example # Environment variables template
├── COMFYUI_MODELS.md # ComfyUI models usage guide
└── README.md # This file
Performance Notes
- Model switching time: 30-120 seconds (depends on model size)
- Text generation: ~20-40 tokens/second (qwen-2.5-7b on RTX 4090)
- Image generation: 4-5 seconds per image (Flux Schnell)
- Music generation: 60-90 seconds for 30s audio (MusicGen Medium)
Future Enhancements
- Parallel model loading (for multi-GPU setups)
- Model preloading based on usage patterns
- Prometheus metrics exporter
- Web dashboard for model management
- Support for quantized models (lower VRAM requirements)
- Add Whisper (speech-to-text)
- Add XTTS-v2 (text-to-speech with voice cloning)
License
MIT License - See LICENSE file for details.
Contributing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Test your changes on RunPod
- Submit a pull request
Support
- Issues: https://github.com/yourusername/runpod/issues
- Discussions: https://github.com/yourusername/runpod/discussions
- Documentation: This README + inline code comments
Acknowledgments
Built with:
- vLLM - Fast LLM inference
- OpenedAI-Images - OpenAI-compatible image generation
- AudioCraft - Music generation
- FastAPI - Modern Python web framework
- LiteLLM - Universal LLM proxy