Sebastian Krüger d74a7cb7cb
All checks were successful
Build and Push RunPod Docker Image / build-and-push (push) Successful in 14s
fix: replace custom Pivoine node with direct DiffRhythm patch
- Remove custom PivoineDiffRhythmRun wrapper node
- Add git patch file for ComfyUI_DiffRhythm __init__.py
- Patch adds LlamaConfig fix at import time
- Add arty script 'fix/diffrhythm-patch' to apply patch
- Revert all workflows to use original DiffRhythmRun
- Remove startup_patch.py and revert start.sh

This approach is cleaner and more maintainable than wrapping the node.
The patch directly fixes the tensor dimension mismatch (32 vs 64) in
DiffRhythm's rotary position embeddings by ensuring num_attention_heads
and num_key_value_heads are properly set based on hidden_size.

References:
- https://github.com/billwuhao/ComfyUI_DiffRhythm/issues/44
- https://github.com/billwuhao/ComfyUI_DiffRhythm/issues/48
2025-11-24 19:27:18 +01:00

AI Model Orchestrator for RunPod

A lightweight, process-based AI model orchestrator designed for single-GPU environments. Automatically switches between text, image, and audio generation models based on API requests, maximizing GPU utilization on resource-constrained hardware.

Overview

This orchestrator solves a common problem: running multiple large AI models on a single GPU by:

  • Loading models sequentially (one at a time) to fit within GPU memory constraints
  • Automatically switching models based on incoming API requests
  • Providing OpenAI-compatible endpoints for seamless integration
  • Supporting streaming responses for real-time text generation

Perfect for:

  • RunPod RTX 4090 instances (24GB VRAM)
  • Cost-effective GPU servers
  • Self-hosted AI infrastructure
  • Integration with LiteLLM, Open WebUI, and other AI proxies

Architecture

Client Request → LiteLLM Proxy → Orchestrator (port 9000) → Model Service (port 8000+)
                  (Optional)       ↓
                              Model Registry
                              (models.yaml)
                                   ↓
                            Sequential Loading
                         (stops current, starts new)

Key Components

  • Orchestrator (orchestrator_subprocess.py): FastAPI proxy managing model lifecycle via Python subprocesses
  • Model Registry (models.yaml): YAML configuration defining available models and their properties
  • Model Services: Individual Python servers (vLLM, OpenedAI-Images, AudioCraft) running specific models
  • Management Scripts: Bash utilities for starting, stopping, and downloading models

Why Process-Based?

Unlike the Docker-based version (orchestrator.py), the subprocess implementation:

  • Works seamlessly in RunPod's containerized environment (no Docker-in-Docker complexity)
  • Starts models faster (direct Python process spawning)
  • Simplifies debugging (single process tree)
  • Reduces overhead (no container management layer)

Available Models

Text Generation (vLLM)

  • qwen-2.5-7b: Qwen 2.5 7B Instruct (14GB VRAM, ~2min startup)
    • Fast text generation with streaming support
    • Endpoint: /v1/chat/completions

Image Generation (OpenedAI-Images)

  • flux-schnell: Flux.1 Schnell (14GB VRAM, ~1min startup)
    • High-quality images in 4-5 seconds
    • Endpoint: /v1/images/generations

Music Generation (AudioCraft)

  • musicgen-medium: MusicGen Medium (11GB VRAM, ~45s startup)
    • Text-to-music generation (60-90s for 30s audio)
    • Endpoint: /v1/audio/generations

Installation

Prerequisites

  • Python 3.10+
  • CUDA-capable GPU (tested on RTX 4090)
  • Ubuntu/Debian-based system
  • HuggingFace account with API token

Quick Start (RunPod)

  1. Clone the repository:

    cd /workspace
    git clone https://github.com/yourusername/runpod.git ai
    cd ai
    
  2. Configure environment:

    cp .env.example .env
    # Edit .env and set:
    # - HF_TOKEN=your_huggingface_token
    # - PORT=9000 (orchestrator port)
    
  3. Run setup (installs dependencies via Ansible):

    ansible-playbook playbook.yml
    
  4. Start all services:

    bash scripts/start-all.sh
    
  5. Verify orchestrator is running:

    curl http://localhost:9000/health
    

Manual Installation

If you prefer step-by-step setup:

# Install system dependencies and download models via Ansible
# Use specific tags for selective installation:
# - base: Base system packages
# - python: Python runtime via pyenv
# - vllm: vLLM dependencies and Qwen model
# - comfyui: ComfyUI installation
# - comfyui-essential: Essential ComfyUI models only
ansible-playbook playbook.yml --tags base,python,vllm

# Start orchestrator
python3 model-orchestrator/orchestrator_subprocess.py

Usage

Starting Services

# Start orchestrator + all models
bash scripts/start-all.sh

# Start orchestrator only (models start on-demand)
python3 model-orchestrator/orchestrator_subprocess.py

Services start in the background and automatically load models when requested.

Stopping Services

# Stop all services
bash scripts/stop-all.sh

# Stop orchestrator only (leaves models running)
pkill -f orchestrator_subprocess.py

Testing the API

Text generation (streaming):

curl -s -N -X POST http://localhost:9000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen-2.5-7b",
    "messages": [{"role": "user", "content": "Count to 5"}],
    "max_tokens": 50,
    "stream": true
  }'

Image generation:

curl -X POST http://localhost:9000/v1/images/generations \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "flux-schnell",
    "prompt": "A serene mountain landscape at sunset",
    "size": "1024x1024"
  }'

Music generation:

curl -X POST http://localhost:9000/v1/audio/generations \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "musicgen-medium",
    "prompt": "Upbeat electronic dance music",
    "duration": 30
  }'

List available models:

curl http://localhost:9000/v1/models

Health check:

curl http://localhost:9000/health

Integration with LiteLLM

The orchestrator is designed to work behind LiteLLM for unified API management:

LiteLLM config (litellm-config.yaml):

model_list:
  - model_name: qwen-2.5-7b
    litellm_params:
      model: hosted_vllm/openai/qwen-2.5-7b
      api_base: http://100.121.199.88:9000/v1  # Tailscale VPN
      api_key: dummy
      stream: true
      timeout: 600

Key configuration notes:

  • Use hosted_vllm/openai/ prefix for vLLM models via orchestrator
  • Set stream: true to enable token-by-token streaming
  • Use Tailscale IP for secure VPN access from remote services

Configuration

Model Registry (model-orchestrator/models.yaml)

Models are defined in YAML with these properties:

qwen-2.5-7b:
  type: text                          # text, image, or audio
  framework: vllm                     # vllm, openedai-images, audiocraft
  service_script: models/vllm/server.py
  port: 8000
  vram_gb: 14                         # GPU memory requirement
  startup_time_seconds: 120           # Max time to wait for health check
  endpoint: /v1/chat/completions      # API endpoint path
  description: "Human-readable description"

Adding New Models

  1. Add model definition to models.yaml:

    llama-3.1-8b:
      type: text
      framework: vllm
      service_script: models/vllm/server_llama.py
      port: 8004
      vram_gb: 17
      startup_time_seconds: 120
      endpoint: /v1/chat/completions
      description: "Llama 3.1 8B Instruct"
    
  2. Create model service script (models/vllm/server_llama.py):

    import os
    from vllm import LLM
    from vllm.entrypoints.openai.api_server import run_server
    
    model = "meta-llama/Llama-3.1-8B-Instruct"
    port = int(os.getenv("PORT", 8004))
    run_server(model=model, port=port)
    
  3. Models are downloaded automatically by the playbook when using the appropriate tags (vllm, comfyui, etc.)

  4. Restart orchestrator:

    bash scripts/stop-all.sh && bash scripts/start-all.sh
    

Deployment Options

Option 1: RunPod (Current Setup)

Designed for RunPod's containerized environment:

  • Services run as background processes in /workspace/ai/
  • Managed by start-all.sh and stop-all.sh
  • No systemd (RunPod containers don't support it)

Option 2: Dedicated VPS with systemd

For production VPS deployment:

  1. Install systemd service:

    sudo cp systemd/ai-orchestrator.service /etc/systemd/system/
    sudo systemctl daemon-reload
    sudo systemctl enable ai-orchestrator
    sudo systemctl start ai-orchestrator
    
  2. Check logs:

    sudo journalctl -u ai-orchestrator -f
    

Option 3: Tailscale VPN Integration

Connect RunPod GPU to your VPS infrastructure:

  1. On RunPod instance:

    # Install Tailscale
    curl -fsSL https://tailscale.com/install.sh | sh
    
    # Connect to VPN
    tailscale up --advertise-tags=tag:gpu
    
  2. On VPS (configure LiteLLM):

    # Use Tailscale IP in litellm-config.yaml
    api_base: http://100.121.199.88:9000/v1
    
  3. Benefits:

    • Secure encrypted tunnel (no public exposure)
    • Low latency direct connection
    • Access GPU from any device on Tailscale network

API Endpoints

Orchestrator Management

  • GET /health - Health check with current model status
  • GET /v1/models - OpenAI-compatible models list
  • POST /switch - Manually switch to a specific model

Model Endpoints (Proxied)

The orchestrator automatically routes requests based on endpoint and model name:

  • POST /v1/chat/completions - Text generation (text models)
  • POST /v1/images/generations - Image generation (image models)
  • POST /v1/audio/generations - Music generation (audio models)

Streaming Support:

  • Set "stream": true in request body
  • Returns Server-Sent Events (SSE) stream
  • Works through LiteLLM proxy (with correct configuration)

Troubleshooting

Streaming Returns Empty Deltas

Symptom: LiteLLM shows streaming chunks but content is empty:

{"choices":[{"delta":{"content":""},"index":0}]}

Solution: Use correct model path format in LiteLLM:

# ✅ Correct
model: hosted_vllm/openai/qwen-2.5-7b

# ❌ Wrong
model: openai/qwen-2.5-7b

Model Fails to Start

Check logs:

# RunPod
ps aux | grep orchestrator
ps aux | grep vllm

# VPS with systemd
sudo journalctl -u ai-orchestrator -n 50

Common issues:

  • Out of GPU memory: Check VRAM usage with nvidia-smi
  • Missing model weights: Run ansible-playbook playbook.yml --tags vllm or --tags comfyui-essential
  • Port conflicts: Check if port is already in use with lsof -i :9000

Orchestrator Not Responding

Test model directly (bypass orchestrator):

# Start vLLM service directly
cd /workspace/ai
python3 models/vllm/server.py

# Test on port 8000
curl -X POST http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Hi"}]}'

HuggingFace Download Errors

Symptom: "Repository not found" or "Authentication required"

Solution:

# Check HF_TOKEN is set
echo $HF_TOKEN

# Set token in .env file
echo "HF_TOKEN=your_token_here" >> .env

# Re-run Ansible playbook to download models
ansible-playbook playbook.yml --tags vllm

Project Structure

runpod/
├── model-orchestrator/
│   ├── orchestrator_subprocess.py  # Main orchestrator (process-based)
│   ├── orchestrator.py             # Alternative Docker-based version
│   └── models.yaml                 # Model registry configuration
├── models/
│   ├── vllm/
│   │   ├── server.py               # vLLM text generation service
│   │   └── requirements.txt
│   └── comfyui/                    # ComfyUI for image/video/audio generation
│       ├── start.sh                # ComfyUI startup script
│       └── models/                 # ComfyUI models directory
├── scripts/
│   ├── start-all.sh                # Start orchestrator + models
│   └── stop-all.sh                 # Stop all services
├── systemd/
│   └── ai-orchestrator.service     # systemd service file (for VPS)
├── playbook.yml                    # Ansible playbook for system setup
├── inventory.yml                   # Ansible inventory
├── .env.example                    # Environment variables template
├── COMFYUI_MODELS.md               # ComfyUI models usage guide
└── README.md                       # This file

Performance Notes

  • Model switching time: 30-120 seconds (depends on model size)
  • Text generation: ~20-40 tokens/second (qwen-2.5-7b on RTX 4090)
  • Image generation: 4-5 seconds per image (Flux Schnell)
  • Music generation: 60-90 seconds for 30s audio (MusicGen Medium)

Future Enhancements

  • Parallel model loading (for multi-GPU setups)
  • Model preloading based on usage patterns
  • Prometheus metrics exporter
  • Web dashboard for model management
  • Support for quantized models (lower VRAM requirements)
  • Add Whisper (speech-to-text)
  • Add XTTS-v2 (text-to-speech with voice cloning)

License

MIT License - See LICENSE file for details.

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Test your changes on RunPod
  4. Submit a pull request

Support

Acknowledgments

Built with:

Description
No description provided
Readme 20 MiB
Languages
Python 79.1%
Shell 19.3%
Dockerfile 1.6%