Sebastian Krüger 2189697734
All checks were successful
Build and Push RunPod Docker Image / build-and-push (push) Successful in 14s
refactor: remove type field from models_huggingface.yaml and include type in dest paths
- Prepended ComfyUI model type folder (checkpoints/, clip/, vae/, etc.) to all dest paths
- Removed separate 'type' field from all model entries
- Consolidated SD3.5 duplicate entries (5 → 1)
- Simplified model configuration by embedding directory structure directly in destination paths

This change eliminates the need to parse the 'type' field separately in artifact_huggingface_download.sh,
making the configuration more explicit and easier to understand.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-25 19:19:42 +01:00

AI Model Orchestrator for RunPod

A lightweight, process-based AI model orchestrator designed for single-GPU environments. Automatically switches between text, image, and audio generation models based on API requests, maximizing GPU utilization on resource-constrained hardware.

Overview

This orchestrator solves a common problem: running multiple large AI models on a single GPU by:

  • Loading models sequentially (one at a time) to fit within GPU memory constraints
  • Automatically switching models based on incoming API requests
  • Providing OpenAI-compatible endpoints for seamless integration
  • Supporting streaming responses for real-time text generation

Perfect for:

  • RunPod RTX 4090 instances (24GB VRAM)
  • Cost-effective GPU servers
  • Self-hosted AI infrastructure
  • Integration with LiteLLM, Open WebUI, and other AI proxies

Architecture

Client Request → LiteLLM Proxy → Orchestrator (port 9000) → Model Service (port 8000+)
                  (Optional)       ↓
                              Model Registry
                              (models.yaml)
                                   ↓
                            Sequential Loading
                         (stops current, starts new)

Key Components

  • Orchestrator (orchestrator_subprocess.py): FastAPI proxy managing model lifecycle via Python subprocesses
  • Model Registry (models.yaml): YAML configuration defining available models and their properties
  • Model Services: Individual Python servers (vLLM, OpenedAI-Images, AudioCraft) running specific models
  • Management Scripts: Bash utilities for starting, stopping, and downloading models

Why Process-Based?

Unlike the Docker-based version (orchestrator.py), the subprocess implementation:

  • Works seamlessly in RunPod's containerized environment (no Docker-in-Docker complexity)
  • Starts models faster (direct Python process spawning)
  • Simplifies debugging (single process tree)
  • Reduces overhead (no container management layer)

Available Models

Text Generation (vLLM)

  • qwen-2.5-7b: Qwen 2.5 7B Instruct (14GB VRAM, ~2min startup)
    • Fast text generation with streaming support
    • Endpoint: /v1/chat/completions

Image Generation (OpenedAI-Images)

  • flux-schnell: Flux.1 Schnell (14GB VRAM, ~1min startup)
    • High-quality images in 4-5 seconds
    • Endpoint: /v1/images/generations

Music Generation (AudioCraft)

  • musicgen-medium: MusicGen Medium (11GB VRAM, ~45s startup)
    • Text-to-music generation (60-90s for 30s audio)
    • Endpoint: /v1/audio/generations

Installation

Prerequisites

  • Python 3.10+
  • CUDA-capable GPU (tested on RTX 4090)
  • Ubuntu/Debian-based system
  • HuggingFace account with API token

Quick Start (RunPod)

  1. Clone the repository:

    cd /workspace
    git clone https://github.com/yourusername/runpod.git ai
    cd ai
    
  2. Configure environment:

    cp .env.example .env
    # Edit .env and set:
    # - HF_TOKEN=your_huggingface_token
    # - PORT=9000 (orchestrator port)
    
  3. Run setup (installs dependencies via Ansible):

    ansible-playbook playbook.yml
    
  4. Start all services:

    bash scripts/start-all.sh
    
  5. Verify orchestrator is running:

    curl http://localhost:9000/health
    

Manual Installation

If you prefer step-by-step setup:

# Install system dependencies and download models via Ansible
# Use specific tags for selective installation:
# - base: Base system packages
# - python: Python runtime via pyenv
# - vllm: vLLM dependencies and Qwen model
# - comfyui: ComfyUI installation
# - comfyui-essential: Essential ComfyUI models only
ansible-playbook playbook.yml --tags base,python,vllm

# Start orchestrator
python3 model-orchestrator/orchestrator_subprocess.py

Usage

Starting Services

# Start orchestrator + all models
bash scripts/start-all.sh

# Start orchestrator only (models start on-demand)
python3 model-orchestrator/orchestrator_subprocess.py

Services start in the background and automatically load models when requested.

Stopping Services

# Stop all services
bash scripts/stop-all.sh

# Stop orchestrator only (leaves models running)
pkill -f orchestrator_subprocess.py

Testing the API

Text generation (streaming):

curl -s -N -X POST http://localhost:9000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen-2.5-7b",
    "messages": [{"role": "user", "content": "Count to 5"}],
    "max_tokens": 50,
    "stream": true
  }'

Image generation:

curl -X POST http://localhost:9000/v1/images/generations \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "flux-schnell",
    "prompt": "A serene mountain landscape at sunset",
    "size": "1024x1024"
  }'

Music generation:

curl -X POST http://localhost:9000/v1/audio/generations \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "musicgen-medium",
    "prompt": "Upbeat electronic dance music",
    "duration": 30
  }'

List available models:

curl http://localhost:9000/v1/models

Health check:

curl http://localhost:9000/health

Integration with LiteLLM

The orchestrator is designed to work behind LiteLLM for unified API management:

LiteLLM config (litellm-config.yaml):

model_list:
  - model_name: qwen-2.5-7b
    litellm_params:
      model: hosted_vllm/openai/qwen-2.5-7b
      api_base: http://100.121.199.88:9000/v1  # Tailscale VPN
      api_key: dummy
      stream: true
      timeout: 600

Key configuration notes:

  • Use hosted_vllm/openai/ prefix for vLLM models via orchestrator
  • Set stream: true to enable token-by-token streaming
  • Use Tailscale IP for secure VPN access from remote services

Configuration

Model Registry (model-orchestrator/models.yaml)

Models are defined in YAML with these properties:

qwen-2.5-7b:
  type: text                          # text, image, or audio
  framework: vllm                     # vllm, openedai-images, audiocraft
  service_script: models/vllm/server.py
  port: 8000
  vram_gb: 14                         # GPU memory requirement
  startup_time_seconds: 120           # Max time to wait for health check
  endpoint: /v1/chat/completions      # API endpoint path
  description: "Human-readable description"

Adding New Models

  1. Add model definition to models.yaml:

    llama-3.1-8b:
      type: text
      framework: vllm
      service_script: models/vllm/server_llama.py
      port: 8004
      vram_gb: 17
      startup_time_seconds: 120
      endpoint: /v1/chat/completions
      description: "Llama 3.1 8B Instruct"
    
  2. Create model service script (models/vllm/server_llama.py):

    import os
    from vllm import LLM
    from vllm.entrypoints.openai.api_server import run_server
    
    model = "meta-llama/Llama-3.1-8B-Instruct"
    port = int(os.getenv("PORT", 8004))
    run_server(model=model, port=port)
    
  3. Models are downloaded automatically by the playbook when using the appropriate tags (vllm, comfyui, etc.)

  4. Restart orchestrator:

    bash scripts/stop-all.sh && bash scripts/start-all.sh
    

Deployment Options

Option 1: RunPod (Current Setup)

Designed for RunPod's containerized environment:

  • Services run as background processes in /workspace/ai/
  • Managed by start-all.sh and stop-all.sh
  • No systemd (RunPod containers don't support it)

Option 2: Dedicated VPS with systemd

For production VPS deployment:

  1. Install systemd service:

    sudo cp systemd/ai-orchestrator.service /etc/systemd/system/
    sudo systemctl daemon-reload
    sudo systemctl enable ai-orchestrator
    sudo systemctl start ai-orchestrator
    
  2. Check logs:

    sudo journalctl -u ai-orchestrator -f
    

Option 3: Tailscale VPN Integration

Connect RunPod GPU to your VPS infrastructure:

  1. On RunPod instance:

    # Install Tailscale
    curl -fsSL https://tailscale.com/install.sh | sh
    
    # Connect to VPN
    tailscale up --advertise-tags=tag:gpu
    
  2. On VPS (configure LiteLLM):

    # Use Tailscale IP in litellm-config.yaml
    api_base: http://100.121.199.88:9000/v1
    
  3. Benefits:

    • Secure encrypted tunnel (no public exposure)
    • Low latency direct connection
    • Access GPU from any device on Tailscale network

API Endpoints

Orchestrator Management

  • GET /health - Health check with current model status
  • GET /v1/models - OpenAI-compatible models list
  • POST /switch - Manually switch to a specific model

Model Endpoints (Proxied)

The orchestrator automatically routes requests based on endpoint and model name:

  • POST /v1/chat/completions - Text generation (text models)
  • POST /v1/images/generations - Image generation (image models)
  • POST /v1/audio/generations - Music generation (audio models)

Streaming Support:

  • Set "stream": true in request body
  • Returns Server-Sent Events (SSE) stream
  • Works through LiteLLM proxy (with correct configuration)

Troubleshooting

Streaming Returns Empty Deltas

Symptom: LiteLLM shows streaming chunks but content is empty:

{"choices":[{"delta":{"content":""},"index":0}]}

Solution: Use correct model path format in LiteLLM:

# ✅ Correct
model: hosted_vllm/openai/qwen-2.5-7b

# ❌ Wrong
model: openai/qwen-2.5-7b

Model Fails to Start

Check logs:

# RunPod
ps aux | grep orchestrator
ps aux | grep vllm

# VPS with systemd
sudo journalctl -u ai-orchestrator -n 50

Common issues:

  • Out of GPU memory: Check VRAM usage with nvidia-smi
  • Missing model weights: Run ansible-playbook playbook.yml --tags vllm or --tags comfyui-essential
  • Port conflicts: Check if port is already in use with lsof -i :9000

Orchestrator Not Responding

Test model directly (bypass orchestrator):

# Start vLLM service directly
cd /workspace/ai
python3 models/vllm/server.py

# Test on port 8000
curl -X POST http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Hi"}]}'

HuggingFace Download Errors

Symptom: "Repository not found" or "Authentication required"

Solution:

# Check HF_TOKEN is set
echo $HF_TOKEN

# Set token in .env file
echo "HF_TOKEN=your_token_here" >> .env

# Re-run Ansible playbook to download models
ansible-playbook playbook.yml --tags vllm

Project Structure

runpod/
├── model-orchestrator/
│   ├── orchestrator_subprocess.py  # Main orchestrator (process-based)
│   ├── orchestrator.py             # Alternative Docker-based version
│   └── models.yaml                 # Model registry configuration
├── models/
│   ├── vllm/
│   │   ├── server.py               # vLLM text generation service
│   │   └── requirements.txt
│   └── comfyui/                    # ComfyUI for image/video/audio generation
│       ├── start.sh                # ComfyUI startup script
│       └── models/                 # ComfyUI models directory
├── scripts/
│   ├── start-all.sh                # Start orchestrator + models
│   └── stop-all.sh                 # Stop all services
├── systemd/
│   └── ai-orchestrator.service     # systemd service file (for VPS)
├── playbook.yml                    # Ansible playbook for system setup
├── inventory.yml                   # Ansible inventory
├── .env.example                    # Environment variables template
├── COMFYUI_MODELS.md               # ComfyUI models usage guide
└── README.md                       # This file

Performance Notes

  • Model switching time: 30-120 seconds (depends on model size)
  • Text generation: ~20-40 tokens/second (qwen-2.5-7b on RTX 4090)
  • Image generation: 4-5 seconds per image (Flux Schnell)
  • Music generation: 60-90 seconds for 30s audio (MusicGen Medium)

Future Enhancements

  • Parallel model loading (for multi-GPU setups)
  • Model preloading based on usage patterns
  • Prometheus metrics exporter
  • Web dashboard for model management
  • Support for quantized models (lower VRAM requirements)
  • Add Whisper (speech-to-text)
  • Add XTTS-v2 (text-to-speech with voice cloning)

License

MIT License - See LICENSE file for details.

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Test your changes on RunPod
  4. Submit a pull request

Support

Acknowledgments

Built with:

Description
No description provided
Readme 20 MiB
Languages
Python 79.1%
Shell 19.3%
Dockerfile 1.6%