Go to file

Build and Push RunPod Docker Image / build-and-push (push) Successful in 14s

Details

refactor: remove type field from models_huggingface.yaml and include type in dest paths

- Prepended ComfyUI model type folder (checkpoints/, clip/, vae/, etc.) to all dest paths
- Removed separate 'type' field from all model entries
- Consolidated SD3.5 duplicate entries (5 → 1)
- Simplified model configuration by embedding directory structure directly in destination paths

This change eliminates the need to parse the 'type' field separately in artifact_huggingface_download.sh,
making the configuration more explicit and easier to understand.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-25 19:19:42 +01:00

.gitea/workflows

fix: update Docker registry from Docker Hub to dev.pivoine.art

2025-11-23 21:57:14 +01:00

comfyui

feat: add complete HunyuanVideo and Wan2.2 video generation integration

2025-11-25 10:43:39 +01:00

logs

feat: configure supervisord to run from /workspace/ai

2025-11-22 21:04:19 +01:00

scripts

feat: add RunPod Docker template with automated build workflow

2025-11-23 21:53:56 +01:00

vllm

feat: add BGE embedding service and reorganize supervisor groups

2025-11-25 06:32:01 +01:00

webdav-sync

refactor: reorganize webdav-sync into dedicated directory

2025-11-22 19:03:48 +01:00

.env.example

feat: add dedicated CivitAI NSFW model downloader

2025-11-23 17:58:25 +01:00

.gitignore

feat: configure supervisord to run from /workspace/ai

2025-11-22 21:04:19 +01:00

arty.yml

feat: add comfyui-workspace-manager custom node

2025-11-25 12:33:00 +01:00

CLAUDE.md

feat: add dedicated CivitAI NSFW model downloader

2025-11-23 17:58:25 +01:00

Dockerfile

feat: add RunPod Docker template with automated build workflow

2025-11-23 21:53:56 +01:00

models_civitai.yaml

fix: correct Pony Diffusion workflow checkpoint reference

2025-11-23 19:57:45 +01:00

models_huggingface_vllm.yaml

feat: add vLLM models configuration file

2025-11-25 06:12:18 +01:00

models_huggingface.yaml

refactor: remove type field from models_huggingface.yaml and include type in dest paths

2025-11-25 19:19:42 +01:00

README.md

refactor: consolidate model management into Ansible playbook

2025-11-22 00:31:26 +01:00

RUNPOD_TEMPLATE.md

fix: update Docker registry from Docker Hub to dev.pivoine.art

2025-11-23 21:57:14 +01:00

start.sh

feat: add RunPod Docker template with automated build workflow

2025-11-23 21:53:56 +01:00

supervisord.conf

feat: add BGE embedding service and reorganize supervisor groups

2025-11-25 06:32:01 +01:00

README.md

AI Model Orchestrator for RunPod

A lightweight, process-based AI model orchestrator designed for single-GPU environments. Automatically switches between text, image, and audio generation models based on API requests, maximizing GPU utilization on resource-constrained hardware.

Overview

This orchestrator solves a common problem: running multiple large AI models on a single GPU by:

Loading models sequentially (one at a time) to fit within GPU memory constraints
Automatically switching models based on incoming API requests
Providing OpenAI-compatible endpoints for seamless integration
Supporting streaming responses for real-time text generation

Perfect for:

RunPod RTX 4090 instances (24GB VRAM)
Cost-effective GPU servers
Self-hosted AI infrastructure
Integration with LiteLLM, Open WebUI, and other AI proxies

Architecture

Client Request → LiteLLM Proxy → Orchestrator (port 9000) → Model Service (port 8000+)
                  (Optional)       ↓
                              Model Registry
                              (models.yaml)
                                   ↓
                            Sequential Loading
                         (stops current, starts new)

Key Components

Orchestrator (orchestrator_subprocess.py): FastAPI proxy managing model lifecycle via Python subprocesses
Model Registry (models.yaml): YAML configuration defining available models and their properties
Model Services: Individual Python servers (vLLM, OpenedAI-Images, AudioCraft) running specific models
Management Scripts: Bash utilities for starting, stopping, and downloading models

Why Process-Based?

Unlike the Docker-based version (orchestrator.py), the subprocess implementation:

Works seamlessly in RunPod's containerized environment (no Docker-in-Docker complexity)
Starts models faster (direct Python process spawning)
Simplifies debugging (single process tree)
Reduces overhead (no container management layer)

Available Models

Text Generation (vLLM)

qwen-2.5-7b: Qwen 2.5 7B Instruct (14GB VRAM, ~2min startup)
- Fast text generation with streaming support
- Endpoint: /v1/chat/completions

Image Generation (OpenedAI-Images)

flux-schnell: Flux.1 Schnell (14GB VRAM, ~1min startup)
- High-quality images in 4-5 seconds
- Endpoint: /v1/images/generations

Music Generation (AudioCraft)

musicgen-medium: MusicGen Medium (11GB VRAM, ~45s startup)
- Text-to-music generation (60-90s for 30s audio)
- Endpoint: /v1/audio/generations

Installation

Prerequisites

Python 3.10+
CUDA-capable GPU (tested on RTX 4090)
Ubuntu/Debian-based system
HuggingFace account with API token

Quick Start (RunPod)

Clone the repository:

cd /workspace
git clone https://github.com/yourusername/runpod.git ai
cd ai

Configure environment:

cp .env.example .env
# Edit .env and set:
# - HF_TOKEN=your_huggingface_token
# - PORT=9000 (orchestrator port)

Run setup (installs dependencies via Ansible):
```
ansible-playbook playbook.yml
```
Start all services:
```
bash scripts/start-all.sh
```
Verify orchestrator is running:
```
curl http://localhost:9000/health
```

Manual Installation

If you prefer step-by-step setup:

# Install system dependencies and download models via Ansible
# Use specific tags for selective installation:
# - base: Base system packages
# - python: Python runtime via pyenv
# - vllm: vLLM dependencies and Qwen model
# - comfyui: ComfyUI installation
# - comfyui-essential: Essential ComfyUI models only
ansible-playbook playbook.yml --tags base,python,vllm

# Start orchestrator
python3 model-orchestrator/orchestrator_subprocess.py

Usage

Starting Services

# Start orchestrator + all models
bash scripts/start-all.sh

# Start orchestrator only (models start on-demand)
python3 model-orchestrator/orchestrator_subprocess.py

Services start in the background and automatically load models when requested.

Stopping Services

# Stop all services
bash scripts/stop-all.sh

# Stop orchestrator only (leaves models running)
pkill -f orchestrator_subprocess.py

Testing the API

Text generation (streaming):

curl -s -N -X POST http://localhost:9000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen-2.5-7b",
    "messages": [{"role": "user", "content": "Count to 5"}],
    "max_tokens": 50,
    "stream": true
  }'

Image generation:

curl -X POST http://localhost:9000/v1/images/generations \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "flux-schnell",
    "prompt": "A serene mountain landscape at sunset",
    "size": "1024x1024"
  }'

Music generation:

curl -X POST http://localhost:9000/v1/audio/generations \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "musicgen-medium",
    "prompt": "Upbeat electronic dance music",
    "duration": 30
  }'

List available models:

curl http://localhost:9000/v1/models

Health check:

curl http://localhost:9000/health

Integration with LiteLLM

The orchestrator is designed to work behind LiteLLM for unified API management:

LiteLLM config (litellm-config.yaml):

model_list:
  - model_name: qwen-2.5-7b
    litellm_params:
      model: hosted_vllm/openai/qwen-2.5-7b
      api_base: http://100.121.199.88:9000/v1  # Tailscale VPN
      api_key: dummy
      stream: true
      timeout: 600

Key configuration notes:

Use hosted_vllm/openai/ prefix for vLLM models via orchestrator
Set stream: true to enable token-by-token streaming
Use Tailscale IP for secure VPN access from remote services

Configuration

Model Registry (`model-orchestrator/models.yaml`)

Models are defined in YAML with these properties:

qwen-2.5-7b:
  type: text                          # text, image, or audio
  framework: vllm                     # vllm, openedai-images, audiocraft
  service_script: models/vllm/server.py
  port: 8000
  vram_gb: 14                         # GPU memory requirement
  startup_time_seconds: 120           # Max time to wait for health check
  endpoint: /v1/chat/completions      # API endpoint path
  description: "Human-readable description"

Adding New Models

Add model definition to models.yaml:

llama-3.1-8b:
  type: text
  framework: vllm
  service_script: models/vllm/server_llama.py
  port: 8004
  vram_gb: 17
  startup_time_seconds: 120
  endpoint: /v1/chat/completions
  description: "Llama 3.1 8B Instruct"

Create model service script (models/vllm/server_llama.py):

import os
from vllm import LLM
from vllm.entrypoints.openai.api_server import run_server

model = "meta-llama/Llama-3.1-8B-Instruct"
port = int(os.getenv("PORT", 8004))
run_server(model=model, port=port)

Models are downloaded automatically by the playbook when using the appropriate tags (vllm, comfyui, etc.)

Restart orchestrator:

bash scripts/stop-all.sh && bash scripts/start-all.sh

Deployment Options

Option 1: RunPod (Current Setup)

Designed for RunPod's containerized environment:

Services run as background processes in /workspace/ai/
Managed by start-all.sh and stop-all.sh
No systemd (RunPod containers don't support it)

Option 2: Dedicated VPS with systemd

For production VPS deployment:

Install systemd service:

sudo cp systemd/ai-orchestrator.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable ai-orchestrator
sudo systemctl start ai-orchestrator

Check logs:
```
sudo journalctl -u ai-orchestrator -f
```

Option 3: Tailscale VPN Integration

Connect RunPod GPU to your VPS infrastructure:

On RunPod instance:

# Install Tailscale
curl -fsSL https://tailscale.com/install.sh | sh

# Connect to VPN
tailscale up --advertise-tags=tag:gpu

On VPS (configure LiteLLM):

# Use Tailscale IP in litellm-config.yaml
api_base: http://100.121.199.88:9000/v1

Benefits:
- Secure encrypted tunnel (no public exposure)
- Low latency direct connection
- Access GPU from any device on Tailscale network

API Endpoints

Orchestrator Management

GET /health - Health check with current model status
GET /v1/models - OpenAI-compatible models list
POST /switch - Manually switch to a specific model

Model Endpoints (Proxied)

The orchestrator automatically routes requests based on endpoint and model name:

POST /v1/chat/completions - Text generation (text models)
POST /v1/images/generations - Image generation (image models)
POST /v1/audio/generations - Music generation (audio models)

Streaming Support:

Set "stream": true in request body
Returns Server-Sent Events (SSE) stream
Works through LiteLLM proxy (with correct configuration)

Troubleshooting

Streaming Returns Empty Deltas

Symptom: LiteLLM shows streaming chunks but content is empty:

{"choices":[{"delta":{"content":""},"index":0}]}

Solution: Use correct model path format in LiteLLM:

# ✅ Correct
model: hosted_vllm/openai/qwen-2.5-7b

# ❌ Wrong
model: openai/qwen-2.5-7b

Model Fails to Start

Check logs:

# RunPod
ps aux | grep orchestrator
ps aux | grep vllm

# VPS with systemd
sudo journalctl -u ai-orchestrator -n 50

Common issues:

Out of GPU memory: Check VRAM usage with nvidia-smi
Missing model weights: Run ansible-playbook playbook.yml --tags vllm or --tags comfyui-essential
Port conflicts: Check if port is already in use with lsof -i :9000

Orchestrator Not Responding

Test model directly (bypass orchestrator):

# Start vLLM service directly
cd /workspace/ai
python3 models/vllm/server.py

# Test on port 8000
curl -X POST http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Hi"}]}'

HuggingFace Download Errors

Symptom: "Repository not found" or "Authentication required"

Solution:

# Check HF_TOKEN is set
echo $HF_TOKEN

# Set token in .env file
echo "HF_TOKEN=your_token_here" >> .env

# Re-run Ansible playbook to download models
ansible-playbook playbook.yml --tags vllm

Project Structure

runpod/
├── model-orchestrator/
│   ├── orchestrator_subprocess.py  # Main orchestrator (process-based)
│   ├── orchestrator.py             # Alternative Docker-based version
│   └── models.yaml                 # Model registry configuration
├── models/
│   ├── vllm/
│   │   ├── server.py               # vLLM text generation service
│   │   └── requirements.txt
│   └── comfyui/                    # ComfyUI for image/video/audio generation
│       ├── start.sh                # ComfyUI startup script
│       └── models/                 # ComfyUI models directory
├── scripts/
│   ├── start-all.sh                # Start orchestrator + models
│   └── stop-all.sh                 # Stop all services
├── systemd/
│   └── ai-orchestrator.service     # systemd service file (for VPS)
├── playbook.yml                    # Ansible playbook for system setup
├── inventory.yml                   # Ansible inventory
├── .env.example                    # Environment variables template
├── COMFYUI_MODELS.md               # ComfyUI models usage guide
└── README.md                       # This file

Performance Notes

Model switching time: 30-120 seconds (depends on model size)
Text generation: ~20-40 tokens/second (qwen-2.5-7b on RTX 4090)
Image generation: 4-5 seconds per image (Flux Schnell)
Music generation: 60-90 seconds for 30s audio (MusicGen Medium)

Future Enhancements

Parallel model loading (for multi-GPU setups)
Model preloading based on usage patterns
Prometheus metrics exporter
Web dashboard for model management
Support for quantized models (lower VRAM requirements)
Add Whisper (speech-to-text)
Add XTTS-v2 (text-to-speech with voice cloning)

License

MIT License - See LICENSE file for details.

Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Test your changes on RunPod
Submit a pull request

Support

Issues: https://github.com/yourusername/runpod/issues
Discussions: https://github.com/yourusername/runpod/discussions
Documentation: This README + inline code comments

Acknowledgments

Built with:

vLLM - Fast LLM inference
OpenedAI-Images - OpenAI-compatible image generation
AudioCraft - Music generation
FastAPI - Modern Python web framework
LiteLLM - Universal LLM proxy

README.md

AI Model Orchestrator for RunPod

Overview

Architecture

Key Components

Why Process-Based?

Available Models

Text Generation (vLLM)

Image Generation (OpenedAI-Images)

Music Generation (AudioCraft)

Installation

Prerequisites

Quick Start (RunPod)

Manual Installation

Usage

Starting Services

Stopping Services

Testing the API

Integration with LiteLLM

Configuration

Model Registry (model-orchestrator/models.yaml)

Adding New Models

Deployment Options

Option 1: RunPod (Current Setup)

Option 2: Dedicated VPS with systemd

Option 3: Tailscale VPN Integration

API Endpoints

Orchestrator Management

Model Endpoints (Proxied)

Troubleshooting

Streaming Returns Empty Deltas

Model Fails to Start

Orchestrator Not Responding

HuggingFace Download Errors

Project Structure

Performance Notes

Future Enhancements

License

Contributing

Support

Acknowledgments

Model Registry (`model-orchestrator/models.yaml`)