# AI Model Orchestrator for RunPod A lightweight, process-based AI model orchestrator designed for single-GPU environments. Automatically switches between text, image, and audio generation models based on API requests, maximizing GPU utilization on resource-constrained hardware. ## Overview This orchestrator solves a common problem: **running multiple large AI models on a single GPU** by: - Loading models **sequentially** (one at a time) to fit within GPU memory constraints - **Automatically switching** models based on incoming API requests - Providing **OpenAI-compatible endpoints** for seamless integration - Supporting streaming responses for real-time text generation **Perfect for:** - RunPod RTX 4090 instances (24GB VRAM) - Cost-effective GPU servers - Self-hosted AI infrastructure - Integration with LiteLLM, Open WebUI, and other AI proxies ## Architecture ``` Client Request → LiteLLM Proxy → Orchestrator (port 9000) → Model Service (port 8000+) (Optional) ↓ Model Registry (models.yaml) ↓ Sequential Loading (stops current, starts new) ``` ### Key Components - **Orchestrator** (`orchestrator_subprocess.py`): FastAPI proxy managing model lifecycle via Python subprocesses - **Model Registry** (`models.yaml`): YAML configuration defining available models and their properties - **Model Services**: Individual Python servers (vLLM, OpenedAI-Images, AudioCraft) running specific models - **Management Scripts**: Bash utilities for starting, stopping, and downloading models ### Why Process-Based? Unlike the Docker-based version (`orchestrator.py`), the subprocess implementation: - Works seamlessly in RunPod's containerized environment (no Docker-in-Docker complexity) - Starts models faster (direct Python process spawning) - Simplifies debugging (single process tree) - Reduces overhead (no container management layer) ## Available Models ### Text Generation (vLLM) - **qwen-2.5-7b**: Qwen 2.5 7B Instruct (14GB VRAM, ~2min startup) - Fast text generation with streaming support - Endpoint: `/v1/chat/completions` ### Image Generation (OpenedAI-Images) - **flux-schnell**: Flux.1 Schnell (14GB VRAM, ~1min startup) - High-quality images in 4-5 seconds - Endpoint: `/v1/images/generations` ### Music Generation (AudioCraft) - **musicgen-medium**: MusicGen Medium (11GB VRAM, ~45s startup) - Text-to-music generation (60-90s for 30s audio) - Endpoint: `/v1/audio/generations` ## Installation ### Prerequisites - Python 3.10+ - CUDA-capable GPU (tested on RTX 4090) - Ubuntu/Debian-based system - HuggingFace account with API token ### Quick Start (RunPod) 1. **Clone the repository:** ```bash cd /workspace git clone https://github.com/yourusername/runpod.git ai cd ai ``` 2. **Configure environment:** ```bash cp .env.example .env # Edit .env and set: # - HF_TOKEN=your_huggingface_token # - PORT=9000 (orchestrator port) ``` 3. **Run setup (installs dependencies via Ansible):** ```bash ansible-playbook playbook.yml ``` 4. **Start all services:** ```bash bash scripts/start-all.sh ``` 5. **Verify orchestrator is running:** ```bash curl http://localhost:9000/health ``` ### Manual Installation If you prefer step-by-step setup: ```bash # Install system dependencies and download models via Ansible # Use specific tags for selective installation: # - base: Base system packages # - python: Python runtime via pyenv # - vllm: vLLM dependencies and Qwen model # - comfyui: ComfyUI installation # - comfyui-essential: Essential ComfyUI models only ansible-playbook playbook.yml --tags base,python,vllm # Start orchestrator python3 model-orchestrator/orchestrator_subprocess.py ``` ## Usage ### Starting Services ```bash # Start orchestrator + all models bash scripts/start-all.sh # Start orchestrator only (models start on-demand) python3 model-orchestrator/orchestrator_subprocess.py ``` Services start in the background and automatically load models when requested. ### Stopping Services ```bash # Stop all services bash scripts/stop-all.sh # Stop orchestrator only (leaves models running) pkill -f orchestrator_subprocess.py ``` ### Testing the API **Text generation (streaming):** ```bash curl -s -N -X POST http://localhost:9000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{ "model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Count to 5"}], "max_tokens": 50, "stream": true }' ``` **Image generation:** ```bash curl -X POST http://localhost:9000/v1/images/generations \ -H 'Content-Type: application/json' \ -d '{ "model": "flux-schnell", "prompt": "A serene mountain landscape at sunset", "size": "1024x1024" }' ``` **Music generation:** ```bash curl -X POST http://localhost:9000/v1/audio/generations \ -H 'Content-Type: application/json' \ -d '{ "model": "musicgen-medium", "prompt": "Upbeat electronic dance music", "duration": 30 }' ``` **List available models:** ```bash curl http://localhost:9000/v1/models ``` **Health check:** ```bash curl http://localhost:9000/health ``` ## Integration with LiteLLM The orchestrator is designed to work behind LiteLLM for unified API management: **LiteLLM config (`litellm-config.yaml`):** ```yaml model_list: - model_name: qwen-2.5-7b litellm_params: model: hosted_vllm/openai/qwen-2.5-7b api_base: http://100.121.199.88:9000/v1 # Tailscale VPN api_key: dummy stream: true timeout: 600 ``` **Key configuration notes:** - Use `hosted_vllm/openai/` prefix for vLLM models via orchestrator - Set `stream: true` to enable token-by-token streaming - Use Tailscale IP for secure VPN access from remote services ## Configuration ### Model Registry (`model-orchestrator/models.yaml`) Models are defined in YAML with these properties: ```yaml qwen-2.5-7b: type: text # text, image, or audio framework: vllm # vllm, openedai-images, audiocraft service_script: models/vllm/server.py port: 8000 vram_gb: 14 # GPU memory requirement startup_time_seconds: 120 # Max time to wait for health check endpoint: /v1/chat/completions # API endpoint path description: "Human-readable description" ``` ### Adding New Models 1. **Add model definition to `models.yaml`:** ```yaml llama-3.1-8b: type: text framework: vllm service_script: models/vllm/server_llama.py port: 8004 vram_gb: 17 startup_time_seconds: 120 endpoint: /v1/chat/completions description: "Llama 3.1 8B Instruct" ``` 2. **Create model service script** (`models/vllm/server_llama.py`): ```python import os from vllm import LLM from vllm.entrypoints.openai.api_server import run_server model = "meta-llama/Llama-3.1-8B-Instruct" port = int(os.getenv("PORT", 8004)) run_server(model=model, port=port) ``` 3. **Models are downloaded automatically by the playbook** when using the appropriate tags (vllm, comfyui, etc.) 4. **Restart orchestrator:** ```bash bash scripts/stop-all.sh && bash scripts/start-all.sh ``` ## Deployment Options ### Option 1: RunPod (Current Setup) Designed for RunPod's containerized environment: - Services run as background processes in `/workspace/ai/` - Managed by `start-all.sh` and `stop-all.sh` - No systemd (RunPod containers don't support it) ### Option 2: Dedicated VPS with systemd For production VPS deployment: 1. **Install systemd service:** ```bash sudo cp systemd/ai-orchestrator.service /etc/systemd/system/ sudo systemctl daemon-reload sudo systemctl enable ai-orchestrator sudo systemctl start ai-orchestrator ``` 2. **Check logs:** ```bash sudo journalctl -u ai-orchestrator -f ``` ### Option 3: Tailscale VPN Integration Connect RunPod GPU to your VPS infrastructure: 1. **On RunPod instance:** ```bash # Install Tailscale curl -fsSL https://tailscale.com/install.sh | sh # Connect to VPN tailscale up --advertise-tags=tag:gpu ``` 2. **On VPS (configure LiteLLM):** ```yaml # Use Tailscale IP in litellm-config.yaml api_base: http://100.121.199.88:9000/v1 ``` 3. **Benefits:** - Secure encrypted tunnel (no public exposure) - Low latency direct connection - Access GPU from any device on Tailscale network ## API Endpoints ### Orchestrator Management - `GET /health` - Health check with current model status - `GET /v1/models` - OpenAI-compatible models list - `POST /switch` - Manually switch to a specific model ### Model Endpoints (Proxied) The orchestrator automatically routes requests based on endpoint and model name: - `POST /v1/chat/completions` - Text generation (text models) - `POST /v1/images/generations` - Image generation (image models) - `POST /v1/audio/generations` - Music generation (audio models) **Streaming Support:** - Set `"stream": true` in request body - Returns Server-Sent Events (SSE) stream - Works through LiteLLM proxy (with correct configuration) ## Troubleshooting ### Streaming Returns Empty Deltas **Symptom:** LiteLLM shows streaming chunks but content is empty: ```json {"choices":[{"delta":{"content":""},"index":0}]} ``` **Solution:** Use correct model path format in LiteLLM: ```yaml # ✅ Correct model: hosted_vllm/openai/qwen-2.5-7b # ❌ Wrong model: openai/qwen-2.5-7b ``` ### Model Fails to Start **Check logs:** ```bash # RunPod ps aux | grep orchestrator ps aux | grep vllm # VPS with systemd sudo journalctl -u ai-orchestrator -n 50 ``` **Common issues:** - Out of GPU memory: Check VRAM usage with `nvidia-smi` - Missing model weights: Run `ansible-playbook playbook.yml --tags vllm` or `--tags comfyui-essential` - Port conflicts: Check if port is already in use with `lsof -i :9000` ### Orchestrator Not Responding **Test model directly (bypass orchestrator):** ```bash # Start vLLM service directly cd /workspace/ai python3 models/vllm/server.py # Test on port 8000 curl -X POST http://localhost:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{"model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Hi"}]}' ``` ### HuggingFace Download Errors **Symptom:** "Repository not found" or "Authentication required" **Solution:** ```bash # Check HF_TOKEN is set echo $HF_TOKEN # Set token in .env file echo "HF_TOKEN=your_token_here" >> .env # Re-run Ansible playbook to download models ansible-playbook playbook.yml --tags vllm ``` ## Project Structure ``` runpod/ ├── model-orchestrator/ │ ├── orchestrator_subprocess.py # Main orchestrator (process-based) │ ├── orchestrator.py # Alternative Docker-based version │ └── models.yaml # Model registry configuration ├── models/ │ ├── vllm/ │ │ ├── server.py # vLLM text generation service │ │ └── requirements.txt │ └── comfyui/ # ComfyUI for image/video/audio generation │ ├── start.sh # ComfyUI startup script │ └── models/ # ComfyUI models directory ├── scripts/ │ ├── start-all.sh # Start orchestrator + models │ └── stop-all.sh # Stop all services ├── systemd/ │ └── ai-orchestrator.service # systemd service file (for VPS) ├── playbook.yml # Ansible playbook for system setup ├── inventory.yml # Ansible inventory ├── .env.example # Environment variables template ├── COMFYUI_MODELS.md # ComfyUI models usage guide └── README.md # This file ``` ## Performance Notes - **Model switching time:** 30-120 seconds (depends on model size) - **Text generation:** ~20-40 tokens/second (qwen-2.5-7b on RTX 4090) - **Image generation:** 4-5 seconds per image (Flux Schnell) - **Music generation:** 60-90 seconds for 30s audio (MusicGen Medium) ## Future Enhancements - [ ] Parallel model loading (for multi-GPU setups) - [ ] Model preloading based on usage patterns - [ ] Prometheus metrics exporter - [ ] Web dashboard for model management - [ ] Support for quantized models (lower VRAM requirements) - [ ] Add Whisper (speech-to-text) - [ ] Add XTTS-v2 (text-to-speech with voice cloning) ## License MIT License - See LICENSE file for details. ## Contributing Contributions welcome! Please: 1. Fork the repository 2. Create a feature branch 3. Test your changes on RunPod 4. Submit a pull request ## Support - **Issues:** https://github.com/yourusername/runpod/issues - **Discussions:** https://github.com/yourusername/runpod/discussions - **Documentation:** This README + inline code comments ## Acknowledgments Built with: - [vLLM](https://github.com/vllm-project/vllm) - Fast LLM inference - [OpenedAI-Images](https://github.com/matatonic/openedai-images) - OpenAI-compatible image generation - [AudioCraft](https://github.com/facebookresearch/audiocraft) - Music generation - [FastAPI](https://fastapi.tiangolo.com/) - Modern Python web framework - [LiteLLM](https://github.com/BerriAI/litellm) - Universal LLM proxy