From 538609da3e19281cde33c06f14aae41221e7068c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Sebastian=20Kr=C3=BCger?= Date: Fri, 21 Nov 2025 19:47:43 +0100 Subject: [PATCH] docs: add comprehensive README for AI model orchestrator - Architecture overview (process-based orchestrator) - Installation and setup instructions for RunPod - Available models (text, image, music generation) - API endpoints and usage examples - Integration guide for LiteLLM proxy - Troubleshooting section with streaming fixes - Deployment options (RunPod, VPS, Tailscale VPN) - Adding new models workflow - Project structure documentation Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- README.md | 472 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 472 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..ad92044 --- /dev/null +++ b/README.md @@ -0,0 +1,472 @@ +# AI Model Orchestrator for RunPod + +A lightweight, process-based AI model orchestrator designed for single-GPU environments. Automatically switches between text, image, and audio generation models based on API requests, maximizing GPU utilization on resource-constrained hardware. + +## Overview + +This orchestrator solves a common problem: **running multiple large AI models on a single GPU** by: +- Loading models **sequentially** (one at a time) to fit within GPU memory constraints +- **Automatically switching** models based on incoming API requests +- Providing **OpenAI-compatible endpoints** for seamless integration +- Supporting streaming responses for real-time text generation + +**Perfect for:** +- RunPod RTX 4090 instances (24GB VRAM) +- Cost-effective GPU servers +- Self-hosted AI infrastructure +- Integration with LiteLLM, Open WebUI, and other AI proxies + +## Architecture + +``` +Client Request → LiteLLM Proxy → Orchestrator (port 9000) → Model Service (port 8000+) + (Optional) ↓ + Model Registry + (models.yaml) + ↓ + Sequential Loading + (stops current, starts new) +``` + +### Key Components + +- **Orchestrator** (`orchestrator_subprocess.py`): FastAPI proxy managing model lifecycle via Python subprocesses +- **Model Registry** (`models.yaml`): YAML configuration defining available models and their properties +- **Model Services**: Individual Python servers (vLLM, OpenedAI-Images, AudioCraft) running specific models +- **Management Scripts**: Bash utilities for starting, stopping, and downloading models + +### Why Process-Based? + +Unlike the Docker-based version (`orchestrator.py`), the subprocess implementation: +- Works seamlessly in RunPod's containerized environment (no Docker-in-Docker complexity) +- Starts models faster (direct Python process spawning) +- Simplifies debugging (single process tree) +- Reduces overhead (no container management layer) + +## Available Models + +### Text Generation (vLLM) +- **qwen-2.5-7b**: Qwen 2.5 7B Instruct (14GB VRAM, ~2min startup) + - Fast text generation with streaming support + - Endpoint: `/v1/chat/completions` + +### Image Generation (OpenedAI-Images) +- **flux-schnell**: Flux.1 Schnell (14GB VRAM, ~1min startup) + - High-quality images in 4-5 seconds + - Endpoint: `/v1/images/generations` + +### Music Generation (AudioCraft) +- **musicgen-medium**: MusicGen Medium (11GB VRAM, ~45s startup) + - Text-to-music generation (60-90s for 30s audio) + - Endpoint: `/v1/audio/generations` + +## Installation + +### Prerequisites +- Python 3.10+ +- CUDA-capable GPU (tested on RTX 4090) +- Ubuntu/Debian-based system +- HuggingFace account with API token + +### Quick Start (RunPod) + +1. **Clone the repository:** + ```bash + cd /workspace + git clone https://github.com/yourusername/runpod.git ai + cd ai + ``` + +2. **Configure environment:** + ```bash + cp .env.example .env + # Edit .env and set: + # - HF_TOKEN=your_huggingface_token + # - PORT=9000 (orchestrator port) + ``` + +3. **Run setup (installs dependencies, downloads models):** + ```bash + bash scripts/install.sh + ``` + +4. **Start all services:** + ```bash + bash scripts/start-all.sh + ``` + +5. **Verify orchestrator is running:** + ```bash + curl http://localhost:9000/health + ``` + +### Manual Installation + +If you prefer step-by-step setup: + +```bash +# Install system dependencies via Ansible +ansible-playbook playbook.yml --tags base,python,dependencies + +# Download model weights +bash scripts/download-models.sh + +# Start orchestrator +python3 model-orchestrator/orchestrator_subprocess.py +``` + +## Usage + +### Starting Services + +```bash +# Start orchestrator + all models +bash scripts/start-all.sh + +# Start orchestrator only (models start on-demand) +python3 model-orchestrator/orchestrator_subprocess.py +``` + +Services start in the background and automatically load models when requested. + +### Stopping Services + +```bash +# Stop all services +bash scripts/stop-all.sh + +# Stop orchestrator only (leaves models running) +pkill -f orchestrator_subprocess.py +``` + +### Testing the API + +**Text generation (streaming):** +```bash +curl -s -N -X POST http://localhost:9000/v1/chat/completions \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "qwen-2.5-7b", + "messages": [{"role": "user", "content": "Count to 5"}], + "max_tokens": 50, + "stream": true + }' +``` + +**Image generation:** +```bash +curl -X POST http://localhost:9000/v1/images/generations \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "flux-schnell", + "prompt": "A serene mountain landscape at sunset", + "size": "1024x1024" + }' +``` + +**Music generation:** +```bash +curl -X POST http://localhost:9000/v1/audio/generations \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "musicgen-medium", + "prompt": "Upbeat electronic dance music", + "duration": 30 + }' +``` + +**List available models:** +```bash +curl http://localhost:9000/v1/models +``` + +**Health check:** +```bash +curl http://localhost:9000/health +``` + +## Integration with LiteLLM + +The orchestrator is designed to work behind LiteLLM for unified API management: + +**LiteLLM config (`litellm-config.yaml`):** +```yaml +model_list: + - model_name: qwen-2.5-7b + litellm_params: + model: hosted_vllm/openai/qwen-2.5-7b + api_base: http://100.121.199.88:9000/v1 # Tailscale VPN + api_key: dummy + stream: true + timeout: 600 +``` + +**Key configuration notes:** +- Use `hosted_vllm/openai/` prefix for vLLM models via orchestrator +- Set `stream: true` to enable token-by-token streaming +- Use Tailscale IP for secure VPN access from remote services + +## Configuration + +### Model Registry (`model-orchestrator/models.yaml`) + +Models are defined in YAML with these properties: + +```yaml +qwen-2.5-7b: + type: text # text, image, or audio + framework: vllm # vllm, openedai-images, audiocraft + service_script: models/vllm/server.py + port: 8000 + vram_gb: 14 # GPU memory requirement + startup_time_seconds: 120 # Max time to wait for health check + endpoint: /v1/chat/completions # API endpoint path + description: "Human-readable description" +``` + +### Adding New Models + +1. **Add model definition to `models.yaml`:** + ```yaml + llama-3.1-8b: + type: text + framework: vllm + service_script: models/vllm/server_llama.py + port: 8004 + vram_gb: 17 + startup_time_seconds: 120 + endpoint: /v1/chat/completions + description: "Llama 3.1 8B Instruct" + ``` + +2. **Create model service script** (`models/vllm/server_llama.py`): + ```python + import os + from vllm import LLM + from vllm.entrypoints.openai.api_server import run_server + + model = "meta-llama/Llama-3.1-8B-Instruct" + port = int(os.getenv("PORT", 8004)) + run_server(model=model, port=port) + ``` + +3. **Download model weights:** + ```bash + bash scripts/download-models.sh + ``` + +4. **Restart orchestrator:** + ```bash + bash scripts/stop-all.sh && bash scripts/start-all.sh + ``` + +## Deployment Options + +### Option 1: RunPod (Current Setup) + +Designed for RunPod's containerized environment: +- Services run as background processes in `/workspace/ai/` +- Managed by `start-all.sh` and `stop-all.sh` +- No systemd (RunPod containers don't support it) + +### Option 2: Dedicated VPS with systemd + +For production VPS deployment: + +1. **Install systemd service:** + ```bash + sudo cp systemd/ai-orchestrator.service /etc/systemd/system/ + sudo systemctl daemon-reload + sudo systemctl enable ai-orchestrator + sudo systemctl start ai-orchestrator + ``` + +2. **Check logs:** + ```bash + sudo journalctl -u ai-orchestrator -f + ``` + +### Option 3: Tailscale VPN Integration + +Connect RunPod GPU to your VPS infrastructure: + +1. **On RunPod instance:** + ```bash + # Install Tailscale + curl -fsSL https://tailscale.com/install.sh | sh + + # Connect to VPN + tailscale up --advertise-tags=tag:gpu + ``` + +2. **On VPS (configure LiteLLM):** + ```yaml + # Use Tailscale IP in litellm-config.yaml + api_base: http://100.121.199.88:9000/v1 + ``` + +3. **Benefits:** + - Secure encrypted tunnel (no public exposure) + - Low latency direct connection + - Access GPU from any device on Tailscale network + +## API Endpoints + +### Orchestrator Management + +- `GET /health` - Health check with current model status +- `GET /v1/models` - OpenAI-compatible models list +- `POST /switch` - Manually switch to a specific model + +### Model Endpoints (Proxied) + +The orchestrator automatically routes requests based on endpoint and model name: + +- `POST /v1/chat/completions` - Text generation (text models) +- `POST /v1/images/generations` - Image generation (image models) +- `POST /v1/audio/generations` - Music generation (audio models) + +**Streaming Support:** +- Set `"stream": true` in request body +- Returns Server-Sent Events (SSE) stream +- Works through LiteLLM proxy (with correct configuration) + +## Troubleshooting + +### Streaming Returns Empty Deltas + +**Symptom:** LiteLLM shows streaming chunks but content is empty: +```json +{"choices":[{"delta":{"content":""},"index":0}]} +``` + +**Solution:** Use correct model path format in LiteLLM: +```yaml +# ✅ Correct +model: hosted_vllm/openai/qwen-2.5-7b + +# ❌ Wrong +model: openai/qwen-2.5-7b +``` + +### Model Fails to Start + +**Check logs:** +```bash +# RunPod +ps aux | grep orchestrator +ps aux | grep vllm + +# VPS with systemd +sudo journalctl -u ai-orchestrator -n 50 +``` + +**Common issues:** +- Out of GPU memory: Check VRAM usage with `nvidia-smi` +- Missing model weights: Run `bash scripts/download-models.sh` +- Port conflicts: Check if port is already in use with `lsof -i :9000` + +### Orchestrator Not Responding + +**Test model directly (bypass orchestrator):** +```bash +# Start vLLM service directly +cd /workspace/ai +python3 models/vllm/server.py + +# Test on port 8000 +curl -X POST http://localhost:8000/v1/chat/completions \ + -H 'Content-Type: application/json' \ + -d '{"model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Hi"}]}' +``` + +### HuggingFace Download Errors + +**Symptom:** "Repository not found" or "Authentication required" + +**Solution:** +```bash +# Check HF_TOKEN is set +echo $HF_TOKEN + +# Set token manually +export HF_TOKEN=your_token_here + +# Re-run download +bash scripts/download-models.sh +``` + +## Project Structure + +``` +runpod/ +├── model-orchestrator/ +│ ├── orchestrator_subprocess.py # Main orchestrator (process-based) +│ ├── orchestrator.py # Alternative Docker-based version +│ └── models.yaml # Model registry configuration +├── models/ +│ ├── vllm/ +│ │ ├── server.py # vLLM text generation service +│ │ └── requirements.txt +│ ├── flux/ +│ │ ├── server.py # Flux image generation service +│ │ └── requirements.txt +│ └── musicgen/ +│ ├── server.py # MusicGen audio generation service +│ └── requirements.txt +├── scripts/ +│ ├── install.sh # Setup script (Ansible + downloads) +│ ├── start-all.sh # Start orchestrator + models +│ ├── stop-all.sh # Stop all services +│ └── download-models.sh # Download model weights from HuggingFace +├── systemd/ +│ └── ai-orchestrator.service # systemd service file (for VPS) +├── playbook.yml # Ansible playbook for system setup +├── inventory.yml # Ansible inventory +├── .env.example # Environment variables template +└── README.md # This file +``` + +## Performance Notes + +- **Model switching time:** 30-120 seconds (depends on model size) +- **Text generation:** ~20-40 tokens/second (qwen-2.5-7b on RTX 4090) +- **Image generation:** 4-5 seconds per image (Flux Schnell) +- **Music generation:** 60-90 seconds for 30s audio (MusicGen Medium) + +## Future Enhancements + +- [ ] Parallel model loading (for multi-GPU setups) +- [ ] Model preloading based on usage patterns +- [ ] Prometheus metrics exporter +- [ ] Web dashboard for model management +- [ ] Support for quantized models (lower VRAM requirements) +- [ ] Add Whisper (speech-to-text) +- [ ] Add XTTS-v2 (text-to-speech with voice cloning) + +## License + +MIT License - See LICENSE file for details. + +## Contributing + +Contributions welcome! Please: +1. Fork the repository +2. Create a feature branch +3. Test your changes on RunPod +4. Submit a pull request + +## Support + +- **Issues:** https://github.com/yourusername/runpod/issues +- **Discussions:** https://github.com/yourusername/runpod/discussions +- **Documentation:** This README + inline code comments + +## Acknowledgments + +Built with: +- [vLLM](https://github.com/vllm-project/vllm) - Fast LLM inference +- [OpenedAI-Images](https://github.com/matatonic/openedai-images) - OpenAI-compatible image generation +- [AudioCraft](https://github.com/facebookresearch/audiocraft) - Music generation +- [FastAPI](https://fastapi.tiangolo.com/) - Modern Python web framework +- [LiteLLM](https://github.com/BerriAI/litellm) - Universal LLM proxy