From 538609da3e19281cde33c06f14aae41221e7068c Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Sebastian=20Kr=C3=BCger?= <valknar@pivoine.art>
Date: Fri, 21 Nov 2025 19:47:43 +0100
Subject: [PATCH] docs: add comprehensive README for AI model orchestrator

- Architecture overview (process-based orchestrator)
- Installation and setup instructions for RunPod
- Available models (text, image, music generation)
- API endpoints and usage examples
- Integration guide for LiteLLM proxy
- Troubleshooting section with streaming fixes
- Deployment options (RunPod, VPS, Tailscale VPN)
- Adding new models workflow
- Project structure documentation

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
---
 README.md | 472 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 472 insertions(+)
 create mode 100644 README.md

diff --git a/README.md b/README.md
new file mode 100644
index 0000000..ad92044
--- /dev/null
+++ b/README.md
@@ -0,0 +1,472 @@
+# AI Model Orchestrator for RunPod
+
+A lightweight, process-based AI model orchestrator designed for single-GPU environments. Automatically switches between text, image, and audio generation models based on API requests, maximizing GPU utilization on resource-constrained hardware.
+
+## Overview
+
+This orchestrator solves a common problem: **running multiple large AI models on a single GPU** by:
+- Loading models **sequentially** (one at a time) to fit within GPU memory constraints
+- **Automatically switching** models based on incoming API requests
+- Providing **OpenAI-compatible endpoints** for seamless integration
+- Supporting streaming responses for real-time text generation
+
+**Perfect for:**
+- RunPod RTX 4090 instances (24GB VRAM)
+- Cost-effective GPU servers
+- Self-hosted AI infrastructure
+- Integration with LiteLLM, Open WebUI, and other AI proxies
+
+## Architecture
+
+```
+Client Request → LiteLLM Proxy → Orchestrator (port 9000) → Model Service (port 8000+)
+                  (Optional)       ↓
+                              Model Registry
+                              (models.yaml)
+                                   ↓
+                            Sequential Loading
+                         (stops current, starts new)
+```
+
+### Key Components
+
+- **Orchestrator** (`orchestrator_subprocess.py`): FastAPI proxy managing model lifecycle via Python subprocesses
+- **Model Registry** (`models.yaml`): YAML configuration defining available models and their properties
+- **Model Services**: Individual Python servers (vLLM, OpenedAI-Images, AudioCraft) running specific models
+- **Management Scripts**: Bash utilities for starting, stopping, and downloading models
+
+### Why Process-Based?
+
+Unlike the Docker-based version (`orchestrator.py`), the subprocess implementation:
+- Works seamlessly in RunPod's containerized environment (no Docker-in-Docker complexity)
+- Starts models faster (direct Python process spawning)
+- Simplifies debugging (single process tree)
+- Reduces overhead (no container management layer)
+
+## Available Models
+
+### Text Generation (vLLM)
+- **qwen-2.5-7b**: Qwen 2.5 7B Instruct (14GB VRAM, ~2min startup)
+  - Fast text generation with streaming support
+  - Endpoint: `/v1/chat/completions`
+
+### Image Generation (OpenedAI-Images)
+- **flux-schnell**: Flux.1 Schnell (14GB VRAM, ~1min startup)
+  - High-quality images in 4-5 seconds
+  - Endpoint: `/v1/images/generations`
+
+### Music Generation (AudioCraft)
+- **musicgen-medium**: MusicGen Medium (11GB VRAM, ~45s startup)
+  - Text-to-music generation (60-90s for 30s audio)
+  - Endpoint: `/v1/audio/generations`
+
+## Installation
+
+### Prerequisites
+- Python 3.10+
+- CUDA-capable GPU (tested on RTX 4090)
+- Ubuntu/Debian-based system
+- HuggingFace account with API token
+
+### Quick Start (RunPod)
+
+1. **Clone the repository:**
+   ```bash
+   cd /workspace
+   git clone https://github.com/yourusername/runpod.git ai
+   cd ai
+   ```
+
+2. **Configure environment:**
+   ```bash
+   cp .env.example .env
+   # Edit .env and set:
+   # - HF_TOKEN=your_huggingface_token
+   # - PORT=9000 (orchestrator port)
+   ```
+
+3. **Run setup (installs dependencies, downloads models):**
+   ```bash
+   bash scripts/install.sh
+   ```
+
+4. **Start all services:**
+   ```bash
+   bash scripts/start-all.sh
+   ```
+
+5. **Verify orchestrator is running:**
+   ```bash
+   curl http://localhost:9000/health
+   ```
+
+### Manual Installation
+
+If you prefer step-by-step setup:
+
+```bash
+# Install system dependencies via Ansible
+ansible-playbook playbook.yml --tags base,python,dependencies
+
+# Download model weights
+bash scripts/download-models.sh
+
+# Start orchestrator
+python3 model-orchestrator/orchestrator_subprocess.py
+```
+
+## Usage
+
+### Starting Services
+
+```bash
+# Start orchestrator + all models
+bash scripts/start-all.sh
+
+# Start orchestrator only (models start on-demand)
+python3 model-orchestrator/orchestrator_subprocess.py
+```
+
+Services start in the background and automatically load models when requested.
+
+### Stopping Services
+
+```bash
+# Stop all services
+bash scripts/stop-all.sh
+
+# Stop orchestrator only (leaves models running)
+pkill -f orchestrator_subprocess.py
+```
+
+### Testing the API
+
+**Text generation (streaming):**
+```bash
+curl -s -N -X POST http://localhost:9000/v1/chat/completions \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "model": "qwen-2.5-7b",
+    "messages": [{"role": "user", "content": "Count to 5"}],
+    "max_tokens": 50,
+    "stream": true
+  }'
+```
+
+**Image generation:**
+```bash
+curl -X POST http://localhost:9000/v1/images/generations \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "model": "flux-schnell",
+    "prompt": "A serene mountain landscape at sunset",
+    "size": "1024x1024"
+  }'
+```
+
+**Music generation:**
+```bash
+curl -X POST http://localhost:9000/v1/audio/generations \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "model": "musicgen-medium",
+    "prompt": "Upbeat electronic dance music",
+    "duration": 30
+  }'
+```
+
+**List available models:**
+```bash
+curl http://localhost:9000/v1/models
+```
+
+**Health check:**
+```bash
+curl http://localhost:9000/health
+```
+
+## Integration with LiteLLM
+
+The orchestrator is designed to work behind LiteLLM for unified API management:
+
+**LiteLLM config (`litellm-config.yaml`):**
+```yaml
+model_list:
+  - model_name: qwen-2.5-7b
+    litellm_params:
+      model: hosted_vllm/openai/qwen-2.5-7b
+      api_base: http://100.121.199.88:9000/v1  # Tailscale VPN
+      api_key: dummy
+      stream: true
+      timeout: 600
+```
+
+**Key configuration notes:**
+- Use `hosted_vllm/openai/` prefix for vLLM models via orchestrator
+- Set `stream: true` to enable token-by-token streaming
+- Use Tailscale IP for secure VPN access from remote services
+
+## Configuration
+
+### Model Registry (`model-orchestrator/models.yaml`)
+
+Models are defined in YAML with these properties:
+
+```yaml
+qwen-2.5-7b:
+  type: text                          # text, image, or audio
+  framework: vllm                     # vllm, openedai-images, audiocraft
+  service_script: models/vllm/server.py
+  port: 8000
+  vram_gb: 14                         # GPU memory requirement
+  startup_time_seconds: 120           # Max time to wait for health check
+  endpoint: /v1/chat/completions      # API endpoint path
+  description: "Human-readable description"
+```
+
+### Adding New Models
+
+1. **Add model definition to `models.yaml`:**
+   ```yaml
+   llama-3.1-8b:
+     type: text
+     framework: vllm
+     service_script: models/vllm/server_llama.py
+     port: 8004
+     vram_gb: 17
+     startup_time_seconds: 120
+     endpoint: /v1/chat/completions
+     description: "Llama 3.1 8B Instruct"
+   ```
+
+2. **Create model service script** (`models/vllm/server_llama.py`):
+   ```python
+   import os
+   from vllm import LLM
+   from vllm.entrypoints.openai.api_server import run_server
+
+   model = "meta-llama/Llama-3.1-8B-Instruct"
+   port = int(os.getenv("PORT", 8004))
+   run_server(model=model, port=port)
+   ```
+
+3. **Download model weights:**
+   ```bash
+   bash scripts/download-models.sh
+   ```
+
+4. **Restart orchestrator:**
+   ```bash
+   bash scripts/stop-all.sh && bash scripts/start-all.sh
+   ```
+
+## Deployment Options
+
+### Option 1: RunPod (Current Setup)
+
+Designed for RunPod's containerized environment:
+- Services run as background processes in `/workspace/ai/`
+- Managed by `start-all.sh` and `stop-all.sh`
+- No systemd (RunPod containers don't support it)
+
+### Option 2: Dedicated VPS with systemd
+
+For production VPS deployment:
+
+1. **Install systemd service:**
+   ```bash
+   sudo cp systemd/ai-orchestrator.service /etc/systemd/system/
+   sudo systemctl daemon-reload
+   sudo systemctl enable ai-orchestrator
+   sudo systemctl start ai-orchestrator
+   ```
+
+2. **Check logs:**
+   ```bash
+   sudo journalctl -u ai-orchestrator -f
+   ```
+
+### Option 3: Tailscale VPN Integration
+
+Connect RunPod GPU to your VPS infrastructure:
+
+1. **On RunPod instance:**
+   ```bash
+   # Install Tailscale
+   curl -fsSL https://tailscale.com/install.sh | sh
+
+   # Connect to VPN
+   tailscale up --advertise-tags=tag:gpu
+   ```
+
+2. **On VPS (configure LiteLLM):**
+   ```yaml
+   # Use Tailscale IP in litellm-config.yaml
+   api_base: http://100.121.199.88:9000/v1
+   ```
+
+3. **Benefits:**
+   - Secure encrypted tunnel (no public exposure)
+   - Low latency direct connection
+   - Access GPU from any device on Tailscale network
+
+## API Endpoints
+
+### Orchestrator Management
+
+- `GET /health` - Health check with current model status
+- `GET /v1/models` - OpenAI-compatible models list
+- `POST /switch` - Manually switch to a specific model
+
+### Model Endpoints (Proxied)
+
+The orchestrator automatically routes requests based on endpoint and model name:
+
+- `POST /v1/chat/completions` - Text generation (text models)
+- `POST /v1/images/generations` - Image generation (image models)
+- `POST /v1/audio/generations` - Music generation (audio models)
+
+**Streaming Support:**
+- Set `"stream": true` in request body
+- Returns Server-Sent Events (SSE) stream
+- Works through LiteLLM proxy (with correct configuration)
+
+## Troubleshooting
+
+### Streaming Returns Empty Deltas
+
+**Symptom:** LiteLLM shows streaming chunks but content is empty:
+```json
+{"choices":[{"delta":{"content":""},"index":0}]}
+```
+
+**Solution:** Use correct model path format in LiteLLM:
+```yaml
+# ✅ Correct
+model: hosted_vllm/openai/qwen-2.5-7b
+
+# ❌ Wrong
+model: openai/qwen-2.5-7b
+```
+
+### Model Fails to Start
+
+**Check logs:**
+```bash
+# RunPod
+ps aux | grep orchestrator
+ps aux | grep vllm
+
+# VPS with systemd
+sudo journalctl -u ai-orchestrator -n 50
+```
+
+**Common issues:**
+- Out of GPU memory: Check VRAM usage with `nvidia-smi`
+- Missing model weights: Run `bash scripts/download-models.sh`
+- Port conflicts: Check if port is already in use with `lsof -i :9000`
+
+### Orchestrator Not Responding
+
+**Test model directly (bypass orchestrator):**
+```bash
+# Start vLLM service directly
+cd /workspace/ai
+python3 models/vllm/server.py
+
+# Test on port 8000
+curl -X POST http://localhost:8000/v1/chat/completions \
+  -H 'Content-Type: application/json' \
+  -d '{"model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Hi"}]}'
+```
+
+### HuggingFace Download Errors
+
+**Symptom:** "Repository not found" or "Authentication required"
+
+**Solution:**
+```bash
+# Check HF_TOKEN is set
+echo $HF_TOKEN
+
+# Set token manually
+export HF_TOKEN=your_token_here
+
+# Re-run download
+bash scripts/download-models.sh
+```
+
+## Project Structure
+
+```
+runpod/
+├── model-orchestrator/
+│   ├── orchestrator_subprocess.py  # Main orchestrator (process-based)
+│   ├── orchestrator.py             # Alternative Docker-based version
+│   └── models.yaml                 # Model registry configuration
+├── models/
+│   ├── vllm/
+│   │   ├── server.py               # vLLM text generation service
+│   │   └── requirements.txt
+│   ├── flux/
+│   │   ├── server.py               # Flux image generation service
+│   │   └── requirements.txt
+│   └── musicgen/
+│       ├── server.py               # MusicGen audio generation service
+│       └── requirements.txt
+├── scripts/
+│   ├── install.sh                  # Setup script (Ansible + downloads)
+│   ├── start-all.sh                # Start orchestrator + models
+│   ├── stop-all.sh                 # Stop all services
+│   └── download-models.sh          # Download model weights from HuggingFace
+├── systemd/
+│   └── ai-orchestrator.service     # systemd service file (for VPS)
+├── playbook.yml                    # Ansible playbook for system setup
+├── inventory.yml                   # Ansible inventory
+├── .env.example                    # Environment variables template
+└── README.md                       # This file
+```
+
+## Performance Notes
+
+- **Model switching time:** 30-120 seconds (depends on model size)
+- **Text generation:** ~20-40 tokens/second (qwen-2.5-7b on RTX 4090)
+- **Image generation:** 4-5 seconds per image (Flux Schnell)
+- **Music generation:** 60-90 seconds for 30s audio (MusicGen Medium)
+
+## Future Enhancements
+
+- [ ] Parallel model loading (for multi-GPU setups)
+- [ ] Model preloading based on usage patterns
+- [ ] Prometheus metrics exporter
+- [ ] Web dashboard for model management
+- [ ] Support for quantized models (lower VRAM requirements)
+- [ ] Add Whisper (speech-to-text)
+- [ ] Add XTTS-v2 (text-to-speech with voice cloning)
+
+## License
+
+MIT License - See LICENSE file for details.
+
+## Contributing
+
+Contributions welcome! Please:
+1. Fork the repository
+2. Create a feature branch
+3. Test your changes on RunPod
+4. Submit a pull request
+
+## Support
+
+- **Issues:** https://github.com/yourusername/runpod/issues
+- **Discussions:** https://github.com/yourusername/runpod/discussions
+- **Documentation:** This README + inline code comments
+
+## Acknowledgments
+
+Built with:
+- [vLLM](https://github.com/vllm-project/vllm) - Fast LLM inference
+- [OpenedAI-Images](https://github.com/matatonic/openedai-images) - OpenAI-compatible image generation
+- [AudioCraft](https://github.com/facebookresearch/audiocraft) - Music generation
+- [FastAPI](https://fastapi.tiangolo.com/) - Modern Python web framework
+- [LiteLLM](https://github.com/BerriAI/litellm) - Universal LLM proxy