docs: add comprehensive README for AI model orchestrator
- Architecture overview (process-based orchestrator) - Installation and setup instructions for RunPod - Available models (text, image, music generation) - API endpoints and usage examples - Integration guide for LiteLLM proxy - Troubleshooting section with streaming fixes - Deployment options (RunPod, VPS, Tailscale VPN) - Adding new models workflow - Project structure documentation Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
472
README.md
Normal file
472
README.md
Normal file
@@ -0,0 +1,472 @@
|
||||
# AI Model Orchestrator for RunPod
|
||||
|
||||
A lightweight, process-based AI model orchestrator designed for single-GPU environments. Automatically switches between text, image, and audio generation models based on API requests, maximizing GPU utilization on resource-constrained hardware.
|
||||
|
||||
## Overview
|
||||
|
||||
This orchestrator solves a common problem: **running multiple large AI models on a single GPU** by:
|
||||
- Loading models **sequentially** (one at a time) to fit within GPU memory constraints
|
||||
- **Automatically switching** models based on incoming API requests
|
||||
- Providing **OpenAI-compatible endpoints** for seamless integration
|
||||
- Supporting streaming responses for real-time text generation
|
||||
|
||||
**Perfect for:**
|
||||
- RunPod RTX 4090 instances (24GB VRAM)
|
||||
- Cost-effective GPU servers
|
||||
- Self-hosted AI infrastructure
|
||||
- Integration with LiteLLM, Open WebUI, and other AI proxies
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
Client Request → LiteLLM Proxy → Orchestrator (port 9000) → Model Service (port 8000+)
|
||||
(Optional) ↓
|
||||
Model Registry
|
||||
(models.yaml)
|
||||
↓
|
||||
Sequential Loading
|
||||
(stops current, starts new)
|
||||
```
|
||||
|
||||
### Key Components
|
||||
|
||||
- **Orchestrator** (`orchestrator_subprocess.py`): FastAPI proxy managing model lifecycle via Python subprocesses
|
||||
- **Model Registry** (`models.yaml`): YAML configuration defining available models and their properties
|
||||
- **Model Services**: Individual Python servers (vLLM, OpenedAI-Images, AudioCraft) running specific models
|
||||
- **Management Scripts**: Bash utilities for starting, stopping, and downloading models
|
||||
|
||||
### Why Process-Based?
|
||||
|
||||
Unlike the Docker-based version (`orchestrator.py`), the subprocess implementation:
|
||||
- Works seamlessly in RunPod's containerized environment (no Docker-in-Docker complexity)
|
||||
- Starts models faster (direct Python process spawning)
|
||||
- Simplifies debugging (single process tree)
|
||||
- Reduces overhead (no container management layer)
|
||||
|
||||
## Available Models
|
||||
|
||||
### Text Generation (vLLM)
|
||||
- **qwen-2.5-7b**: Qwen 2.5 7B Instruct (14GB VRAM, ~2min startup)
|
||||
- Fast text generation with streaming support
|
||||
- Endpoint: `/v1/chat/completions`
|
||||
|
||||
### Image Generation (OpenedAI-Images)
|
||||
- **flux-schnell**: Flux.1 Schnell (14GB VRAM, ~1min startup)
|
||||
- High-quality images in 4-5 seconds
|
||||
- Endpoint: `/v1/images/generations`
|
||||
|
||||
### Music Generation (AudioCraft)
|
||||
- **musicgen-medium**: MusicGen Medium (11GB VRAM, ~45s startup)
|
||||
- Text-to-music generation (60-90s for 30s audio)
|
||||
- Endpoint: `/v1/audio/generations`
|
||||
|
||||
## Installation
|
||||
|
||||
### Prerequisites
|
||||
- Python 3.10+
|
||||
- CUDA-capable GPU (tested on RTX 4090)
|
||||
- Ubuntu/Debian-based system
|
||||
- HuggingFace account with API token
|
||||
|
||||
### Quick Start (RunPod)
|
||||
|
||||
1. **Clone the repository:**
|
||||
```bash
|
||||
cd /workspace
|
||||
git clone https://github.com/yourusername/runpod.git ai
|
||||
cd ai
|
||||
```
|
||||
|
||||
2. **Configure environment:**
|
||||
```bash
|
||||
cp .env.example .env
|
||||
# Edit .env and set:
|
||||
# - HF_TOKEN=your_huggingface_token
|
||||
# - PORT=9000 (orchestrator port)
|
||||
```
|
||||
|
||||
3. **Run setup (installs dependencies, downloads models):**
|
||||
```bash
|
||||
bash scripts/install.sh
|
||||
```
|
||||
|
||||
4. **Start all services:**
|
||||
```bash
|
||||
bash scripts/start-all.sh
|
||||
```
|
||||
|
||||
5. **Verify orchestrator is running:**
|
||||
```bash
|
||||
curl http://localhost:9000/health
|
||||
```
|
||||
|
||||
### Manual Installation
|
||||
|
||||
If you prefer step-by-step setup:
|
||||
|
||||
```bash
|
||||
# Install system dependencies via Ansible
|
||||
ansible-playbook playbook.yml --tags base,python,dependencies
|
||||
|
||||
# Download model weights
|
||||
bash scripts/download-models.sh
|
||||
|
||||
# Start orchestrator
|
||||
python3 model-orchestrator/orchestrator_subprocess.py
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Starting Services
|
||||
|
||||
```bash
|
||||
# Start orchestrator + all models
|
||||
bash scripts/start-all.sh
|
||||
|
||||
# Start orchestrator only (models start on-demand)
|
||||
python3 model-orchestrator/orchestrator_subprocess.py
|
||||
```
|
||||
|
||||
Services start in the background and automatically load models when requested.
|
||||
|
||||
### Stopping Services
|
||||
|
||||
```bash
|
||||
# Stop all services
|
||||
bash scripts/stop-all.sh
|
||||
|
||||
# Stop orchestrator only (leaves models running)
|
||||
pkill -f orchestrator_subprocess.py
|
||||
```
|
||||
|
||||
### Testing the API
|
||||
|
||||
**Text generation (streaming):**
|
||||
```bash
|
||||
curl -s -N -X POST http://localhost:9000/v1/chat/completions \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"model": "qwen-2.5-7b",
|
||||
"messages": [{"role": "user", "content": "Count to 5"}],
|
||||
"max_tokens": 50,
|
||||
"stream": true
|
||||
}'
|
||||
```
|
||||
|
||||
**Image generation:**
|
||||
```bash
|
||||
curl -X POST http://localhost:9000/v1/images/generations \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"model": "flux-schnell",
|
||||
"prompt": "A serene mountain landscape at sunset",
|
||||
"size": "1024x1024"
|
||||
}'
|
||||
```
|
||||
|
||||
**Music generation:**
|
||||
```bash
|
||||
curl -X POST http://localhost:9000/v1/audio/generations \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"model": "musicgen-medium",
|
||||
"prompt": "Upbeat electronic dance music",
|
||||
"duration": 30
|
||||
}'
|
||||
```
|
||||
|
||||
**List available models:**
|
||||
```bash
|
||||
curl http://localhost:9000/v1/models
|
||||
```
|
||||
|
||||
**Health check:**
|
||||
```bash
|
||||
curl http://localhost:9000/health
|
||||
```
|
||||
|
||||
## Integration with LiteLLM
|
||||
|
||||
The orchestrator is designed to work behind LiteLLM for unified API management:
|
||||
|
||||
**LiteLLM config (`litellm-config.yaml`):**
|
||||
```yaml
|
||||
model_list:
|
||||
- model_name: qwen-2.5-7b
|
||||
litellm_params:
|
||||
model: hosted_vllm/openai/qwen-2.5-7b
|
||||
api_base: http://100.121.199.88:9000/v1 # Tailscale VPN
|
||||
api_key: dummy
|
||||
stream: true
|
||||
timeout: 600
|
||||
```
|
||||
|
||||
**Key configuration notes:**
|
||||
- Use `hosted_vllm/openai/` prefix for vLLM models via orchestrator
|
||||
- Set `stream: true` to enable token-by-token streaming
|
||||
- Use Tailscale IP for secure VPN access from remote services
|
||||
|
||||
## Configuration
|
||||
|
||||
### Model Registry (`model-orchestrator/models.yaml`)
|
||||
|
||||
Models are defined in YAML with these properties:
|
||||
|
||||
```yaml
|
||||
qwen-2.5-7b:
|
||||
type: text # text, image, or audio
|
||||
framework: vllm # vllm, openedai-images, audiocraft
|
||||
service_script: models/vllm/server.py
|
||||
port: 8000
|
||||
vram_gb: 14 # GPU memory requirement
|
||||
startup_time_seconds: 120 # Max time to wait for health check
|
||||
endpoint: /v1/chat/completions # API endpoint path
|
||||
description: "Human-readable description"
|
||||
```
|
||||
|
||||
### Adding New Models
|
||||
|
||||
1. **Add model definition to `models.yaml`:**
|
||||
```yaml
|
||||
llama-3.1-8b:
|
||||
type: text
|
||||
framework: vllm
|
||||
service_script: models/vllm/server_llama.py
|
||||
port: 8004
|
||||
vram_gb: 17
|
||||
startup_time_seconds: 120
|
||||
endpoint: /v1/chat/completions
|
||||
description: "Llama 3.1 8B Instruct"
|
||||
```
|
||||
|
||||
2. **Create model service script** (`models/vllm/server_llama.py`):
|
||||
```python
|
||||
import os
|
||||
from vllm import LLM
|
||||
from vllm.entrypoints.openai.api_server import run_server
|
||||
|
||||
model = "meta-llama/Llama-3.1-8B-Instruct"
|
||||
port = int(os.getenv("PORT", 8004))
|
||||
run_server(model=model, port=port)
|
||||
```
|
||||
|
||||
3. **Download model weights:**
|
||||
```bash
|
||||
bash scripts/download-models.sh
|
||||
```
|
||||
|
||||
4. **Restart orchestrator:**
|
||||
```bash
|
||||
bash scripts/stop-all.sh && bash scripts/start-all.sh
|
||||
```
|
||||
|
||||
## Deployment Options
|
||||
|
||||
### Option 1: RunPod (Current Setup)
|
||||
|
||||
Designed for RunPod's containerized environment:
|
||||
- Services run as background processes in `/workspace/ai/`
|
||||
- Managed by `start-all.sh` and `stop-all.sh`
|
||||
- No systemd (RunPod containers don't support it)
|
||||
|
||||
### Option 2: Dedicated VPS with systemd
|
||||
|
||||
For production VPS deployment:
|
||||
|
||||
1. **Install systemd service:**
|
||||
```bash
|
||||
sudo cp systemd/ai-orchestrator.service /etc/systemd/system/
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl enable ai-orchestrator
|
||||
sudo systemctl start ai-orchestrator
|
||||
```
|
||||
|
||||
2. **Check logs:**
|
||||
```bash
|
||||
sudo journalctl -u ai-orchestrator -f
|
||||
```
|
||||
|
||||
### Option 3: Tailscale VPN Integration
|
||||
|
||||
Connect RunPod GPU to your VPS infrastructure:
|
||||
|
||||
1. **On RunPod instance:**
|
||||
```bash
|
||||
# Install Tailscale
|
||||
curl -fsSL https://tailscale.com/install.sh | sh
|
||||
|
||||
# Connect to VPN
|
||||
tailscale up --advertise-tags=tag:gpu
|
||||
```
|
||||
|
||||
2. **On VPS (configure LiteLLM):**
|
||||
```yaml
|
||||
# Use Tailscale IP in litellm-config.yaml
|
||||
api_base: http://100.121.199.88:9000/v1
|
||||
```
|
||||
|
||||
3. **Benefits:**
|
||||
- Secure encrypted tunnel (no public exposure)
|
||||
- Low latency direct connection
|
||||
- Access GPU from any device on Tailscale network
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Orchestrator Management
|
||||
|
||||
- `GET /health` - Health check with current model status
|
||||
- `GET /v1/models` - OpenAI-compatible models list
|
||||
- `POST /switch` - Manually switch to a specific model
|
||||
|
||||
### Model Endpoints (Proxied)
|
||||
|
||||
The orchestrator automatically routes requests based on endpoint and model name:
|
||||
|
||||
- `POST /v1/chat/completions` - Text generation (text models)
|
||||
- `POST /v1/images/generations` - Image generation (image models)
|
||||
- `POST /v1/audio/generations` - Music generation (audio models)
|
||||
|
||||
**Streaming Support:**
|
||||
- Set `"stream": true` in request body
|
||||
- Returns Server-Sent Events (SSE) stream
|
||||
- Works through LiteLLM proxy (with correct configuration)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Streaming Returns Empty Deltas
|
||||
|
||||
**Symptom:** LiteLLM shows streaming chunks but content is empty:
|
||||
```json
|
||||
{"choices":[{"delta":{"content":""},"index":0}]}
|
||||
```
|
||||
|
||||
**Solution:** Use correct model path format in LiteLLM:
|
||||
```yaml
|
||||
# ✅ Correct
|
||||
model: hosted_vllm/openai/qwen-2.5-7b
|
||||
|
||||
# ❌ Wrong
|
||||
model: openai/qwen-2.5-7b
|
||||
```
|
||||
|
||||
### Model Fails to Start
|
||||
|
||||
**Check logs:**
|
||||
```bash
|
||||
# RunPod
|
||||
ps aux | grep orchestrator
|
||||
ps aux | grep vllm
|
||||
|
||||
# VPS with systemd
|
||||
sudo journalctl -u ai-orchestrator -n 50
|
||||
```
|
||||
|
||||
**Common issues:**
|
||||
- Out of GPU memory: Check VRAM usage with `nvidia-smi`
|
||||
- Missing model weights: Run `bash scripts/download-models.sh`
|
||||
- Port conflicts: Check if port is already in use with `lsof -i :9000`
|
||||
|
||||
### Orchestrator Not Responding
|
||||
|
||||
**Test model directly (bypass orchestrator):**
|
||||
```bash
|
||||
# Start vLLM service directly
|
||||
cd /workspace/ai
|
||||
python3 models/vllm/server.py
|
||||
|
||||
# Test on port 8000
|
||||
curl -X POST http://localhost:8000/v1/chat/completions \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Hi"}]}'
|
||||
```
|
||||
|
||||
### HuggingFace Download Errors
|
||||
|
||||
**Symptom:** "Repository not found" or "Authentication required"
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
# Check HF_TOKEN is set
|
||||
echo $HF_TOKEN
|
||||
|
||||
# Set token manually
|
||||
export HF_TOKEN=your_token_here
|
||||
|
||||
# Re-run download
|
||||
bash scripts/download-models.sh
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
runpod/
|
||||
├── model-orchestrator/
|
||||
│ ├── orchestrator_subprocess.py # Main orchestrator (process-based)
|
||||
│ ├── orchestrator.py # Alternative Docker-based version
|
||||
│ └── models.yaml # Model registry configuration
|
||||
├── models/
|
||||
│ ├── vllm/
|
||||
│ │ ├── server.py # vLLM text generation service
|
||||
│ │ └── requirements.txt
|
||||
│ ├── flux/
|
||||
│ │ ├── server.py # Flux image generation service
|
||||
│ │ └── requirements.txt
|
||||
│ └── musicgen/
|
||||
│ ├── server.py # MusicGen audio generation service
|
||||
│ └── requirements.txt
|
||||
├── scripts/
|
||||
│ ├── install.sh # Setup script (Ansible + downloads)
|
||||
│ ├── start-all.sh # Start orchestrator + models
|
||||
│ ├── stop-all.sh # Stop all services
|
||||
│ └── download-models.sh # Download model weights from HuggingFace
|
||||
├── systemd/
|
||||
│ └── ai-orchestrator.service # systemd service file (for VPS)
|
||||
├── playbook.yml # Ansible playbook for system setup
|
||||
├── inventory.yml # Ansible inventory
|
||||
├── .env.example # Environment variables template
|
||||
└── README.md # This file
|
||||
```
|
||||
|
||||
## Performance Notes
|
||||
|
||||
- **Model switching time:** 30-120 seconds (depends on model size)
|
||||
- **Text generation:** ~20-40 tokens/second (qwen-2.5-7b on RTX 4090)
|
||||
- **Image generation:** 4-5 seconds per image (Flux Schnell)
|
||||
- **Music generation:** 60-90 seconds for 30s audio (MusicGen Medium)
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
- [ ] Parallel model loading (for multi-GPU setups)
|
||||
- [ ] Model preloading based on usage patterns
|
||||
- [ ] Prometheus metrics exporter
|
||||
- [ ] Web dashboard for model management
|
||||
- [ ] Support for quantized models (lower VRAM requirements)
|
||||
- [ ] Add Whisper (speech-to-text)
|
||||
- [ ] Add XTTS-v2 (text-to-speech with voice cloning)
|
||||
|
||||
## License
|
||||
|
||||
MIT License - See LICENSE file for details.
|
||||
|
||||
## Contributing
|
||||
|
||||
Contributions welcome! Please:
|
||||
1. Fork the repository
|
||||
2. Create a feature branch
|
||||
3. Test your changes on RunPod
|
||||
4. Submit a pull request
|
||||
|
||||
## Support
|
||||
|
||||
- **Issues:** https://github.com/yourusername/runpod/issues
|
||||
- **Discussions:** https://github.com/yourusername/runpod/discussions
|
||||
- **Documentation:** This README + inline code comments
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
Built with:
|
||||
- [vLLM](https://github.com/vllm-project/vllm) - Fast LLM inference
|
||||
- [OpenedAI-Images](https://github.com/matatonic/openedai-images) - OpenAI-compatible image generation
|
||||
- [AudioCraft](https://github.com/facebookresearch/audiocraft) - Music generation
|
||||
- [FastAPI](https://fastapi.tiangolo.com/) - Modern Python web framework
|
||||
- [LiteLLM](https://github.com/BerriAI/litellm) - Universal LLM proxy
|
||||
Reference in New Issue
Block a user