468 lines
14 KiB
Markdown
468 lines
14 KiB
Markdown
|
|
# Multi-Modal AI Orchestration System
|
|||
|
|
|
|||
|
|
**Cost-optimized AI infrastructure running text, image, and music generation on a single RunPod RTX 4090 GPU.**
|
|||
|
|
|
|||
|
|
## Architecture Overview
|
|||
|
|
|
|||
|
|
This system provides a unified API for multiple AI model types with automatic model switching on a single GPU (24GB VRAM). All requests route through an intelligent orchestrator that manages model lifecycle.
|
|||
|
|
|
|||
|
|
### Components
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|||
|
|
│ VPS (Tailscale: 100.102.217.79) │
|
|||
|
|
│ ┌───────────────────────────────────────────────────────────┐ │
|
|||
|
|
│ │ LiteLLM Proxy (Port 4000) │ │
|
|||
|
|
│ │ Routes to: Claude API + GPU Orchestrator │ │
|
|||
|
|
│ └────────────────────┬──────────────────────────────────────┘ │
|
|||
|
|
└───────────────────────┼─────────────────────────────────────────┘
|
|||
|
|
│ Tailscale VPN
|
|||
|
|
┌───────────────────────┼─────────────────────────────────────────┐
|
|||
|
|
│ RunPod GPU Server (Tailscale: 100.100.108.13) │
|
|||
|
|
│ ┌────────────────────▼──────────────────────────────────────┐ │
|
|||
|
|
│ │ Orchestrator (Port 9000) │ │
|
|||
|
|
│ │ Manages sequential model loading based on request type │ │
|
|||
|
|
│ └─────┬──────────────┬──────────────────┬──────────────────┘ │
|
|||
|
|
│ │ │ │ │
|
|||
|
|
│ ┌─────▼──────┐ ┌────▼────────┐ ┌──────▼───────┐ │
|
|||
|
|
│ │vLLM │ │Flux.1 │ │MusicGen │ │
|
|||
|
|
│ │Qwen 2.5 7B │ │Schnell │ │Medium │ │
|
|||
|
|
│ │Port: 8001 │ │Port: 8002 │ │Port: 8003 │ │
|
|||
|
|
│ │VRAM: 14GB │ │VRAM: 14GB │ │VRAM: 11GB │ │
|
|||
|
|
│ └────────────┘ └─────────────┘ └──────────────┘ │
|
|||
|
|
│ │
|
|||
|
|
│ Only ONE model active at a time (sequential loading) │
|
|||
|
|
└─────────────────────────────────────────────────────────────────┘
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Features
|
|||
|
|
|
|||
|
|
✅ **Automatic Model Switching** - Orchestrator detects request type and loads appropriate model
|
|||
|
|
✅ **OpenAI-Compatible APIs** - Works with existing OpenAI clients and tools
|
|||
|
|
✅ **Cost-Optimized** - Sequential loading on single GPU (~$0.50/hr vs ~$0.75/hr for multi-GPU)
|
|||
|
|
✅ **Easy Model Addition** - Add new models by editing YAML config
|
|||
|
|
✅ **Centralized Routing** - LiteLLM proxy provides unified API for all models
|
|||
|
|
✅ **GPU Memory Safe** - Orchestrator ensures only one model loaded at a time
|
|||
|
|
|
|||
|
|
## Supported Model Types
|
|||
|
|
|
|||
|
|
### Text Generation
|
|||
|
|
- **Qwen 2.5 7B Instruct** (facebook/Qwen2.5-7B-Instruct)
|
|||
|
|
- VRAM: 14GB | Speed: Fast | OpenAI-compatible chat API
|
|||
|
|
|
|||
|
|
### Image Generation
|
|||
|
|
- **Flux.1 Schnell** (black-forest-labs/FLUX.1-schnell)
|
|||
|
|
- VRAM: 14GB | Speed: 4-5 sec/image | OpenAI DALL-E compatible API
|
|||
|
|
|
|||
|
|
### Music Generation
|
|||
|
|
- **MusicGen Medium** (facebook/musicgen-medium)
|
|||
|
|
- VRAM: 11GB | Speed: 60-90 sec for 30s audio | Custom audio API
|
|||
|
|
|
|||
|
|
## Quick Start
|
|||
|
|
|
|||
|
|
### 1. Prerequisites
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# On RunPod GPU server
|
|||
|
|
- RunPod RTX 4090 instance (24GB VRAM)
|
|||
|
|
- Docker & Docker Compose installed
|
|||
|
|
- Tailscale VPN configured
|
|||
|
|
- HuggingFace token (for model downloads)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2. Clone & Configure
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# On local machine
|
|||
|
|
cd ai/
|
|||
|
|
|
|||
|
|
# Create environment file
|
|||
|
|
cp .env.example .env
|
|||
|
|
# Edit .env and add your HF_TOKEN
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3. Deploy to RunPod
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Copy all files to RunPod GPU server
|
|||
|
|
scp -r ai/* gpu-pivoine:/workspace/ai/
|
|||
|
|
|
|||
|
|
# SSH to GPU server
|
|||
|
|
ssh gpu-pivoine
|
|||
|
|
|
|||
|
|
# Navigate to project
|
|||
|
|
cd /workspace/ai/
|
|||
|
|
|
|||
|
|
# Start orchestrator (always running)
|
|||
|
|
docker compose -f docker-compose.gpu.yaml up -d orchestrator
|
|||
|
|
|
|||
|
|
# Orchestrator will automatically manage model services as needed
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 4. Test Deployment
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Check orchestrator health
|
|||
|
|
curl http://100.100.108.13:9000/health
|
|||
|
|
|
|||
|
|
# Test text generation (auto-loads vLLM)
|
|||
|
|
curl http://100.100.108.13:9000/v1/chat/completions \
|
|||
|
|
-H "Content-Type: application/json" \
|
|||
|
|
-d '{
|
|||
|
|
"model": "qwen-2.5-7b",
|
|||
|
|
"messages": [{"role": "user", "content": "Hello!"}]
|
|||
|
|
}'
|
|||
|
|
|
|||
|
|
# Test image generation (auto-switches to Flux)
|
|||
|
|
curl http://100.100.108.13:9000/v1/images/generations \
|
|||
|
|
-H "Content-Type: application/json" \
|
|||
|
|
-d '{
|
|||
|
|
"model": "flux-schnell",
|
|||
|
|
"prompt": "a cute cat",
|
|||
|
|
"size": "1024x1024"
|
|||
|
|
}'
|
|||
|
|
|
|||
|
|
# Test music generation (auto-switches to MusicGen)
|
|||
|
|
curl http://100.100.108.13:9000/v1/audio/generations \
|
|||
|
|
-H "Content-Type: application/json" \
|
|||
|
|
-d '{
|
|||
|
|
"model": "musicgen-medium",
|
|||
|
|
"prompt": "upbeat electronic dance music",
|
|||
|
|
"duration": 30
|
|||
|
|
}'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 5. Update VPS LiteLLM
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# On VPS, restart LiteLLM to pick up new config
|
|||
|
|
ssh vps
|
|||
|
|
cd ~/Projects/docker-compose
|
|||
|
|
arty restart litellm
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Usage Examples
|
|||
|
|
|
|||
|
|
### Via Open WebUI (https://ai.pivoine.art)
|
|||
|
|
|
|||
|
|
**Text Generation:**
|
|||
|
|
1. Select model: `qwen-2.5-7b`
|
|||
|
|
2. Type message and send
|
|||
|
|
3. Orchestrator loads vLLM automatically
|
|||
|
|
|
|||
|
|
**Image Generation:**
|
|||
|
|
1. Select model: `flux-schnell`
|
|||
|
|
2. Enter image prompt
|
|||
|
|
3. Orchestrator switches to Flux.1
|
|||
|
|
|
|||
|
|
**Music Generation:**
|
|||
|
|
1. Select model: `musicgen-medium`
|
|||
|
|
2. Describe the music you want
|
|||
|
|
3. Orchestrator switches to MusicGen
|
|||
|
|
|
|||
|
|
### Via API (Direct)
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
import openai
|
|||
|
|
|
|||
|
|
# Configure client to use orchestrator
|
|||
|
|
client = openai.OpenAI(
|
|||
|
|
base_url="http://100.100.108.13:9000/v1",
|
|||
|
|
api_key="dummy" # Not used but required
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Text generation
|
|||
|
|
response = client.chat.completions.create(
|
|||
|
|
model="qwen-2.5-7b",
|
|||
|
|
messages=[{"role": "user", "content": "Write a haiku"}]
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Image generation
|
|||
|
|
image = client.images.generate(
|
|||
|
|
model="flux-schnell",
|
|||
|
|
prompt="a sunset over mountains",
|
|||
|
|
size="1024x1024"
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Music generation (custom endpoint)
|
|||
|
|
import requests
|
|||
|
|
music = requests.post(
|
|||
|
|
"http://100.100.108.13:9000/v1/audio/generations",
|
|||
|
|
json={
|
|||
|
|
"model": "musicgen-medium",
|
|||
|
|
"prompt": "calm piano music",
|
|||
|
|
"duration": 30
|
|||
|
|
}
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Adding New Models
|
|||
|
|
|
|||
|
|
### Step 1: Update `models.yaml`
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# Add to ai/model-orchestrator/models.yaml
|
|||
|
|
models:
|
|||
|
|
llama-3.1-8b: # New model
|
|||
|
|
type: text
|
|||
|
|
framework: vllm
|
|||
|
|
docker_service: vllm-llama
|
|||
|
|
port: 8004
|
|||
|
|
vram_gb: 17
|
|||
|
|
startup_time_seconds: 120
|
|||
|
|
endpoint: /v1/chat/completions
|
|||
|
|
description: "Llama 3.1 8B Instruct - Meta's latest model"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Step 2: Add Docker Service
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# Add to ai/docker-compose.gpu.yaml
|
|||
|
|
services:
|
|||
|
|
vllm-llama:
|
|||
|
|
build: ./vllm
|
|||
|
|
container_name: ai_vllm-llama_1
|
|||
|
|
command: >
|
|||
|
|
vllm serve meta-llama/Llama-3.1-8B-Instruct
|
|||
|
|
--port 8000 --dtype bfloat16
|
|||
|
|
ports:
|
|||
|
|
- "8004:8000"
|
|||
|
|
environment:
|
|||
|
|
- HF_TOKEN=${HF_TOKEN}
|
|||
|
|
deploy:
|
|||
|
|
resources:
|
|||
|
|
reservations:
|
|||
|
|
devices:
|
|||
|
|
- driver: nvidia
|
|||
|
|
count: 1
|
|||
|
|
capabilities: [gpu]
|
|||
|
|
profiles: ["text"]
|
|||
|
|
restart: "no"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Step 3: Restart Orchestrator
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
ssh gpu-pivoine
|
|||
|
|
cd /workspace/ai/
|
|||
|
|
docker compose -f docker-compose.gpu.yaml restart orchestrator
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**That's it!** The orchestrator automatically detects the new model.
|
|||
|
|
|
|||
|
|
## Management Commands
|
|||
|
|
|
|||
|
|
### Orchestrator
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Start orchestrator
|
|||
|
|
docker compose -f docker-compose.gpu.yaml up -d orchestrator
|
|||
|
|
|
|||
|
|
# View orchestrator logs
|
|||
|
|
docker logs -f ai_orchestrator
|
|||
|
|
|
|||
|
|
# Restart orchestrator
|
|||
|
|
docker compose -f docker-compose.gpu.yaml restart orchestrator
|
|||
|
|
|
|||
|
|
# Check active model
|
|||
|
|
curl http://100.100.108.13:9000/health
|
|||
|
|
|
|||
|
|
# List all models
|
|||
|
|
curl http://100.100.108.13:9000/models
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Manual Model Control
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Manually switch to specific model
|
|||
|
|
curl -X POST http://100.100.108.13:9000/switch \
|
|||
|
|
-H "Content-Type: application/json" \
|
|||
|
|
-d '{"model": "flux-schnell"}'
|
|||
|
|
|
|||
|
|
# Check which model is running
|
|||
|
|
curl http://100.100.108.13:9000/health | jq '.current_model'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Model Services
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Manually start a specific model (bypassing orchestrator)
|
|||
|
|
docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen
|
|||
|
|
|
|||
|
|
# Stop a model
|
|||
|
|
docker compose -f docker-compose.gpu.yaml stop vllm-qwen
|
|||
|
|
|
|||
|
|
# View model logs
|
|||
|
|
docker logs -f ai_vllm-qwen_1
|
|||
|
|
docker logs -f ai_flux_1
|
|||
|
|
docker logs -f ai_musicgen_1
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Monitoring
|
|||
|
|
|
|||
|
|
### GPU Usage
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
ssh gpu-pivoine "nvidia-smi"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Model Status
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Which model is active?
|
|||
|
|
curl http://100.100.108.13:9000/health
|
|||
|
|
|
|||
|
|
# Model memory usage
|
|||
|
|
curl http://100.100.108.13:9000/health | jq '{current: .current_model, vram: .model_info.vram_gb}'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Performance
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Orchestrator logs (model switching)
|
|||
|
|
docker logs -f ai_orchestrator
|
|||
|
|
|
|||
|
|
# Model-specific logs
|
|||
|
|
docker logs -f ai_vllm-qwen_1
|
|||
|
|
docker logs -f ai_flux_1
|
|||
|
|
docker logs -f ai_musicgen_1
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Troubleshooting
|
|||
|
|
|
|||
|
|
### Model Won't Load
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Check orchestrator logs
|
|||
|
|
docker logs ai_orchestrator
|
|||
|
|
|
|||
|
|
# Check if model service exists
|
|||
|
|
docker compose -f docker-compose.gpu.yaml config | grep -A 10 "vllm-qwen"
|
|||
|
|
|
|||
|
|
# Manually test model service
|
|||
|
|
docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen
|
|||
|
|
curl http://localhost:8001/health
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Orchestrator Can't Connect
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Check Docker socket permissions
|
|||
|
|
ls -l /var/run/docker.sock
|
|||
|
|
|
|||
|
|
# Restart Docker daemon
|
|||
|
|
sudo systemctl restart docker
|
|||
|
|
|
|||
|
|
# Rebuild orchestrator
|
|||
|
|
docker compose -f docker-compose.gpu.yaml build orchestrator
|
|||
|
|
docker compose -f docker-compose.gpu.yaml up -d orchestrator
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Model Switching Too Slow
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Check model startup times in models.yaml
|
|||
|
|
# Adjust startup_time_seconds if needed
|
|||
|
|
|
|||
|
|
# Pre-download models to /workspace cache
|
|||
|
|
docker run --rm -it --gpus all \
|
|||
|
|
-v /workspace/huggingface_cache:/cache \
|
|||
|
|
-e HF_HOME=/cache \
|
|||
|
|
nvidia/cuda:12.4.0-runtime-ubuntu22.04 \
|
|||
|
|
huggingface-cli download facebook/musicgen-medium
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## File Structure
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
ai/
|
|||
|
|
├── docker-compose.gpu.yaml # Main orchestration file
|
|||
|
|
├── .env.example # Environment template
|
|||
|
|
├── README.md # This file
|
|||
|
|
│
|
|||
|
|
├── model-orchestrator/ # Central orchestrator service
|
|||
|
|
│ ├── orchestrator.py # FastAPI app managing models
|
|||
|
|
│ ├── models.yaml # Model registry (EDIT TO ADD MODELS)
|
|||
|
|
│ ├── Dockerfile
|
|||
|
|
│ └── requirements.txt
|
|||
|
|
│
|
|||
|
|
├── vllm/ # Text generation (vLLM)
|
|||
|
|
│ ├── server.py # Qwen 2.5 7B server
|
|||
|
|
│ ├── Dockerfile
|
|||
|
|
│ └── requirements.txt
|
|||
|
|
│
|
|||
|
|
├── flux/ # Image generation (Flux.1 Schnell)
|
|||
|
|
│ └── config/
|
|||
|
|
│ └── config.json # Flux configuration
|
|||
|
|
│
|
|||
|
|
├── musicgen/ # Music generation (MusicGen)
|
|||
|
|
│ ├── server.py # MusicGen API server
|
|||
|
|
│ ├── Dockerfile
|
|||
|
|
│ └── requirements.txt
|
|||
|
|
│
|
|||
|
|
├── litellm-config.yaml # LiteLLM proxy configuration
|
|||
|
|
└── GPU_DEPLOYMENT_LOG.md # Deployment history and notes
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Cost Analysis
|
|||
|
|
|
|||
|
|
### Current Setup (Single GPU)
|
|||
|
|
- **Provider**: RunPod Spot Instance
|
|||
|
|
- **GPU**: RTX 4090 24GB
|
|||
|
|
- **Cost**: ~$0.50/hour
|
|||
|
|
- **Monthly**: ~$360 (if running 24/7)
|
|||
|
|
- **Optimized**: ~$120 (8 hours/day during business hours)
|
|||
|
|
|
|||
|
|
### Alternative: Multi-GPU (All Models Always On)
|
|||
|
|
- **GPUs**: 2× RTX 4090
|
|||
|
|
- **Cost**: ~$0.75/hour
|
|||
|
|
- **Monthly**: ~$540 (if running 24/7)
|
|||
|
|
- **Trade-off**: No switching latency, +$180/month
|
|||
|
|
|
|||
|
|
### Recommendation
|
|||
|
|
Stick with single GPU sequential loading for cost optimization. Model switching (30-120 seconds) is acceptable for most use cases.
|
|||
|
|
|
|||
|
|
## Performance Expectations
|
|||
|
|
|
|||
|
|
| Model | VRAM | Startup Time | Generation Speed |
|
|||
|
|
|-------|------|--------------|------------------|
|
|||
|
|
| Qwen 2.5 7B | 14GB | 120s | ~50 tokens/sec |
|
|||
|
|
| Flux.1 Schnell | 14GB | 60s | ~4-5 sec/image |
|
|||
|
|
| MusicGen Medium | 11GB | 45s | ~60-90 sec for 30s audio |
|
|||
|
|
|
|||
|
|
**Model Switching**: 30-120 seconds (unload current + load new)
|
|||
|
|
|
|||
|
|
## Security Notes
|
|||
|
|
|
|||
|
|
- Orchestrator requires Docker socket access (`/var/run/docker.sock`)
|
|||
|
|
- All services run on private Tailscale network
|
|||
|
|
- No public exposure (only via VPS LiteLLM proxy)
|
|||
|
|
- HuggingFace token stored in `.env` (not committed to git)
|
|||
|
|
|
|||
|
|
## Future Enhancements
|
|||
|
|
|
|||
|
|
1. ⏹️ Add Llama 3.1 8B for alternative text generation
|
|||
|
|
2. ⏹️ Add Whisper Large v3 for speech-to-text
|
|||
|
|
3. ⏹️ Add XTTS v2 for text-to-speech
|
|||
|
|
4. ⏹️ Implement model preloading/caching for faster switching
|
|||
|
|
5. ⏹️ Add usage metrics and cost tracking
|
|||
|
|
6. ⏹️ Auto-stop GPU pod during idle periods
|
|||
|
|
|
|||
|
|
## Support
|
|||
|
|
|
|||
|
|
For issues or questions:
|
|||
|
|
- Check orchestrator logs: `docker logs ai_orchestrator`
|
|||
|
|
- View model-specific logs: `docker logs ai_<service>_1`
|
|||
|
|
- Test direct model access: `curl http://localhost:<port>/health`
|
|||
|
|
- Review GPU deployment log: `GPU_DEPLOYMENT_LOG.md`
|
|||
|
|
|
|||
|
|
## License
|
|||
|
|
|
|||
|
|
Built with:
|
|||
|
|
- [vLLM](https://github.com/vllm-project/vllm) - Apache 2.0
|
|||
|
|
- [AudioCraft](https://github.com/facebookresearch/audiocraft) - MIT (code), CC-BY-NC (weights)
|
|||
|
|
- [Flux.1](https://github.com/black-forest-labs/flux) - Apache 2.0
|
|||
|
|
- [LiteLLM](https://github.com/BerriAI/litellm) - MIT
|
|||
|
|
|
|||
|
|
**Note**: MusicGen pre-trained weights are non-commercial (CC-BY-NC). Train your own models for commercial use with the MIT-licensed code.
|