docs: migrate multi-modal AI orchestration to dedicated runpod repository

Multi-modal AI stack (text/image/music generation) has been moved to:
Repository: ssh://git@dev.pivoine.art:2222/valknar/runpod.git

Updated ai/README.md to document:
- VPS AI services (Open WebUI, Crawl4AI, AI PostgreSQL)
- Reference to new runpod repository for GPU infrastructure
- Clear separation between VPS and GPU deployments
- Integration architecture via Tailscale VPN
This commit is contained in:
2025-11-21 14:36:36 +01:00
parent 3ed3e68271
commit abcebd1d9b

View File

@@ -1,467 +1,170 @@
# Multi-Modal AI Orchestration System # AI Infrastructure
**Cost-optimized AI infrastructure running text, image, and music generation on a single RunPod RTX 4090 GPU.** This directory contains AI-related configurations for the VPS deployment.
## Architecture Overview ## Multi-Modal GPU Infrastructure (Migrated)
This system provides a unified API for multiple AI model types with automatic model switching on a single GPU (24GB VRAM). All requests route through an intelligent orchestrator that manages model lifecycle. **The multi-modal AI orchestration stack (text, image, music generation) has been moved to a dedicated repository:**
### Components **Repository**: https://dev.pivoine.art/valknar/runpod
``` The RunPod repository contains:
┌─────────────────────────────────────────────────────────────────┐ - Model orchestrator for automatic switching between text, image, and music models
│ VPS (Tailscale: 100.102.217.79) │ - vLLM + Qwen 2.5 7B (text generation)
│ ┌───────────────────────────────────────────────────────────┐ │ - Flux.1 Schnell (image generation)
│ │ LiteLLM Proxy (Port 4000) │ │ - MusicGen Medium (music generation)
│ │ Routes to: Claude API + GPU Orchestrator │ │ - RunPod template creation scripts
│ └────────────────────┬──────────────────────────────────────┘ │ - Complete deployment documentation
└───────────────────────┼─────────────────────────────────────────┘
│ Tailscale VPN
┌───────────────────────┼─────────────────────────────────────────┐
│ RunPod GPU Server (Tailscale: 100.100.108.13) │
│ ┌────────────────────▼──────────────────────────────────────┐ │
│ │ Orchestrator (Port 9000) │ │
│ │ Manages sequential model loading based on request type │ │
│ └─────┬──────────────┬──────────────────┬──────────────────┘ │
│ │ │ │ │
│ ┌─────▼──────┐ ┌────▼────────┐ ┌──────▼───────┐ │
│ │vLLM │ │Flux.1 │ │MusicGen │ │
│ │Qwen 2.5 7B │ │Schnell │ │Medium │ │
│ │Port: 8001 │ │Port: 8002 │ │Port: 8003 │ │
│ │VRAM: 14GB │ │VRAM: 14GB │ │VRAM: 11GB │ │
│ └────────────┘ └─────────────┘ └──────────────┘ │
│ │
│ Only ONE model active at a time (sequential loading) │
└─────────────────────────────────────────────────────────────────┘
```
### Features This separation allows for independent management of:
- **VPS Services** (this repo): Open WebUI, Crawl4AI, AI database
- **GPU Services** (runpod repo): Model inference, orchestration, RunPod templates
**Automatic Model Switching** - Orchestrator detects request type and loads appropriate model ## VPS AI Services (ai/compose.yaml)
**OpenAI-Compatible APIs** - Works with existing OpenAI clients and tools
**Cost-Optimized** - Sequential loading on single GPU (~$0.50/hr vs ~$0.75/hr for multi-GPU)
**Easy Model Addition** - Add new models by editing YAML config
**Centralized Routing** - LiteLLM proxy provides unified API for all models
**GPU Memory Safe** - Orchestrator ensures only one model loaded at a time
## Supported Model Types This compose stack manages the VPS-side AI infrastructure that integrates with the GPU server:
### Text Generation ### Services
- **Qwen 2.5 7B Instruct** (facebook/Qwen2.5-7B-Instruct)
- VRAM: 14GB | Speed: Fast | OpenAI-compatible chat API
### Image Generation #### ai_postgres
- **Flux.1 Schnell** (black-forest-labs/FLUX.1-schnell) Dedicated PostgreSQL 16 instance with pgvector extension for AI workloads:
- VRAM: 14GB | Speed: 4-5 sec/image | OpenAI DALL-E compatible API - Vector similarity search support
- Isolated from core database for performance
- Used by Open WebUI for RAG and embeddings
### Music Generation #### webui (Open WebUI)
- **MusicGen Medium** (facebook/musicgen-medium) ChatGPT-like interface exposed at `ai.pivoine.art:8080`:
- VRAM: 11GB | Speed: 60-90 sec for 30s audio | Custom audio API - Claude API integration via Anthropic
- RAG support with document upload
- Vector storage via pgvector
- Web search capability
- SMTP email via IONOS
- User signup enabled
## Quick Start #### crawl4ai
Internal web scraping service for LLM content preparation:
- API on port 11235 (not exposed publicly)
- Optimized for AI/RAG workflows
- Integration with Open WebUI and n8n
### 1. Prerequisites ## Integration with GPU Server
The VPS AI services connect to the GPU server via Tailscale VPN:
- **VPS Tailscale IP**: 100.102.217.79
- **GPU Tailscale IP**: 100.100.108.13
**LiteLLM Proxy** (port 4000 on VPS) routes requests:
- Claude API for chat completions
- GPU orchestrator for self-hosted models (text, image, music)
See `../litellm-config.yaml` for routing configuration.
## Environment Variables
Required in `.env`:
```bash ```bash
# On RunPod GPU server # AI Database
- RunPod RTX 4090 instance (24GB VRAM) AI_DB_PASSWORD=<password>
- Docker & Docker Compose installed
- Tailscale VPN configured # Open WebUI
- HuggingFace token (for model downloads) AI_WEBUI_SECRET_KEY=<secret>
# Claude API
ANTHROPIC_API_KEY=<api_key>
# Email (IONOS SMTP)
ADMIN_EMAIL=<email>
SMTP_HOST=smtp.ionos.com
SMTP_PORT=587
SMTP_USER=<smtp_user>
SMTP_PASSWORD=<smtp_password>
``` ```
### 2. Clone & Configure ## Backup Configuration
```bash AI services are backed up daily via Restic:
# On local machine - **ai_postgres_data**: 3 AM (7 daily, 4 weekly, 6 monthly, 2 yearly)
cd ai/ - **ai_webui_data**: 3 AM (same retention)
- **ai_crawl4ai_data**: 3 AM (same retention)
# Create environment file Repository: `/mnt/hidrive/users/valknar/Backup`
cp .env.example .env
# Edit .env and add your HF_TOKEN
```
### 3. Deploy to RunPod
```bash
# Copy all files to RunPod GPU server
scp -r ai/* gpu-pivoine:/workspace/ai/
# SSH to GPU server
ssh gpu-pivoine
# Navigate to project
cd /workspace/ai/
# Start orchestrator (always running)
docker compose -f docker-compose.gpu.yaml up -d orchestrator
# Orchestrator will automatically manage model services as needed
```
### 4. Test Deployment
```bash
# Check orchestrator health
curl http://100.100.108.13:9000/health
# Test text generation (auto-loads vLLM)
curl http://100.100.108.13:9000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-2.5-7b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Test image generation (auto-switches to Flux)
curl http://100.100.108.13:9000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "flux-schnell",
"prompt": "a cute cat",
"size": "1024x1024"
}'
# Test music generation (auto-switches to MusicGen)
curl http://100.100.108.13:9000/v1/audio/generations \
-H "Content-Type: application/json" \
-d '{
"model": "musicgen-medium",
"prompt": "upbeat electronic dance music",
"duration": 30
}'
```
### 5. Update VPS LiteLLM
```bash
# On VPS, restart LiteLLM to pick up new config
ssh vps
cd ~/Projects/docker-compose
arty restart litellm
```
## Usage Examples
### Via Open WebUI (https://ai.pivoine.art)
**Text Generation:**
1. Select model: `qwen-2.5-7b`
2. Type message and send
3. Orchestrator loads vLLM automatically
**Image Generation:**
1. Select model: `flux-schnell`
2. Enter image prompt
3. Orchestrator switches to Flux.1
**Music Generation:**
1. Select model: `musicgen-medium`
2. Describe the music you want
3. Orchestrator switches to MusicGen
### Via API (Direct)
```python
import openai
# Configure client to use orchestrator
client = openai.OpenAI(
base_url="http://100.100.108.13:9000/v1",
api_key="dummy" # Not used but required
)
# Text generation
response = client.chat.completions.create(
model="qwen-2.5-7b",
messages=[{"role": "user", "content": "Write a haiku"}]
)
# Image generation
image = client.images.generate(
model="flux-schnell",
prompt="a sunset over mountains",
size="1024x1024"
)
# Music generation (custom endpoint)
import requests
music = requests.post(
"http://100.100.108.13:9000/v1/audio/generations",
json={
"model": "musicgen-medium",
"prompt": "calm piano music",
"duration": 30
}
)
```
## Adding New Models
### Step 1: Update `models.yaml`
```yaml
# Add to ai/model-orchestrator/models.yaml
models:
llama-3.1-8b: # New model
type: text
framework: vllm
docker_service: vllm-llama
port: 8004
vram_gb: 17
startup_time_seconds: 120
endpoint: /v1/chat/completions
description: "Llama 3.1 8B Instruct - Meta's latest model"
```
### Step 2: Add Docker Service
```yaml
# Add to ai/docker-compose.gpu.yaml
services:
vllm-llama:
build: ./vllm
container_name: ai_vllm-llama_1
command: >
vllm serve meta-llama/Llama-3.1-8B-Instruct
--port 8000 --dtype bfloat16
ports:
- "8004:8000"
environment:
- HF_TOKEN=${HF_TOKEN}
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
profiles: ["text"]
restart: "no"
```
### Step 3: Restart Orchestrator
```bash
ssh gpu-pivoine
cd /workspace/ai/
docker compose -f docker-compose.gpu.yaml restart orchestrator
```
**That's it!** The orchestrator automatically detects the new model.
## Management Commands ## Management Commands
### Orchestrator ```bash
# Start AI stack
pnpm arty up ai_postgres webui crawl4ai
# View logs
docker logs -f ai_webui
docker logs -f ai_postgres
docker logs -f ai_crawl4ai
# Check Open WebUI
curl http://ai.pivoine.art:8080/health
# Restart AI services
pnpm arty restart ai_postgres webui crawl4ai
```
## GPU Server Management
For GPU server operations (model orchestration, template creation, etc.):
```bash ```bash
# Start orchestrator # Clone the dedicated repository
docker compose -f docker-compose.gpu.yaml up -d orchestrator git clone ssh://git@dev.pivoine.art:2222/valknar/runpod.git
# View orchestrator logs # See runpod repository for:
docker logs -f ai_orchestrator # - Model orchestration setup
# - RunPod template creation
# Restart orchestrator # - GPU deployment guides
docker compose -f docker-compose.gpu.yaml restart orchestrator
# Check active model
curl http://100.100.108.13:9000/health
# List all models
curl http://100.100.108.13:9000/models
``` ```
### Manual Model Control ## Documentation
```bash ### VPS AI Services
# Manually switch to specific model - [GPU_DEPLOYMENT_LOG.md](GPU_DEPLOYMENT_LOG.md) - VPS AI deployment history
curl -X POST http://100.100.108.13:9000/switch \
-H "Content-Type: application/json" \
-d '{"model": "flux-schnell"}'
# Check which model is running ### GPU Server (Separate Repository)
curl http://100.100.108.13:9000/health | jq '.current_model' - [runpod/README.md](https://dev.pivoine.art/valknar/runpod) - Main GPU documentation
``` - [runpod/DEPLOYMENT.md](https://dev.pivoine.art/valknar/runpod) - Deployment guide
- [runpod/RUNPOD_TEMPLATE.md](https://dev.pivoine.art/valknar/runpod) - Template creation
### Model Services ## Architecture Overview
```bash
# Manually start a specific model (bypassing orchestrator)
docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen
# Stop a model
docker compose -f docker-compose.gpu.yaml stop vllm-qwen
# View model logs
docker logs -f ai_vllm-qwen_1
docker logs -f ai_flux_1
docker logs -f ai_musicgen_1
```
## Monitoring
### GPU Usage
```bash
ssh gpu-pivoine "nvidia-smi"
```
### Model Status
```bash
# Which model is active?
curl http://100.100.108.13:9000/health
# Model memory usage
curl http://100.100.108.13:9000/health | jq '{current: .current_model, vram: .model_info.vram_gb}'
```
### Performance
```bash
# Orchestrator logs (model switching)
docker logs -f ai_orchestrator
# Model-specific logs
docker logs -f ai_vllm-qwen_1
docker logs -f ai_flux_1
docker logs -f ai_musicgen_1
```
## Troubleshooting
### Model Won't Load
```bash
# Check orchestrator logs
docker logs ai_orchestrator
# Check if model service exists
docker compose -f docker-compose.gpu.yaml config | grep -A 10 "vllm-qwen"
# Manually test model service
docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen
curl http://localhost:8001/health
```
### Orchestrator Can't Connect
```bash
# Check Docker socket permissions
ls -l /var/run/docker.sock
# Restart Docker daemon
sudo systemctl restart docker
# Rebuild orchestrator
docker compose -f docker-compose.gpu.yaml build orchestrator
docker compose -f docker-compose.gpu.yaml up -d orchestrator
```
### Model Switching Too Slow
```bash
# Check model startup times in models.yaml
# Adjust startup_time_seconds if needed
# Pre-download models to /workspace cache
docker run --rm -it --gpus all \
-v /workspace/huggingface_cache:/cache \
-e HF_HOME=/cache \
nvidia/cuda:12.4.0-runtime-ubuntu22.04 \
huggingface-cli download facebook/musicgen-medium
```
## File Structure
``` ```
ai/ ┌─────────────────────────────────────────────────────────────────┐
├── docker-compose.gpu.yaml # Main orchestration file │ VPS (Tailscale: 100.102.217.79) │
├── .env.example # Environment template │ ┌───────────────────────────────────────────────────────────┐ │
├── README.md # This file │ │ LiteLLM Proxy (Port 4000) │ │
│ │ Routes to: Claude API + GPU Orchestrator │
├── model-orchestrator/ # Central orchestrator service │ └───────┬───────────────────────────────────────────────────┘ │
├── orchestrator.py # FastAPI app managing models
├── models.yaml # Model registry (EDIT TO ADD MODELS) ┌───────▼─────────┐ ┌──────────────┐ ┌─────────────────┐ │
├── Dockerfile │ Open WebUI │ │ Crawl4AI │ │ AI PostgreSQL │ │
└── requirements.txt │ Port: 8080 │ │ Port: 11235 │ │ + pgvector │ │
│ └─────────────────┘ └──────────────┘ └─────────────────┘
── vllm/ # Text generation (vLLM) ─────────────────────────────────────────────────────────────────┘
│ ├── server.py # Qwen 2.5 7B server │ Tailscale VPN
│ ├── Dockerfile ┌──────────────────────────────┼──────────────────────────────────┐
└── requirements.txt RunPod GPU Server (Tailscale: 100.100.108.13) │
│ ┌───────────────────────────▼──────────────────────────────┐
├── flux/ # Image generation (Flux.1 Schnell) │ │ Orchestrator (Port 9000) │ │
└── config/ │ Manages sequential model loading │ │
└── config.json # Flux configuration └─────┬──────────────┬──────────────────┬──────────────────┘ │
│ │ │ │
├── musicgen/ # Music generation (MusicGen) │ ┌─────▼──────┐ ┌────▼────────┐ ┌──────▼───────┐ │
├── server.py # MusicGen API server │vLLM │ │Flux.1 │ │MusicGen │ │
├── Dockerfile │Qwen 2.5 7B │ │Schnell │ │Medium │ │
└── requirements.txt │Port: 8001 │ │Port: 8002 │ │Port: 8003 │ │
│ └────────────┘ └─────────────┘ └──────────────┘
├── litellm-config.yaml # LiteLLM proxy configuration └─────────────────────────────────────────────────────────────────┘
└── GPU_DEPLOYMENT_LOG.md # Deployment history and notes
``` ```
## Cost Analysis
### Current Setup (Single GPU)
- **Provider**: RunPod Spot Instance
- **GPU**: RTX 4090 24GB
- **Cost**: ~$0.50/hour
- **Monthly**: ~$360 (if running 24/7)
- **Optimized**: ~$120 (8 hours/day during business hours)
### Alternative: Multi-GPU (All Models Always On)
- **GPUs**: 2× RTX 4090
- **Cost**: ~$0.75/hour
- **Monthly**: ~$540 (if running 24/7)
- **Trade-off**: No switching latency, +$180/month
### Recommendation
Stick with single GPU sequential loading for cost optimization. Model switching (30-120 seconds) is acceptable for most use cases.
## Performance Expectations
| Model | VRAM | Startup Time | Generation Speed |
|-------|------|--------------|------------------|
| Qwen 2.5 7B | 14GB | 120s | ~50 tokens/sec |
| Flux.1 Schnell | 14GB | 60s | ~4-5 sec/image |
| MusicGen Medium | 11GB | 45s | ~60-90 sec for 30s audio |
**Model Switching**: 30-120 seconds (unload current + load new)
## Security Notes
- Orchestrator requires Docker socket access (`/var/run/docker.sock`)
- All services run on private Tailscale network
- No public exposure (only via VPS LiteLLM proxy)
- HuggingFace token stored in `.env` (not committed to git)
## Future Enhancements
1. ⏹️ Add Llama 3.1 8B for alternative text generation
2. ⏹️ Add Whisper Large v3 for speech-to-text
3. ⏹️ Add XTTS v2 for text-to-speech
4. ⏹️ Implement model preloading/caching for faster switching
5. ⏹️ Add usage metrics and cost tracking
6. ⏹️ Auto-stop GPU pod during idle periods
## Support ## Support
For issues or questions: For issues:
- Check orchestrator logs: `docker logs ai_orchestrator` - **VPS AI services**: Check logs via `docker logs`
- View model-specific logs: `docker logs ai_<service>_1` - **GPU server**: See runpod repository documentation
- Test direct model access: `curl http://localhost:<port>/health` - **LiteLLM routing**: Review `../litellm-config.yaml`
- Review GPU deployment log: `GPU_DEPLOYMENT_LOG.md`
## License
Built with:
- [vLLM](https://github.com/vllm-project/vllm) - Apache 2.0
- [AudioCraft](https://github.com/facebookresearch/audiocraft) - MIT (code), CC-BY-NC (weights)
- [Flux.1](https://github.com/black-forest-labs/flux) - Apache 2.0
- [LiteLLM](https://github.com/BerriAI/litellm) - MIT
**Note**: MusicGen pre-trained weights are non-commercial (CC-BY-NC). Train your own models for commercial use with the MIT-licensed code.