From abcebd1d9b291e24561ed0a6a249a4691c04bfc5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Sebastian=20Kr=C3=BCger?= Date: Fri, 21 Nov 2025 14:36:36 +0100 Subject: [PATCH] docs: migrate multi-modal AI orchestration to dedicated runpod repository Multi-modal AI stack (text/image/music generation) has been moved to: Repository: ssh://git@dev.pivoine.art:2222/valknar/runpod.git Updated ai/README.md to document: - VPS AI services (Open WebUI, Crawl4AI, AI PostgreSQL) - Reference to new runpod repository for GPU infrastructure - Clear separation between VPS and GPU deployments - Integration architecture via Tailscale VPN --- ai/README.md | 565 ++++++++++++--------------------------------------- 1 file changed, 134 insertions(+), 431 deletions(-) diff --git a/ai/README.md b/ai/README.md index 04736da..4f386a9 100644 --- a/ai/README.md +++ b/ai/README.md @@ -1,467 +1,170 @@ -# Multi-Modal AI Orchestration System +# AI Infrastructure -**Cost-optimized AI infrastructure running text, image, and music generation on a single RunPod RTX 4090 GPU.** +This directory contains AI-related configurations for the VPS deployment. -## Architecture Overview +## Multi-Modal GPU Infrastructure (Migrated) -This system provides a unified API for multiple AI model types with automatic model switching on a single GPU (24GB VRAM). All requests route through an intelligent orchestrator that manages model lifecycle. +**The multi-modal AI orchestration stack (text, image, music generation) has been moved to a dedicated repository:** -### Components +**Repository**: https://dev.pivoine.art/valknar/runpod -``` -┌─────────────────────────────────────────────────────────────────┐ -│ VPS (Tailscale: 100.102.217.79) │ -│ ┌───────────────────────────────────────────────────────────┐ │ -│ │ LiteLLM Proxy (Port 4000) │ │ -│ │ Routes to: Claude API + GPU Orchestrator │ │ -│ └────────────────────┬──────────────────────────────────────┘ │ -└───────────────────────┼─────────────────────────────────────────┘ - │ Tailscale VPN -┌───────────────────────┼─────────────────────────────────────────┐ -│ RunPod GPU Server (Tailscale: 100.100.108.13) │ -│ ┌────────────────────▼──────────────────────────────────────┐ │ -│ │ Orchestrator (Port 9000) │ │ -│ │ Manages sequential model loading based on request type │ │ -│ └─────┬──────────────┬──────────────────┬──────────────────┘ │ -│ │ │ │ │ -│ ┌─────▼──────┐ ┌────▼────────┐ ┌──────▼───────┐ │ -│ │vLLM │ │Flux.1 │ │MusicGen │ │ -│ │Qwen 2.5 7B │ │Schnell │ │Medium │ │ -│ │Port: 8001 │ │Port: 8002 │ │Port: 8003 │ │ -│ │VRAM: 14GB │ │VRAM: 14GB │ │VRAM: 11GB │ │ -│ └────────────┘ └─────────────┘ └──────────────┘ │ -│ │ -│ Only ONE model active at a time (sequential loading) │ -└─────────────────────────────────────────────────────────────────┘ -``` +The RunPod repository contains: +- Model orchestrator for automatic switching between text, image, and music models +- vLLM + Qwen 2.5 7B (text generation) +- Flux.1 Schnell (image generation) +- MusicGen Medium (music generation) +- RunPod template creation scripts +- Complete deployment documentation -### Features +This separation allows for independent management of: +- **VPS Services** (this repo): Open WebUI, Crawl4AI, AI database +- **GPU Services** (runpod repo): Model inference, orchestration, RunPod templates -✅ **Automatic Model Switching** - Orchestrator detects request type and loads appropriate model -✅ **OpenAI-Compatible APIs** - Works with existing OpenAI clients and tools -✅ **Cost-Optimized** - Sequential loading on single GPU (~$0.50/hr vs ~$0.75/hr for multi-GPU) -✅ **Easy Model Addition** - Add new models by editing YAML config -✅ **Centralized Routing** - LiteLLM proxy provides unified API for all models -✅ **GPU Memory Safe** - Orchestrator ensures only one model loaded at a time +## VPS AI Services (ai/compose.yaml) -## Supported Model Types +This compose stack manages the VPS-side AI infrastructure that integrates with the GPU server: -### Text Generation -- **Qwen 2.5 7B Instruct** (facebook/Qwen2.5-7B-Instruct) -- VRAM: 14GB | Speed: Fast | OpenAI-compatible chat API +### Services -### Image Generation -- **Flux.1 Schnell** (black-forest-labs/FLUX.1-schnell) -- VRAM: 14GB | Speed: 4-5 sec/image | OpenAI DALL-E compatible API +#### ai_postgres +Dedicated PostgreSQL 16 instance with pgvector extension for AI workloads: +- Vector similarity search support +- Isolated from core database for performance +- Used by Open WebUI for RAG and embeddings -### Music Generation -- **MusicGen Medium** (facebook/musicgen-medium) -- VRAM: 11GB | Speed: 60-90 sec for 30s audio | Custom audio API +#### webui (Open WebUI) +ChatGPT-like interface exposed at `ai.pivoine.art:8080`: +- Claude API integration via Anthropic +- RAG support with document upload +- Vector storage via pgvector +- Web search capability +- SMTP email via IONOS +- User signup enabled -## Quick Start +#### crawl4ai +Internal web scraping service for LLM content preparation: +- API on port 11235 (not exposed publicly) +- Optimized for AI/RAG workflows +- Integration with Open WebUI and n8n -### 1. Prerequisites +## Integration with GPU Server +The VPS AI services connect to the GPU server via Tailscale VPN: +- **VPS Tailscale IP**: 100.102.217.79 +- **GPU Tailscale IP**: 100.100.108.13 + +**LiteLLM Proxy** (port 4000 on VPS) routes requests: +- Claude API for chat completions +- GPU orchestrator for self-hosted models (text, image, music) + +See `../litellm-config.yaml` for routing configuration. + +## Environment Variables + +Required in `.env`: ```bash -# On RunPod GPU server -- RunPod RTX 4090 instance (24GB VRAM) -- Docker & Docker Compose installed -- Tailscale VPN configured -- HuggingFace token (for model downloads) +# AI Database +AI_DB_PASSWORD= + +# Open WebUI +AI_WEBUI_SECRET_KEY= + +# Claude API +ANTHROPIC_API_KEY= + +# Email (IONOS SMTP) +ADMIN_EMAIL= +SMTP_HOST=smtp.ionos.com +SMTP_PORT=587 +SMTP_USER= +SMTP_PASSWORD= ``` -### 2. Clone & Configure +## Backup Configuration -```bash -# On local machine -cd ai/ +AI services are backed up daily via Restic: +- **ai_postgres_data**: 3 AM (7 daily, 4 weekly, 6 monthly, 2 yearly) +- **ai_webui_data**: 3 AM (same retention) +- **ai_crawl4ai_data**: 3 AM (same retention) -# Create environment file -cp .env.example .env -# Edit .env and add your HF_TOKEN -``` - -### 3. Deploy to RunPod - -```bash -# Copy all files to RunPod GPU server -scp -r ai/* gpu-pivoine:/workspace/ai/ - -# SSH to GPU server -ssh gpu-pivoine - -# Navigate to project -cd /workspace/ai/ - -# Start orchestrator (always running) -docker compose -f docker-compose.gpu.yaml up -d orchestrator - -# Orchestrator will automatically manage model services as needed -``` - -### 4. Test Deployment - -```bash -# Check orchestrator health -curl http://100.100.108.13:9000/health - -# Test text generation (auto-loads vLLM) -curl http://100.100.108.13:9000/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "qwen-2.5-7b", - "messages": [{"role": "user", "content": "Hello!"}] - }' - -# Test image generation (auto-switches to Flux) -curl http://100.100.108.13:9000/v1/images/generations \ - -H "Content-Type: application/json" \ - -d '{ - "model": "flux-schnell", - "prompt": "a cute cat", - "size": "1024x1024" - }' - -# Test music generation (auto-switches to MusicGen) -curl http://100.100.108.13:9000/v1/audio/generations \ - -H "Content-Type: application/json" \ - -d '{ - "model": "musicgen-medium", - "prompt": "upbeat electronic dance music", - "duration": 30 - }' -``` - -### 5. Update VPS LiteLLM - -```bash -# On VPS, restart LiteLLM to pick up new config -ssh vps -cd ~/Projects/docker-compose -arty restart litellm -``` - -## Usage Examples - -### Via Open WebUI (https://ai.pivoine.art) - -**Text Generation:** -1. Select model: `qwen-2.5-7b` -2. Type message and send -3. Orchestrator loads vLLM automatically - -**Image Generation:** -1. Select model: `flux-schnell` -2. Enter image prompt -3. Orchestrator switches to Flux.1 - -**Music Generation:** -1. Select model: `musicgen-medium` -2. Describe the music you want -3. Orchestrator switches to MusicGen - -### Via API (Direct) - -```python -import openai - -# Configure client to use orchestrator -client = openai.OpenAI( - base_url="http://100.100.108.13:9000/v1", - api_key="dummy" # Not used but required -) - -# Text generation -response = client.chat.completions.create( - model="qwen-2.5-7b", - messages=[{"role": "user", "content": "Write a haiku"}] -) - -# Image generation -image = client.images.generate( - model="flux-schnell", - prompt="a sunset over mountains", - size="1024x1024" -) - -# Music generation (custom endpoint) -import requests -music = requests.post( - "http://100.100.108.13:9000/v1/audio/generations", - json={ - "model": "musicgen-medium", - "prompt": "calm piano music", - "duration": 30 - } -) -``` - -## Adding New Models - -### Step 1: Update `models.yaml` - -```yaml -# Add to ai/model-orchestrator/models.yaml -models: - llama-3.1-8b: # New model - type: text - framework: vllm - docker_service: vllm-llama - port: 8004 - vram_gb: 17 - startup_time_seconds: 120 - endpoint: /v1/chat/completions - description: "Llama 3.1 8B Instruct - Meta's latest model" -``` - -### Step 2: Add Docker Service - -```yaml -# Add to ai/docker-compose.gpu.yaml -services: - vllm-llama: - build: ./vllm - container_name: ai_vllm-llama_1 - command: > - vllm serve meta-llama/Llama-3.1-8B-Instruct - --port 8000 --dtype bfloat16 - ports: - - "8004:8000" - environment: - - HF_TOKEN=${HF_TOKEN} - deploy: - resources: - reservations: - devices: - - driver: nvidia - count: 1 - capabilities: [gpu] - profiles: ["text"] - restart: "no" -``` - -### Step 3: Restart Orchestrator - -```bash -ssh gpu-pivoine -cd /workspace/ai/ -docker compose -f docker-compose.gpu.yaml restart orchestrator -``` - -**That's it!** The orchestrator automatically detects the new model. +Repository: `/mnt/hidrive/users/valknar/Backup` ## Management Commands -### Orchestrator +```bash +# Start AI stack +pnpm arty up ai_postgres webui crawl4ai + +# View logs +docker logs -f ai_webui +docker logs -f ai_postgres +docker logs -f ai_crawl4ai + +# Check Open WebUI +curl http://ai.pivoine.art:8080/health + +# Restart AI services +pnpm arty restart ai_postgres webui crawl4ai +``` + +## GPU Server Management + +For GPU server operations (model orchestration, template creation, etc.): ```bash -# Start orchestrator -docker compose -f docker-compose.gpu.yaml up -d orchestrator +# Clone the dedicated repository +git clone ssh://git@dev.pivoine.art:2222/valknar/runpod.git -# View orchestrator logs -docker logs -f ai_orchestrator - -# Restart orchestrator -docker compose -f docker-compose.gpu.yaml restart orchestrator - -# Check active model -curl http://100.100.108.13:9000/health - -# List all models -curl http://100.100.108.13:9000/models +# See runpod repository for: +# - Model orchestration setup +# - RunPod template creation +# - GPU deployment guides ``` -### Manual Model Control +## Documentation -```bash -# Manually switch to specific model -curl -X POST http://100.100.108.13:9000/switch \ - -H "Content-Type: application/json" \ - -d '{"model": "flux-schnell"}' +### VPS AI Services +- [GPU_DEPLOYMENT_LOG.md](GPU_DEPLOYMENT_LOG.md) - VPS AI deployment history -# Check which model is running -curl http://100.100.108.13:9000/health | jq '.current_model' -``` +### GPU Server (Separate Repository) +- [runpod/README.md](https://dev.pivoine.art/valknar/runpod) - Main GPU documentation +- [runpod/DEPLOYMENT.md](https://dev.pivoine.art/valknar/runpod) - Deployment guide +- [runpod/RUNPOD_TEMPLATE.md](https://dev.pivoine.art/valknar/runpod) - Template creation -### Model Services - -```bash -# Manually start a specific model (bypassing orchestrator) -docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen - -# Stop a model -docker compose -f docker-compose.gpu.yaml stop vllm-qwen - -# View model logs -docker logs -f ai_vllm-qwen_1 -docker logs -f ai_flux_1 -docker logs -f ai_musicgen_1 -``` - -## Monitoring - -### GPU Usage - -```bash -ssh gpu-pivoine "nvidia-smi" -``` - -### Model Status - -```bash -# Which model is active? -curl http://100.100.108.13:9000/health - -# Model memory usage -curl http://100.100.108.13:9000/health | jq '{current: .current_model, vram: .model_info.vram_gb}' -``` - -### Performance - -```bash -# Orchestrator logs (model switching) -docker logs -f ai_orchestrator - -# Model-specific logs -docker logs -f ai_vllm-qwen_1 -docker logs -f ai_flux_1 -docker logs -f ai_musicgen_1 -``` - -## Troubleshooting - -### Model Won't Load - -```bash -# Check orchestrator logs -docker logs ai_orchestrator - -# Check if model service exists -docker compose -f docker-compose.gpu.yaml config | grep -A 10 "vllm-qwen" - -# Manually test model service -docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen -curl http://localhost:8001/health -``` - -### Orchestrator Can't Connect - -```bash -# Check Docker socket permissions -ls -l /var/run/docker.sock - -# Restart Docker daemon -sudo systemctl restart docker - -# Rebuild orchestrator -docker compose -f docker-compose.gpu.yaml build orchestrator -docker compose -f docker-compose.gpu.yaml up -d orchestrator -``` - -### Model Switching Too Slow - -```bash -# Check model startup times in models.yaml -# Adjust startup_time_seconds if needed - -# Pre-download models to /workspace cache -docker run --rm -it --gpus all \ - -v /workspace/huggingface_cache:/cache \ - -e HF_HOME=/cache \ - nvidia/cuda:12.4.0-runtime-ubuntu22.04 \ - huggingface-cli download facebook/musicgen-medium -``` - -## File Structure +## Architecture Overview ``` -ai/ -├── docker-compose.gpu.yaml # Main orchestration file -├── .env.example # Environment template -├── README.md # This file -│ -├── model-orchestrator/ # Central orchestrator service -│ ├── orchestrator.py # FastAPI app managing models -│ ├── models.yaml # Model registry (EDIT TO ADD MODELS) -│ ├── Dockerfile -│ └── requirements.txt -│ -├── vllm/ # Text generation (vLLM) -│ ├── server.py # Qwen 2.5 7B server -│ ├── Dockerfile -│ └── requirements.txt -│ -├── flux/ # Image generation (Flux.1 Schnell) -│ └── config/ -│ └── config.json # Flux configuration -│ -├── musicgen/ # Music generation (MusicGen) -│ ├── server.py # MusicGen API server -│ ├── Dockerfile -│ └── requirements.txt -│ -├── litellm-config.yaml # LiteLLM proxy configuration -└── GPU_DEPLOYMENT_LOG.md # Deployment history and notes +┌─────────────────────────────────────────────────────────────────┐ +│ VPS (Tailscale: 100.102.217.79) │ +│ ┌───────────────────────────────────────────────────────────┐ │ +│ │ LiteLLM Proxy (Port 4000) │ │ +│ │ Routes to: Claude API + GPU Orchestrator │ │ +│ └───────┬───────────────────────────────────────────────────┘ │ +│ │ │ +│ ┌───────▼─────────┐ ┌──────────────┐ ┌─────────────────┐ │ +│ │ Open WebUI │ │ Crawl4AI │ │ AI PostgreSQL │ │ +│ │ Port: 8080 │ │ Port: 11235 │ │ + pgvector │ │ +│ └─────────────────┘ └──────────────┘ └─────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ + │ Tailscale VPN +┌──────────────────────────────┼──────────────────────────────────┐ +│ RunPod GPU Server (Tailscale: 100.100.108.13) │ +│ ┌───────────────────────────▼──────────────────────────────┐ │ +│ │ Orchestrator (Port 9000) │ │ +│ │ Manages sequential model loading │ │ +│ └─────┬──────────────┬──────────────────┬──────────────────┘ │ +│ │ │ │ │ +│ ┌─────▼──────┐ ┌────▼────────┐ ┌──────▼───────┐ │ +│ │vLLM │ │Flux.1 │ │MusicGen │ │ +│ │Qwen 2.5 7B │ │Schnell │ │Medium │ │ +│ │Port: 8001 │ │Port: 8002 │ │Port: 8003 │ │ +│ └────────────┘ └─────────────┘ └──────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ ``` -## Cost Analysis - -### Current Setup (Single GPU) -- **Provider**: RunPod Spot Instance -- **GPU**: RTX 4090 24GB -- **Cost**: ~$0.50/hour -- **Monthly**: ~$360 (if running 24/7) -- **Optimized**: ~$120 (8 hours/day during business hours) - -### Alternative: Multi-GPU (All Models Always On) -- **GPUs**: 2× RTX 4090 -- **Cost**: ~$0.75/hour -- **Monthly**: ~$540 (if running 24/7) -- **Trade-off**: No switching latency, +$180/month - -### Recommendation -Stick with single GPU sequential loading for cost optimization. Model switching (30-120 seconds) is acceptable for most use cases. - -## Performance Expectations - -| Model | VRAM | Startup Time | Generation Speed | -|-------|------|--------------|------------------| -| Qwen 2.5 7B | 14GB | 120s | ~50 tokens/sec | -| Flux.1 Schnell | 14GB | 60s | ~4-5 sec/image | -| MusicGen Medium | 11GB | 45s | ~60-90 sec for 30s audio | - -**Model Switching**: 30-120 seconds (unload current + load new) - -## Security Notes - -- Orchestrator requires Docker socket access (`/var/run/docker.sock`) -- All services run on private Tailscale network -- No public exposure (only via VPS LiteLLM proxy) -- HuggingFace token stored in `.env` (not committed to git) - -## Future Enhancements - -1. ⏹️ Add Llama 3.1 8B for alternative text generation -2. ⏹️ Add Whisper Large v3 for speech-to-text -3. ⏹️ Add XTTS v2 for text-to-speech -4. ⏹️ Implement model preloading/caching for faster switching -5. ⏹️ Add usage metrics and cost tracking -6. ⏹️ Auto-stop GPU pod during idle periods - ## Support -For issues or questions: -- Check orchestrator logs: `docker logs ai_orchestrator` -- View model-specific logs: `docker logs ai__1` -- Test direct model access: `curl http://localhost:/health` -- Review GPU deployment log: `GPU_DEPLOYMENT_LOG.md` - -## License - -Built with: -- [vLLM](https://github.com/vllm-project/vllm) - Apache 2.0 -- [AudioCraft](https://github.com/facebookresearch/audiocraft) - MIT (code), CC-BY-NC (weights) -- [Flux.1](https://github.com/black-forest-labs/flux) - Apache 2.0 -- [LiteLLM](https://github.com/BerriAI/litellm) - MIT - -**Note**: MusicGen pre-trained weights are non-commercial (CC-BY-NC). Train your own models for commercial use with the MIT-licensed code. +For issues: +- **VPS AI services**: Check logs via `docker logs` +- **GPU server**: See runpod repository documentation +- **LiteLLM routing**: Review `../litellm-config.yaml`