docs: migrate multi-modal AI orchestration to dedicated runpod repository

Multi-modal AI stack (text/image/music generation) has been moved to: Repository: ssh://git@dev.pivoine.art:2222/valknar/runpod.git Updated ai/README.md to document: - VPS AI services (Open WebUI, Crawl4AI, AI PostgreSQL) - Reference to new runpod repository for GPU infrastructure - Clear separation between VPS and GPU deployments - Integration architecture via Tailscale VPN
2025-11-21 14:36:36 +01:00
parent 3ed3e68271
commit abcebd1d9b
1 changed files with 134 additions and 431 deletions
--- a/ai/README.md
+++ b/ai/README.md
@@ -1,467 +1,170 @@
-# Multi-Modal AI Orchestration System
+# AI Infrastructure

-**Cost-optimized AI infrastructure running text, image, and music generation on a single RunPod RTX 4090 GPU.**
+This directory contains AI-related configurations for the VPS deployment.

-## Architecture Overview
+## Multi-Modal GPU Infrastructure (Migrated)

-This system provides a unified API for multiple AI model types with automatic model switching on a single GPU (24GB VRAM). All requests route through an intelligent orchestrator that manages model lifecycle.
+**The multi-modal AI orchestration stack (text, image, music generation) has been moved to a dedicated repository:**

-### Components
+**Repository**: https://dev.pivoine.art/valknar/runpod

-```
-┌─────────────────────────────────────────────────────────────────┐
-│                        VPS (Tailscale: 100.102.217.79)          │
-│  ┌───────────────────────────────────────────────────────────┐  │
-│  │ LiteLLM Proxy (Port 4000)                                 │  │
-│  │ Routes to: Claude API + GPU Orchestrator                  │  │
-│  └────────────────────┬──────────────────────────────────────┘  │
-└───────────────────────┼─────────────────────────────────────────┘
-                        │ Tailscale VPN
-┌───────────────────────┼─────────────────────────────────────────┐
-│           RunPod GPU Server (Tailscale: 100.100.108.13)         │
-│  ┌────────────────────▼──────────────────────────────────────┐  │
-│  │ Orchestrator (Port 9000)                                  │  │
-│  │ Manages sequential model loading based on request type   │  │
-│  └─────┬──────────────┬──────────────────┬──────────────────┘  │
-│        │              │                  │                      │
-│  ┌─────▼──────┐ ┌────▼────────┐  ┌──────▼───────┐             │
-│  │vLLM        │ │Flux.1       │  │MusicGen      │             │
-│  │Qwen 2.5 7B │ │Schnell      │  │Medium        │             │
-│  │Port: 8001  │ │Port: 8002   │  │Port: 8003    │             │
-│  │VRAM: 14GB  │ │VRAM: 14GB   │  │VRAM: 11GB    │             │
-│  └────────────┘ └─────────────┘  └──────────────┘             │
-│                                                                 │
-│  Only ONE model active at a time (sequential loading)          │
-└─────────────────────────────────────────────────────────────────┘
-```
+The RunPod repository contains:
+- Model orchestrator for automatic switching between text, image, and music models
+- vLLM + Qwen 2.5 7B (text generation)
+- Flux.1 Schnell (image generation)
+- MusicGen Medium (music generation)
+- RunPod template creation scripts
+- Complete deployment documentation

-### Features
+This separation allows for independent management of:
+- **VPS Services** (this repo): Open WebUI, Crawl4AI, AI database
+- **GPU Services** (runpod repo): Model inference, orchestration, RunPod templates

-✅ **Automatic Model Switching** - Orchestrator detects request type and loads appropriate model
-✅ **OpenAI-Compatible APIs** - Works with existing OpenAI clients and tools
-✅ **Cost-Optimized** - Sequential loading on single GPU (~$0.50/hr vs ~$0.75/hr for multi-GPU)
-✅ **Easy Model Addition** - Add new models by editing YAML config
-✅ **Centralized Routing** - LiteLLM proxy provides unified API for all models
-✅ **GPU Memory Safe** - Orchestrator ensures only one model loaded at a time
+## VPS AI Services (ai/compose.yaml)

-## Supported Model Types
+This compose stack manages the VPS-side AI infrastructure that integrates with the GPU server:

-### Text Generation
- **Qwen 2.5 7B Instruct** (facebook/Qwen2.5-7B-Instruct)
- VRAM: 14GB | Speed: Fast | OpenAI-compatible chat API
+### Services

-### Image Generation
- **Flux.1 Schnell** (black-forest-labs/FLUX.1-schnell)
- VRAM: 14GB | Speed: 4-5 sec/image | OpenAI DALL-E compatible API
+#### ai_postgres
+Dedicated PostgreSQL 16 instance with pgvector extension for AI workloads:
+- Vector similarity search support
+- Isolated from core database for performance
+- Used by Open WebUI for RAG and embeddings

-### Music Generation
- **MusicGen Medium** (facebook/musicgen-medium)
- VRAM: 11GB | Speed: 60-90 sec for 30s audio | Custom audio API
+#### webui (Open WebUI)
+ChatGPT-like interface exposed at `ai.pivoine.art:8080`:
+- Claude API integration via Anthropic
+- RAG support with document upload
+- Vector storage via pgvector
+- Web search capability
+- SMTP email via IONOS
+- User signup enabled

-## Quick Start
+#### crawl4ai
+Internal web scraping service for LLM content preparation:
+- API on port 11235 (not exposed publicly)
+- Optimized for AI/RAG workflows
+- Integration with Open WebUI and n8n

-### 1. Prerequisites
+## Integration with GPU Server

+The VPS AI services connect to the GPU server via Tailscale VPN:
+- **VPS Tailscale IP**: 100.102.217.79
+- **GPU Tailscale IP**: 100.100.108.13
+
+**LiteLLM Proxy** (port 4000 on VPS) routes requests:
+- Claude API for chat completions
+- GPU orchestrator for self-hosted models (text, image, music)
+
+See `../litellm-config.yaml` for routing configuration.
+
+## Environment Variables
+
+Required in `.env`:
 ```bash
-# On RunPod GPU server
- RunPod RTX 4090 instance (24GB VRAM)
- Docker & Docker Compose installed
- Tailscale VPN configured
- HuggingFace token (for model downloads)
+# AI Database
+AI_DB_PASSWORD=<password>
+
+# Open WebUI
+AI_WEBUI_SECRET_KEY=<secret>
+
+# Claude API
+ANTHROPIC_API_KEY=<api_key>
+
+# Email (IONOS SMTP)
+ADMIN_EMAIL=<email>
+SMTP_HOST=smtp.ionos.com
+SMTP_PORT=587
+SMTP_USER=<smtp_user>
+SMTP_PASSWORD=<smtp_password>
 ```

-### 2. Clone & Configure
+## Backup Configuration

-```bash
-# On local machine
-cd ai/
+AI services are backed up daily via Restic:
+- **ai_postgres_data**: 3 AM (7 daily, 4 weekly, 6 monthly, 2 yearly)
+- **ai_webui_data**: 3 AM (same retention)
+- **ai_crawl4ai_data**: 3 AM (same retention)

-# Create environment file
-cp .env.example .env
-# Edit .env and add your HF_TOKEN
-```
-
-### 3. Deploy to RunPod
-
-```bash
-# Copy all files to RunPod GPU server
-scp -r ai/* gpu-pivoine:/workspace/ai/
-
-# SSH to GPU server
-ssh gpu-pivoine
-
-# Navigate to project
-cd /workspace/ai/
-
-# Start orchestrator (always running)
-docker compose -f docker-compose.gpu.yaml up -d orchestrator
-
-# Orchestrator will automatically manage model services as needed
-```
-
-### 4. Test Deployment
-
-```bash
-# Check orchestrator health
-curl http://100.100.108.13:9000/health
-
-# Test text generation (auto-loads vLLM)
-curl http://100.100.108.13:9000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "qwen-2.5-7b",
-    "messages": [{"role": "user", "content": "Hello!"}]
-  }'
-
-# Test image generation (auto-switches to Flux)
-curl http://100.100.108.13:9000/v1/images/generations \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "flux-schnell",
-    "prompt": "a cute cat",
-    "size": "1024x1024"
-  }'
-
-# Test music generation (auto-switches to MusicGen)
-curl http://100.100.108.13:9000/v1/audio/generations \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "musicgen-medium",
-    "prompt": "upbeat electronic dance music",
-    "duration": 30
-  }'
-```
-
-### 5. Update VPS LiteLLM
-
-```bash
-# On VPS, restart LiteLLM to pick up new config
-ssh vps
-cd ~/Projects/docker-compose
-arty restart litellm
-```
-
-## Usage Examples
-
-### Via Open WebUI (https://ai.pivoine.art)
-
-**Text Generation:**
-1. Select model: `qwen-2.5-7b`
-2. Type message and send
-3. Orchestrator loads vLLM automatically
-
-**Image Generation:**
-1. Select model: `flux-schnell`
-2. Enter image prompt
-3. Orchestrator switches to Flux.1
-
-**Music Generation:**
-1. Select model: `musicgen-medium`
-2. Describe the music you want
-3. Orchestrator switches to MusicGen
-
-### Via API (Direct)
-
-```python
-import openai
-
-# Configure client to use orchestrator
-client = openai.OpenAI(
-    base_url="http://100.100.108.13:9000/v1",
-    api_key="dummy"  # Not used but required
-)
-
-# Text generation
-response = client.chat.completions.create(
-    model="qwen-2.5-7b",
-    messages=[{"role": "user", "content": "Write a haiku"}]
-)
-
-# Image generation
-image = client.images.generate(
-    model="flux-schnell",
-    prompt="a sunset over mountains",
-    size="1024x1024"
-)
-
-# Music generation (custom endpoint)
-import requests
-music = requests.post(
-    "http://100.100.108.13:9000/v1/audio/generations",
-    json={
-        "model": "musicgen-medium",
-        "prompt": "calm piano music",
-        "duration": 30
-    }
-)
-```
-
-## Adding New Models
-
-### Step 1: Update `models.yaml`
-
-```yaml
-# Add to ai/model-orchestrator/models.yaml
-models:
-  llama-3.1-8b:  # New model
-    type: text
-    framework: vllm
-    docker_service: vllm-llama
-    port: 8004
-    vram_gb: 17
-    startup_time_seconds: 120
-    endpoint: /v1/chat/completions
-    description: "Llama 3.1 8B Instruct - Meta's latest model"
-```
-
-### Step 2: Add Docker Service
-
-```yaml
-# Add to ai/docker-compose.gpu.yaml
-services:
-  vllm-llama:
-    build: ./vllm
-    container_name: ai_vllm-llama_1
-    command: >
-      vllm serve meta-llama/Llama-3.1-8B-Instruct
-      --port 8000 --dtype bfloat16
-    ports:
-      - "8004:8000"
-    environment:
-      - HF_TOKEN=${HF_TOKEN}
-    deploy:
-      resources:
-        reservations:
-          devices:
-            - driver: nvidia
-              count: 1
-              capabilities: [gpu]
-    profiles: ["text"]
-    restart: "no"
-```
-
-### Step 3: Restart Orchestrator
-
-```bash
-ssh gpu-pivoine
-cd /workspace/ai/
-docker compose -f docker-compose.gpu.yaml restart orchestrator
-```
-
-**That's it!** The orchestrator automatically detects the new model.
+Repository: `/mnt/hidrive/users/valknar/Backup`

 ## Management Commands

-### Orchestrator
+```bash
+# Start AI stack
+pnpm arty up ai_postgres webui crawl4ai
+
+# View logs
+docker logs -f ai_webui
+docker logs -f ai_postgres
+docker logs -f ai_crawl4ai
+
+# Check Open WebUI
+curl http://ai.pivoine.art:8080/health
+
+# Restart AI services
+pnpm arty restart ai_postgres webui crawl4ai
+```
+
+## GPU Server Management
+
+For GPU server operations (model orchestration, template creation, etc.):

 ```bash
-# Start orchestrator
-docker compose -f docker-compose.gpu.yaml up -d orchestrator
+# Clone the dedicated repository
+git clone ssh://git@dev.pivoine.art:2222/valknar/runpod.git

-# View orchestrator logs
-docker logs -f ai_orchestrator
-
-# Restart orchestrator
-docker compose -f docker-compose.gpu.yaml restart orchestrator
-
-# Check active model
-curl http://100.100.108.13:9000/health
-
-# List all models
-curl http://100.100.108.13:9000/models
+# See runpod repository for:
+# - Model orchestration setup
+# - RunPod template creation
+# - GPU deployment guides
 ```

-### Manual Model Control
+## Documentation

-```bash
-# Manually switch to specific model
-curl -X POST http://100.100.108.13:9000/switch \
-  -H "Content-Type: application/json" \
-  -d '{"model": "flux-schnell"}'
+### VPS AI Services
+- [GPU_DEPLOYMENT_LOG.md](GPU_DEPLOYMENT_LOG.md) - VPS AI deployment history

-# Check which model is running
-curl http://100.100.108.13:9000/health | jq '.current_model'
-```
+### GPU Server (Separate Repository)
+- [runpod/README.md](https://dev.pivoine.art/valknar/runpod) - Main GPU documentation
+- [runpod/DEPLOYMENT.md](https://dev.pivoine.art/valknar/runpod) - Deployment guide
+- [runpod/RUNPOD_TEMPLATE.md](https://dev.pivoine.art/valknar/runpod) - Template creation

-### Model Services
-
-```bash
-# Manually start a specific model (bypassing orchestrator)
-docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen
-
-# Stop a model
-docker compose -f docker-compose.gpu.yaml stop vllm-qwen
-
-# View model logs
-docker logs -f ai_vllm-qwen_1
-docker logs -f ai_flux_1
-docker logs -f ai_musicgen_1
-```
-
-## Monitoring
-
-### GPU Usage
-
-```bash
-ssh gpu-pivoine "nvidia-smi"
-```
-
-### Model Status
-
-```bash
-# Which model is active?
-curl http://100.100.108.13:9000/health
-
-# Model memory usage
-curl http://100.100.108.13:9000/health | jq '{current: .current_model, vram: .model_info.vram_gb}'
-```
-
-### Performance
-
-```bash
-# Orchestrator logs (model switching)
-docker logs -f ai_orchestrator
-
-# Model-specific logs
-docker logs -f ai_vllm-qwen_1
-docker logs -f ai_flux_1
-docker logs -f ai_musicgen_1
-```
-
-## Troubleshooting
-
-### Model Won't Load
-
-```bash
-# Check orchestrator logs
-docker logs ai_orchestrator
-
-# Check if model service exists
-docker compose -f docker-compose.gpu.yaml config | grep -A 10 "vllm-qwen"
-
-# Manually test model service
-docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen
-curl http://localhost:8001/health
-```
-
-### Orchestrator Can't Connect
-
-```bash
-# Check Docker socket permissions
-ls -l /var/run/docker.sock
-
-# Restart Docker daemon
-sudo systemctl restart docker
-
-# Rebuild orchestrator
-docker compose -f docker-compose.gpu.yaml build orchestrator
-docker compose -f docker-compose.gpu.yaml up -d orchestrator
-```
-
-### Model Switching Too Slow
-
-```bash
-# Check model startup times in models.yaml
-# Adjust startup_time_seconds if needed
-
-# Pre-download models to /workspace cache
-docker run --rm -it --gpus all \
-  -v /workspace/huggingface_cache:/cache \
-  -e HF_HOME=/cache \
-  nvidia/cuda:12.4.0-runtime-ubuntu22.04 \
-  huggingface-cli download facebook/musicgen-medium
-```
-
-## File Structure
+## Architecture Overview

 ```
-ai/
-├── docker-compose.gpu.yaml    # Main orchestration file
-├── .env.example               # Environment template
-├── README.md                  # This file
-│
-├── model-orchestrator/        # Central orchestrator service
-│   ├── orchestrator.py        # FastAPI app managing models
-│   ├── models.yaml            # Model registry (EDIT TO ADD MODELS)
-│   ├── Dockerfile
-│   └── requirements.txt
-│
-├── vllm/                      # Text generation (vLLM)
-│   ├── server.py              # Qwen 2.5 7B server
-│   ├── Dockerfile
-│   └── requirements.txt
-│
-├── flux/                      # Image generation (Flux.1 Schnell)
-│   └── config/
-│       └── config.json        # Flux configuration
-│
-├── musicgen/                  # Music generation (MusicGen)
-│   ├── server.py              # MusicGen API server
-│   ├── Dockerfile
-│   └── requirements.txt
-│
-├── litellm-config.yaml        # LiteLLM proxy configuration
-└── GPU_DEPLOYMENT_LOG.md      # Deployment history and notes
+┌─────────────────────────────────────────────────────────────────┐
+│                     VPS (Tailscale: 100.102.217.79)             │
+│  ┌───────────────────────────────────────────────────────────┐  │
+│  │ LiteLLM Proxy (Port 4000)                                 │  │
+│  │ Routes to: Claude API + GPU Orchestrator                  │  │
+│  └───────┬───────────────────────────────────────────────────┘  │
+│          │                                                       │
+│  ┌───────▼─────────┐  ┌──────────────┐  ┌─────────────────┐   │
+│  │ Open WebUI      │  │ Crawl4AI     │  │ AI PostgreSQL   │   │
+│  │ Port: 8080      │  │ Port: 11235  │  │ + pgvector      │   │
+│  └─────────────────┘  └──────────────┘  └─────────────────┘   │
+└─────────────────────────────────────────────────────────────────┘
+                               │ Tailscale VPN
+┌──────────────────────────────┼──────────────────────────────────┐
+│              RunPod GPU Server (Tailscale: 100.100.108.13)      │
+│  ┌───────────────────────────▼──────────────────────────────┐   │
+│  │ Orchestrator (Port 9000)                                  │   │
+│  │ Manages sequential model loading                          │   │
+│  └─────┬──────────────┬──────────────────┬──────────────────┘   │
+│        │              │                  │                       │
+│  ┌─────▼──────┐ ┌────▼────────┐  ┌──────▼───────┐              │
+│  │vLLM        │ │Flux.1       │  │MusicGen      │              │
+│  │Qwen 2.5 7B │ │Schnell      │  │Medium        │              │
+│  │Port: 8001  │ │Port: 8002   │  │Port: 8003    │              │
+│  └────────────┘ └─────────────┘  └──────────────┘              │
+└─────────────────────────────────────────────────────────────────┘
 ```

-## Cost Analysis
-
-### Current Setup (Single GPU)
- **Provider**: RunPod Spot Instance
- **GPU**: RTX 4090 24GB
- **Cost**: ~$0.50/hour
- **Monthly**: ~$360 (if running 24/7)
- **Optimized**: ~$120 (8 hours/day during business hours)
-
-### Alternative: Multi-GPU (All Models Always On)
- **GPUs**: 2× RTX 4090
- **Cost**: ~$0.75/hour
- **Monthly**: ~$540 (if running 24/7)
- **Trade-off**: No switching latency, +$180/month
-
-### Recommendation
-Stick with single GPU sequential loading for cost optimization. Model switching (30-120 seconds) is acceptable for most use cases.
-
-## Performance Expectations
-
-| Model | VRAM | Startup Time | Generation Speed |
-|-------|------|--------------|------------------|
-| Qwen 2.5 7B | 14GB | 120s | ~50 tokens/sec |
-| Flux.1 Schnell | 14GB | 60s | ~4-5 sec/image |
-| MusicGen Medium | 11GB | 45s | ~60-90 sec for 30s audio |
-
-**Model Switching**: 30-120 seconds (unload current + load new)
-
-## Security Notes
-
- Orchestrator requires Docker socket access (`/var/run/docker.sock`)
- All services run on private Tailscale network
- No public exposure (only via VPS LiteLLM proxy)
- HuggingFace token stored in `.env` (not committed to git)
-
-## Future Enhancements
-
-1. ⏹️ Add Llama 3.1 8B for alternative text generation
-2. ⏹️ Add Whisper Large v3 for speech-to-text
-3. ⏹️ Add XTTS v2 for text-to-speech
-4. ⏹️ Implement model preloading/caching for faster switching
-5. ⏹️ Add usage metrics and cost tracking
-6. ⏹️ Auto-stop GPU pod during idle periods
-
 ## Support

-For issues or questions:
- Check orchestrator logs: `docker logs ai_orchestrator`
- View model-specific logs: `docker logs ai_<service>_1`
- Test direct model access: `curl http://localhost:<port>/health`
- Review GPU deployment log: `GPU_DEPLOYMENT_LOG.md`
-
-## License
-
-Built with:
- [vLLM](https://github.com/vllm-project/vllm) - Apache 2.0
- [AudioCraft](https://github.com/facebookresearch/audiocraft) - MIT (code), CC-BY-NC (weights)
- [Flux.1](https://github.com/black-forest-labs/flux) - Apache 2.0
- [LiteLLM](https://github.com/BerriAI/litellm) - MIT
-
-**Note**: MusicGen pre-trained weights are non-commercial (CC-BY-NC). Train your own models for commercial use with the MIT-licensed code.
+For issues:
+- **VPS AI services**: Check logs via `docker logs`
+- **GPU server**: See runpod repository documentation
+- **LiteLLM routing**: Review `../litellm-config.yaml`