docs: migrate multi-modal AI orchestration to dedicated runpod repository

Multi-modal AI stack (text/image/music generation) has been moved to: Repository: ssh://git@dev.pivoine.art:2222/valknar/runpod.git Updated ai/README.md to document: - VPS AI services (Open WebUI, Crawl4AI, AI PostgreSQL) - Reference to new runpod repository for GPU infrastructure - Clear separation between VPS and GPU deployments - Integration architecture via Tailscale VPN
2025-11-21 14:36:36 +01:00
parent 3ed3e68271
commit abcebd1d9b
1 changed files with 134 additions and 431 deletions
--- a/ai/README.md
+++ b/ai/README.md
@@ -1,467 +1,170 @@
-# Multi-Modal AI Orchestration System
+# AI Infrastructure
-**Cost-optimized AI infrastructure running text, image, and music generation on a single RunPod RTX 4090 GPU.**
+This directory contains AI-related configurations for the VPS deployment.
-## Architecture Overview
+## Multi-Modal GPU Infrastructure (Migrated)
-This system provides a unified API for multiple AI model types with automatic model switching on a single GPU (24GB VRAM). All requests route through an intelligent orchestrator that manages model lifecycle.
+**The multi-modal AI orchestration stack (text, image, music generation) has been moved to a dedicated repository:**
-### Components
+**Repository**: https://dev.pivoine.art/valknar/runpod
-```
+The RunPod repository contains:
-┌─────────────────────────────────────────────────────────────────┐
+- Model orchestrator for automatic switching between text, image, and music models
-│                        VPS (Tailscale: 100.102.217.79)          │
+- vLLM + Qwen 2.5 7B (text generation)
-│  ┌───────────────────────────────────────────────────────────┐  │
+- Flux.1 Schnell (image generation)
-│  │ LiteLLM Proxy (Port 4000)                                 │  │
+- MusicGen Medium (music generation)
-│  │ Routes to: Claude API + GPU Orchestrator                  │  │
+- RunPod template creation scripts
-│  └────────────────────┬──────────────────────────────────────┘  │
+- Complete deployment documentation
 └───────────────────────┼─────────────────────────────────────────┘
                        │ Tailscale VPN
 ┌───────────────────────┼─────────────────────────────────────────┐
 │           RunPod GPU Server (Tailscale: 100.100.108.13)         │
 │  ┌────────────────────▼──────────────────────────────────────┐  │
 │  │ Orchestrator (Port 9000)                                  │  │
 │  │ Manages sequential model loading based on request type   │  │
 │  └─────┬──────────────┬──────────────────┬──────────────────┘  │
 │        │              │                  │                      │
 │  ┌─────▼──────┐ ┌────▼────────┐  ┌──────▼───────┐             │
 │  │vLLM        │ │Flux.1       │  │MusicGen      │             │
 │  │Qwen 2.5 7B │ │Schnell      │  │Medium        │             │
 │  │Port: 8001  │ │Port: 8002   │  │Port: 8003    │             │
 │  │VRAM: 14GB  │ │VRAM: 14GB   │  │VRAM: 11GB    │             │
 │  └────────────┘ └─────────────┘  └──────────────┘             │
 │                                                                 │
 │  Only ONE model active at a time (sequential loading)          │
 └─────────────────────────────────────────────────────────────────┘
 ```
-### Features
+This separation allows for independent management of:
 - **VPS Services** (this repo): Open WebUI, Crawl4AI, AI database
 - **GPU Services** (runpod repo): Model inference, orchestration, RunPod templates
-✅ **Automatic Model Switching** - Orchestrator detects request type and loads appropriate model
+## VPS AI Services (ai/compose.yaml)
 ✅ **OpenAI-Compatible APIs** - Works with existing OpenAI clients and tools
 ✅ **Cost-Optimized** - Sequential loading on single GPU (~$0.50/hr vs ~$0.75/hr for multi-GPU)
 ✅ **Easy Model Addition** - Add new models by editing YAML config
 ✅ **Centralized Routing** - LiteLLM proxy provides unified API for all models
 ✅ **GPU Memory Safe** - Orchestrator ensures only one model loaded at a time
-## Supported Model Types
+This compose stack manages the VPS-side AI infrastructure that integrates with the GPU server:
-### Text Generation
+### Services
 - **Qwen 2.5 7B Instruct** (facebook/Qwen2.5-7B-Instruct)
 - VRAM: 14GB | Speed: Fast | OpenAI-compatible chat API
-### Image Generation
+#### ai_postgres
- **Flux.1 Schnell** (black-forest-labs/FLUX.1-schnell)
+Dedicated PostgreSQL 16 instance with pgvector extension for AI workloads:
- VRAM: 14GB | Speed: 4-5 sec/image | OpenAI DALL-E compatible API
+- Vector similarity search support
 - Isolated from core database for performance
 - Used by Open WebUI for RAG and embeddings
-### Music Generation
+#### webui (Open WebUI)
- **MusicGen Medium** (facebook/musicgen-medium)
+ChatGPT-like interface exposed at `ai.pivoine.art:8080`:
- VRAM: 11GB | Speed: 60-90 sec for 30s audio | Custom audio API
+- Claude API integration via Anthropic
 - RAG support with document upload
 - Vector storage via pgvector
 - Web search capability
 - SMTP email via IONOS
 - User signup enabled
-## Quick Start
+#### crawl4ai
 Internal web scraping service for LLM content preparation:
 - API on port 11235 (not exposed publicly)
 - Optimized for AI/RAG workflows
 - Integration with Open WebUI and n8n
-### 1. Prerequisites
+## Integration with GPU Server
 The VPS AI services connect to the GPU server via Tailscale VPN:
 - **VPS Tailscale IP**: 100.102.217.79
 - **GPU Tailscale IP**: 100.100.108.13
 **LiteLLM Proxy** (port 4000 on VPS) routes requests:
 - Claude API for chat completions
 - GPU orchestrator for self-hosted models (text, image, music)
 See `../litellm-config.yaml` for routing configuration.
 ## Environment Variables
 Required in `.env`:
 ```bash
-# On RunPod GPU server
+# AI Database
- RunPod RTX 4090 instance (24GB VRAM)
+AI_DB_PASSWORD=<password>
- Docker & Docker Compose installed
+
- Tailscale VPN configured
+# Open WebUI
- HuggingFace token (for model downloads)
+AI_WEBUI_SECRET_KEY=<secret>
 # Claude API
 ANTHROPIC_API_KEY=<api_key>
 # Email (IONOS SMTP)
 ADMIN_EMAIL=<email>
 SMTP_HOST=smtp.ionos.com
 SMTP_PORT=587
 SMTP_USER=<smtp_user>
 SMTP_PASSWORD=<smtp_password>
 ```
-### 2. Clone & Configure
+## Backup Configuration
-```bash
+AI services are backed up daily via Restic:
-# On local machine
+- **ai_postgres_data**: 3 AM (7 daily, 4 weekly, 6 monthly, 2 yearly)
-cd ai/
+- **ai_webui_data**: 3 AM (same retention)
 - **ai_crawl4ai_data**: 3 AM (same retention)
-# Create environment file
+Repository: `/mnt/hidrive/users/valknar/Backup`
 cp .env.example .env
 # Edit .env and add your HF_TOKEN
 ```
 ### 3. Deploy to RunPod
 ```bash
 # Copy all files to RunPod GPU server
 scp -r ai/* gpu-pivoine:/workspace/ai/
 # SSH to GPU server
 ssh gpu-pivoine
 # Navigate to project
 cd /workspace/ai/
 # Start orchestrator (always running)
 docker compose -f docker-compose.gpu.yaml up -d orchestrator
 # Orchestrator will automatically manage model services as needed
 ```
 ### 4. Test Deployment
 ```bash
 # Check orchestrator health
 curl http://100.100.108.13:9000/health
 # Test text generation (auto-loads vLLM)
 curl http://100.100.108.13:9000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-2.5-7b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
 # Test image generation (auto-switches to Flux)
 curl http://100.100.108.13:9000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "flux-schnell",
    "prompt": "a cute cat",
    "size": "1024x1024"
  }'
 # Test music generation (auto-switches to MusicGen)
 curl http://100.100.108.13:9000/v1/audio/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "musicgen-medium",
    "prompt": "upbeat electronic dance music",
    "duration": 30
  }'
 ```
 ### 5. Update VPS LiteLLM
 ```bash
 # On VPS, restart LiteLLM to pick up new config
 ssh vps
 cd ~/Projects/docker-compose
 arty restart litellm
 ```
 ## Usage Examples
 ### Via Open WebUI (https://ai.pivoine.art)
 **Text Generation:**
 1. Select model: `qwen-2.5-7b`
 2. Type message and send
 3. Orchestrator loads vLLM automatically
 **Image Generation:**
 1. Select model: `flux-schnell`
 2. Enter image prompt
 3. Orchestrator switches to Flux.1
 **Music Generation:**
 1. Select model: `musicgen-medium`
 2. Describe the music you want
 3. Orchestrator switches to MusicGen
 ### Via API (Direct)
 ```python
 import openai
 # Configure client to use orchestrator
 client = openai.OpenAI(
    base_url="http://100.100.108.13:9000/v1",
    api_key="dummy"  # Not used but required
 )
 # Text generation
 response = client.chat.completions.create(
    model="qwen-2.5-7b",
    messages=[{"role": "user", "content": "Write a haiku"}]
 )
 # Image generation
 image = client.images.generate(
    model="flux-schnell",
    prompt="a sunset over mountains",
    size="1024x1024"
 )
 # Music generation (custom endpoint)
 import requests
 music = requests.post(
    "http://100.100.108.13:9000/v1/audio/generations",
    json={
        "model": "musicgen-medium",
        "prompt": "calm piano music",
        "duration": 30
    }
 )
 ```
 ## Adding New Models
 ### Step 1: Update `models.yaml`
 ```yaml
 # Add to ai/model-orchestrator/models.yaml
 models:
  llama-3.1-8b:  # New model
    type: text
    framework: vllm
    docker_service: vllm-llama
    port: 8004
    vram_gb: 17
    startup_time_seconds: 120
    endpoint: /v1/chat/completions
    description: "Llama 3.1 8B Instruct - Meta's latest model"
 ```
 ### Step 2: Add Docker Service
 ```yaml
 # Add to ai/docker-compose.gpu.yaml
 services:
  vllm-llama:
    build: ./vllm
    container_name: ai_vllm-llama_1
    command: >
      vllm serve meta-llama/Llama-3.1-8B-Instruct
      --port 8000 --dtype bfloat16
    ports:
      - "8004:8000"
    environment:
      - HF_TOKEN=${HF_TOKEN}
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    profiles: ["text"]
    restart: "no"
 ```
 ### Step 3: Restart Orchestrator
 ```bash
 ssh gpu-pivoine
 cd /workspace/ai/
 docker compose -f docker-compose.gpu.yaml restart orchestrator
 ```
 **That's it!** The orchestrator automatically detects the new model.
 ## Management Commands
-### Orchestrator
+```bash
 # Start AI stack
 pnpm arty up ai_postgres webui crawl4ai
 # View logs
 docker logs -f ai_webui
 docker logs -f ai_postgres
 docker logs -f ai_crawl4ai
 # Check Open WebUI
 curl http://ai.pivoine.art:8080/health
 # Restart AI services
 pnpm arty restart ai_postgres webui crawl4ai
 ```
 ## GPU Server Management
 For GPU server operations (model orchestration, template creation, etc.):
 ```bash
-# Start orchestrator
+# Clone the dedicated repository
-docker compose -f docker-compose.gpu.yaml up -d orchestrator
+git clone ssh://git@dev.pivoine.art:2222/valknar/runpod.git
-# View orchestrator logs
+# See runpod repository for:
-docker logs -f ai_orchestrator
+# - Model orchestration setup
-
+# - RunPod template creation
-# Restart orchestrator
+# - GPU deployment guides
 docker compose -f docker-compose.gpu.yaml restart orchestrator
 # Check active model
 curl http://100.100.108.13:9000/health
 # List all models
 curl http://100.100.108.13:9000/models
 ```
-### Manual Model Control
+## Documentation
-```bash
+### VPS AI Services
-# Manually switch to specific model
+- [GPU_DEPLOYMENT_LOG.md](GPU_DEPLOYMENT_LOG.md) - VPS AI deployment history
 curl -X POST http://100.100.108.13:9000/switch \
  -H "Content-Type: application/json" \
  -d '{"model": "flux-schnell"}'
-# Check which model is running
+### GPU Server (Separate Repository)
-curl http://100.100.108.13:9000/health | jq '.current_model'
+- [runpod/README.md](https://dev.pivoine.art/valknar/runpod) - Main GPU documentation
-```
+- [runpod/DEPLOYMENT.md](https://dev.pivoine.art/valknar/runpod) - Deployment guide
 - [runpod/RUNPOD_TEMPLATE.md](https://dev.pivoine.art/valknar/runpod) - Template creation
-### Model Services
+## Architecture Overview
 ```bash
 # Manually start a specific model (bypassing orchestrator)
 docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen
 # Stop a model
 docker compose -f docker-compose.gpu.yaml stop vllm-qwen
 # View model logs
 docker logs -f ai_vllm-qwen_1
 docker logs -f ai_flux_1
 docker logs -f ai_musicgen_1
 ```
 ## Monitoring
 ### GPU Usage
 ```bash
 ssh gpu-pivoine "nvidia-smi"
 ```
 ### Model Status
 ```bash
 # Which model is active?
 curl http://100.100.108.13:9000/health
 # Model memory usage
 curl http://100.100.108.13:9000/health | jq '{current: .current_model, vram: .model_info.vram_gb}'
 ```
 ### Performance
 ```bash
 # Orchestrator logs (model switching)
 docker logs -f ai_orchestrator
 # Model-specific logs
 docker logs -f ai_vllm-qwen_1
 docker logs -f ai_flux_1
 docker logs -f ai_musicgen_1
 ```
 ## Troubleshooting
 ### Model Won't Load
 ```bash
 # Check orchestrator logs
 docker logs ai_orchestrator
 # Check if model service exists
 docker compose -f docker-compose.gpu.yaml config | grep -A 10 "vllm-qwen"
 # Manually test model service
 docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen
 curl http://localhost:8001/health
 ```
 ### Orchestrator Can't Connect
 ```bash
 # Check Docker socket permissions
 ls -l /var/run/docker.sock
 # Restart Docker daemon
 sudo systemctl restart docker
 # Rebuild orchestrator
 docker compose -f docker-compose.gpu.yaml build orchestrator
 docker compose -f docker-compose.gpu.yaml up -d orchestrator
 ```
 ### Model Switching Too Slow
 ```bash
 # Check model startup times in models.yaml
 # Adjust startup_time_seconds if needed
 # Pre-download models to /workspace cache
 docker run --rm -it --gpus all \
  -v /workspace/huggingface_cache:/cache \
  -e HF_HOME=/cache \
  nvidia/cuda:12.4.0-runtime-ubuntu22.04 \
  huggingface-cli download facebook/musicgen-medium
 ```
 ## File Structure
 ```
-ai/
+┌─────────────────────────────────────────────────────────────────┐
-├── docker-compose.gpu.yaml    # Main orchestration file
+│                     VPS (Tailscale: 100.102.217.79)             │
-├── .env.example               # Environment template
+│  ┌───────────────────────────────────────────────────────────┐  │
-├── README.md                  # This file
+│  │ LiteLLM Proxy (Port 4000)                                 │  │
-│
+│  │ Routes to: Claude API + GPU Orchestrator                  │  │
-├── model-orchestrator/        # Central orchestrator service
+│  └───────┬───────────────────────────────────────────────────┘  │
-│   ├── orchestrator.py        # FastAPI app managing models
+│          │                                                       │
-│   ├── models.yaml            # Model registry (EDIT TO ADD MODELS)
+│  ┌───────▼─────────┐  ┌──────────────┐  ┌─────────────────┐   │
-│   ├── Dockerfile
+│  │ Open WebUI      │  │ Crawl4AI     │  │ AI PostgreSQL   │   │
-│   └── requirements.txt
+│  │ Port: 8080      │  │ Port: 11235  │  │ + pgvector      │   │
-│
+│  └─────────────────┘  └──────────────┘  └─────────────────┘   │
-├── vllm/                      # Text generation (vLLM)
+└─────────────────────────────────────────────────────────────────┘
-│   ├── server.py              # Qwen 2.5 7B server
+                               │ Tailscale VPN
-│   ├── Dockerfile
+┌──────────────────────────────┼──────────────────────────────────┐
-│   └── requirements.txt
+│              RunPod GPU Server (Tailscale: 100.100.108.13)      │
-│
+│  ┌───────────────────────────▼──────────────────────────────┐   │
-├── flux/                      # Image generation (Flux.1 Schnell)
+│  │ Orchestrator (Port 9000)                                  │   │
-│   └── config/
+│  │ Manages sequential model loading                          │   │
-│       └── config.json        # Flux configuration
+│  └─────┬──────────────┬──────────────────┬──────────────────┘   │
-│
+│        │              │                  │                       │
-├── musicgen/                  # Music generation (MusicGen)
+│  ┌─────▼──────┐ ┌────▼────────┐  ┌──────▼───────┐              │
-│   ├── server.py              # MusicGen API server
+│  │vLLM        │ │Flux.1       │  │MusicGen      │              │
-│   ├── Dockerfile
+│  │Qwen 2.5 7B │ │Schnell      │  │Medium        │              │
-│   └── requirements.txt
+│  │Port: 8001  │ │Port: 8002   │  │Port: 8003    │              │
-│
+│  └────────────┘ └─────────────┘  └──────────────┘              │
-├── litellm-config.yaml        # LiteLLM proxy configuration
+└─────────────────────────────────────────────────────────────────┘
 └── GPU_DEPLOYMENT_LOG.md      # Deployment history and notes
 ```
 ## Cost Analysis
 ### Current Setup (Single GPU)
 - **Provider**: RunPod Spot Instance
 - **GPU**: RTX 4090 24GB
 - **Cost**: ~$0.50/hour
 - **Monthly**: ~$360 (if running 24/7)
 - **Optimized**: ~$120 (8 hours/day during business hours)
 ### Alternative: Multi-GPU (All Models Always On)
 - **GPUs**: 2× RTX 4090
 - **Cost**: ~$0.75/hour
 - **Monthly**: ~$540 (if running 24/7)
 - **Trade-off**: No switching latency, +$180/month
 ### Recommendation
 Stick with single GPU sequential loading for cost optimization. Model switching (30-120 seconds) is acceptable for most use cases.
 ## Performance Expectations
 | Model | VRAM | Startup Time | Generation Speed |
 |-------|------|--------------|------------------|
 | Qwen 2.5 7B | 14GB | 120s | ~50 tokens/sec |
 | Flux.1 Schnell | 14GB | 60s | ~4-5 sec/image |
 | MusicGen Medium | 11GB | 45s | ~60-90 sec for 30s audio |
 **Model Switching**: 30-120 seconds (unload current + load new)
 ## Security Notes
 - Orchestrator requires Docker socket access (`/var/run/docker.sock`)
 - All services run on private Tailscale network
 - No public exposure (only via VPS LiteLLM proxy)
 - HuggingFace token stored in `.env` (not committed to git)
 ## Future Enhancements
 1. ⏹️ Add Llama 3.1 8B for alternative text generation
 2. ⏹️ Add Whisper Large v3 for speech-to-text
 3. ⏹️ Add XTTS v2 for text-to-speech
 4. ⏹️ Implement model preloading/caching for faster switching
 5. ⏹️ Add usage metrics and cost tracking
 6. ⏹️ Auto-stop GPU pod during idle periods
 ## Support
-For issues or questions:
+For issues:
- Check orchestrator logs: `docker logs ai_orchestrator`
+- **VPS AI services**: Check logs via `docker logs`
- View model-specific logs: `docker logs ai_<service>_1`
+- **GPU server**: See runpod repository documentation
- Test direct model access: `curl http://localhost:<port>/health`
+- **LiteLLM routing**: Review `../litellm-config.yaml`
 - Review GPU deployment log: `GPU_DEPLOYMENT_LOG.md`
 ## License
 Built with:
 - [vLLM](https://github.com/vllm-project/vllm) - Apache 2.0
 - [AudioCraft](https://github.com/facebookresearch/audiocraft) - MIT (code), CC-BY-NC (weights)
 - [Flux.1](https://github.com/black-forest-labs/flux) - Apache 2.0
 - [LiteLLM](https://github.com/BerriAI/litellm) - MIT
 **Note**: MusicGen pre-trained weights are non-commercial (CC-BY-NC). Train your own models for commercial use with the MIT-licensed code.