refactor: clean up runpod repository structure

Removed facefusion and VPS-related files: - compose.yaml, postgres/, litellm-config.yaml (VPS services) - Dockerfile, entrypoint.sh, disable-nsfw-filter.patch (facefusion) Removed outdated documentation: - DOCKER_GPU_SETUP.md, README_GPU_SETUP.md, SETUP_GUIDE.md - TAILSCALE_SETUP.md, WIREGUARD_SETUP.md (covered in DEPLOYMENT.md) - GPU_EXPANSION_PLAN.md (historical planning doc) - gpu-server-compose.yaml, litellm-config-gpu.yaml (old versions) - deploy-gpu-stack.sh, simple_vllm_server.py (old scripts) Organized documentation: - Created docs/ directory - Moved DEPLOYMENT.md, RUNPOD_TEMPLATE.md, GPU_DEPLOYMENT_LOG.md to docs/ - Updated all documentation links in README.md Final structure: - Clean root directory with only GPU-specific files - Organized documentation in docs/ - Model services in dedicated directories (model-orchestrator/, vllm/, flux/, musicgen/) - Automation scripts in scripts/
2025-11-21 14:45:49 +01:00
parent 277f1c95bd
commit cafa0a1147
20 changed files with 8 additions and 4612 deletions
--- a/docs/DEPLOYMENT.md
+++ b/docs/DEPLOYMENT.md
@@ -0,0 +1,467 @@
+# Multi-Modal AI Orchestration System
+
+**Cost-optimized AI infrastructure running text, image, and music generation on a single RunPod RTX 4090 GPU.**
+
+## Architecture Overview
+
+This system provides a unified API for multiple AI model types with automatic model switching on a single GPU (24GB VRAM). All requests route through an intelligent orchestrator that manages model lifecycle.
+
+### Components
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                        VPS (Tailscale: 100.102.217.79)          │
+│  ┌───────────────────────────────────────────────────────────┐  │
+│  │ LiteLLM Proxy (Port 4000)                                 │  │
+│  │ Routes to: Claude API + GPU Orchestrator                  │  │
+│  └────────────────────┬──────────────────────────────────────┘  │
+└───────────────────────┼─────────────────────────────────────────┘
+                        │ Tailscale VPN
+┌───────────────────────┼─────────────────────────────────────────┐
+│           RunPod GPU Server (Tailscale: 100.100.108.13)         │
+│  ┌────────────────────▼──────────────────────────────────────┐  │
+│  │ Orchestrator (Port 9000)                                  │  │
+│  │ Manages sequential model loading based on request type   │  │
+│  └─────┬──────────────┬──────────────────┬──────────────────┘  │
+│        │              │                  │                      │
+│  ┌─────▼──────┐ ┌────▼────────┐  ┌──────▼───────┐             │
+│  │vLLM        │ │Flux.1       │  │MusicGen      │             │
+│  │Qwen 2.5 7B │ │Schnell      │  │Medium        │             │
+│  │Port: 8001  │ │Port: 8002   │  │Port: 8003    │             │
+│  │VRAM: 14GB  │ │VRAM: 14GB   │  │VRAM: 11GB    │             │
+│  └────────────┘ └─────────────┘  └──────────────┘             │
+│                                                                 │
+│  Only ONE model active at a time (sequential loading)          │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+### Features
+
+✅ **Automatic Model Switching** - Orchestrator detects request type and loads appropriate model
+✅ **OpenAI-Compatible APIs** - Works with existing OpenAI clients and tools
+✅ **Cost-Optimized** - Sequential loading on single GPU (~$0.50/hr vs ~$0.75/hr for multi-GPU)
+✅ **Easy Model Addition** - Add new models by editing YAML config
+✅ **Centralized Routing** - LiteLLM proxy provides unified API for all models
+✅ **GPU Memory Safe** - Orchestrator ensures only one model loaded at a time
+
+## Supported Model Types
+
+### Text Generation
+- **Qwen 2.5 7B Instruct** (facebook/Qwen2.5-7B-Instruct)
+- VRAM: 14GB | Speed: Fast | OpenAI-compatible chat API
+
+### Image Generation
+- **Flux.1 Schnell** (black-forest-labs/FLUX.1-schnell)
+- VRAM: 14GB | Speed: 4-5 sec/image | OpenAI DALL-E compatible API
+
+### Music Generation
+- **MusicGen Medium** (facebook/musicgen-medium)
+- VRAM: 11GB | Speed: 60-90 sec for 30s audio | Custom audio API
+
+## Quick Start
+
+### 1. Prerequisites
+
+```bash
+# On RunPod GPU server
+- RunPod RTX 4090 instance (24GB VRAM)
+- Docker & Docker Compose installed
+- Tailscale VPN configured
+- HuggingFace token (for model downloads)
+```
+
+### 2. Clone & Configure
+
+```bash
+# On local machine
+cd ai/
+
+# Create environment file
+cp .env.example .env
+# Edit .env and add your HF_TOKEN
+```
+
+### 3. Deploy to RunPod
+
+```bash
+# Copy all files to RunPod GPU server
+scp -r ai/* gpu-pivoine:/workspace/ai/
+
+# SSH to GPU server
+ssh gpu-pivoine
+
+# Navigate to project
+cd /workspace/ai/
+
+# Start orchestrator (always running)
+docker compose -f docker-compose.gpu.yaml up -d orchestrator
+
+# Orchestrator will automatically manage model services as needed
+```
+
+### 4. Test Deployment
+
+```bash
+# Check orchestrator health
+curl http://100.100.108.13:9000/health
+
+# Test text generation (auto-loads vLLM)
+curl http://100.100.108.13:9000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "qwen-2.5-7b",
+    "messages": [{"role": "user", "content": "Hello!"}]
+  }'
+
+# Test image generation (auto-switches to Flux)
+curl http://100.100.108.13:9000/v1/images/generations \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "flux-schnell",
+    "prompt": "a cute cat",
+    "size": "1024x1024"
+  }'
+
+# Test music generation (auto-switches to MusicGen)
+curl http://100.100.108.13:9000/v1/audio/generations \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "musicgen-medium",
+    "prompt": "upbeat electronic dance music",
+    "duration": 30
+  }'
+```
+
+### 5. Update VPS LiteLLM
+
+```bash
+# On VPS, restart LiteLLM to pick up new config
+ssh vps
+cd ~/Projects/docker-compose
+arty restart litellm
+```
+
+## Usage Examples
+
+### Via Open WebUI (https://ai.pivoine.art)
+
+**Text Generation:**
+1. Select model: `qwen-2.5-7b`
+2. Type message and send
+3. Orchestrator loads vLLM automatically
+
+**Image Generation:**
+1. Select model: `flux-schnell`
+2. Enter image prompt
+3. Orchestrator switches to Flux.1
+
+**Music Generation:**
+1. Select model: `musicgen-medium`
+2. Describe the music you want
+3. Orchestrator switches to MusicGen
+
+### Via API (Direct)
+
+```python
+import openai
+
+# Configure client to use orchestrator
+client = openai.OpenAI(
+    base_url="http://100.100.108.13:9000/v1",
+    api_key="dummy"  # Not used but required
+)
+
+# Text generation
+response = client.chat.completions.create(
+    model="qwen-2.5-7b",
+    messages=[{"role": "user", "content": "Write a haiku"}]
+)
+
+# Image generation
+image = client.images.generate(
+    model="flux-schnell",
+    prompt="a sunset over mountains",
+    size="1024x1024"
+)
+
+# Music generation (custom endpoint)
+import requests
+music = requests.post(
+    "http://100.100.108.13:9000/v1/audio/generations",
+    json={
+        "model": "musicgen-medium",
+        "prompt": "calm piano music",
+        "duration": 30
+    }
+)
+```
+
+## Adding New Models
+
+### Step 1: Update `models.yaml`
+
+```yaml
+# Add to ai/model-orchestrator/models.yaml
+models:
+  llama-3.1-8b:  # New model
+    type: text
+    framework: vllm
+    docker_service: vllm-llama
+    port: 8004
+    vram_gb: 17
+    startup_time_seconds: 120
+    endpoint: /v1/chat/completions
+    description: "Llama 3.1 8B Instruct - Meta's latest model"
+```
+
+### Step 2: Add Docker Service
+
+```yaml
+# Add to ai/docker-compose.gpu.yaml
+services:
+  vllm-llama:
+    build: ./vllm
+    container_name: ai_vllm-llama_1
+    command: >
+      vllm serve meta-llama/Llama-3.1-8B-Instruct
+      --port 8000 --dtype bfloat16
+    ports:
+      - "8004:8000"
+    environment:
+      - HF_TOKEN=${HF_TOKEN}
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: 1
+              capabilities: [gpu]
+    profiles: ["text"]
+    restart: "no"
+```
+
+### Step 3: Restart Orchestrator
+
+```bash
+ssh gpu-pivoine
+cd /workspace/ai/
+docker compose -f docker-compose.gpu.yaml restart orchestrator
+```
+
+**That's it!** The orchestrator automatically detects the new model.
+
+## Management Commands
+
+### Orchestrator
+
+```bash
+# Start orchestrator
+docker compose -f docker-compose.gpu.yaml up -d orchestrator
+
+# View orchestrator logs
+docker logs -f ai_orchestrator
+
+# Restart orchestrator
+docker compose -f docker-compose.gpu.yaml restart orchestrator
+
+# Check active model
+curl http://100.100.108.13:9000/health
+
+# List all models
+curl http://100.100.108.13:9000/models
+```
+
+### Manual Model Control
+
+```bash
+# Manually switch to specific model
+curl -X POST http://100.100.108.13:9000/switch \
+  -H "Content-Type: application/json" \
+  -d '{"model": "flux-schnell"}'
+
+# Check which model is running
+curl http://100.100.108.13:9000/health | jq '.current_model'
+```
+
+### Model Services
+
+```bash
+# Manually start a specific model (bypassing orchestrator)
+docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen
+
+# Stop a model
+docker compose -f docker-compose.gpu.yaml stop vllm-qwen
+
+# View model logs
+docker logs -f ai_vllm-qwen_1
+docker logs -f ai_flux_1
+docker logs -f ai_musicgen_1
+```
+
+## Monitoring
+
+### GPU Usage
+
+```bash
+ssh gpu-pivoine "nvidia-smi"
+```
+
+### Model Status
+
+```bash
+# Which model is active?
+curl http://100.100.108.13:9000/health
+
+# Model memory usage
+curl http://100.100.108.13:9000/health | jq '{current: .current_model, vram: .model_info.vram_gb}'
+```
+
+### Performance
+
+```bash
+# Orchestrator logs (model switching)
+docker logs -f ai_orchestrator
+
+# Model-specific logs
+docker logs -f ai_vllm-qwen_1
+docker logs -f ai_flux_1
+docker logs -f ai_musicgen_1
+```
+
+## Troubleshooting
+
+### Model Won't Load
+
+```bash
+# Check orchestrator logs
+docker logs ai_orchestrator
+
+# Check if model service exists
+docker compose -f docker-compose.gpu.yaml config | grep -A 10 "vllm-qwen"
+
+# Manually test model service
+docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen
+curl http://localhost:8001/health
+```
+
+### Orchestrator Can't Connect
+
+```bash
+# Check Docker socket permissions
+ls -l /var/run/docker.sock
+
+# Restart Docker daemon
+sudo systemctl restart docker
+
+# Rebuild orchestrator
+docker compose -f docker-compose.gpu.yaml build orchestrator
+docker compose -f docker-compose.gpu.yaml up -d orchestrator
+```
+
+### Model Switching Too Slow
+
+```bash
+# Check model startup times in models.yaml
+# Adjust startup_time_seconds if needed
+
+# Pre-download models to /workspace cache
+docker run --rm -it --gpus all \
+  -v /workspace/huggingface_cache:/cache \
+  -e HF_HOME=/cache \
+  nvidia/cuda:12.4.0-runtime-ubuntu22.04 \
+  huggingface-cli download facebook/musicgen-medium
+```
+
+## File Structure
+
+```
+ai/
+├── docker-compose.gpu.yaml    # Main orchestration file
+├── .env.example               # Environment template
+├── README.md                  # This file
+│
+├── model-orchestrator/        # Central orchestrator service
+│   ├── orchestrator.py        # FastAPI app managing models
+│   ├── models.yaml            # Model registry (EDIT TO ADD MODELS)
+│   ├── Dockerfile
+│   └── requirements.txt
+│
+├── vllm/                      # Text generation (vLLM)
+│   ├── server.py              # Qwen 2.5 7B server
+│   ├── Dockerfile
+│   └── requirements.txt
+│
+├── flux/                      # Image generation (Flux.1 Schnell)
+│   └── config/
+│       └── config.json        # Flux configuration
+│
+├── musicgen/                  # Music generation (MusicGen)
+│   ├── server.py              # MusicGen API server
+│   ├── Dockerfile
+│   └── requirements.txt
+│
+├── litellm-config.yaml        # LiteLLM proxy configuration
+└── GPU_DEPLOYMENT_LOG.md      # Deployment history and notes
+```
+
+## Cost Analysis
+
+### Current Setup (Single GPU)
+- **Provider**: RunPod Spot Instance
+- **GPU**: RTX 4090 24GB
+- **Cost**: ~$0.50/hour
+- **Monthly**: ~$360 (if running 24/7)
+- **Optimized**: ~$120 (8 hours/day during business hours)
+
+### Alternative: Multi-GPU (All Models Always On)
+- **GPUs**: 2× RTX 4090
+- **Cost**: ~$0.75/hour
+- **Monthly**: ~$540 (if running 24/7)
+- **Trade-off**: No switching latency, +$180/month
+
+### Recommendation
+Stick with single GPU sequential loading for cost optimization. Model switching (30-120 seconds) is acceptable for most use cases.
+
+## Performance Expectations
+
+| Model | VRAM | Startup Time | Generation Speed |
+|-------|------|--------------|------------------|
+| Qwen 2.5 7B | 14GB | 120s | ~50 tokens/sec |
+| Flux.1 Schnell | 14GB | 60s | ~4-5 sec/image |
+| MusicGen Medium | 11GB | 45s | ~60-90 sec for 30s audio |
+
+**Model Switching**: 30-120 seconds (unload current + load new)
+
+## Security Notes
+
+- Orchestrator requires Docker socket access (`/var/run/docker.sock`)
+- All services run on private Tailscale network
+- No public exposure (only via VPS LiteLLM proxy)
+- HuggingFace token stored in `.env` (not committed to git)
+
+## Future Enhancements
+
+1. ⏹️ Add Llama 3.1 8B for alternative text generation
+2. ⏹️ Add Whisper Large v3 for speech-to-text
+3. ⏹️ Add XTTS v2 for text-to-speech
+4. ⏹️ Implement model preloading/caching for faster switching
+5. ⏹️ Add usage metrics and cost tracking
+6. ⏹️ Auto-stop GPU pod during idle periods
+
+## Support
+
+For issues or questions:
+- Check orchestrator logs: `docker logs ai_orchestrator`
+- View model-specific logs: `docker logs ai_<service>_1`
+- Test direct model access: `curl http://localhost:<port>/health`
+- Review GPU deployment log: `GPU_DEPLOYMENT_LOG.md`
+
+## License
+
+Built with:
+- [vLLM](https://github.com/vllm-project/vllm) - Apache 2.0
+- [AudioCraft](https://github.com/facebookresearch/audiocraft) - MIT (code), CC-BY-NC (weights)
+- [Flux.1](https://github.com/black-forest-labs/flux) - Apache 2.0
+- [LiteLLM](https://github.com/BerriAI/litellm) - MIT
+
+**Note**: MusicGen pre-trained weights are non-commercial (CC-BY-NC). Train your own models for commercial use with the MIT-licensed code.
--- a/docs/GPU_DEPLOYMENT_LOG.md
+++ b/docs/GPU_DEPLOYMENT_LOG.md
@@ -0,0 +1,421 @@
+# GPU Server Deployment Log
+
+## Current Deployment (2025-11-21)
+
+### Infrastructure
+- **Provider**: RunPod (Spot Instance)
+- **GPU**: NVIDIA RTX 4090 24GB
+- **Disk**: 50GB local SSD (expanded from 20GB)
+- **Network Volume**: 922TB at `/workspace`
+- **Region**: Europe
+- **Cost**: ~$0.50/hour (~$360/month if running 24/7)
+
+### Network Configuration
+- **VPN**: Tailscale (replaces WireGuard due to RunPod UDP restrictions)
+- **GPU Server Tailscale IP**: 100.100.108.13
+- **VPS Tailscale IP**: (get with `tailscale ip -4` on VPS)
+
+### SSH Access
+```
+Host gpu-pivoine
+    HostName 213.173.102.232
+    Port 29695
+    User root
+    IdentityFile ~/.ssh/id_ed25519
+```
+
+**Note**: RunPod Spot instances can be terminated and restarted with new ports/IPs. Update SSH config accordingly.
+
+### Software Stack
+- **Python**: 3.11.10
+- **vLLM**: 0.6.4.post1 (installed with pip)
+- **PyTorch**: 2.5.1 with CUDA 12.4
+- **Tailscale**: Installed via official script
+
+### vLLM Deployment
+
+**Custom Server**: `ai/simple_vllm_server.py`
+- Uses `AsyncLLMEngine` directly to bypass multiprocessing issues
+- OpenAI-compatible API endpoints:
+  - `GET /v1/models` - List available models
+  - `POST /v1/completions` - Text completion
+  - `POST /v1/chat/completions` - Chat completion
+- Default model: Qwen/Qwen2.5-7B-Instruct
+- Cache directory: `/workspace/huggingface_cache`
+
+**Deployment Command**:
+```bash
+# Copy server script to GPU server
+scp ai/simple_vllm_server.py gpu-pivoine:/workspace/
+
+# Start server
+ssh gpu-pivoine "cd /workspace && nohup python3 simple_vllm_server.py > vllm.log 2>&1 &"
+
+# Check status
+ssh gpu-pivoine "curl http://localhost:8000/v1/models"
+```
+
+**Server Configuration** (environment variables):
+- `VLLM_HOST`: 0.0.0.0 (default)
+- `VLLM_PORT`: 8000 (default)
+
+### Model Configuration
+- **Model**: Qwen/Qwen2.5-7B-Instruct (no auth required)
+- **Context Length**: 4096 tokens
+- **GPU Memory**: 85% utilization
+- **Tensor Parallel**: 1 (single GPU)
+
+### Known Issues & Solutions
+
+#### Issue 1: vLLM Multiprocessing Errors
+**Problem**: Default vLLM v1 engine fails with ZMQ/CUDA multiprocessing errors on RunPod.
+**Solution**: Custom `AsyncLLMEngine` FastAPI server bypasses multiprocessing layer entirely.
+
+#### Issue 2: Disk Space (Solved)
+**Problem**: Original 20GB disk filled up with Hugging Face cache.
+**Solution**: Expanded to 50GB and use `/workspace` for model cache.
+
+#### Issue 3: Gated Models
+**Problem**: Llama models require Hugging Face authentication.
+**Solution**: Use Qwen 2.5 7B Instruct (no auth required) or set `HF_TOKEN` environment variable.
+
+#### Issue 4: Spot Instance Volatility
+**Problem**: RunPod Spot instances can be terminated anytime.
+**Solution**: Accept as trade-off for cost savings. Document SSH details for quick reconnection.
+
+### Monitoring
+
+**Check vLLM logs**:
+```bash
+ssh gpu-pivoine "tail -f /workspace/vllm.log"
+```
+
+**Check GPU usage**:
+```bash
+ssh gpu-pivoine "nvidia-smi"
+```
+
+**Check Tailscale status**:
+```bash
+ssh gpu-pivoine "tailscale status"
+```
+
+**Test API locally (on GPU server)**:
+```bash
+ssh gpu-pivoine "curl http://localhost:8000/v1/models"
+```
+
+**Test API via Tailscale (from VPS)**:
+```bash
+curl http://100.100.108.13:8000/v1/models
+```
+
+### LiteLLM Integration
+
+Update VPS LiteLLM config at `ai/litellm-config-gpu.yaml`:
+
+```yaml
+# Replace old WireGuard IP (10.8.0.2) with Tailscale IP
+- model_name: qwen-2.5-7b
+  litellm_params:
+    model: openai/qwen-2.5-7b
+    api_base: http://100.100.108.13:8000/v1  # Tailscale IP
+    api_key: dummy
+    rpm: 1000
+    tpm: 100000
+```
+
+Restart LiteLLM:
+```bash
+arty restart litellm
+```
+
+### Troubleshooting
+
+**Server not responding**:
+1. Check if process is running: `pgrep -f simple_vllm_server`
+2. Check logs: `tail -100 /workspace/vllm.log`
+3. Check GPU availability: `nvidia-smi`
+4. Restart server: `pkill -f simple_vllm_server && python3 /workspace/simple_vllm_server.py &`
+
+**Tailscale not connected**:
+1. Check status: `tailscale status`
+2. Check daemon: `ps aux | grep tailscaled`
+3. Restart: `tailscale down && tailscale up`
+
+**Model download failing**:
+1. Check disk space: `df -h`
+2. Check cache directory: `ls -lah /workspace/huggingface_cache`
+3. Clear cache if needed: `rm -rf /workspace/huggingface_cache/*`
+
+### Deployment Status ✅ COMPLETE
+
+**Deployment Date**: 2025-11-21
+
+1. ✅ Deploy vLLM with Qwen 2.5 7B - COMPLETE
+2. ✅ Test API endpoints locally and via Tailscale - COMPLETE
+3. ✅ Update VPS LiteLLM configuration - COMPLETE
+4. ✅ Test end-to-end: Open WebUI → LiteLLM → vLLM - COMPLETE
+5. ⏳ Monitor performance and costs - ONGOING
+
+**Model Available**: `qwen-2.5-7b` visible in Open WebUI at https://ai.pivoine.art
+
+### Next Steps (2025-11-21 Original)
+6. ✅ Consider adding more models → COMPLETE (added Flux.1 Schnell + MusicGen Medium)
+7. ⏹️ Set up auto-stop for idle periods to save costs
+
+---
+
+## Multi-Modal Architecture (2025-11-21 Update)
+
+### Overview
+
+Expanded GPU deployment to support **text, image, and music generation** with intelligent model orchestration. All models run sequentially on a single RTX 4090 GPU with automatic switching based on request type.
+
+### Architecture Components
+
+#### 1. **Orchestrator Service** (Port 9000 - Always Running)
+- **Location**: `ai/model-orchestrator/`
+- **Purpose**: Central service managing model lifecycle
+- **Features**:
+  - Detects request type (text/image/audio)
+  - Automatically unloads current model
+  - Loads requested model
+  - Proxies requests to active model
+  - Tracks GPU memory usage
+- **Technology**: FastAPI + Docker SDK Python
+- **Endpoints**:
+  - `POST /v1/chat/completions` → Routes to text models
+  - `POST /v1/images/generations` → Routes to image models
+  - `POST /v1/audio/generations` → Routes to music models
+  - `GET /health` → Shows active model and status
+  - `GET /models` → Lists all available models
+  - `POST /switch` → Manually switch models
+
+#### 2. **Text Generation** (vLLM + Qwen 2.5 7B)
+- **Service**: `vllm-qwen` (Port 8001)
+- **Location**: `ai/vllm/`
+- **Model**: Qwen/Qwen2.5-7B-Instruct
+- **VRAM**: 14GB (85% GPU utilization)
+- **Speed**: ~50 tokens/second
+- **Startup**: 120 seconds
+- **Status**: ✅ Working (same as original deployment)
+
+#### 3. **Image Generation** (Flux.1 Schnell)
+- **Service**: `flux` (Port 8002)
+- **Location**: `ai/flux/`
+- **Model**: black-forest-labs/FLUX.1-schnell
+- **VRAM**: 14GB with CPU offloading
+- **Speed**: 4-5 seconds per image
+- **Startup**: 60 seconds
+- **Features**: OpenAI DALL-E compatible API
+- **Image**: `ghcr.io/matatonic/openedai-images-flux:latest`
+
+#### 4. **Music Generation** (MusicGen Medium)
+- **Service**: `musicgen` (Port 8003)
+- **Location**: `ai/musicgen/`
+- **Model**: facebook/musicgen-medium
+- **VRAM**: 11GB
+- **Speed**: 60-90 seconds for 30 seconds of audio
+- **Startup**: 45 seconds
+- **Features**: Text-to-music generation with sampling controls
+- **Technology**: Meta's AudioCraft + custom FastAPI wrapper
+
+### Model Registry (`models.yaml`)
+
+Simple configuration file for managing all models:
+
+```yaml
+models:
+  qwen-2.5-7b:
+    type: text
+    framework: vllm
+    docker_service: vllm-qwen
+    port: 8001
+    vram_gb: 14
+    startup_time_seconds: 120
+    endpoint: /v1/chat/completions
+
+  flux-schnell:
+    type: image
+    framework: openedai-images
+    docker_service: flux
+    port: 8002
+    vram_gb: 14
+    startup_time_seconds: 60
+    endpoint: /v1/images/generations
+
+  musicgen-medium:
+    type: audio
+    framework: audiocraft
+    docker_service: musicgen
+    port: 8003
+    vram_gb: 11
+    startup_time_seconds: 45
+    endpoint: /v1/audio/generations
+```
+
+**Adding new models**: Just add a new entry to this file and define the Docker service.
+
+### Deployment Changes
+
+#### Docker Compose Structure
+- **File**: `docker-compose.gpu.yaml`
+- **Services**: 4 total (1 orchestrator + 3 models)
+- **Profiles**: `text`, `image`, `audio` (orchestrator manages activation)
+- **Restart Policy**: `no` for models (orchestrator controls lifecycle)
+- **Volumes**: All model caches on `/workspace` (922TB network volume)
+
+#### LiteLLM Integration
+Updated `litellm-config.yaml` to route all self-hosted models through orchestrator:
+
+```yaml
+# Text
+- model_name: qwen-2.5-7b
+  api_base: http://100.100.108.13:9000/v1  # Orchestrator
+
+# Image
+- model_name: flux-schnell
+  api_base: http://100.100.108.13:9000/v1  # Orchestrator
+
+# Music
+- model_name: musicgen-medium
+  api_base: http://100.100.108.13:9000/v1  # Orchestrator
+```
+
+All models now available via Open WebUI at https://ai.pivoine.art
+
+### Usage Examples
+
+**Text Generation**:
+```bash
+curl http://100.100.108.13:9000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Hello"}]}'
+```
+
+**Image Generation**:
+```bash
+curl http://100.100.108.13:9000/v1/images/generations \
+  -H "Content-Type: application/json" \
+  -d '{"model": "flux-schnell", "prompt": "a cute cat", "size": "1024x1024"}'
+```
+
+**Music Generation**:
+```bash
+curl http://100.100.108.13:9000/v1/audio/generations \
+  -H "Content-Type: application/json" \
+  -d '{"model": "musicgen-medium", "prompt": "upbeat electronic", "duration": 30}'
+```
+
+### Deployment Commands
+
+```bash
+# Copy all files to RunPod
+scp -r ai/* gpu-pivoine:/workspace/ai/
+
+# SSH to GPU server
+ssh gpu-pivoine
+cd /workspace/ai/
+
+# Start orchestrator (manages everything)
+docker compose -f docker-compose.gpu.yaml up -d orchestrator
+
+# Check status
+curl http://100.100.108.13:9000/health
+
+# View logs
+docker logs -f ai_orchestrator
+
+# Manually switch models (optional)
+curl -X POST http://100.100.108.13:9000/switch \
+  -H "Content-Type: application/json" \
+  -d '{"model": "flux-schnell"}'
+```
+
+### Performance Characteristics
+
+| Model | VRAM | Startup Time | Generation Time | Notes |
+|-------|------|--------------|-----------------|-------|
+| Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | Fast text generation |
+| Flux.1 Schnell | 14GB | 60s | 4-5s/image | High-quality images |
+| MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Text-to-music |
+
+**Model Switching Overhead**: 30-120 seconds (unload + load)
+
+### Cost Analysis
+
+**Current (Single GPU Sequential)**:
+- Cost: ~$0.50/hour
+- Monthly: ~$360 (24/7) or ~$120 (8hr/day)
+- Trade-off: 30-120s switching time
+
+**Alternative (Multi-GPU Concurrent)**:
+- Cost: ~$0.75/hour (+50%)
+- Monthly: ~$540 (24/7) or ~$180 (8hr/day)
+- Benefit: No switching time, all models always available
+
+**Decision**: Stick with single GPU for cost optimization. Switching time is acceptable for most use cases.
+
+### Known Limitations
+
+1. **Sequential Only**: Only one model active at a time
+2. **Switching Latency**: 30-120 seconds to change models
+3. **MusicGen License**: Pre-trained weights are CC-BY-NC (non-commercial)
+4. **Spot Instance Volatility**: Pod can be terminated anytime
+
+### Monitoring
+
+**Check active model**:
+```bash
+curl http://100.100.108.13:9000/health | jq '{model: .current_model, vram: .model_info.vram_gb}'
+```
+
+**View orchestrator logs**:
+```bash
+docker logs -f ai_orchestrator
+```
+
+**GPU usage**:
+```bash
+ssh gpu-pivoine "nvidia-smi"
+```
+
+### Deployment Status ✅ COMPLETE (Multi-Modal)
+
+**Deployment Date**: 2025-11-21
+
+1. ✅ Create model orchestrator service - COMPLETE
+2. ✅ Deploy vLLM text generation (Qwen 2.5 7B) - COMPLETE
+3. ✅ Deploy Flux.1 Schnell image generation - COMPLETE
+4. ✅ Deploy MusicGen Medium music generation - COMPLETE
+5. ✅ Update LiteLLM configuration - COMPLETE
+6. ✅ Test all three model types via orchestrator - READY FOR TESTING
+7. ⏳ Monitor performance and costs - ONGOING
+
+**Models Available**: `qwen-2.5-7b`, `flux-schnell`, `musicgen-medium` via Open WebUI
+
+### Future Model Additions
+
+**Easy to add** (just edit `models.yaml`):
+- Llama 3.1 8B Instruct (text, gated model)
+- Whisper Large v3 (speech-to-text)
+- XTTS v2 (text-to-speech)
+- Stable Diffusion XL (alternative image generation)
+
+See `README.md` for detailed instructions on adding new models.
+
+### Cost Optimization Ideas
+1. **Auto-stop**: Configure RunPod to auto-stop after 30 minutes idle
+2. **Spot Instances**: Already using Spot for 50% cost reduction
+3. **Scheduled Operation**: Run only during business hours (8 hours/day = $120/month)
+4. **Smaller Models**: Use Mistral 7B or quantized models for lighter workloads
+5. **Pay-as-you-go**: Manually start/stop pod as needed
+
+### Performance Benchmarks
+*To be measured after deployment*
+
+Expected (based on RTX 4090):
+- Qwen 2.5 7B: 50-80 tokens/second
+- Context processing: ~2-3 seconds for 1000 tokens
+- First token latency: ~200-300ms
--- a/docs/RUNPOD_TEMPLATE.md
+++ b/docs/RUNPOD_TEMPLATE.md
@@ -0,0 +1,416 @@
+# RunPod Template Creation Guide
+
+This guide shows you how to create a reusable RunPod template so you never have to reinstall everything from scratch when Spot instances restart.
+
+## Why Create a Template?
+
+**Without Template** (Manual Setup Every Time):
+- ❌ Install Docker & Docker Compose (10-15 min)
+- ❌ Install Tailscale (5 min)
+- ❌ Pull Docker images (10-20 min)
+- ❌ Download models: Qwen (~14GB), Flux (~12GB), MusicGen (~11GB) = 30-45 min
+- ❌ Configure everything (5-10 min)
+- **Total: 60-90 minutes per Spot instance restart**
+
+**With Template** (Ready to Go):
+- ✅ Everything pre-installed
+- ✅ Models cached in `/workspace`
+- ✅ Just start orchestrator
+- **Total: 2-3 minutes**
+
+## Template Contents
+
+### System Software
+- ✅ Docker 24.x + Docker Compose v2
+- ✅ Tailscale latest
+- ✅ NVIDIA Docker runtime
+- ✅ Python 3.11
+- ✅ Git, curl, wget, htop, nvtop
+
+### Docker Images (Pre-built)
+- ✅ `ai_orchestrator` - Model orchestration service
+- ✅ `ai_vllm-qwen_1` - Text generation (vLLM + Qwen 2.5 7B)
+- ✅ `ai_musicgen_1` - Music generation (AudioCraft)
+- ✅ `ghcr.io/matatonic/openedai-images-flux:latest` - Image generation
+
+### Model Cache (/workspace - Persistent)
+- ✅ Qwen 2.5 7B Instruct (~14GB)
+- ✅ Flux.1 Schnell (~12GB)
+- ✅ MusicGen Medium (~11GB)
+- **Total: ~37GB cached**
+
+### Project Files (/workspace/ai)
+- ✅ All orchestrator code
+- ✅ Docker Compose configurations
+- ✅ Model service configurations
+- ✅ Documentation
+
+---
+
+## Step-by-Step Template Creation
+
+### Prerequisites
+1. RunPod account
+2. Active RTX 4090 pod (or similar GPU)
+3. SSH access to the pod
+4. This repository cloned locally
+
+### Step 1: Deploy Fresh Pod
+
+```bash
+# Create new RunPod instance:
+# - GPU: RTX 4090 (24GB VRAM)
+# - Disk: 50GB container disk
+# - Network Volume: Attach or create 100GB+ volume
+# - Template: Start with official PyTorch or CUDA template
+
+# Note the SSH connection details (host, port, password)
+```
+
+### Step 2: Prepare the Instance
+
+Run the automated preparation script:
+
+```bash
+# On your local machine, copy everything to RunPod
+scp -P <PORT> -r /home/valknar/Projects/runpod/* root@<HOST>:/workspace/ai/
+
+# SSH to the pod
+ssh -p <PORT> root@<HOST>
+
+# Run the preparation script
+cd /workspace/ai
+chmod +x scripts/prepare-template.sh
+./scripts/prepare-template.sh
+```
+
+**What the script does:**
+1. Installs Docker & Docker Compose
+2. Installs Tailscale
+3. Builds all Docker images
+4. Pre-downloads all models
+5. Validates everything works
+6. Cleans up temporary files
+
+**Estimated time: 45-60 minutes**
+
+### Step 3: Manual Verification
+
+After the script completes, verify everything:
+
+```bash
+# Check Docker is installed
+docker --version
+docker compose version
+
+# Check Tailscale
+tailscale version
+
+# Check all images are built
+docker images | grep ai_
+
+# Check models are cached
+ls -lh /workspace/huggingface_cache/
+ls -lh /workspace/flux/models/
+ls -lh /workspace/musicgen/models/
+
+# Test orchestrator starts
+cd /workspace/ai
+docker compose -f docker-compose.gpu.yaml up -d orchestrator
+docker logs ai_orchestrator
+
+# Test model loading (should be fast since models are cached)
+curl http://localhost:9000/health
+
+# Stop orchestrator
+docker compose -f docker-compose.gpu.yaml down
+```
+
+### Step 4: Clean Up Before Saving
+
+**IMPORTANT**: Remove secrets and temporary data before creating template!
+
+```bash
+# Remove sensitive data
+rm -f /workspace/ai/.env
+rm -f /root/.ssh/known_hosts
+rm -f /root/.bash_history
+
+# Clear logs
+rm -f /var/log/*.log
+docker system prune -af --volumes  # Clean Docker cache but keep images
+
+# Clear Tailscale state (will re-authenticate on first use)
+tailscale logout
+
+# Create template-ready marker
+echo "RunPod Multi-Modal AI Template v1.0" > /workspace/TEMPLATE_VERSION
+echo "Created: $(date)" >> /workspace/TEMPLATE_VERSION
+```
+
+### Step 5: Save Template in RunPod Dashboard
+
+1. **Go to RunPod Dashboard** → "My Pods"
+2. **Select your prepared pod**
+3. **Click "⋮" menu** → "Save as Template"
+4. **Template Configuration**:
+   - **Name**: `multi-modal-ai-v1.0`
+   - **Description**:
+     ```
+     Multi-Modal AI Stack with Orchestrator
+     - Text: vLLM + Qwen 2.5 7B
+     - Image: Flux.1 Schnell
+     - Music: MusicGen Medium
+     - Models pre-cached (~37GB)
+     - Ready to deploy in 2-3 minutes
+     ```
+   - **Category**: `AI/ML`
+   - **Docker Image**: (auto-detected)
+   - **Container Disk**: 50GB
+   - **Expose Ports**: 9000, 8001, 8002, 8003
+   - **Environment Variables** (optional):
+     ```
+     HF_TOKEN=<leave empty, user will add>
+     TAILSCALE_AUTHKEY=<leave empty, user will add>
+     ```
+
+5. **Click "Save Template"**
+6. **Wait for template creation** (5-10 minutes)
+7. **Test the template** by deploying a new pod with it
+
+---
+
+## Using Your Template
+
+### Deploy New Pod from Template
+
+1. **RunPod Dashboard** → "➕ Deploy"
+2. **Select "Community Templates"** or "My Templates"
+3. **Choose**: `multi-modal-ai-v1.0`
+4. **Configure**:
+   - GPU: RTX 4090 (or compatible)
+   - Network Volume: Attach your existing volume with `/workspace` mount
+   - Environment:
+     - `HF_TOKEN`: Your Hugging Face token
+     - (Tailscale will be configured via SSH)
+
+5. **Deploy Pod**
+
+### First-Time Setup (On New Pod)
+
+```bash
+# SSH to the new pod
+ssh -p <PORT> root@<HOST>
+
+# Navigate to project
+cd /workspace/ai
+
+# Create .env file
+cat > .env <<EOF
+HF_TOKEN=hf_your_token_here
+GPU_TAILSCALE_IP=100.100.108.13
+EOF
+
+# Configure Tailscale (one-time)
+tailscale up --authkey=<YOUR_TAILSCALE_KEY>
+
+# Start orchestrator (models already cached, starts in seconds!)
+docker compose -f docker-compose.gpu.yaml up -d orchestrator
+
+# Verify
+curl http://localhost:9000/health
+
+# Check logs
+docker logs -f ai_orchestrator
+```
+
+**Total setup time: 2-3 minutes!** 🎉
+
+### Updating SSH Config (If Spot Instance Restarts)
+
+Since Spot instances can restart with new IPs/ports:
+
+```bash
+# On your local machine
+# Update ~/.ssh/config with new connection details
+
+Host gpu-pivoine
+    HostName <NEW_IP>
+    Port <NEW_PORT>
+    User root
+    IdentityFile ~/.ssh/id_ed25519
+```
+
+---
+
+## Template Maintenance
+
+### Updating the Template
+
+When you add new models or make improvements:
+
+1. Deploy a pod from your existing template
+2. Make your changes
+3. Test everything
+4. Clean up (remove secrets)
+5. Save as new template version: `multi-modal-ai-v1.1`
+6. Update your documentation
+
+### Version History
+
+Keep track of template versions:
+
+```
+v1.0 (2025-11-21) - Initial release
+- Text: Qwen 2.5 7B
+- Image: Flux.1 Schnell
+- Music: MusicGen Medium
+- Docker orchestrator
+
+v1.1 (future) - Planned
+- Add Llama 3.1 8B
+- Add Whisper Large v3
+- Optimize model loading
+```
+
+---
+
+## Troubleshooting Template Creation
+
+### Models Not Downloading
+
+```bash
+# Manually trigger model downloads
+docker compose --profile text up -d vllm-qwen
+docker logs -f ai_vllm-qwen_1
+# Wait for "Model loaded successfully"
+docker compose stop vllm-qwen
+
+# Repeat for other models
+docker compose --profile image up -d flux
+docker compose --profile audio up -d musicgen
+```
+
+### Docker Images Not Building
+
+```bash
+# Build images one at a time
+docker compose -f docker-compose.gpu.yaml build orchestrator
+docker compose -f docker-compose.gpu.yaml build vllm-qwen
+docker compose -f docker-compose.gpu.yaml build musicgen
+
+# Check build logs for errors
+docker compose -f docker-compose.gpu.yaml build --no-cache --progress=plain orchestrator
+```
+
+### Tailscale Won't Install
+
+```bash
+# Manual Tailscale installation
+curl -fsSL https://tailscale.com/install.sh | sh
+
+# Start daemon
+tailscaled --tun=userspace-networking --socks5-server=localhost:1055 &
+
+# Test
+tailscale version
+```
+
+### Template Too Large
+
+RunPod templates have size limits. If your template is too large:
+
+**Option 1**: Use network volume for models
+- Move models to network volume: `/workspace/models/`
+- Mount volume when deploying from template
+- Models persist across pod restarts
+
+**Option 2**: Reduce cached models
+- Only cache most-used model (Qwen 2.5 7B)
+- Download others on first use
+- Accept slightly longer first-time startup
+
+**Option 3**: Use Docker layer optimization
+```dockerfile
+# In Dockerfile, order commands by change frequency
+# Less frequently changed layers first
+```
+
+---
+
+## Cost Analysis
+
+### Template Storage Cost
+- RunPod charges for template storage: ~$0.10/GB/month
+- This template: ~50GB = **~$5/month**
+- **Worth it!** Saves 60-90 minutes per Spot restart
+
+### Time Savings
+- Spot instance restarts: 2-5 times per week (highly variable)
+- Time saved per restart: 60-90 minutes
+- **Total saved per month: 8-20 hours**
+- **Value: Priceless for rapid deployment**
+
+---
+
+## Advanced: Automated Template Updates
+
+Create a CI/CD pipeline to automatically update templates:
+
+```bash
+# GitHub Actions workflow (future enhancement)
+# 1. Deploy pod from template
+# 2. Pull latest code
+# 3. Rebuild images
+# 4. Test
+# 5. Save new template version
+# 6. Notify team
+```
+
+---
+
+## Template Checklist
+
+Before saving your template, verify:
+
+- [ ] All Docker images built and working
+- [ ] All models downloaded and cached
+- [ ] Tailscale installed (but logged out)
+- [ ] Docker Compose files present
+- [ ] `.env` file removed (secrets cleared)
+- [ ] Logs cleared
+- [ ] SSH keys removed
+- [ ] Bash history cleared
+- [ ] Template version documented
+- [ ] Test deployment successful
+
+---
+
+## Support
+
+If you have issues creating the template:
+
+1. Check `/workspace/ai/scripts/prepare-template.sh` logs
+2. Review Docker build logs: `docker compose build --progress=plain`
+3. Check model download logs: `docker logs <container>`
+4. Verify disk space: `df -h`
+5. Check network volume is mounted: `mount | grep workspace`
+
+For RunPod-specific issues:
+- RunPod Docs: https://docs.runpod.io/
+- RunPod Discord: https://discord.gg/runpod
+
+---
+
+## Next Steps
+
+After creating your template:
+
+1. ✅ Test deployment from template
+2. ✅ Document in `GPU_DEPLOYMENT_LOG.md`
+3. ✅ Share template ID with team (if applicable)
+4. ✅ Set up monitoring (Netdata, etc.)
+5. ✅ Configure auto-stop for cost optimization
+6. ✅ Add more models as needed
+
+**Your multi-modal AI infrastructure is now portable and reproducible!** 🚀