refactor: clean up runpod repository structure

Removed facefusion and VPS-related files:
- compose.yaml, postgres/, litellm-config.yaml (VPS services)
- Dockerfile, entrypoint.sh, disable-nsfw-filter.patch (facefusion)

Removed outdated documentation:
- DOCKER_GPU_SETUP.md, README_GPU_SETUP.md, SETUP_GUIDE.md
- TAILSCALE_SETUP.md, WIREGUARD_SETUP.md (covered in DEPLOYMENT.md)
- GPU_EXPANSION_PLAN.md (historical planning doc)
- gpu-server-compose.yaml, litellm-config-gpu.yaml (old versions)
- deploy-gpu-stack.sh, simple_vllm_server.py (old scripts)

Organized documentation:
- Created docs/ directory
- Moved DEPLOYMENT.md, RUNPOD_TEMPLATE.md, GPU_DEPLOYMENT_LOG.md to docs/
- Updated all documentation links in README.md

Final structure:
- Clean root directory with only GPU-specific files
- Organized documentation in docs/
- Model services in dedicated directories (model-orchestrator/, vllm/, flux/, musicgen/)
- Automation scripts in scripts/
This commit is contained in:
2025-11-21 14:45:49 +01:00
parent 277f1c95bd
commit cafa0a1147
20 changed files with 8 additions and 4612 deletions

467
docs/DEPLOYMENT.md Normal file
View File

@@ -0,0 +1,467 @@
# Multi-Modal AI Orchestration System
**Cost-optimized AI infrastructure running text, image, and music generation on a single RunPod RTX 4090 GPU.**
## Architecture Overview
This system provides a unified API for multiple AI model types with automatic model switching on a single GPU (24GB VRAM). All requests route through an intelligent orchestrator that manages model lifecycle.
### Components
```
┌─────────────────────────────────────────────────────────────────┐
│ VPS (Tailscale: 100.102.217.79) │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ LiteLLM Proxy (Port 4000) │ │
│ │ Routes to: Claude API + GPU Orchestrator │ │
│ └────────────────────┬──────────────────────────────────────┘ │
└───────────────────────┼─────────────────────────────────────────┘
│ Tailscale VPN
┌───────────────────────┼─────────────────────────────────────────┐
│ RunPod GPU Server (Tailscale: 100.100.108.13) │
│ ┌────────────────────▼──────────────────────────────────────┐ │
│ │ Orchestrator (Port 9000) │ │
│ │ Manages sequential model loading based on request type │ │
│ └─────┬──────────────┬──────────────────┬──────────────────┘ │
│ │ │ │ │
│ ┌─────▼──────┐ ┌────▼────────┐ ┌──────▼───────┐ │
│ │vLLM │ │Flux.1 │ │MusicGen │ │
│ │Qwen 2.5 7B │ │Schnell │ │Medium │ │
│ │Port: 8001 │ │Port: 8002 │ │Port: 8003 │ │
│ │VRAM: 14GB │ │VRAM: 14GB │ │VRAM: 11GB │ │
│ └────────────┘ └─────────────┘ └──────────────┘ │
│ │
│ Only ONE model active at a time (sequential loading) │
└─────────────────────────────────────────────────────────────────┘
```
### Features
**Automatic Model Switching** - Orchestrator detects request type and loads appropriate model
**OpenAI-Compatible APIs** - Works with existing OpenAI clients and tools
**Cost-Optimized** - Sequential loading on single GPU (~$0.50/hr vs ~$0.75/hr for multi-GPU)
**Easy Model Addition** - Add new models by editing YAML config
**Centralized Routing** - LiteLLM proxy provides unified API for all models
**GPU Memory Safe** - Orchestrator ensures only one model loaded at a time
## Supported Model Types
### Text Generation
- **Qwen 2.5 7B Instruct** (facebook/Qwen2.5-7B-Instruct)
- VRAM: 14GB | Speed: Fast | OpenAI-compatible chat API
### Image Generation
- **Flux.1 Schnell** (black-forest-labs/FLUX.1-schnell)
- VRAM: 14GB | Speed: 4-5 sec/image | OpenAI DALL-E compatible API
### Music Generation
- **MusicGen Medium** (facebook/musicgen-medium)
- VRAM: 11GB | Speed: 60-90 sec for 30s audio | Custom audio API
## Quick Start
### 1. Prerequisites
```bash
# On RunPod GPU server
- RunPod RTX 4090 instance (24GB VRAM)
- Docker & Docker Compose installed
- Tailscale VPN configured
- HuggingFace token (for model downloads)
```
### 2. Clone & Configure
```bash
# On local machine
cd ai/
# Create environment file
cp .env.example .env
# Edit .env and add your HF_TOKEN
```
### 3. Deploy to RunPod
```bash
# Copy all files to RunPod GPU server
scp -r ai/* gpu-pivoine:/workspace/ai/
# SSH to GPU server
ssh gpu-pivoine
# Navigate to project
cd /workspace/ai/
# Start orchestrator (always running)
docker compose -f docker-compose.gpu.yaml up -d orchestrator
# Orchestrator will automatically manage model services as needed
```
### 4. Test Deployment
```bash
# Check orchestrator health
curl http://100.100.108.13:9000/health
# Test text generation (auto-loads vLLM)
curl http://100.100.108.13:9000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-2.5-7b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Test image generation (auto-switches to Flux)
curl http://100.100.108.13:9000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "flux-schnell",
"prompt": "a cute cat",
"size": "1024x1024"
}'
# Test music generation (auto-switches to MusicGen)
curl http://100.100.108.13:9000/v1/audio/generations \
-H "Content-Type: application/json" \
-d '{
"model": "musicgen-medium",
"prompt": "upbeat electronic dance music",
"duration": 30
}'
```
### 5. Update VPS LiteLLM
```bash
# On VPS, restart LiteLLM to pick up new config
ssh vps
cd ~/Projects/docker-compose
arty restart litellm
```
## Usage Examples
### Via Open WebUI (https://ai.pivoine.art)
**Text Generation:**
1. Select model: `qwen-2.5-7b`
2. Type message and send
3. Orchestrator loads vLLM automatically
**Image Generation:**
1. Select model: `flux-schnell`
2. Enter image prompt
3. Orchestrator switches to Flux.1
**Music Generation:**
1. Select model: `musicgen-medium`
2. Describe the music you want
3. Orchestrator switches to MusicGen
### Via API (Direct)
```python
import openai
# Configure client to use orchestrator
client = openai.OpenAI(
base_url="http://100.100.108.13:9000/v1",
api_key="dummy" # Not used but required
)
# Text generation
response = client.chat.completions.create(
model="qwen-2.5-7b",
messages=[{"role": "user", "content": "Write a haiku"}]
)
# Image generation
image = client.images.generate(
model="flux-schnell",
prompt="a sunset over mountains",
size="1024x1024"
)
# Music generation (custom endpoint)
import requests
music = requests.post(
"http://100.100.108.13:9000/v1/audio/generations",
json={
"model": "musicgen-medium",
"prompt": "calm piano music",
"duration": 30
}
)
```
## Adding New Models
### Step 1: Update `models.yaml`
```yaml
# Add to ai/model-orchestrator/models.yaml
models:
llama-3.1-8b: # New model
type: text
framework: vllm
docker_service: vllm-llama
port: 8004
vram_gb: 17
startup_time_seconds: 120
endpoint: /v1/chat/completions
description: "Llama 3.1 8B Instruct - Meta's latest model"
```
### Step 2: Add Docker Service
```yaml
# Add to ai/docker-compose.gpu.yaml
services:
vllm-llama:
build: ./vllm
container_name: ai_vllm-llama_1
command: >
vllm serve meta-llama/Llama-3.1-8B-Instruct
--port 8000 --dtype bfloat16
ports:
- "8004:8000"
environment:
- HF_TOKEN=${HF_TOKEN}
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
profiles: ["text"]
restart: "no"
```
### Step 3: Restart Orchestrator
```bash
ssh gpu-pivoine
cd /workspace/ai/
docker compose -f docker-compose.gpu.yaml restart orchestrator
```
**That's it!** The orchestrator automatically detects the new model.
## Management Commands
### Orchestrator
```bash
# Start orchestrator
docker compose -f docker-compose.gpu.yaml up -d orchestrator
# View orchestrator logs
docker logs -f ai_orchestrator
# Restart orchestrator
docker compose -f docker-compose.gpu.yaml restart orchestrator
# Check active model
curl http://100.100.108.13:9000/health
# List all models
curl http://100.100.108.13:9000/models
```
### Manual Model Control
```bash
# Manually switch to specific model
curl -X POST http://100.100.108.13:9000/switch \
-H "Content-Type: application/json" \
-d '{"model": "flux-schnell"}'
# Check which model is running
curl http://100.100.108.13:9000/health | jq '.current_model'
```
### Model Services
```bash
# Manually start a specific model (bypassing orchestrator)
docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen
# Stop a model
docker compose -f docker-compose.gpu.yaml stop vllm-qwen
# View model logs
docker logs -f ai_vllm-qwen_1
docker logs -f ai_flux_1
docker logs -f ai_musicgen_1
```
## Monitoring
### GPU Usage
```bash
ssh gpu-pivoine "nvidia-smi"
```
### Model Status
```bash
# Which model is active?
curl http://100.100.108.13:9000/health
# Model memory usage
curl http://100.100.108.13:9000/health | jq '{current: .current_model, vram: .model_info.vram_gb}'
```
### Performance
```bash
# Orchestrator logs (model switching)
docker logs -f ai_orchestrator
# Model-specific logs
docker logs -f ai_vllm-qwen_1
docker logs -f ai_flux_1
docker logs -f ai_musicgen_1
```
## Troubleshooting
### Model Won't Load
```bash
# Check orchestrator logs
docker logs ai_orchestrator
# Check if model service exists
docker compose -f docker-compose.gpu.yaml config | grep -A 10 "vllm-qwen"
# Manually test model service
docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen
curl http://localhost:8001/health
```
### Orchestrator Can't Connect
```bash
# Check Docker socket permissions
ls -l /var/run/docker.sock
# Restart Docker daemon
sudo systemctl restart docker
# Rebuild orchestrator
docker compose -f docker-compose.gpu.yaml build orchestrator
docker compose -f docker-compose.gpu.yaml up -d orchestrator
```
### Model Switching Too Slow
```bash
# Check model startup times in models.yaml
# Adjust startup_time_seconds if needed
# Pre-download models to /workspace cache
docker run --rm -it --gpus all \
-v /workspace/huggingface_cache:/cache \
-e HF_HOME=/cache \
nvidia/cuda:12.4.0-runtime-ubuntu22.04 \
huggingface-cli download facebook/musicgen-medium
```
## File Structure
```
ai/
├── docker-compose.gpu.yaml # Main orchestration file
├── .env.example # Environment template
├── README.md # This file
├── model-orchestrator/ # Central orchestrator service
│ ├── orchestrator.py # FastAPI app managing models
│ ├── models.yaml # Model registry (EDIT TO ADD MODELS)
│ ├── Dockerfile
│ └── requirements.txt
├── vllm/ # Text generation (vLLM)
│ ├── server.py # Qwen 2.5 7B server
│ ├── Dockerfile
│ └── requirements.txt
├── flux/ # Image generation (Flux.1 Schnell)
│ └── config/
│ └── config.json # Flux configuration
├── musicgen/ # Music generation (MusicGen)
│ ├── server.py # MusicGen API server
│ ├── Dockerfile
│ └── requirements.txt
├── litellm-config.yaml # LiteLLM proxy configuration
└── GPU_DEPLOYMENT_LOG.md # Deployment history and notes
```
## Cost Analysis
### Current Setup (Single GPU)
- **Provider**: RunPod Spot Instance
- **GPU**: RTX 4090 24GB
- **Cost**: ~$0.50/hour
- **Monthly**: ~$360 (if running 24/7)
- **Optimized**: ~$120 (8 hours/day during business hours)
### Alternative: Multi-GPU (All Models Always On)
- **GPUs**: 2× RTX 4090
- **Cost**: ~$0.75/hour
- **Monthly**: ~$540 (if running 24/7)
- **Trade-off**: No switching latency, +$180/month
### Recommendation
Stick with single GPU sequential loading for cost optimization. Model switching (30-120 seconds) is acceptable for most use cases.
## Performance Expectations
| Model | VRAM | Startup Time | Generation Speed |
|-------|------|--------------|------------------|
| Qwen 2.5 7B | 14GB | 120s | ~50 tokens/sec |
| Flux.1 Schnell | 14GB | 60s | ~4-5 sec/image |
| MusicGen Medium | 11GB | 45s | ~60-90 sec for 30s audio |
**Model Switching**: 30-120 seconds (unload current + load new)
## Security Notes
- Orchestrator requires Docker socket access (`/var/run/docker.sock`)
- All services run on private Tailscale network
- No public exposure (only via VPS LiteLLM proxy)
- HuggingFace token stored in `.env` (not committed to git)
## Future Enhancements
1. ⏹️ Add Llama 3.1 8B for alternative text generation
2. ⏹️ Add Whisper Large v3 for speech-to-text
3. ⏹️ Add XTTS v2 for text-to-speech
4. ⏹️ Implement model preloading/caching for faster switching
5. ⏹️ Add usage metrics and cost tracking
6. ⏹️ Auto-stop GPU pod during idle periods
## Support
For issues or questions:
- Check orchestrator logs: `docker logs ai_orchestrator`
- View model-specific logs: `docker logs ai_<service>_1`
- Test direct model access: `curl http://localhost:<port>/health`
- Review GPU deployment log: `GPU_DEPLOYMENT_LOG.md`
## License
Built with:
- [vLLM](https://github.com/vllm-project/vllm) - Apache 2.0
- [AudioCraft](https://github.com/facebookresearch/audiocraft) - MIT (code), CC-BY-NC (weights)
- [Flux.1](https://github.com/black-forest-labs/flux) - Apache 2.0
- [LiteLLM](https://github.com/BerriAI/litellm) - MIT
**Note**: MusicGen pre-trained weights are non-commercial (CC-BY-NC). Train your own models for commercial use with the MIT-licensed code.

421
docs/GPU_DEPLOYMENT_LOG.md Normal file
View File

@@ -0,0 +1,421 @@
# GPU Server Deployment Log
## Current Deployment (2025-11-21)
### Infrastructure
- **Provider**: RunPod (Spot Instance)
- **GPU**: NVIDIA RTX 4090 24GB
- **Disk**: 50GB local SSD (expanded from 20GB)
- **Network Volume**: 922TB at `/workspace`
- **Region**: Europe
- **Cost**: ~$0.50/hour (~$360/month if running 24/7)
### Network Configuration
- **VPN**: Tailscale (replaces WireGuard due to RunPod UDP restrictions)
- **GPU Server Tailscale IP**: 100.100.108.13
- **VPS Tailscale IP**: (get with `tailscale ip -4` on VPS)
### SSH Access
```
Host gpu-pivoine
HostName 213.173.102.232
Port 29695
User root
IdentityFile ~/.ssh/id_ed25519
```
**Note**: RunPod Spot instances can be terminated and restarted with new ports/IPs. Update SSH config accordingly.
### Software Stack
- **Python**: 3.11.10
- **vLLM**: 0.6.4.post1 (installed with pip)
- **PyTorch**: 2.5.1 with CUDA 12.4
- **Tailscale**: Installed via official script
### vLLM Deployment
**Custom Server**: `ai/simple_vllm_server.py`
- Uses `AsyncLLMEngine` directly to bypass multiprocessing issues
- OpenAI-compatible API endpoints:
- `GET /v1/models` - List available models
- `POST /v1/completions` - Text completion
- `POST /v1/chat/completions` - Chat completion
- Default model: Qwen/Qwen2.5-7B-Instruct
- Cache directory: `/workspace/huggingface_cache`
**Deployment Command**:
```bash
# Copy server script to GPU server
scp ai/simple_vllm_server.py gpu-pivoine:/workspace/
# Start server
ssh gpu-pivoine "cd /workspace && nohup python3 simple_vllm_server.py > vllm.log 2>&1 &"
# Check status
ssh gpu-pivoine "curl http://localhost:8000/v1/models"
```
**Server Configuration** (environment variables):
- `VLLM_HOST`: 0.0.0.0 (default)
- `VLLM_PORT`: 8000 (default)
### Model Configuration
- **Model**: Qwen/Qwen2.5-7B-Instruct (no auth required)
- **Context Length**: 4096 tokens
- **GPU Memory**: 85% utilization
- **Tensor Parallel**: 1 (single GPU)
### Known Issues & Solutions
#### Issue 1: vLLM Multiprocessing Errors
**Problem**: Default vLLM v1 engine fails with ZMQ/CUDA multiprocessing errors on RunPod.
**Solution**: Custom `AsyncLLMEngine` FastAPI server bypasses multiprocessing layer entirely.
#### Issue 2: Disk Space (Solved)
**Problem**: Original 20GB disk filled up with Hugging Face cache.
**Solution**: Expanded to 50GB and use `/workspace` for model cache.
#### Issue 3: Gated Models
**Problem**: Llama models require Hugging Face authentication.
**Solution**: Use Qwen 2.5 7B Instruct (no auth required) or set `HF_TOKEN` environment variable.
#### Issue 4: Spot Instance Volatility
**Problem**: RunPod Spot instances can be terminated anytime.
**Solution**: Accept as trade-off for cost savings. Document SSH details for quick reconnection.
### Monitoring
**Check vLLM logs**:
```bash
ssh gpu-pivoine "tail -f /workspace/vllm.log"
```
**Check GPU usage**:
```bash
ssh gpu-pivoine "nvidia-smi"
```
**Check Tailscale status**:
```bash
ssh gpu-pivoine "tailscale status"
```
**Test API locally (on GPU server)**:
```bash
ssh gpu-pivoine "curl http://localhost:8000/v1/models"
```
**Test API via Tailscale (from VPS)**:
```bash
curl http://100.100.108.13:8000/v1/models
```
### LiteLLM Integration
Update VPS LiteLLM config at `ai/litellm-config-gpu.yaml`:
```yaml
# Replace old WireGuard IP (10.8.0.2) with Tailscale IP
- model_name: qwen-2.5-7b
litellm_params:
model: openai/qwen-2.5-7b
api_base: http://100.100.108.13:8000/v1 # Tailscale IP
api_key: dummy
rpm: 1000
tpm: 100000
```
Restart LiteLLM:
```bash
arty restart litellm
```
### Troubleshooting
**Server not responding**:
1. Check if process is running: `pgrep -f simple_vllm_server`
2. Check logs: `tail -100 /workspace/vllm.log`
3. Check GPU availability: `nvidia-smi`
4. Restart server: `pkill -f simple_vllm_server && python3 /workspace/simple_vllm_server.py &`
**Tailscale not connected**:
1. Check status: `tailscale status`
2. Check daemon: `ps aux | grep tailscaled`
3. Restart: `tailscale down && tailscale up`
**Model download failing**:
1. Check disk space: `df -h`
2. Check cache directory: `ls -lah /workspace/huggingface_cache`
3. Clear cache if needed: `rm -rf /workspace/huggingface_cache/*`
### Deployment Status ✅ COMPLETE
**Deployment Date**: 2025-11-21
1. ✅ Deploy vLLM with Qwen 2.5 7B - COMPLETE
2. ✅ Test API endpoints locally and via Tailscale - COMPLETE
3. ✅ Update VPS LiteLLM configuration - COMPLETE
4. ✅ Test end-to-end: Open WebUI → LiteLLM → vLLM - COMPLETE
5. ⏳ Monitor performance and costs - ONGOING
**Model Available**: `qwen-2.5-7b` visible in Open WebUI at https://ai.pivoine.art
### Next Steps (2025-11-21 Original)
6. ✅ Consider adding more models → COMPLETE (added Flux.1 Schnell + MusicGen Medium)
7. ⏹️ Set up auto-stop for idle periods to save costs
---
## Multi-Modal Architecture (2025-11-21 Update)
### Overview
Expanded GPU deployment to support **text, image, and music generation** with intelligent model orchestration. All models run sequentially on a single RTX 4090 GPU with automatic switching based on request type.
### Architecture Components
#### 1. **Orchestrator Service** (Port 9000 - Always Running)
- **Location**: `ai/model-orchestrator/`
- **Purpose**: Central service managing model lifecycle
- **Features**:
- Detects request type (text/image/audio)
- Automatically unloads current model
- Loads requested model
- Proxies requests to active model
- Tracks GPU memory usage
- **Technology**: FastAPI + Docker SDK Python
- **Endpoints**:
- `POST /v1/chat/completions` → Routes to text models
- `POST /v1/images/generations` → Routes to image models
- `POST /v1/audio/generations` → Routes to music models
- `GET /health` → Shows active model and status
- `GET /models` → Lists all available models
- `POST /switch` → Manually switch models
#### 2. **Text Generation** (vLLM + Qwen 2.5 7B)
- **Service**: `vllm-qwen` (Port 8001)
- **Location**: `ai/vllm/`
- **Model**: Qwen/Qwen2.5-7B-Instruct
- **VRAM**: 14GB (85% GPU utilization)
- **Speed**: ~50 tokens/second
- **Startup**: 120 seconds
- **Status**: ✅ Working (same as original deployment)
#### 3. **Image Generation** (Flux.1 Schnell)
- **Service**: `flux` (Port 8002)
- **Location**: `ai/flux/`
- **Model**: black-forest-labs/FLUX.1-schnell
- **VRAM**: 14GB with CPU offloading
- **Speed**: 4-5 seconds per image
- **Startup**: 60 seconds
- **Features**: OpenAI DALL-E compatible API
- **Image**: `ghcr.io/matatonic/openedai-images-flux:latest`
#### 4. **Music Generation** (MusicGen Medium)
- **Service**: `musicgen` (Port 8003)
- **Location**: `ai/musicgen/`
- **Model**: facebook/musicgen-medium
- **VRAM**: 11GB
- **Speed**: 60-90 seconds for 30 seconds of audio
- **Startup**: 45 seconds
- **Features**: Text-to-music generation with sampling controls
- **Technology**: Meta's AudioCraft + custom FastAPI wrapper
### Model Registry (`models.yaml`)
Simple configuration file for managing all models:
```yaml
models:
qwen-2.5-7b:
type: text
framework: vllm
docker_service: vllm-qwen
port: 8001
vram_gb: 14
startup_time_seconds: 120
endpoint: /v1/chat/completions
flux-schnell:
type: image
framework: openedai-images
docker_service: flux
port: 8002
vram_gb: 14
startup_time_seconds: 60
endpoint: /v1/images/generations
musicgen-medium:
type: audio
framework: audiocraft
docker_service: musicgen
port: 8003
vram_gb: 11
startup_time_seconds: 45
endpoint: /v1/audio/generations
```
**Adding new models**: Just add a new entry to this file and define the Docker service.
### Deployment Changes
#### Docker Compose Structure
- **File**: `docker-compose.gpu.yaml`
- **Services**: 4 total (1 orchestrator + 3 models)
- **Profiles**: `text`, `image`, `audio` (orchestrator manages activation)
- **Restart Policy**: `no` for models (orchestrator controls lifecycle)
- **Volumes**: All model caches on `/workspace` (922TB network volume)
#### LiteLLM Integration
Updated `litellm-config.yaml` to route all self-hosted models through orchestrator:
```yaml
# Text
- model_name: qwen-2.5-7b
api_base: http://100.100.108.13:9000/v1 # Orchestrator
# Image
- model_name: flux-schnell
api_base: http://100.100.108.13:9000/v1 # Orchestrator
# Music
- model_name: musicgen-medium
api_base: http://100.100.108.13:9000/v1 # Orchestrator
```
All models now available via Open WebUI at https://ai.pivoine.art
### Usage Examples
**Text Generation**:
```bash
curl http://100.100.108.13:9000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Hello"}]}'
```
**Image Generation**:
```bash
curl http://100.100.108.13:9000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{"model": "flux-schnell", "prompt": "a cute cat", "size": "1024x1024"}'
```
**Music Generation**:
```bash
curl http://100.100.108.13:9000/v1/audio/generations \
-H "Content-Type: application/json" \
-d '{"model": "musicgen-medium", "prompt": "upbeat electronic", "duration": 30}'
```
### Deployment Commands
```bash
# Copy all files to RunPod
scp -r ai/* gpu-pivoine:/workspace/ai/
# SSH to GPU server
ssh gpu-pivoine
cd /workspace/ai/
# Start orchestrator (manages everything)
docker compose -f docker-compose.gpu.yaml up -d orchestrator
# Check status
curl http://100.100.108.13:9000/health
# View logs
docker logs -f ai_orchestrator
# Manually switch models (optional)
curl -X POST http://100.100.108.13:9000/switch \
-H "Content-Type: application/json" \
-d '{"model": "flux-schnell"}'
```
### Performance Characteristics
| Model | VRAM | Startup Time | Generation Time | Notes |
|-------|------|--------------|-----------------|-------|
| Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | Fast text generation |
| Flux.1 Schnell | 14GB | 60s | 4-5s/image | High-quality images |
| MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Text-to-music |
**Model Switching Overhead**: 30-120 seconds (unload + load)
### Cost Analysis
**Current (Single GPU Sequential)**:
- Cost: ~$0.50/hour
- Monthly: ~$360 (24/7) or ~$120 (8hr/day)
- Trade-off: 30-120s switching time
**Alternative (Multi-GPU Concurrent)**:
- Cost: ~$0.75/hour (+50%)
- Monthly: ~$540 (24/7) or ~$180 (8hr/day)
- Benefit: No switching time, all models always available
**Decision**: Stick with single GPU for cost optimization. Switching time is acceptable for most use cases.
### Known Limitations
1. **Sequential Only**: Only one model active at a time
2. **Switching Latency**: 30-120 seconds to change models
3. **MusicGen License**: Pre-trained weights are CC-BY-NC (non-commercial)
4. **Spot Instance Volatility**: Pod can be terminated anytime
### Monitoring
**Check active model**:
```bash
curl http://100.100.108.13:9000/health | jq '{model: .current_model, vram: .model_info.vram_gb}'
```
**View orchestrator logs**:
```bash
docker logs -f ai_orchestrator
```
**GPU usage**:
```bash
ssh gpu-pivoine "nvidia-smi"
```
### Deployment Status ✅ COMPLETE (Multi-Modal)
**Deployment Date**: 2025-11-21
1. ✅ Create model orchestrator service - COMPLETE
2. ✅ Deploy vLLM text generation (Qwen 2.5 7B) - COMPLETE
3. ✅ Deploy Flux.1 Schnell image generation - COMPLETE
4. ✅ Deploy MusicGen Medium music generation - COMPLETE
5. ✅ Update LiteLLM configuration - COMPLETE
6. ✅ Test all three model types via orchestrator - READY FOR TESTING
7. ⏳ Monitor performance and costs - ONGOING
**Models Available**: `qwen-2.5-7b`, `flux-schnell`, `musicgen-medium` via Open WebUI
### Future Model Additions
**Easy to add** (just edit `models.yaml`):
- Llama 3.1 8B Instruct (text, gated model)
- Whisper Large v3 (speech-to-text)
- XTTS v2 (text-to-speech)
- Stable Diffusion XL (alternative image generation)
See `README.md` for detailed instructions on adding new models.
### Cost Optimization Ideas
1. **Auto-stop**: Configure RunPod to auto-stop after 30 minutes idle
2. **Spot Instances**: Already using Spot for 50% cost reduction
3. **Scheduled Operation**: Run only during business hours (8 hours/day = $120/month)
4. **Smaller Models**: Use Mistral 7B or quantized models for lighter workloads
5. **Pay-as-you-go**: Manually start/stop pod as needed
### Performance Benchmarks
*To be measured after deployment*
Expected (based on RTX 4090):
- Qwen 2.5 7B: 50-80 tokens/second
- Context processing: ~2-3 seconds for 1000 tokens
- First token latency: ~200-300ms

416
docs/RUNPOD_TEMPLATE.md Normal file
View File

@@ -0,0 +1,416 @@
# RunPod Template Creation Guide
This guide shows you how to create a reusable RunPod template so you never have to reinstall everything from scratch when Spot instances restart.
## Why Create a Template?
**Without Template** (Manual Setup Every Time):
- ❌ Install Docker & Docker Compose (10-15 min)
- ❌ Install Tailscale (5 min)
- ❌ Pull Docker images (10-20 min)
- ❌ Download models: Qwen (~14GB), Flux (~12GB), MusicGen (~11GB) = 30-45 min
- ❌ Configure everything (5-10 min)
- **Total: 60-90 minutes per Spot instance restart**
**With Template** (Ready to Go):
- ✅ Everything pre-installed
- ✅ Models cached in `/workspace`
- ✅ Just start orchestrator
- **Total: 2-3 minutes**
## Template Contents
### System Software
- ✅ Docker 24.x + Docker Compose v2
- ✅ Tailscale latest
- ✅ NVIDIA Docker runtime
- ✅ Python 3.11
- ✅ Git, curl, wget, htop, nvtop
### Docker Images (Pre-built)
-`ai_orchestrator` - Model orchestration service
-`ai_vllm-qwen_1` - Text generation (vLLM + Qwen 2.5 7B)
-`ai_musicgen_1` - Music generation (AudioCraft)
-`ghcr.io/matatonic/openedai-images-flux:latest` - Image generation
### Model Cache (/workspace - Persistent)
- ✅ Qwen 2.5 7B Instruct (~14GB)
- ✅ Flux.1 Schnell (~12GB)
- ✅ MusicGen Medium (~11GB)
- **Total: ~37GB cached**
### Project Files (/workspace/ai)
- ✅ All orchestrator code
- ✅ Docker Compose configurations
- ✅ Model service configurations
- ✅ Documentation
---
## Step-by-Step Template Creation
### Prerequisites
1. RunPod account
2. Active RTX 4090 pod (or similar GPU)
3. SSH access to the pod
4. This repository cloned locally
### Step 1: Deploy Fresh Pod
```bash
# Create new RunPod instance:
# - GPU: RTX 4090 (24GB VRAM)
# - Disk: 50GB container disk
# - Network Volume: Attach or create 100GB+ volume
# - Template: Start with official PyTorch or CUDA template
# Note the SSH connection details (host, port, password)
```
### Step 2: Prepare the Instance
Run the automated preparation script:
```bash
# On your local machine, copy everything to RunPod
scp -P <PORT> -r /home/valknar/Projects/runpod/* root@<HOST>:/workspace/ai/
# SSH to the pod
ssh -p <PORT> root@<HOST>
# Run the preparation script
cd /workspace/ai
chmod +x scripts/prepare-template.sh
./scripts/prepare-template.sh
```
**What the script does:**
1. Installs Docker & Docker Compose
2. Installs Tailscale
3. Builds all Docker images
4. Pre-downloads all models
5. Validates everything works
6. Cleans up temporary files
**Estimated time: 45-60 minutes**
### Step 3: Manual Verification
After the script completes, verify everything:
```bash
# Check Docker is installed
docker --version
docker compose version
# Check Tailscale
tailscale version
# Check all images are built
docker images | grep ai_
# Check models are cached
ls -lh /workspace/huggingface_cache/
ls -lh /workspace/flux/models/
ls -lh /workspace/musicgen/models/
# Test orchestrator starts
cd /workspace/ai
docker compose -f docker-compose.gpu.yaml up -d orchestrator
docker logs ai_orchestrator
# Test model loading (should be fast since models are cached)
curl http://localhost:9000/health
# Stop orchestrator
docker compose -f docker-compose.gpu.yaml down
```
### Step 4: Clean Up Before Saving
**IMPORTANT**: Remove secrets and temporary data before creating template!
```bash
# Remove sensitive data
rm -f /workspace/ai/.env
rm -f /root/.ssh/known_hosts
rm -f /root/.bash_history
# Clear logs
rm -f /var/log/*.log
docker system prune -af --volumes # Clean Docker cache but keep images
# Clear Tailscale state (will re-authenticate on first use)
tailscale logout
# Create template-ready marker
echo "RunPod Multi-Modal AI Template v1.0" > /workspace/TEMPLATE_VERSION
echo "Created: $(date)" >> /workspace/TEMPLATE_VERSION
```
### Step 5: Save Template in RunPod Dashboard
1. **Go to RunPod Dashboard** → "My Pods"
2. **Select your prepared pod**
3. **Click "⋮" menu** → "Save as Template"
4. **Template Configuration**:
- **Name**: `multi-modal-ai-v1.0`
- **Description**:
```
Multi-Modal AI Stack with Orchestrator
- Text: vLLM + Qwen 2.5 7B
- Image: Flux.1 Schnell
- Music: MusicGen Medium
- Models pre-cached (~37GB)
- Ready to deploy in 2-3 minutes
```
- **Category**: `AI/ML`
- **Docker Image**: (auto-detected)
- **Container Disk**: 50GB
- **Expose Ports**: 9000, 8001, 8002, 8003
- **Environment Variables** (optional):
```
HF_TOKEN=<leave empty, user will add>
TAILSCALE_AUTHKEY=<leave empty, user will add>
```
5. **Click "Save Template"**
6. **Wait for template creation** (5-10 minutes)
7. **Test the template** by deploying a new pod with it
---
## Using Your Template
### Deploy New Pod from Template
1. **RunPod Dashboard** → " Deploy"
2. **Select "Community Templates"** or "My Templates"
3. **Choose**: `multi-modal-ai-v1.0`
4. **Configure**:
- GPU: RTX 4090 (or compatible)
- Network Volume: Attach your existing volume with `/workspace` mount
- Environment:
- `HF_TOKEN`: Your Hugging Face token
- (Tailscale will be configured via SSH)
5. **Deploy Pod**
### First-Time Setup (On New Pod)
```bash
# SSH to the new pod
ssh -p <PORT> root@<HOST>
# Navigate to project
cd /workspace/ai
# Create .env file
cat > .env <<EOF
HF_TOKEN=hf_your_token_here
GPU_TAILSCALE_IP=100.100.108.13
EOF
# Configure Tailscale (one-time)
tailscale up --authkey=<YOUR_TAILSCALE_KEY>
# Start orchestrator (models already cached, starts in seconds!)
docker compose -f docker-compose.gpu.yaml up -d orchestrator
# Verify
curl http://localhost:9000/health
# Check logs
docker logs -f ai_orchestrator
```
**Total setup time: 2-3 minutes!** 🎉
### Updating SSH Config (If Spot Instance Restarts)
Since Spot instances can restart with new IPs/ports:
```bash
# On your local machine
# Update ~/.ssh/config with new connection details
Host gpu-pivoine
HostName <NEW_IP>
Port <NEW_PORT>
User root
IdentityFile ~/.ssh/id_ed25519
```
---
## Template Maintenance
### Updating the Template
When you add new models or make improvements:
1. Deploy a pod from your existing template
2. Make your changes
3. Test everything
4. Clean up (remove secrets)
5. Save as new template version: `multi-modal-ai-v1.1`
6. Update your documentation
### Version History
Keep track of template versions:
```
v1.0 (2025-11-21) - Initial release
- Text: Qwen 2.5 7B
- Image: Flux.1 Schnell
- Music: MusicGen Medium
- Docker orchestrator
v1.1 (future) - Planned
- Add Llama 3.1 8B
- Add Whisper Large v3
- Optimize model loading
```
---
## Troubleshooting Template Creation
### Models Not Downloading
```bash
# Manually trigger model downloads
docker compose --profile text up -d vllm-qwen
docker logs -f ai_vllm-qwen_1
# Wait for "Model loaded successfully"
docker compose stop vllm-qwen
# Repeat for other models
docker compose --profile image up -d flux
docker compose --profile audio up -d musicgen
```
### Docker Images Not Building
```bash
# Build images one at a time
docker compose -f docker-compose.gpu.yaml build orchestrator
docker compose -f docker-compose.gpu.yaml build vllm-qwen
docker compose -f docker-compose.gpu.yaml build musicgen
# Check build logs for errors
docker compose -f docker-compose.gpu.yaml build --no-cache --progress=plain orchestrator
```
### Tailscale Won't Install
```bash
# Manual Tailscale installation
curl -fsSL https://tailscale.com/install.sh | sh
# Start daemon
tailscaled --tun=userspace-networking --socks5-server=localhost:1055 &
# Test
tailscale version
```
### Template Too Large
RunPod templates have size limits. If your template is too large:
**Option 1**: Use network volume for models
- Move models to network volume: `/workspace/models/`
- Mount volume when deploying from template
- Models persist across pod restarts
**Option 2**: Reduce cached models
- Only cache most-used model (Qwen 2.5 7B)
- Download others on first use
- Accept slightly longer first-time startup
**Option 3**: Use Docker layer optimization
```dockerfile
# In Dockerfile, order commands by change frequency
# Less frequently changed layers first
```
---
## Cost Analysis
### Template Storage Cost
- RunPod charges for template storage: ~$0.10/GB/month
- This template: ~50GB = **~$5/month**
- **Worth it!** Saves 60-90 minutes per Spot restart
### Time Savings
- Spot instance restarts: 2-5 times per week (highly variable)
- Time saved per restart: 60-90 minutes
- **Total saved per month: 8-20 hours**
- **Value: Priceless for rapid deployment**
---
## Advanced: Automated Template Updates
Create a CI/CD pipeline to automatically update templates:
```bash
# GitHub Actions workflow (future enhancement)
# 1. Deploy pod from template
# 2. Pull latest code
# 3. Rebuild images
# 4. Test
# 5. Save new template version
# 6. Notify team
```
---
## Template Checklist
Before saving your template, verify:
- [ ] All Docker images built and working
- [ ] All models downloaded and cached
- [ ] Tailscale installed (but logged out)
- [ ] Docker Compose files present
- [ ] `.env` file removed (secrets cleared)
- [ ] Logs cleared
- [ ] SSH keys removed
- [ ] Bash history cleared
- [ ] Template version documented
- [ ] Test deployment successful
---
## Support
If you have issues creating the template:
1. Check `/workspace/ai/scripts/prepare-template.sh` logs
2. Review Docker build logs: `docker compose build --progress=plain`
3. Check model download logs: `docker logs <container>`
4. Verify disk space: `df -h`
5. Check network volume is mounted: `mount | grep workspace`
For RunPod-specific issues:
- RunPod Docs: https://docs.runpod.io/
- RunPod Discord: https://discord.gg/runpod
---
## Next Steps
After creating your template:
1. ✅ Test deployment from template
2. ✅ Document in `GPU_DEPLOYMENT_LOG.md`
3. ✅ Share template ID with team (if applicable)
4. ✅ Set up monitoring (Netdata, etc.)
5. ✅ Configure auto-stop for cost optimization
6. ✅ Add more models as needed
**Your multi-modal AI infrastructure is now portable and reproducible!** 🚀