refactor: clean up runpod repository structure
Removed facefusion and VPS-related files: - compose.yaml, postgres/, litellm-config.yaml (VPS services) - Dockerfile, entrypoint.sh, disable-nsfw-filter.patch (facefusion) Removed outdated documentation: - DOCKER_GPU_SETUP.md, README_GPU_SETUP.md, SETUP_GUIDE.md - TAILSCALE_SETUP.md, WIREGUARD_SETUP.md (covered in DEPLOYMENT.md) - GPU_EXPANSION_PLAN.md (historical planning doc) - gpu-server-compose.yaml, litellm-config-gpu.yaml (old versions) - deploy-gpu-stack.sh, simple_vllm_server.py (old scripts) Organized documentation: - Created docs/ directory - Moved DEPLOYMENT.md, RUNPOD_TEMPLATE.md, GPU_DEPLOYMENT_LOG.md to docs/ - Updated all documentation links in README.md Final structure: - Clean root directory with only GPU-specific files - Organized documentation in docs/ - Model services in dedicated directories (model-orchestrator/, vllm/, flux/, musicgen/) - Automation scripts in scripts/
This commit is contained in:
467
docs/DEPLOYMENT.md
Normal file
467
docs/DEPLOYMENT.md
Normal file
@@ -0,0 +1,467 @@
|
||||
# Multi-Modal AI Orchestration System
|
||||
|
||||
**Cost-optimized AI infrastructure running text, image, and music generation on a single RunPod RTX 4090 GPU.**
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
This system provides a unified API for multiple AI model types with automatic model switching on a single GPU (24GB VRAM). All requests route through an intelligent orchestrator that manages model lifecycle.
|
||||
|
||||
### Components
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ VPS (Tailscale: 100.102.217.79) │
|
||||
│ ┌───────────────────────────────────────────────────────────┐ │
|
||||
│ │ LiteLLM Proxy (Port 4000) │ │
|
||||
│ │ Routes to: Claude API + GPU Orchestrator │ │
|
||||
│ └────────────────────┬──────────────────────────────────────┘ │
|
||||
└───────────────────────┼─────────────────────────────────────────┘
|
||||
│ Tailscale VPN
|
||||
┌───────────────────────┼─────────────────────────────────────────┐
|
||||
│ RunPod GPU Server (Tailscale: 100.100.108.13) │
|
||||
│ ┌────────────────────▼──────────────────────────────────────┐ │
|
||||
│ │ Orchestrator (Port 9000) │ │
|
||||
│ │ Manages sequential model loading based on request type │ │
|
||||
│ └─────┬──────────────┬──────────────────┬──────────────────┘ │
|
||||
│ │ │ │ │
|
||||
│ ┌─────▼──────┐ ┌────▼────────┐ ┌──────▼───────┐ │
|
||||
│ │vLLM │ │Flux.1 │ │MusicGen │ │
|
||||
│ │Qwen 2.5 7B │ │Schnell │ │Medium │ │
|
||||
│ │Port: 8001 │ │Port: 8002 │ │Port: 8003 │ │
|
||||
│ │VRAM: 14GB │ │VRAM: 14GB │ │VRAM: 11GB │ │
|
||||
│ └────────────┘ └─────────────┘ └──────────────┘ │
|
||||
│ │
|
||||
│ Only ONE model active at a time (sequential loading) │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Features
|
||||
|
||||
✅ **Automatic Model Switching** - Orchestrator detects request type and loads appropriate model
|
||||
✅ **OpenAI-Compatible APIs** - Works with existing OpenAI clients and tools
|
||||
✅ **Cost-Optimized** - Sequential loading on single GPU (~$0.50/hr vs ~$0.75/hr for multi-GPU)
|
||||
✅ **Easy Model Addition** - Add new models by editing YAML config
|
||||
✅ **Centralized Routing** - LiteLLM proxy provides unified API for all models
|
||||
✅ **GPU Memory Safe** - Orchestrator ensures only one model loaded at a time
|
||||
|
||||
## Supported Model Types
|
||||
|
||||
### Text Generation
|
||||
- **Qwen 2.5 7B Instruct** (facebook/Qwen2.5-7B-Instruct)
|
||||
- VRAM: 14GB | Speed: Fast | OpenAI-compatible chat API
|
||||
|
||||
### Image Generation
|
||||
- **Flux.1 Schnell** (black-forest-labs/FLUX.1-schnell)
|
||||
- VRAM: 14GB | Speed: 4-5 sec/image | OpenAI DALL-E compatible API
|
||||
|
||||
### Music Generation
|
||||
- **MusicGen Medium** (facebook/musicgen-medium)
|
||||
- VRAM: 11GB | Speed: 60-90 sec for 30s audio | Custom audio API
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Prerequisites
|
||||
|
||||
```bash
|
||||
# On RunPod GPU server
|
||||
- RunPod RTX 4090 instance (24GB VRAM)
|
||||
- Docker & Docker Compose installed
|
||||
- Tailscale VPN configured
|
||||
- HuggingFace token (for model downloads)
|
||||
```
|
||||
|
||||
### 2. Clone & Configure
|
||||
|
||||
```bash
|
||||
# On local machine
|
||||
cd ai/
|
||||
|
||||
# Create environment file
|
||||
cp .env.example .env
|
||||
# Edit .env and add your HF_TOKEN
|
||||
```
|
||||
|
||||
### 3. Deploy to RunPod
|
||||
|
||||
```bash
|
||||
# Copy all files to RunPod GPU server
|
||||
scp -r ai/* gpu-pivoine:/workspace/ai/
|
||||
|
||||
# SSH to GPU server
|
||||
ssh gpu-pivoine
|
||||
|
||||
# Navigate to project
|
||||
cd /workspace/ai/
|
||||
|
||||
# Start orchestrator (always running)
|
||||
docker compose -f docker-compose.gpu.yaml up -d orchestrator
|
||||
|
||||
# Orchestrator will automatically manage model services as needed
|
||||
```
|
||||
|
||||
### 4. Test Deployment
|
||||
|
||||
```bash
|
||||
# Check orchestrator health
|
||||
curl http://100.100.108.13:9000/health
|
||||
|
||||
# Test text generation (auto-loads vLLM)
|
||||
curl http://100.100.108.13:9000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "qwen-2.5-7b",
|
||||
"messages": [{"role": "user", "content": "Hello!"}]
|
||||
}'
|
||||
|
||||
# Test image generation (auto-switches to Flux)
|
||||
curl http://100.100.108.13:9000/v1/images/generations \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "flux-schnell",
|
||||
"prompt": "a cute cat",
|
||||
"size": "1024x1024"
|
||||
}'
|
||||
|
||||
# Test music generation (auto-switches to MusicGen)
|
||||
curl http://100.100.108.13:9000/v1/audio/generations \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "musicgen-medium",
|
||||
"prompt": "upbeat electronic dance music",
|
||||
"duration": 30
|
||||
}'
|
||||
```
|
||||
|
||||
### 5. Update VPS LiteLLM
|
||||
|
||||
```bash
|
||||
# On VPS, restart LiteLLM to pick up new config
|
||||
ssh vps
|
||||
cd ~/Projects/docker-compose
|
||||
arty restart litellm
|
||||
```
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Via Open WebUI (https://ai.pivoine.art)
|
||||
|
||||
**Text Generation:**
|
||||
1. Select model: `qwen-2.5-7b`
|
||||
2. Type message and send
|
||||
3. Orchestrator loads vLLM automatically
|
||||
|
||||
**Image Generation:**
|
||||
1. Select model: `flux-schnell`
|
||||
2. Enter image prompt
|
||||
3. Orchestrator switches to Flux.1
|
||||
|
||||
**Music Generation:**
|
||||
1. Select model: `musicgen-medium`
|
||||
2. Describe the music you want
|
||||
3. Orchestrator switches to MusicGen
|
||||
|
||||
### Via API (Direct)
|
||||
|
||||
```python
|
||||
import openai
|
||||
|
||||
# Configure client to use orchestrator
|
||||
client = openai.OpenAI(
|
||||
base_url="http://100.100.108.13:9000/v1",
|
||||
api_key="dummy" # Not used but required
|
||||
)
|
||||
|
||||
# Text generation
|
||||
response = client.chat.completions.create(
|
||||
model="qwen-2.5-7b",
|
||||
messages=[{"role": "user", "content": "Write a haiku"}]
|
||||
)
|
||||
|
||||
# Image generation
|
||||
image = client.images.generate(
|
||||
model="flux-schnell",
|
||||
prompt="a sunset over mountains",
|
||||
size="1024x1024"
|
||||
)
|
||||
|
||||
# Music generation (custom endpoint)
|
||||
import requests
|
||||
music = requests.post(
|
||||
"http://100.100.108.13:9000/v1/audio/generations",
|
||||
json={
|
||||
"model": "musicgen-medium",
|
||||
"prompt": "calm piano music",
|
||||
"duration": 30
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
## Adding New Models
|
||||
|
||||
### Step 1: Update `models.yaml`
|
||||
|
||||
```yaml
|
||||
# Add to ai/model-orchestrator/models.yaml
|
||||
models:
|
||||
llama-3.1-8b: # New model
|
||||
type: text
|
||||
framework: vllm
|
||||
docker_service: vllm-llama
|
||||
port: 8004
|
||||
vram_gb: 17
|
||||
startup_time_seconds: 120
|
||||
endpoint: /v1/chat/completions
|
||||
description: "Llama 3.1 8B Instruct - Meta's latest model"
|
||||
```
|
||||
|
||||
### Step 2: Add Docker Service
|
||||
|
||||
```yaml
|
||||
# Add to ai/docker-compose.gpu.yaml
|
||||
services:
|
||||
vllm-llama:
|
||||
build: ./vllm
|
||||
container_name: ai_vllm-llama_1
|
||||
command: >
|
||||
vllm serve meta-llama/Llama-3.1-8B-Instruct
|
||||
--port 8000 --dtype bfloat16
|
||||
ports:
|
||||
- "8004:8000"
|
||||
environment:
|
||||
- HF_TOKEN=${HF_TOKEN}
|
||||
deploy:
|
||||
resources:
|
||||
reservations:
|
||||
devices:
|
||||
- driver: nvidia
|
||||
count: 1
|
||||
capabilities: [gpu]
|
||||
profiles: ["text"]
|
||||
restart: "no"
|
||||
```
|
||||
|
||||
### Step 3: Restart Orchestrator
|
||||
|
||||
```bash
|
||||
ssh gpu-pivoine
|
||||
cd /workspace/ai/
|
||||
docker compose -f docker-compose.gpu.yaml restart orchestrator
|
||||
```
|
||||
|
||||
**That's it!** The orchestrator automatically detects the new model.
|
||||
|
||||
## Management Commands
|
||||
|
||||
### Orchestrator
|
||||
|
||||
```bash
|
||||
# Start orchestrator
|
||||
docker compose -f docker-compose.gpu.yaml up -d orchestrator
|
||||
|
||||
# View orchestrator logs
|
||||
docker logs -f ai_orchestrator
|
||||
|
||||
# Restart orchestrator
|
||||
docker compose -f docker-compose.gpu.yaml restart orchestrator
|
||||
|
||||
# Check active model
|
||||
curl http://100.100.108.13:9000/health
|
||||
|
||||
# List all models
|
||||
curl http://100.100.108.13:9000/models
|
||||
```
|
||||
|
||||
### Manual Model Control
|
||||
|
||||
```bash
|
||||
# Manually switch to specific model
|
||||
curl -X POST http://100.100.108.13:9000/switch \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"model": "flux-schnell"}'
|
||||
|
||||
# Check which model is running
|
||||
curl http://100.100.108.13:9000/health | jq '.current_model'
|
||||
```
|
||||
|
||||
### Model Services
|
||||
|
||||
```bash
|
||||
# Manually start a specific model (bypassing orchestrator)
|
||||
docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen
|
||||
|
||||
# Stop a model
|
||||
docker compose -f docker-compose.gpu.yaml stop vllm-qwen
|
||||
|
||||
# View model logs
|
||||
docker logs -f ai_vllm-qwen_1
|
||||
docker logs -f ai_flux_1
|
||||
docker logs -f ai_musicgen_1
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
### GPU Usage
|
||||
|
||||
```bash
|
||||
ssh gpu-pivoine "nvidia-smi"
|
||||
```
|
||||
|
||||
### Model Status
|
||||
|
||||
```bash
|
||||
# Which model is active?
|
||||
curl http://100.100.108.13:9000/health
|
||||
|
||||
# Model memory usage
|
||||
curl http://100.100.108.13:9000/health | jq '{current: .current_model, vram: .model_info.vram_gb}'
|
||||
```
|
||||
|
||||
### Performance
|
||||
|
||||
```bash
|
||||
# Orchestrator logs (model switching)
|
||||
docker logs -f ai_orchestrator
|
||||
|
||||
# Model-specific logs
|
||||
docker logs -f ai_vllm-qwen_1
|
||||
docker logs -f ai_flux_1
|
||||
docker logs -f ai_musicgen_1
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Model Won't Load
|
||||
|
||||
```bash
|
||||
# Check orchestrator logs
|
||||
docker logs ai_orchestrator
|
||||
|
||||
# Check if model service exists
|
||||
docker compose -f docker-compose.gpu.yaml config | grep -A 10 "vllm-qwen"
|
||||
|
||||
# Manually test model service
|
||||
docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen
|
||||
curl http://localhost:8001/health
|
||||
```
|
||||
|
||||
### Orchestrator Can't Connect
|
||||
|
||||
```bash
|
||||
# Check Docker socket permissions
|
||||
ls -l /var/run/docker.sock
|
||||
|
||||
# Restart Docker daemon
|
||||
sudo systemctl restart docker
|
||||
|
||||
# Rebuild orchestrator
|
||||
docker compose -f docker-compose.gpu.yaml build orchestrator
|
||||
docker compose -f docker-compose.gpu.yaml up -d orchestrator
|
||||
```
|
||||
|
||||
### Model Switching Too Slow
|
||||
|
||||
```bash
|
||||
# Check model startup times in models.yaml
|
||||
# Adjust startup_time_seconds if needed
|
||||
|
||||
# Pre-download models to /workspace cache
|
||||
docker run --rm -it --gpus all \
|
||||
-v /workspace/huggingface_cache:/cache \
|
||||
-e HF_HOME=/cache \
|
||||
nvidia/cuda:12.4.0-runtime-ubuntu22.04 \
|
||||
huggingface-cli download facebook/musicgen-medium
|
||||
```
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
ai/
|
||||
├── docker-compose.gpu.yaml # Main orchestration file
|
||||
├── .env.example # Environment template
|
||||
├── README.md # This file
|
||||
│
|
||||
├── model-orchestrator/ # Central orchestrator service
|
||||
│ ├── orchestrator.py # FastAPI app managing models
|
||||
│ ├── models.yaml # Model registry (EDIT TO ADD MODELS)
|
||||
│ ├── Dockerfile
|
||||
│ └── requirements.txt
|
||||
│
|
||||
├── vllm/ # Text generation (vLLM)
|
||||
│ ├── server.py # Qwen 2.5 7B server
|
||||
│ ├── Dockerfile
|
||||
│ └── requirements.txt
|
||||
│
|
||||
├── flux/ # Image generation (Flux.1 Schnell)
|
||||
│ └── config/
|
||||
│ └── config.json # Flux configuration
|
||||
│
|
||||
├── musicgen/ # Music generation (MusicGen)
|
||||
│ ├── server.py # MusicGen API server
|
||||
│ ├── Dockerfile
|
||||
│ └── requirements.txt
|
||||
│
|
||||
├── litellm-config.yaml # LiteLLM proxy configuration
|
||||
└── GPU_DEPLOYMENT_LOG.md # Deployment history and notes
|
||||
```
|
||||
|
||||
## Cost Analysis
|
||||
|
||||
### Current Setup (Single GPU)
|
||||
- **Provider**: RunPod Spot Instance
|
||||
- **GPU**: RTX 4090 24GB
|
||||
- **Cost**: ~$0.50/hour
|
||||
- **Monthly**: ~$360 (if running 24/7)
|
||||
- **Optimized**: ~$120 (8 hours/day during business hours)
|
||||
|
||||
### Alternative: Multi-GPU (All Models Always On)
|
||||
- **GPUs**: 2× RTX 4090
|
||||
- **Cost**: ~$0.75/hour
|
||||
- **Monthly**: ~$540 (if running 24/7)
|
||||
- **Trade-off**: No switching latency, +$180/month
|
||||
|
||||
### Recommendation
|
||||
Stick with single GPU sequential loading for cost optimization. Model switching (30-120 seconds) is acceptable for most use cases.
|
||||
|
||||
## Performance Expectations
|
||||
|
||||
| Model | VRAM | Startup Time | Generation Speed |
|
||||
|-------|------|--------------|------------------|
|
||||
| Qwen 2.5 7B | 14GB | 120s | ~50 tokens/sec |
|
||||
| Flux.1 Schnell | 14GB | 60s | ~4-5 sec/image |
|
||||
| MusicGen Medium | 11GB | 45s | ~60-90 sec for 30s audio |
|
||||
|
||||
**Model Switching**: 30-120 seconds (unload current + load new)
|
||||
|
||||
## Security Notes
|
||||
|
||||
- Orchestrator requires Docker socket access (`/var/run/docker.sock`)
|
||||
- All services run on private Tailscale network
|
||||
- No public exposure (only via VPS LiteLLM proxy)
|
||||
- HuggingFace token stored in `.env` (not committed to git)
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
1. ⏹️ Add Llama 3.1 8B for alternative text generation
|
||||
2. ⏹️ Add Whisper Large v3 for speech-to-text
|
||||
3. ⏹️ Add XTTS v2 for text-to-speech
|
||||
4. ⏹️ Implement model preloading/caching for faster switching
|
||||
5. ⏹️ Add usage metrics and cost tracking
|
||||
6. ⏹️ Auto-stop GPU pod during idle periods
|
||||
|
||||
## Support
|
||||
|
||||
For issues or questions:
|
||||
- Check orchestrator logs: `docker logs ai_orchestrator`
|
||||
- View model-specific logs: `docker logs ai_<service>_1`
|
||||
- Test direct model access: `curl http://localhost:<port>/health`
|
||||
- Review GPU deployment log: `GPU_DEPLOYMENT_LOG.md`
|
||||
|
||||
## License
|
||||
|
||||
Built with:
|
||||
- [vLLM](https://github.com/vllm-project/vllm) - Apache 2.0
|
||||
- [AudioCraft](https://github.com/facebookresearch/audiocraft) - MIT (code), CC-BY-NC (weights)
|
||||
- [Flux.1](https://github.com/black-forest-labs/flux) - Apache 2.0
|
||||
- [LiteLLM](https://github.com/BerriAI/litellm) - MIT
|
||||
|
||||
**Note**: MusicGen pre-trained weights are non-commercial (CC-BY-NC). Train your own models for commercial use with the MIT-licensed code.
|
||||
421
docs/GPU_DEPLOYMENT_LOG.md
Normal file
421
docs/GPU_DEPLOYMENT_LOG.md
Normal file
@@ -0,0 +1,421 @@
|
||||
# GPU Server Deployment Log
|
||||
|
||||
## Current Deployment (2025-11-21)
|
||||
|
||||
### Infrastructure
|
||||
- **Provider**: RunPod (Spot Instance)
|
||||
- **GPU**: NVIDIA RTX 4090 24GB
|
||||
- **Disk**: 50GB local SSD (expanded from 20GB)
|
||||
- **Network Volume**: 922TB at `/workspace`
|
||||
- **Region**: Europe
|
||||
- **Cost**: ~$0.50/hour (~$360/month if running 24/7)
|
||||
|
||||
### Network Configuration
|
||||
- **VPN**: Tailscale (replaces WireGuard due to RunPod UDP restrictions)
|
||||
- **GPU Server Tailscale IP**: 100.100.108.13
|
||||
- **VPS Tailscale IP**: (get with `tailscale ip -4` on VPS)
|
||||
|
||||
### SSH Access
|
||||
```
|
||||
Host gpu-pivoine
|
||||
HostName 213.173.102.232
|
||||
Port 29695
|
||||
User root
|
||||
IdentityFile ~/.ssh/id_ed25519
|
||||
```
|
||||
|
||||
**Note**: RunPod Spot instances can be terminated and restarted with new ports/IPs. Update SSH config accordingly.
|
||||
|
||||
### Software Stack
|
||||
- **Python**: 3.11.10
|
||||
- **vLLM**: 0.6.4.post1 (installed with pip)
|
||||
- **PyTorch**: 2.5.1 with CUDA 12.4
|
||||
- **Tailscale**: Installed via official script
|
||||
|
||||
### vLLM Deployment
|
||||
|
||||
**Custom Server**: `ai/simple_vllm_server.py`
|
||||
- Uses `AsyncLLMEngine` directly to bypass multiprocessing issues
|
||||
- OpenAI-compatible API endpoints:
|
||||
- `GET /v1/models` - List available models
|
||||
- `POST /v1/completions` - Text completion
|
||||
- `POST /v1/chat/completions` - Chat completion
|
||||
- Default model: Qwen/Qwen2.5-7B-Instruct
|
||||
- Cache directory: `/workspace/huggingface_cache`
|
||||
|
||||
**Deployment Command**:
|
||||
```bash
|
||||
# Copy server script to GPU server
|
||||
scp ai/simple_vllm_server.py gpu-pivoine:/workspace/
|
||||
|
||||
# Start server
|
||||
ssh gpu-pivoine "cd /workspace && nohup python3 simple_vllm_server.py > vllm.log 2>&1 &"
|
||||
|
||||
# Check status
|
||||
ssh gpu-pivoine "curl http://localhost:8000/v1/models"
|
||||
```
|
||||
|
||||
**Server Configuration** (environment variables):
|
||||
- `VLLM_HOST`: 0.0.0.0 (default)
|
||||
- `VLLM_PORT`: 8000 (default)
|
||||
|
||||
### Model Configuration
|
||||
- **Model**: Qwen/Qwen2.5-7B-Instruct (no auth required)
|
||||
- **Context Length**: 4096 tokens
|
||||
- **GPU Memory**: 85% utilization
|
||||
- **Tensor Parallel**: 1 (single GPU)
|
||||
|
||||
### Known Issues & Solutions
|
||||
|
||||
#### Issue 1: vLLM Multiprocessing Errors
|
||||
**Problem**: Default vLLM v1 engine fails with ZMQ/CUDA multiprocessing errors on RunPod.
|
||||
**Solution**: Custom `AsyncLLMEngine` FastAPI server bypasses multiprocessing layer entirely.
|
||||
|
||||
#### Issue 2: Disk Space (Solved)
|
||||
**Problem**: Original 20GB disk filled up with Hugging Face cache.
|
||||
**Solution**: Expanded to 50GB and use `/workspace` for model cache.
|
||||
|
||||
#### Issue 3: Gated Models
|
||||
**Problem**: Llama models require Hugging Face authentication.
|
||||
**Solution**: Use Qwen 2.5 7B Instruct (no auth required) or set `HF_TOKEN` environment variable.
|
||||
|
||||
#### Issue 4: Spot Instance Volatility
|
||||
**Problem**: RunPod Spot instances can be terminated anytime.
|
||||
**Solution**: Accept as trade-off for cost savings. Document SSH details for quick reconnection.
|
||||
|
||||
### Monitoring
|
||||
|
||||
**Check vLLM logs**:
|
||||
```bash
|
||||
ssh gpu-pivoine "tail -f /workspace/vllm.log"
|
||||
```
|
||||
|
||||
**Check GPU usage**:
|
||||
```bash
|
||||
ssh gpu-pivoine "nvidia-smi"
|
||||
```
|
||||
|
||||
**Check Tailscale status**:
|
||||
```bash
|
||||
ssh gpu-pivoine "tailscale status"
|
||||
```
|
||||
|
||||
**Test API locally (on GPU server)**:
|
||||
```bash
|
||||
ssh gpu-pivoine "curl http://localhost:8000/v1/models"
|
||||
```
|
||||
|
||||
**Test API via Tailscale (from VPS)**:
|
||||
```bash
|
||||
curl http://100.100.108.13:8000/v1/models
|
||||
```
|
||||
|
||||
### LiteLLM Integration
|
||||
|
||||
Update VPS LiteLLM config at `ai/litellm-config-gpu.yaml`:
|
||||
|
||||
```yaml
|
||||
# Replace old WireGuard IP (10.8.0.2) with Tailscale IP
|
||||
- model_name: qwen-2.5-7b
|
||||
litellm_params:
|
||||
model: openai/qwen-2.5-7b
|
||||
api_base: http://100.100.108.13:8000/v1 # Tailscale IP
|
||||
api_key: dummy
|
||||
rpm: 1000
|
||||
tpm: 100000
|
||||
```
|
||||
|
||||
Restart LiteLLM:
|
||||
```bash
|
||||
arty restart litellm
|
||||
```
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
**Server not responding**:
|
||||
1. Check if process is running: `pgrep -f simple_vllm_server`
|
||||
2. Check logs: `tail -100 /workspace/vllm.log`
|
||||
3. Check GPU availability: `nvidia-smi`
|
||||
4. Restart server: `pkill -f simple_vllm_server && python3 /workspace/simple_vllm_server.py &`
|
||||
|
||||
**Tailscale not connected**:
|
||||
1. Check status: `tailscale status`
|
||||
2. Check daemon: `ps aux | grep tailscaled`
|
||||
3. Restart: `tailscale down && tailscale up`
|
||||
|
||||
**Model download failing**:
|
||||
1. Check disk space: `df -h`
|
||||
2. Check cache directory: `ls -lah /workspace/huggingface_cache`
|
||||
3. Clear cache if needed: `rm -rf /workspace/huggingface_cache/*`
|
||||
|
||||
### Deployment Status ✅ COMPLETE
|
||||
|
||||
**Deployment Date**: 2025-11-21
|
||||
|
||||
1. ✅ Deploy vLLM with Qwen 2.5 7B - COMPLETE
|
||||
2. ✅ Test API endpoints locally and via Tailscale - COMPLETE
|
||||
3. ✅ Update VPS LiteLLM configuration - COMPLETE
|
||||
4. ✅ Test end-to-end: Open WebUI → LiteLLM → vLLM - COMPLETE
|
||||
5. ⏳ Monitor performance and costs - ONGOING
|
||||
|
||||
**Model Available**: `qwen-2.5-7b` visible in Open WebUI at https://ai.pivoine.art
|
||||
|
||||
### Next Steps (2025-11-21 Original)
|
||||
6. ✅ Consider adding more models → COMPLETE (added Flux.1 Schnell + MusicGen Medium)
|
||||
7. ⏹️ Set up auto-stop for idle periods to save costs
|
||||
|
||||
---
|
||||
|
||||
## Multi-Modal Architecture (2025-11-21 Update)
|
||||
|
||||
### Overview
|
||||
|
||||
Expanded GPU deployment to support **text, image, and music generation** with intelligent model orchestration. All models run sequentially on a single RTX 4090 GPU with automatic switching based on request type.
|
||||
|
||||
### Architecture Components
|
||||
|
||||
#### 1. **Orchestrator Service** (Port 9000 - Always Running)
|
||||
- **Location**: `ai/model-orchestrator/`
|
||||
- **Purpose**: Central service managing model lifecycle
|
||||
- **Features**:
|
||||
- Detects request type (text/image/audio)
|
||||
- Automatically unloads current model
|
||||
- Loads requested model
|
||||
- Proxies requests to active model
|
||||
- Tracks GPU memory usage
|
||||
- **Technology**: FastAPI + Docker SDK Python
|
||||
- **Endpoints**:
|
||||
- `POST /v1/chat/completions` → Routes to text models
|
||||
- `POST /v1/images/generations` → Routes to image models
|
||||
- `POST /v1/audio/generations` → Routes to music models
|
||||
- `GET /health` → Shows active model and status
|
||||
- `GET /models` → Lists all available models
|
||||
- `POST /switch` → Manually switch models
|
||||
|
||||
#### 2. **Text Generation** (vLLM + Qwen 2.5 7B)
|
||||
- **Service**: `vllm-qwen` (Port 8001)
|
||||
- **Location**: `ai/vllm/`
|
||||
- **Model**: Qwen/Qwen2.5-7B-Instruct
|
||||
- **VRAM**: 14GB (85% GPU utilization)
|
||||
- **Speed**: ~50 tokens/second
|
||||
- **Startup**: 120 seconds
|
||||
- **Status**: ✅ Working (same as original deployment)
|
||||
|
||||
#### 3. **Image Generation** (Flux.1 Schnell)
|
||||
- **Service**: `flux` (Port 8002)
|
||||
- **Location**: `ai/flux/`
|
||||
- **Model**: black-forest-labs/FLUX.1-schnell
|
||||
- **VRAM**: 14GB with CPU offloading
|
||||
- **Speed**: 4-5 seconds per image
|
||||
- **Startup**: 60 seconds
|
||||
- **Features**: OpenAI DALL-E compatible API
|
||||
- **Image**: `ghcr.io/matatonic/openedai-images-flux:latest`
|
||||
|
||||
#### 4. **Music Generation** (MusicGen Medium)
|
||||
- **Service**: `musicgen` (Port 8003)
|
||||
- **Location**: `ai/musicgen/`
|
||||
- **Model**: facebook/musicgen-medium
|
||||
- **VRAM**: 11GB
|
||||
- **Speed**: 60-90 seconds for 30 seconds of audio
|
||||
- **Startup**: 45 seconds
|
||||
- **Features**: Text-to-music generation with sampling controls
|
||||
- **Technology**: Meta's AudioCraft + custom FastAPI wrapper
|
||||
|
||||
### Model Registry (`models.yaml`)
|
||||
|
||||
Simple configuration file for managing all models:
|
||||
|
||||
```yaml
|
||||
models:
|
||||
qwen-2.5-7b:
|
||||
type: text
|
||||
framework: vllm
|
||||
docker_service: vllm-qwen
|
||||
port: 8001
|
||||
vram_gb: 14
|
||||
startup_time_seconds: 120
|
||||
endpoint: /v1/chat/completions
|
||||
|
||||
flux-schnell:
|
||||
type: image
|
||||
framework: openedai-images
|
||||
docker_service: flux
|
||||
port: 8002
|
||||
vram_gb: 14
|
||||
startup_time_seconds: 60
|
||||
endpoint: /v1/images/generations
|
||||
|
||||
musicgen-medium:
|
||||
type: audio
|
||||
framework: audiocraft
|
||||
docker_service: musicgen
|
||||
port: 8003
|
||||
vram_gb: 11
|
||||
startup_time_seconds: 45
|
||||
endpoint: /v1/audio/generations
|
||||
```
|
||||
|
||||
**Adding new models**: Just add a new entry to this file and define the Docker service.
|
||||
|
||||
### Deployment Changes
|
||||
|
||||
#### Docker Compose Structure
|
||||
- **File**: `docker-compose.gpu.yaml`
|
||||
- **Services**: 4 total (1 orchestrator + 3 models)
|
||||
- **Profiles**: `text`, `image`, `audio` (orchestrator manages activation)
|
||||
- **Restart Policy**: `no` for models (orchestrator controls lifecycle)
|
||||
- **Volumes**: All model caches on `/workspace` (922TB network volume)
|
||||
|
||||
#### LiteLLM Integration
|
||||
Updated `litellm-config.yaml` to route all self-hosted models through orchestrator:
|
||||
|
||||
```yaml
|
||||
# Text
|
||||
- model_name: qwen-2.5-7b
|
||||
api_base: http://100.100.108.13:9000/v1 # Orchestrator
|
||||
|
||||
# Image
|
||||
- model_name: flux-schnell
|
||||
api_base: http://100.100.108.13:9000/v1 # Orchestrator
|
||||
|
||||
# Music
|
||||
- model_name: musicgen-medium
|
||||
api_base: http://100.100.108.13:9000/v1 # Orchestrator
|
||||
```
|
||||
|
||||
All models now available via Open WebUI at https://ai.pivoine.art
|
||||
|
||||
### Usage Examples
|
||||
|
||||
**Text Generation**:
|
||||
```bash
|
||||
curl http://100.100.108.13:9000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Hello"}]}'
|
||||
```
|
||||
|
||||
**Image Generation**:
|
||||
```bash
|
||||
curl http://100.100.108.13:9000/v1/images/generations \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"model": "flux-schnell", "prompt": "a cute cat", "size": "1024x1024"}'
|
||||
```
|
||||
|
||||
**Music Generation**:
|
||||
```bash
|
||||
curl http://100.100.108.13:9000/v1/audio/generations \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"model": "musicgen-medium", "prompt": "upbeat electronic", "duration": 30}'
|
||||
```
|
||||
|
||||
### Deployment Commands
|
||||
|
||||
```bash
|
||||
# Copy all files to RunPod
|
||||
scp -r ai/* gpu-pivoine:/workspace/ai/
|
||||
|
||||
# SSH to GPU server
|
||||
ssh gpu-pivoine
|
||||
cd /workspace/ai/
|
||||
|
||||
# Start orchestrator (manages everything)
|
||||
docker compose -f docker-compose.gpu.yaml up -d orchestrator
|
||||
|
||||
# Check status
|
||||
curl http://100.100.108.13:9000/health
|
||||
|
||||
# View logs
|
||||
docker logs -f ai_orchestrator
|
||||
|
||||
# Manually switch models (optional)
|
||||
curl -X POST http://100.100.108.13:9000/switch \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"model": "flux-schnell"}'
|
||||
```
|
||||
|
||||
### Performance Characteristics
|
||||
|
||||
| Model | VRAM | Startup Time | Generation Time | Notes |
|
||||
|-------|------|--------------|-----------------|-------|
|
||||
| Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | Fast text generation |
|
||||
| Flux.1 Schnell | 14GB | 60s | 4-5s/image | High-quality images |
|
||||
| MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Text-to-music |
|
||||
|
||||
**Model Switching Overhead**: 30-120 seconds (unload + load)
|
||||
|
||||
### Cost Analysis
|
||||
|
||||
**Current (Single GPU Sequential)**:
|
||||
- Cost: ~$0.50/hour
|
||||
- Monthly: ~$360 (24/7) or ~$120 (8hr/day)
|
||||
- Trade-off: 30-120s switching time
|
||||
|
||||
**Alternative (Multi-GPU Concurrent)**:
|
||||
- Cost: ~$0.75/hour (+50%)
|
||||
- Monthly: ~$540 (24/7) or ~$180 (8hr/day)
|
||||
- Benefit: No switching time, all models always available
|
||||
|
||||
**Decision**: Stick with single GPU for cost optimization. Switching time is acceptable for most use cases.
|
||||
|
||||
### Known Limitations
|
||||
|
||||
1. **Sequential Only**: Only one model active at a time
|
||||
2. **Switching Latency**: 30-120 seconds to change models
|
||||
3. **MusicGen License**: Pre-trained weights are CC-BY-NC (non-commercial)
|
||||
4. **Spot Instance Volatility**: Pod can be terminated anytime
|
||||
|
||||
### Monitoring
|
||||
|
||||
**Check active model**:
|
||||
```bash
|
||||
curl http://100.100.108.13:9000/health | jq '{model: .current_model, vram: .model_info.vram_gb}'
|
||||
```
|
||||
|
||||
**View orchestrator logs**:
|
||||
```bash
|
||||
docker logs -f ai_orchestrator
|
||||
```
|
||||
|
||||
**GPU usage**:
|
||||
```bash
|
||||
ssh gpu-pivoine "nvidia-smi"
|
||||
```
|
||||
|
||||
### Deployment Status ✅ COMPLETE (Multi-Modal)
|
||||
|
||||
**Deployment Date**: 2025-11-21
|
||||
|
||||
1. ✅ Create model orchestrator service - COMPLETE
|
||||
2. ✅ Deploy vLLM text generation (Qwen 2.5 7B) - COMPLETE
|
||||
3. ✅ Deploy Flux.1 Schnell image generation - COMPLETE
|
||||
4. ✅ Deploy MusicGen Medium music generation - COMPLETE
|
||||
5. ✅ Update LiteLLM configuration - COMPLETE
|
||||
6. ✅ Test all three model types via orchestrator - READY FOR TESTING
|
||||
7. ⏳ Monitor performance and costs - ONGOING
|
||||
|
||||
**Models Available**: `qwen-2.5-7b`, `flux-schnell`, `musicgen-medium` via Open WebUI
|
||||
|
||||
### Future Model Additions
|
||||
|
||||
**Easy to add** (just edit `models.yaml`):
|
||||
- Llama 3.1 8B Instruct (text, gated model)
|
||||
- Whisper Large v3 (speech-to-text)
|
||||
- XTTS v2 (text-to-speech)
|
||||
- Stable Diffusion XL (alternative image generation)
|
||||
|
||||
See `README.md` for detailed instructions on adding new models.
|
||||
|
||||
### Cost Optimization Ideas
|
||||
1. **Auto-stop**: Configure RunPod to auto-stop after 30 minutes idle
|
||||
2. **Spot Instances**: Already using Spot for 50% cost reduction
|
||||
3. **Scheduled Operation**: Run only during business hours (8 hours/day = $120/month)
|
||||
4. **Smaller Models**: Use Mistral 7B or quantized models for lighter workloads
|
||||
5. **Pay-as-you-go**: Manually start/stop pod as needed
|
||||
|
||||
### Performance Benchmarks
|
||||
*To be measured after deployment*
|
||||
|
||||
Expected (based on RTX 4090):
|
||||
- Qwen 2.5 7B: 50-80 tokens/second
|
||||
- Context processing: ~2-3 seconds for 1000 tokens
|
||||
- First token latency: ~200-300ms
|
||||
416
docs/RUNPOD_TEMPLATE.md
Normal file
416
docs/RUNPOD_TEMPLATE.md
Normal file
@@ -0,0 +1,416 @@
|
||||
# RunPod Template Creation Guide
|
||||
|
||||
This guide shows you how to create a reusable RunPod template so you never have to reinstall everything from scratch when Spot instances restart.
|
||||
|
||||
## Why Create a Template?
|
||||
|
||||
**Without Template** (Manual Setup Every Time):
|
||||
- ❌ Install Docker & Docker Compose (10-15 min)
|
||||
- ❌ Install Tailscale (5 min)
|
||||
- ❌ Pull Docker images (10-20 min)
|
||||
- ❌ Download models: Qwen (~14GB), Flux (~12GB), MusicGen (~11GB) = 30-45 min
|
||||
- ❌ Configure everything (5-10 min)
|
||||
- **Total: 60-90 minutes per Spot instance restart**
|
||||
|
||||
**With Template** (Ready to Go):
|
||||
- ✅ Everything pre-installed
|
||||
- ✅ Models cached in `/workspace`
|
||||
- ✅ Just start orchestrator
|
||||
- **Total: 2-3 minutes**
|
||||
|
||||
## Template Contents
|
||||
|
||||
### System Software
|
||||
- ✅ Docker 24.x + Docker Compose v2
|
||||
- ✅ Tailscale latest
|
||||
- ✅ NVIDIA Docker runtime
|
||||
- ✅ Python 3.11
|
||||
- ✅ Git, curl, wget, htop, nvtop
|
||||
|
||||
### Docker Images (Pre-built)
|
||||
- ✅ `ai_orchestrator` - Model orchestration service
|
||||
- ✅ `ai_vllm-qwen_1` - Text generation (vLLM + Qwen 2.5 7B)
|
||||
- ✅ `ai_musicgen_1` - Music generation (AudioCraft)
|
||||
- ✅ `ghcr.io/matatonic/openedai-images-flux:latest` - Image generation
|
||||
|
||||
### Model Cache (/workspace - Persistent)
|
||||
- ✅ Qwen 2.5 7B Instruct (~14GB)
|
||||
- ✅ Flux.1 Schnell (~12GB)
|
||||
- ✅ MusicGen Medium (~11GB)
|
||||
- **Total: ~37GB cached**
|
||||
|
||||
### Project Files (/workspace/ai)
|
||||
- ✅ All orchestrator code
|
||||
- ✅ Docker Compose configurations
|
||||
- ✅ Model service configurations
|
||||
- ✅ Documentation
|
||||
|
||||
---
|
||||
|
||||
## Step-by-Step Template Creation
|
||||
|
||||
### Prerequisites
|
||||
1. RunPod account
|
||||
2. Active RTX 4090 pod (or similar GPU)
|
||||
3. SSH access to the pod
|
||||
4. This repository cloned locally
|
||||
|
||||
### Step 1: Deploy Fresh Pod
|
||||
|
||||
```bash
|
||||
# Create new RunPod instance:
|
||||
# - GPU: RTX 4090 (24GB VRAM)
|
||||
# - Disk: 50GB container disk
|
||||
# - Network Volume: Attach or create 100GB+ volume
|
||||
# - Template: Start with official PyTorch or CUDA template
|
||||
|
||||
# Note the SSH connection details (host, port, password)
|
||||
```
|
||||
|
||||
### Step 2: Prepare the Instance
|
||||
|
||||
Run the automated preparation script:
|
||||
|
||||
```bash
|
||||
# On your local machine, copy everything to RunPod
|
||||
scp -P <PORT> -r /home/valknar/Projects/runpod/* root@<HOST>:/workspace/ai/
|
||||
|
||||
# SSH to the pod
|
||||
ssh -p <PORT> root@<HOST>
|
||||
|
||||
# Run the preparation script
|
||||
cd /workspace/ai
|
||||
chmod +x scripts/prepare-template.sh
|
||||
./scripts/prepare-template.sh
|
||||
```
|
||||
|
||||
**What the script does:**
|
||||
1. Installs Docker & Docker Compose
|
||||
2. Installs Tailscale
|
||||
3. Builds all Docker images
|
||||
4. Pre-downloads all models
|
||||
5. Validates everything works
|
||||
6. Cleans up temporary files
|
||||
|
||||
**Estimated time: 45-60 minutes**
|
||||
|
||||
### Step 3: Manual Verification
|
||||
|
||||
After the script completes, verify everything:
|
||||
|
||||
```bash
|
||||
# Check Docker is installed
|
||||
docker --version
|
||||
docker compose version
|
||||
|
||||
# Check Tailscale
|
||||
tailscale version
|
||||
|
||||
# Check all images are built
|
||||
docker images | grep ai_
|
||||
|
||||
# Check models are cached
|
||||
ls -lh /workspace/huggingface_cache/
|
||||
ls -lh /workspace/flux/models/
|
||||
ls -lh /workspace/musicgen/models/
|
||||
|
||||
# Test orchestrator starts
|
||||
cd /workspace/ai
|
||||
docker compose -f docker-compose.gpu.yaml up -d orchestrator
|
||||
docker logs ai_orchestrator
|
||||
|
||||
# Test model loading (should be fast since models are cached)
|
||||
curl http://localhost:9000/health
|
||||
|
||||
# Stop orchestrator
|
||||
docker compose -f docker-compose.gpu.yaml down
|
||||
```
|
||||
|
||||
### Step 4: Clean Up Before Saving
|
||||
|
||||
**IMPORTANT**: Remove secrets and temporary data before creating template!
|
||||
|
||||
```bash
|
||||
# Remove sensitive data
|
||||
rm -f /workspace/ai/.env
|
||||
rm -f /root/.ssh/known_hosts
|
||||
rm -f /root/.bash_history
|
||||
|
||||
# Clear logs
|
||||
rm -f /var/log/*.log
|
||||
docker system prune -af --volumes # Clean Docker cache but keep images
|
||||
|
||||
# Clear Tailscale state (will re-authenticate on first use)
|
||||
tailscale logout
|
||||
|
||||
# Create template-ready marker
|
||||
echo "RunPod Multi-Modal AI Template v1.0" > /workspace/TEMPLATE_VERSION
|
||||
echo "Created: $(date)" >> /workspace/TEMPLATE_VERSION
|
||||
```
|
||||
|
||||
### Step 5: Save Template in RunPod Dashboard
|
||||
|
||||
1. **Go to RunPod Dashboard** → "My Pods"
|
||||
2. **Select your prepared pod**
|
||||
3. **Click "⋮" menu** → "Save as Template"
|
||||
4. **Template Configuration**:
|
||||
- **Name**: `multi-modal-ai-v1.0`
|
||||
- **Description**:
|
||||
```
|
||||
Multi-Modal AI Stack with Orchestrator
|
||||
- Text: vLLM + Qwen 2.5 7B
|
||||
- Image: Flux.1 Schnell
|
||||
- Music: MusicGen Medium
|
||||
- Models pre-cached (~37GB)
|
||||
- Ready to deploy in 2-3 minutes
|
||||
```
|
||||
- **Category**: `AI/ML`
|
||||
- **Docker Image**: (auto-detected)
|
||||
- **Container Disk**: 50GB
|
||||
- **Expose Ports**: 9000, 8001, 8002, 8003
|
||||
- **Environment Variables** (optional):
|
||||
```
|
||||
HF_TOKEN=<leave empty, user will add>
|
||||
TAILSCALE_AUTHKEY=<leave empty, user will add>
|
||||
```
|
||||
|
||||
5. **Click "Save Template"**
|
||||
6. **Wait for template creation** (5-10 minutes)
|
||||
7. **Test the template** by deploying a new pod with it
|
||||
|
||||
---
|
||||
|
||||
## Using Your Template
|
||||
|
||||
### Deploy New Pod from Template
|
||||
|
||||
1. **RunPod Dashboard** → "➕ Deploy"
|
||||
2. **Select "Community Templates"** or "My Templates"
|
||||
3. **Choose**: `multi-modal-ai-v1.0`
|
||||
4. **Configure**:
|
||||
- GPU: RTX 4090 (or compatible)
|
||||
- Network Volume: Attach your existing volume with `/workspace` mount
|
||||
- Environment:
|
||||
- `HF_TOKEN`: Your Hugging Face token
|
||||
- (Tailscale will be configured via SSH)
|
||||
|
||||
5. **Deploy Pod**
|
||||
|
||||
### First-Time Setup (On New Pod)
|
||||
|
||||
```bash
|
||||
# SSH to the new pod
|
||||
ssh -p <PORT> root@<HOST>
|
||||
|
||||
# Navigate to project
|
||||
cd /workspace/ai
|
||||
|
||||
# Create .env file
|
||||
cat > .env <<EOF
|
||||
HF_TOKEN=hf_your_token_here
|
||||
GPU_TAILSCALE_IP=100.100.108.13
|
||||
EOF
|
||||
|
||||
# Configure Tailscale (one-time)
|
||||
tailscale up --authkey=<YOUR_TAILSCALE_KEY>
|
||||
|
||||
# Start orchestrator (models already cached, starts in seconds!)
|
||||
docker compose -f docker-compose.gpu.yaml up -d orchestrator
|
||||
|
||||
# Verify
|
||||
curl http://localhost:9000/health
|
||||
|
||||
# Check logs
|
||||
docker logs -f ai_orchestrator
|
||||
```
|
||||
|
||||
**Total setup time: 2-3 minutes!** 🎉
|
||||
|
||||
### Updating SSH Config (If Spot Instance Restarts)
|
||||
|
||||
Since Spot instances can restart with new IPs/ports:
|
||||
|
||||
```bash
|
||||
# On your local machine
|
||||
# Update ~/.ssh/config with new connection details
|
||||
|
||||
Host gpu-pivoine
|
||||
HostName <NEW_IP>
|
||||
Port <NEW_PORT>
|
||||
User root
|
||||
IdentityFile ~/.ssh/id_ed25519
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Template Maintenance
|
||||
|
||||
### Updating the Template
|
||||
|
||||
When you add new models or make improvements:
|
||||
|
||||
1. Deploy a pod from your existing template
|
||||
2. Make your changes
|
||||
3. Test everything
|
||||
4. Clean up (remove secrets)
|
||||
5. Save as new template version: `multi-modal-ai-v1.1`
|
||||
6. Update your documentation
|
||||
|
||||
### Version History
|
||||
|
||||
Keep track of template versions:
|
||||
|
||||
```
|
||||
v1.0 (2025-11-21) - Initial release
|
||||
- Text: Qwen 2.5 7B
|
||||
- Image: Flux.1 Schnell
|
||||
- Music: MusicGen Medium
|
||||
- Docker orchestrator
|
||||
|
||||
v1.1 (future) - Planned
|
||||
- Add Llama 3.1 8B
|
||||
- Add Whisper Large v3
|
||||
- Optimize model loading
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting Template Creation
|
||||
|
||||
### Models Not Downloading
|
||||
|
||||
```bash
|
||||
# Manually trigger model downloads
|
||||
docker compose --profile text up -d vllm-qwen
|
||||
docker logs -f ai_vllm-qwen_1
|
||||
# Wait for "Model loaded successfully"
|
||||
docker compose stop vllm-qwen
|
||||
|
||||
# Repeat for other models
|
||||
docker compose --profile image up -d flux
|
||||
docker compose --profile audio up -d musicgen
|
||||
```
|
||||
|
||||
### Docker Images Not Building
|
||||
|
||||
```bash
|
||||
# Build images one at a time
|
||||
docker compose -f docker-compose.gpu.yaml build orchestrator
|
||||
docker compose -f docker-compose.gpu.yaml build vllm-qwen
|
||||
docker compose -f docker-compose.gpu.yaml build musicgen
|
||||
|
||||
# Check build logs for errors
|
||||
docker compose -f docker-compose.gpu.yaml build --no-cache --progress=plain orchestrator
|
||||
```
|
||||
|
||||
### Tailscale Won't Install
|
||||
|
||||
```bash
|
||||
# Manual Tailscale installation
|
||||
curl -fsSL https://tailscale.com/install.sh | sh
|
||||
|
||||
# Start daemon
|
||||
tailscaled --tun=userspace-networking --socks5-server=localhost:1055 &
|
||||
|
||||
# Test
|
||||
tailscale version
|
||||
```
|
||||
|
||||
### Template Too Large
|
||||
|
||||
RunPod templates have size limits. If your template is too large:
|
||||
|
||||
**Option 1**: Use network volume for models
|
||||
- Move models to network volume: `/workspace/models/`
|
||||
- Mount volume when deploying from template
|
||||
- Models persist across pod restarts
|
||||
|
||||
**Option 2**: Reduce cached models
|
||||
- Only cache most-used model (Qwen 2.5 7B)
|
||||
- Download others on first use
|
||||
- Accept slightly longer first-time startup
|
||||
|
||||
**Option 3**: Use Docker layer optimization
|
||||
```dockerfile
|
||||
# In Dockerfile, order commands by change frequency
|
||||
# Less frequently changed layers first
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Cost Analysis
|
||||
|
||||
### Template Storage Cost
|
||||
- RunPod charges for template storage: ~$0.10/GB/month
|
||||
- This template: ~50GB = **~$5/month**
|
||||
- **Worth it!** Saves 60-90 minutes per Spot restart
|
||||
|
||||
### Time Savings
|
||||
- Spot instance restarts: 2-5 times per week (highly variable)
|
||||
- Time saved per restart: 60-90 minutes
|
||||
- **Total saved per month: 8-20 hours**
|
||||
- **Value: Priceless for rapid deployment**
|
||||
|
||||
---
|
||||
|
||||
## Advanced: Automated Template Updates
|
||||
|
||||
Create a CI/CD pipeline to automatically update templates:
|
||||
|
||||
```bash
|
||||
# GitHub Actions workflow (future enhancement)
|
||||
# 1. Deploy pod from template
|
||||
# 2. Pull latest code
|
||||
# 3. Rebuild images
|
||||
# 4. Test
|
||||
# 5. Save new template version
|
||||
# 6. Notify team
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Template Checklist
|
||||
|
||||
Before saving your template, verify:
|
||||
|
||||
- [ ] All Docker images built and working
|
||||
- [ ] All models downloaded and cached
|
||||
- [ ] Tailscale installed (but logged out)
|
||||
- [ ] Docker Compose files present
|
||||
- [ ] `.env` file removed (secrets cleared)
|
||||
- [ ] Logs cleared
|
||||
- [ ] SSH keys removed
|
||||
- [ ] Bash history cleared
|
||||
- [ ] Template version documented
|
||||
- [ ] Test deployment successful
|
||||
|
||||
---
|
||||
|
||||
## Support
|
||||
|
||||
If you have issues creating the template:
|
||||
|
||||
1. Check `/workspace/ai/scripts/prepare-template.sh` logs
|
||||
2. Review Docker build logs: `docker compose build --progress=plain`
|
||||
3. Check model download logs: `docker logs <container>`
|
||||
4. Verify disk space: `df -h`
|
||||
5. Check network volume is mounted: `mount | grep workspace`
|
||||
|
||||
For RunPod-specific issues:
|
||||
- RunPod Docs: https://docs.runpod.io/
|
||||
- RunPod Discord: https://discord.gg/runpod
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
After creating your template:
|
||||
|
||||
1. ✅ Test deployment from template
|
||||
2. ✅ Document in `GPU_DEPLOYMENT_LOG.md`
|
||||
3. ✅ Share template ID with team (if applicable)
|
||||
4. ✅ Set up monitoring (Netdata, etc.)
|
||||
5. ✅ Configure auto-stop for cost optimization
|
||||
6. ✅ Add more models as needed
|
||||
|
||||
**Your multi-modal AI infrastructure is now portable and reproducible!** 🚀
|
||||
Reference in New Issue
Block a user