Removed facefusion and VPS-related files: - compose.yaml, postgres/, litellm-config.yaml (VPS services) - Dockerfile, entrypoint.sh, disable-nsfw-filter.patch (facefusion) Removed outdated documentation: - DOCKER_GPU_SETUP.md, README_GPU_SETUP.md, SETUP_GUIDE.md - TAILSCALE_SETUP.md, WIREGUARD_SETUP.md (covered in DEPLOYMENT.md) - GPU_EXPANSION_PLAN.md (historical planning doc) - gpu-server-compose.yaml, litellm-config-gpu.yaml (old versions) - deploy-gpu-stack.sh, simple_vllm_server.py (old scripts) Organized documentation: - Created docs/ directory - Moved DEPLOYMENT.md, RUNPOD_TEMPLATE.md, GPU_DEPLOYMENT_LOG.md to docs/ - Updated all documentation links in README.md Final structure: - Clean root directory with only GPU-specific files - Organized documentation in docs/ - Model services in dedicated directories (model-orchestrator/, vllm/, flux/, musicgen/) - Automation scripts in scripts/
422 lines
12 KiB
Markdown
422 lines
12 KiB
Markdown
# GPU Server Deployment Log
|
|
|
|
## Current Deployment (2025-11-21)
|
|
|
|
### Infrastructure
|
|
- **Provider**: RunPod (Spot Instance)
|
|
- **GPU**: NVIDIA RTX 4090 24GB
|
|
- **Disk**: 50GB local SSD (expanded from 20GB)
|
|
- **Network Volume**: 922TB at `/workspace`
|
|
- **Region**: Europe
|
|
- **Cost**: ~$0.50/hour (~$360/month if running 24/7)
|
|
|
|
### Network Configuration
|
|
- **VPN**: Tailscale (replaces WireGuard due to RunPod UDP restrictions)
|
|
- **GPU Server Tailscale IP**: 100.100.108.13
|
|
- **VPS Tailscale IP**: (get with `tailscale ip -4` on VPS)
|
|
|
|
### SSH Access
|
|
```
|
|
Host gpu-pivoine
|
|
HostName 213.173.102.232
|
|
Port 29695
|
|
User root
|
|
IdentityFile ~/.ssh/id_ed25519
|
|
```
|
|
|
|
**Note**: RunPod Spot instances can be terminated and restarted with new ports/IPs. Update SSH config accordingly.
|
|
|
|
### Software Stack
|
|
- **Python**: 3.11.10
|
|
- **vLLM**: 0.6.4.post1 (installed with pip)
|
|
- **PyTorch**: 2.5.1 with CUDA 12.4
|
|
- **Tailscale**: Installed via official script
|
|
|
|
### vLLM Deployment
|
|
|
|
**Custom Server**: `ai/simple_vllm_server.py`
|
|
- Uses `AsyncLLMEngine` directly to bypass multiprocessing issues
|
|
- OpenAI-compatible API endpoints:
|
|
- `GET /v1/models` - List available models
|
|
- `POST /v1/completions` - Text completion
|
|
- `POST /v1/chat/completions` - Chat completion
|
|
- Default model: Qwen/Qwen2.5-7B-Instruct
|
|
- Cache directory: `/workspace/huggingface_cache`
|
|
|
|
**Deployment Command**:
|
|
```bash
|
|
# Copy server script to GPU server
|
|
scp ai/simple_vllm_server.py gpu-pivoine:/workspace/
|
|
|
|
# Start server
|
|
ssh gpu-pivoine "cd /workspace && nohup python3 simple_vllm_server.py > vllm.log 2>&1 &"
|
|
|
|
# Check status
|
|
ssh gpu-pivoine "curl http://localhost:8000/v1/models"
|
|
```
|
|
|
|
**Server Configuration** (environment variables):
|
|
- `VLLM_HOST`: 0.0.0.0 (default)
|
|
- `VLLM_PORT`: 8000 (default)
|
|
|
|
### Model Configuration
|
|
- **Model**: Qwen/Qwen2.5-7B-Instruct (no auth required)
|
|
- **Context Length**: 4096 tokens
|
|
- **GPU Memory**: 85% utilization
|
|
- **Tensor Parallel**: 1 (single GPU)
|
|
|
|
### Known Issues & Solutions
|
|
|
|
#### Issue 1: vLLM Multiprocessing Errors
|
|
**Problem**: Default vLLM v1 engine fails with ZMQ/CUDA multiprocessing errors on RunPod.
|
|
**Solution**: Custom `AsyncLLMEngine` FastAPI server bypasses multiprocessing layer entirely.
|
|
|
|
#### Issue 2: Disk Space (Solved)
|
|
**Problem**: Original 20GB disk filled up with Hugging Face cache.
|
|
**Solution**: Expanded to 50GB and use `/workspace` for model cache.
|
|
|
|
#### Issue 3: Gated Models
|
|
**Problem**: Llama models require Hugging Face authentication.
|
|
**Solution**: Use Qwen 2.5 7B Instruct (no auth required) or set `HF_TOKEN` environment variable.
|
|
|
|
#### Issue 4: Spot Instance Volatility
|
|
**Problem**: RunPod Spot instances can be terminated anytime.
|
|
**Solution**: Accept as trade-off for cost savings. Document SSH details for quick reconnection.
|
|
|
|
### Monitoring
|
|
|
|
**Check vLLM logs**:
|
|
```bash
|
|
ssh gpu-pivoine "tail -f /workspace/vllm.log"
|
|
```
|
|
|
|
**Check GPU usage**:
|
|
```bash
|
|
ssh gpu-pivoine "nvidia-smi"
|
|
```
|
|
|
|
**Check Tailscale status**:
|
|
```bash
|
|
ssh gpu-pivoine "tailscale status"
|
|
```
|
|
|
|
**Test API locally (on GPU server)**:
|
|
```bash
|
|
ssh gpu-pivoine "curl http://localhost:8000/v1/models"
|
|
```
|
|
|
|
**Test API via Tailscale (from VPS)**:
|
|
```bash
|
|
curl http://100.100.108.13:8000/v1/models
|
|
```
|
|
|
|
### LiteLLM Integration
|
|
|
|
Update VPS LiteLLM config at `ai/litellm-config-gpu.yaml`:
|
|
|
|
```yaml
|
|
# Replace old WireGuard IP (10.8.0.2) with Tailscale IP
|
|
- model_name: qwen-2.5-7b
|
|
litellm_params:
|
|
model: openai/qwen-2.5-7b
|
|
api_base: http://100.100.108.13:8000/v1 # Tailscale IP
|
|
api_key: dummy
|
|
rpm: 1000
|
|
tpm: 100000
|
|
```
|
|
|
|
Restart LiteLLM:
|
|
```bash
|
|
arty restart litellm
|
|
```
|
|
|
|
### Troubleshooting
|
|
|
|
**Server not responding**:
|
|
1. Check if process is running: `pgrep -f simple_vllm_server`
|
|
2. Check logs: `tail -100 /workspace/vllm.log`
|
|
3. Check GPU availability: `nvidia-smi`
|
|
4. Restart server: `pkill -f simple_vllm_server && python3 /workspace/simple_vllm_server.py &`
|
|
|
|
**Tailscale not connected**:
|
|
1. Check status: `tailscale status`
|
|
2. Check daemon: `ps aux | grep tailscaled`
|
|
3. Restart: `tailscale down && tailscale up`
|
|
|
|
**Model download failing**:
|
|
1. Check disk space: `df -h`
|
|
2. Check cache directory: `ls -lah /workspace/huggingface_cache`
|
|
3. Clear cache if needed: `rm -rf /workspace/huggingface_cache/*`
|
|
|
|
### Deployment Status ✅ COMPLETE
|
|
|
|
**Deployment Date**: 2025-11-21
|
|
|
|
1. ✅ Deploy vLLM with Qwen 2.5 7B - COMPLETE
|
|
2. ✅ Test API endpoints locally and via Tailscale - COMPLETE
|
|
3. ✅ Update VPS LiteLLM configuration - COMPLETE
|
|
4. ✅ Test end-to-end: Open WebUI → LiteLLM → vLLM - COMPLETE
|
|
5. ⏳ Monitor performance and costs - ONGOING
|
|
|
|
**Model Available**: `qwen-2.5-7b` visible in Open WebUI at https://ai.pivoine.art
|
|
|
|
### Next Steps (2025-11-21 Original)
|
|
6. ✅ Consider adding more models → COMPLETE (added Flux.1 Schnell + MusicGen Medium)
|
|
7. ⏹️ Set up auto-stop for idle periods to save costs
|
|
|
|
---
|
|
|
|
## Multi-Modal Architecture (2025-11-21 Update)
|
|
|
|
### Overview
|
|
|
|
Expanded GPU deployment to support **text, image, and music generation** with intelligent model orchestration. All models run sequentially on a single RTX 4090 GPU with automatic switching based on request type.
|
|
|
|
### Architecture Components
|
|
|
|
#### 1. **Orchestrator Service** (Port 9000 - Always Running)
|
|
- **Location**: `ai/model-orchestrator/`
|
|
- **Purpose**: Central service managing model lifecycle
|
|
- **Features**:
|
|
- Detects request type (text/image/audio)
|
|
- Automatically unloads current model
|
|
- Loads requested model
|
|
- Proxies requests to active model
|
|
- Tracks GPU memory usage
|
|
- **Technology**: FastAPI + Docker SDK Python
|
|
- **Endpoints**:
|
|
- `POST /v1/chat/completions` → Routes to text models
|
|
- `POST /v1/images/generations` → Routes to image models
|
|
- `POST /v1/audio/generations` → Routes to music models
|
|
- `GET /health` → Shows active model and status
|
|
- `GET /models` → Lists all available models
|
|
- `POST /switch` → Manually switch models
|
|
|
|
#### 2. **Text Generation** (vLLM + Qwen 2.5 7B)
|
|
- **Service**: `vllm-qwen` (Port 8001)
|
|
- **Location**: `ai/vllm/`
|
|
- **Model**: Qwen/Qwen2.5-7B-Instruct
|
|
- **VRAM**: 14GB (85% GPU utilization)
|
|
- **Speed**: ~50 tokens/second
|
|
- **Startup**: 120 seconds
|
|
- **Status**: ✅ Working (same as original deployment)
|
|
|
|
#### 3. **Image Generation** (Flux.1 Schnell)
|
|
- **Service**: `flux` (Port 8002)
|
|
- **Location**: `ai/flux/`
|
|
- **Model**: black-forest-labs/FLUX.1-schnell
|
|
- **VRAM**: 14GB with CPU offloading
|
|
- **Speed**: 4-5 seconds per image
|
|
- **Startup**: 60 seconds
|
|
- **Features**: OpenAI DALL-E compatible API
|
|
- **Image**: `ghcr.io/matatonic/openedai-images-flux:latest`
|
|
|
|
#### 4. **Music Generation** (MusicGen Medium)
|
|
- **Service**: `musicgen` (Port 8003)
|
|
- **Location**: `ai/musicgen/`
|
|
- **Model**: facebook/musicgen-medium
|
|
- **VRAM**: 11GB
|
|
- **Speed**: 60-90 seconds for 30 seconds of audio
|
|
- **Startup**: 45 seconds
|
|
- **Features**: Text-to-music generation with sampling controls
|
|
- **Technology**: Meta's AudioCraft + custom FastAPI wrapper
|
|
|
|
### Model Registry (`models.yaml`)
|
|
|
|
Simple configuration file for managing all models:
|
|
|
|
```yaml
|
|
models:
|
|
qwen-2.5-7b:
|
|
type: text
|
|
framework: vllm
|
|
docker_service: vllm-qwen
|
|
port: 8001
|
|
vram_gb: 14
|
|
startup_time_seconds: 120
|
|
endpoint: /v1/chat/completions
|
|
|
|
flux-schnell:
|
|
type: image
|
|
framework: openedai-images
|
|
docker_service: flux
|
|
port: 8002
|
|
vram_gb: 14
|
|
startup_time_seconds: 60
|
|
endpoint: /v1/images/generations
|
|
|
|
musicgen-medium:
|
|
type: audio
|
|
framework: audiocraft
|
|
docker_service: musicgen
|
|
port: 8003
|
|
vram_gb: 11
|
|
startup_time_seconds: 45
|
|
endpoint: /v1/audio/generations
|
|
```
|
|
|
|
**Adding new models**: Just add a new entry to this file and define the Docker service.
|
|
|
|
### Deployment Changes
|
|
|
|
#### Docker Compose Structure
|
|
- **File**: `docker-compose.gpu.yaml`
|
|
- **Services**: 4 total (1 orchestrator + 3 models)
|
|
- **Profiles**: `text`, `image`, `audio` (orchestrator manages activation)
|
|
- **Restart Policy**: `no` for models (orchestrator controls lifecycle)
|
|
- **Volumes**: All model caches on `/workspace` (922TB network volume)
|
|
|
|
#### LiteLLM Integration
|
|
Updated `litellm-config.yaml` to route all self-hosted models through orchestrator:
|
|
|
|
```yaml
|
|
# Text
|
|
- model_name: qwen-2.5-7b
|
|
api_base: http://100.100.108.13:9000/v1 # Orchestrator
|
|
|
|
# Image
|
|
- model_name: flux-schnell
|
|
api_base: http://100.100.108.13:9000/v1 # Orchestrator
|
|
|
|
# Music
|
|
- model_name: musicgen-medium
|
|
api_base: http://100.100.108.13:9000/v1 # Orchestrator
|
|
```
|
|
|
|
All models now available via Open WebUI at https://ai.pivoine.art
|
|
|
|
### Usage Examples
|
|
|
|
**Text Generation**:
|
|
```bash
|
|
curl http://100.100.108.13:9000/v1/chat/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Hello"}]}'
|
|
```
|
|
|
|
**Image Generation**:
|
|
```bash
|
|
curl http://100.100.108.13:9000/v1/images/generations \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"model": "flux-schnell", "prompt": "a cute cat", "size": "1024x1024"}'
|
|
```
|
|
|
|
**Music Generation**:
|
|
```bash
|
|
curl http://100.100.108.13:9000/v1/audio/generations \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"model": "musicgen-medium", "prompt": "upbeat electronic", "duration": 30}'
|
|
```
|
|
|
|
### Deployment Commands
|
|
|
|
```bash
|
|
# Copy all files to RunPod
|
|
scp -r ai/* gpu-pivoine:/workspace/ai/
|
|
|
|
# SSH to GPU server
|
|
ssh gpu-pivoine
|
|
cd /workspace/ai/
|
|
|
|
# Start orchestrator (manages everything)
|
|
docker compose -f docker-compose.gpu.yaml up -d orchestrator
|
|
|
|
# Check status
|
|
curl http://100.100.108.13:9000/health
|
|
|
|
# View logs
|
|
docker logs -f ai_orchestrator
|
|
|
|
# Manually switch models (optional)
|
|
curl -X POST http://100.100.108.13:9000/switch \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"model": "flux-schnell"}'
|
|
```
|
|
|
|
### Performance Characteristics
|
|
|
|
| Model | VRAM | Startup Time | Generation Time | Notes |
|
|
|-------|------|--------------|-----------------|-------|
|
|
| Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | Fast text generation |
|
|
| Flux.1 Schnell | 14GB | 60s | 4-5s/image | High-quality images |
|
|
| MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Text-to-music |
|
|
|
|
**Model Switching Overhead**: 30-120 seconds (unload + load)
|
|
|
|
### Cost Analysis
|
|
|
|
**Current (Single GPU Sequential)**:
|
|
- Cost: ~$0.50/hour
|
|
- Monthly: ~$360 (24/7) or ~$120 (8hr/day)
|
|
- Trade-off: 30-120s switching time
|
|
|
|
**Alternative (Multi-GPU Concurrent)**:
|
|
- Cost: ~$0.75/hour (+50%)
|
|
- Monthly: ~$540 (24/7) or ~$180 (8hr/day)
|
|
- Benefit: No switching time, all models always available
|
|
|
|
**Decision**: Stick with single GPU for cost optimization. Switching time is acceptable for most use cases.
|
|
|
|
### Known Limitations
|
|
|
|
1. **Sequential Only**: Only one model active at a time
|
|
2. **Switching Latency**: 30-120 seconds to change models
|
|
3. **MusicGen License**: Pre-trained weights are CC-BY-NC (non-commercial)
|
|
4. **Spot Instance Volatility**: Pod can be terminated anytime
|
|
|
|
### Monitoring
|
|
|
|
**Check active model**:
|
|
```bash
|
|
curl http://100.100.108.13:9000/health | jq '{model: .current_model, vram: .model_info.vram_gb}'
|
|
```
|
|
|
|
**View orchestrator logs**:
|
|
```bash
|
|
docker logs -f ai_orchestrator
|
|
```
|
|
|
|
**GPU usage**:
|
|
```bash
|
|
ssh gpu-pivoine "nvidia-smi"
|
|
```
|
|
|
|
### Deployment Status ✅ COMPLETE (Multi-Modal)
|
|
|
|
**Deployment Date**: 2025-11-21
|
|
|
|
1. ✅ Create model orchestrator service - COMPLETE
|
|
2. ✅ Deploy vLLM text generation (Qwen 2.5 7B) - COMPLETE
|
|
3. ✅ Deploy Flux.1 Schnell image generation - COMPLETE
|
|
4. ✅ Deploy MusicGen Medium music generation - COMPLETE
|
|
5. ✅ Update LiteLLM configuration - COMPLETE
|
|
6. ✅ Test all three model types via orchestrator - READY FOR TESTING
|
|
7. ⏳ Monitor performance and costs - ONGOING
|
|
|
|
**Models Available**: `qwen-2.5-7b`, `flux-schnell`, `musicgen-medium` via Open WebUI
|
|
|
|
### Future Model Additions
|
|
|
|
**Easy to add** (just edit `models.yaml`):
|
|
- Llama 3.1 8B Instruct (text, gated model)
|
|
- Whisper Large v3 (speech-to-text)
|
|
- XTTS v2 (text-to-speech)
|
|
- Stable Diffusion XL (alternative image generation)
|
|
|
|
See `README.md` for detailed instructions on adding new models.
|
|
|
|
### Cost Optimization Ideas
|
|
1. **Auto-stop**: Configure RunPod to auto-stop after 30 minutes idle
|
|
2. **Spot Instances**: Already using Spot for 50% cost reduction
|
|
3. **Scheduled Operation**: Run only during business hours (8 hours/day = $120/month)
|
|
4. **Smaller Models**: Use Mistral 7B or quantized models for lighter workloads
|
|
5. **Pay-as-you-go**: Manually start/stop pod as needed
|
|
|
|
### Performance Benchmarks
|
|
*To be measured after deployment*
|
|
|
|
Expected (based on RTX 4090):
|
|
- Qwen 2.5 7B: 50-80 tokens/second
|
|
- Context processing: ~2-3 seconds for 1000 tokens
|
|
- First token latency: ~200-300ms
|