# GPU Server Deployment Log ## Current Deployment (2025-11-21) ### Infrastructure - **Provider**: RunPod (Spot Instance) - **GPU**: NVIDIA RTX 4090 24GB - **Disk**: 50GB local SSD (expanded from 20GB) - **Network Volume**: 922TB at `/workspace` - **Region**: Europe - **Cost**: ~$0.50/hour (~$360/month if running 24/7) ### Network Configuration - **VPN**: Tailscale (replaces WireGuard due to RunPod UDP restrictions) - **GPU Server Tailscale IP**: 100.100.108.13 - **VPS Tailscale IP**: (get with `tailscale ip -4` on VPS) ### SSH Access ``` Host gpu-pivoine HostName 213.173.102.232 Port 29695 User root IdentityFile ~/.ssh/id_ed25519 ``` **Note**: RunPod Spot instances can be terminated and restarted with new ports/IPs. Update SSH config accordingly. ### Software Stack - **Python**: 3.11.10 - **vLLM**: 0.6.4.post1 (installed with pip) - **PyTorch**: 2.5.1 with CUDA 12.4 - **Tailscale**: Installed via official script ### vLLM Deployment **Custom Server**: `ai/simple_vllm_server.py` - Uses `AsyncLLMEngine` directly to bypass multiprocessing issues - OpenAI-compatible API endpoints: - `GET /v1/models` - List available models - `POST /v1/completions` - Text completion - `POST /v1/chat/completions` - Chat completion - Default model: Qwen/Qwen2.5-7B-Instruct - Cache directory: `/workspace/huggingface_cache` **Deployment Command**: ```bash # Copy server script to GPU server scp ai/simple_vllm_server.py gpu-pivoine:/workspace/ # Start server ssh gpu-pivoine "cd /workspace && nohup python3 simple_vllm_server.py > vllm.log 2>&1 &" # Check status ssh gpu-pivoine "curl http://localhost:8000/v1/models" ``` **Server Configuration** (environment variables): - `VLLM_HOST`: 0.0.0.0 (default) - `VLLM_PORT`: 8000 (default) ### Model Configuration - **Model**: Qwen/Qwen2.5-7B-Instruct (no auth required) - **Context Length**: 4096 tokens - **GPU Memory**: 85% utilization - **Tensor Parallel**: 1 (single GPU) ### Known Issues & Solutions #### Issue 1: vLLM Multiprocessing Errors **Problem**: Default vLLM v1 engine fails with ZMQ/CUDA multiprocessing errors on RunPod. **Solution**: Custom `AsyncLLMEngine` FastAPI server bypasses multiprocessing layer entirely. #### Issue 2: Disk Space (Solved) **Problem**: Original 20GB disk filled up with Hugging Face cache. **Solution**: Expanded to 50GB and use `/workspace` for model cache. #### Issue 3: Gated Models **Problem**: Llama models require Hugging Face authentication. **Solution**: Use Qwen 2.5 7B Instruct (no auth required) or set `HF_TOKEN` environment variable. #### Issue 4: Spot Instance Volatility **Problem**: RunPod Spot instances can be terminated anytime. **Solution**: Accept as trade-off for cost savings. Document SSH details for quick reconnection. ### Monitoring **Check vLLM logs**: ```bash ssh gpu-pivoine "tail -f /workspace/vllm.log" ``` **Check GPU usage**: ```bash ssh gpu-pivoine "nvidia-smi" ``` **Check Tailscale status**: ```bash ssh gpu-pivoine "tailscale status" ``` **Test API locally (on GPU server)**: ```bash ssh gpu-pivoine "curl http://localhost:8000/v1/models" ``` **Test API via Tailscale (from VPS)**: ```bash curl http://100.100.108.13:8000/v1/models ``` ### LiteLLM Integration Update VPS LiteLLM config at `ai/litellm-config-gpu.yaml`: ```yaml # Replace old WireGuard IP (10.8.0.2) with Tailscale IP - model_name: qwen-2.5-7b litellm_params: model: openai/qwen-2.5-7b api_base: http://100.100.108.13:8000/v1 # Tailscale IP api_key: dummy rpm: 1000 tpm: 100000 ``` Restart LiteLLM: ```bash arty restart litellm ``` ### Troubleshooting **Server not responding**: 1. Check if process is running: `pgrep -f simple_vllm_server` 2. Check logs: `tail -100 /workspace/vllm.log` 3. Check GPU availability: `nvidia-smi` 4. Restart server: `pkill -f simple_vllm_server && python3 /workspace/simple_vllm_server.py &` **Tailscale not connected**: 1. Check status: `tailscale status` 2. Check daemon: `ps aux | grep tailscaled` 3. Restart: `tailscale down && tailscale up` **Model download failing**: 1. Check disk space: `df -h` 2. Check cache directory: `ls -lah /workspace/huggingface_cache` 3. Clear cache if needed: `rm -rf /workspace/huggingface_cache/*` ### Deployment Status ✅ COMPLETE **Deployment Date**: 2025-11-21 1. ✅ Deploy vLLM with Qwen 2.5 7B - COMPLETE 2. ✅ Test API endpoints locally and via Tailscale - COMPLETE 3. ✅ Update VPS LiteLLM configuration - COMPLETE 4. ✅ Test end-to-end: Open WebUI → LiteLLM → vLLM - COMPLETE 5. ⏳ Monitor performance and costs - ONGOING **Model Available**: `qwen-2.5-7b` visible in Open WebUI at https://ai.pivoine.art ### Next Steps (2025-11-21 Original) 6. ✅ Consider adding more models → COMPLETE (added Flux.1 Schnell + MusicGen Medium) 7. ⏹️ Set up auto-stop for idle periods to save costs --- ## Multi-Modal Architecture (2025-11-21 Update) ### Overview Expanded GPU deployment to support **text, image, and music generation** with intelligent model orchestration. All models run sequentially on a single RTX 4090 GPU with automatic switching based on request type. ### Architecture Components #### 1. **Orchestrator Service** (Port 9000 - Always Running) - **Location**: `ai/model-orchestrator/` - **Purpose**: Central service managing model lifecycle - **Features**: - Detects request type (text/image/audio) - Automatically unloads current model - Loads requested model - Proxies requests to active model - Tracks GPU memory usage - **Technology**: FastAPI + Docker SDK Python - **Endpoints**: - `POST /v1/chat/completions` → Routes to text models - `POST /v1/images/generations` → Routes to image models - `POST /v1/audio/generations` → Routes to music models - `GET /health` → Shows active model and status - `GET /models` → Lists all available models - `POST /switch` → Manually switch models #### 2. **Text Generation** (vLLM + Qwen 2.5 7B) - **Service**: `vllm-qwen` (Port 8001) - **Location**: `ai/vllm/` - **Model**: Qwen/Qwen2.5-7B-Instruct - **VRAM**: 14GB (85% GPU utilization) - **Speed**: ~50 tokens/second - **Startup**: 120 seconds - **Status**: ✅ Working (same as original deployment) #### 3. **Image Generation** (Flux.1 Schnell) - **Service**: `flux` (Port 8002) - **Location**: `ai/flux/` - **Model**: black-forest-labs/FLUX.1-schnell - **VRAM**: 14GB with CPU offloading - **Speed**: 4-5 seconds per image - **Startup**: 60 seconds - **Features**: OpenAI DALL-E compatible API - **Image**: `ghcr.io/matatonic/openedai-images-flux:latest` #### 4. **Music Generation** (MusicGen Medium) - **Service**: `musicgen` (Port 8003) - **Location**: `ai/musicgen/` - **Model**: facebook/musicgen-medium - **VRAM**: 11GB - **Speed**: 60-90 seconds for 30 seconds of audio - **Startup**: 45 seconds - **Features**: Text-to-music generation with sampling controls - **Technology**: Meta's AudioCraft + custom FastAPI wrapper ### Model Registry (`models.yaml`) Simple configuration file for managing all models: ```yaml models: qwen-2.5-7b: type: text framework: vllm docker_service: vllm-qwen port: 8001 vram_gb: 14 startup_time_seconds: 120 endpoint: /v1/chat/completions flux-schnell: type: image framework: openedai-images docker_service: flux port: 8002 vram_gb: 14 startup_time_seconds: 60 endpoint: /v1/images/generations musicgen-medium: type: audio framework: audiocraft docker_service: musicgen port: 8003 vram_gb: 11 startup_time_seconds: 45 endpoint: /v1/audio/generations ``` **Adding new models**: Just add a new entry to this file and define the Docker service. ### Deployment Changes #### Docker Compose Structure - **File**: `docker-compose.gpu.yaml` - **Services**: 4 total (1 orchestrator + 3 models) - **Profiles**: `text`, `image`, `audio` (orchestrator manages activation) - **Restart Policy**: `no` for models (orchestrator controls lifecycle) - **Volumes**: All model caches on `/workspace` (922TB network volume) #### LiteLLM Integration Updated `litellm-config.yaml` to route all self-hosted models through orchestrator: ```yaml # Text - model_name: qwen-2.5-7b api_base: http://100.100.108.13:9000/v1 # Orchestrator # Image - model_name: flux-schnell api_base: http://100.100.108.13:9000/v1 # Orchestrator # Music - model_name: musicgen-medium api_base: http://100.100.108.13:9000/v1 # Orchestrator ``` All models now available via Open WebUI at https://ai.pivoine.art ### Usage Examples **Text Generation**: ```bash curl http://100.100.108.13:9000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Hello"}]}' ``` **Image Generation**: ```bash curl http://100.100.108.13:9000/v1/images/generations \ -H "Content-Type: application/json" \ -d '{"model": "flux-schnell", "prompt": "a cute cat", "size": "1024x1024"}' ``` **Music Generation**: ```bash curl http://100.100.108.13:9000/v1/audio/generations \ -H "Content-Type: application/json" \ -d '{"model": "musicgen-medium", "prompt": "upbeat electronic", "duration": 30}' ``` ### Deployment Commands ```bash # Copy all files to RunPod scp -r ai/* gpu-pivoine:/workspace/ai/ # SSH to GPU server ssh gpu-pivoine cd /workspace/ai/ # Start orchestrator (manages everything) docker compose -f docker-compose.gpu.yaml up -d orchestrator # Check status curl http://100.100.108.13:9000/health # View logs docker logs -f ai_orchestrator # Manually switch models (optional) curl -X POST http://100.100.108.13:9000/switch \ -H "Content-Type: application/json" \ -d '{"model": "flux-schnell"}' ``` ### Performance Characteristics | Model | VRAM | Startup Time | Generation Time | Notes | |-------|------|--------------|-----------------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | Fast text generation | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | High-quality images | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Text-to-music | **Model Switching Overhead**: 30-120 seconds (unload + load) ### Cost Analysis **Current (Single GPU Sequential)**: - Cost: ~$0.50/hour - Monthly: ~$360 (24/7) or ~$120 (8hr/day) - Trade-off: 30-120s switching time **Alternative (Multi-GPU Concurrent)**: - Cost: ~$0.75/hour (+50%) - Monthly: ~$540 (24/7) or ~$180 (8hr/day) - Benefit: No switching time, all models always available **Decision**: Stick with single GPU for cost optimization. Switching time is acceptable for most use cases. ### Known Limitations 1. **Sequential Only**: Only one model active at a time 2. **Switching Latency**: 30-120 seconds to change models 3. **MusicGen License**: Pre-trained weights are CC-BY-NC (non-commercial) 4. **Spot Instance Volatility**: Pod can be terminated anytime ### Monitoring **Check active model**: ```bash curl http://100.100.108.13:9000/health | jq '{model: .current_model, vram: .model_info.vram_gb}' ``` **View orchestrator logs**: ```bash docker logs -f ai_orchestrator ``` **GPU usage**: ```bash ssh gpu-pivoine "nvidia-smi" ``` ### Deployment Status ✅ COMPLETE (Multi-Modal) **Deployment Date**: 2025-11-21 1. ✅ Create model orchestrator service - COMPLETE 2. ✅ Deploy vLLM text generation (Qwen 2.5 7B) - COMPLETE 3. ✅ Deploy Flux.1 Schnell image generation - COMPLETE 4. ✅ Deploy MusicGen Medium music generation - COMPLETE 5. ✅ Update LiteLLM configuration - COMPLETE 6. ✅ Test all three model types via orchestrator - READY FOR TESTING 7. ⏳ Monitor performance and costs - ONGOING **Models Available**: `qwen-2.5-7b`, `flux-schnell`, `musicgen-medium` via Open WebUI ### Future Model Additions **Easy to add** (just edit `models.yaml`): - Llama 3.1 8B Instruct (text, gated model) - Whisper Large v3 (speech-to-text) - XTTS v2 (text-to-speech) - Stable Diffusion XL (alternative image generation) See `README.md` for detailed instructions on adding new models. ### Cost Optimization Ideas 1. **Auto-stop**: Configure RunPod to auto-stop after 30 minutes idle 2. **Spot Instances**: Already using Spot for 50% cost reduction 3. **Scheduled Operation**: Run only during business hours (8 hours/day = $120/month) 4. **Smaller Models**: Use Mistral 7B or quantized models for lighter workloads 5. **Pay-as-you-go**: Manually start/stop pod as needed ### Performance Benchmarks *To be measured after deployment* Expected (based on RTX 4090): - Qwen 2.5 7B: 50-80 tokens/second - Context processing: ~2-3 seconds for 1000 tokens - First token latency: ~200-300ms