# GPU Server Deployment Log ## Current Deployment (2025-11-21) ### Infrastructure - **Provider**: RunPod (Spot Instance) - **GPU**: NVIDIA RTX 4090 24GB - **Disk**: 50GB local SSD (expanded from 20GB) - **Network Volume**: 922TB at `/workspace` - **Region**: Europe - **Cost**: ~$0.50/hour (~$360/month if running 24/7) ### Network Configuration - **VPN**: Tailscale (replaces WireGuard due to RunPod UDP restrictions) - **GPU Server Tailscale IP**: 100.100.108.13 - **VPS Tailscale IP**: (get with `tailscale ip -4` on VPS) ### SSH Access ``` Host gpu-pivoine HostName 213.173.102.232 Port 29695 User root IdentityFile ~/.ssh/id_ed25519 ``` **Note**: RunPod Spot instances can be terminated and restarted with new ports/IPs. Update SSH config accordingly. ### Software Stack - **Python**: 3.11.10 - **vLLM**: 0.6.4.post1 (installed with pip) - **PyTorch**: 2.5.1 with CUDA 12.4 - **Tailscale**: Installed via official script ### vLLM Deployment **Custom Server**: `ai/simple_vllm_server.py` - Uses `AsyncLLMEngine` directly to bypass multiprocessing issues - OpenAI-compatible API endpoints: - `GET /v1/models` - List available models - `POST /v1/completions` - Text completion - `POST /v1/chat/completions` - Chat completion - Default model: Qwen/Qwen2.5-7B-Instruct - Cache directory: `/workspace/huggingface_cache` **Deployment Command**: ```bash # Copy server script to GPU server scp ai/simple_vllm_server.py gpu-pivoine:/workspace/ # Start server ssh gpu-pivoine "cd /workspace && nohup python3 simple_vllm_server.py > vllm.log 2>&1 &" # Check status ssh gpu-pivoine "curl http://localhost:8000/v1/models" ``` **Server Configuration** (environment variables): - `VLLM_HOST`: 0.0.0.0 (default) - `VLLM_PORT`: 8000 (default) ### Model Configuration - **Model**: Qwen/Qwen2.5-7B-Instruct (no auth required) - **Context Length**: 4096 tokens - **GPU Memory**: 85% utilization - **Tensor Parallel**: 1 (single GPU) ### Known Issues & Solutions #### Issue 1: vLLM Multiprocessing Errors **Problem**: Default vLLM v1 engine fails with ZMQ/CUDA multiprocessing errors on RunPod. **Solution**: Custom `AsyncLLMEngine` FastAPI server bypasses multiprocessing layer entirely. #### Issue 2: Disk Space (Solved) **Problem**: Original 20GB disk filled up with Hugging Face cache. **Solution**: Expanded to 50GB and use `/workspace` for model cache. #### Issue 3: Gated Models **Problem**: Llama models require Hugging Face authentication. **Solution**: Use Qwen 2.5 7B Instruct (no auth required) or set `HF_TOKEN` environment variable. #### Issue 4: Spot Instance Volatility **Problem**: RunPod Spot instances can be terminated anytime. **Solution**: Accept as trade-off for cost savings. Document SSH details for quick reconnection. ### Monitoring **Check vLLM logs**: ```bash ssh gpu-pivoine "tail -f /workspace/vllm.log" ``` **Check GPU usage**: ```bash ssh gpu-pivoine "nvidia-smi" ``` **Check Tailscale status**: ```bash ssh gpu-pivoine "tailscale status" ``` **Test API locally (on GPU server)**: ```bash ssh gpu-pivoine "curl http://localhost:8000/v1/models" ``` **Test API via Tailscale (from VPS)**: ```bash curl http://100.100.108.13:8000/v1/models ``` ### LiteLLM Integration Update VPS LiteLLM config at `ai/litellm-config-gpu.yaml`: ```yaml # Replace old WireGuard IP (10.8.0.2) with Tailscale IP - model_name: qwen-2.5-7b litellm_params: model: openai/qwen-2.5-7b api_base: http://100.100.108.13:8000/v1 # Tailscale IP api_key: dummy rpm: 1000 tpm: 100000 ``` Restart LiteLLM: ```bash arty restart litellm ``` ### Troubleshooting **Server not responding**: 1. Check if process is running: `pgrep -f simple_vllm_server` 2. Check logs: `tail -100 /workspace/vllm.log` 3. Check GPU availability: `nvidia-smi` 4. Restart server: `pkill -f simple_vllm_server && python3 /workspace/simple_vllm_server.py &` **Tailscale not connected**: 1. Check status: `tailscale status` 2. Check daemon: `ps aux | grep tailscaled` 3. Restart: `tailscale down && tailscale up` **Model download failing**: 1. Check disk space: `df -h` 2. Check cache directory: `ls -lah /workspace/huggingface_cache` 3. Clear cache if needed: `rm -rf /workspace/huggingface_cache/*` ### Deployment Status ✅ COMPLETE **Deployment Date**: 2025-11-21 1. ✅ Deploy vLLM with Qwen 2.5 7B - COMPLETE 2. ✅ Test API endpoints locally and via Tailscale - COMPLETE 3. ✅ Update VPS LiteLLM configuration - COMPLETE 4. ✅ Test end-to-end: Open WebUI → LiteLLM → vLLM - COMPLETE 5. ⏳ Monitor performance and costs - ONGOING **Model Available**: `qwen-2.5-7b` visible in Open WebUI at https://ai.pivoine.art ### Next Steps 6. ⏹️ Consider adding more models (Mistral, DeepSeek Coder) 7. ⏹️ Set up auto-stop for idle periods to save costs ### Cost Optimization Ideas 1. **Auto-stop**: Configure RunPod to auto-stop after 30 minutes idle 2. **Spot Instances**: Already using Spot for 50% cost reduction 3. **Scheduled Operation**: Run only during business hours (8 hours/day = $120/month) 4. **Smaller Models**: Use Mistral 7B or quantized models for lighter workloads 5. **Pay-as-you-go**: Manually start/stop pod as needed ### Performance Benchmarks *To be measured after deployment* Expected (based on RTX 4090): - Qwen 2.5 7B: 50-80 tokens/second - Context processing: ~2-3 seconds for 1000 tokens - First token latency: ~200-300ms