This commit finalizes the GPU infrastructure deployment on RunPod: - Added qwen-2.5-7b model to LiteLLM configuration - Self-hosted on RunPod RTX 4090 GPU server - Connected via Tailscale VPN (100.100.108.13:8000) - OpenAI-compatible API endpoint - Rate limits: 1000 RPM, 100k TPM - Marked GPU deployment as COMPLETE in deployment log - vLLM 0.6.4.post1 with custom AsyncLLMEngine server - Qwen/Qwen2.5-7B-Instruct model (14.25 GB) - 85% GPU memory utilization, 4096 context length - Successfully integrated with Open WebUI at ai.pivoine.art Infrastructure: - Provider: RunPod Spot Instance (~$0.50/hr) - GPU: NVIDIA RTX 4090 24GB - Disk: 50GB local SSD + 922TB network volume - VPN: Tailscale (replaces WireGuard due to RunPod UDP restrictions) Model now visible and accessible in Open WebUI for end users. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
5.3 KiB
GPU Server Deployment Log
Current Deployment (2025-11-21)
Infrastructure
- Provider: RunPod (Spot Instance)
- GPU: NVIDIA RTX 4090 24GB
- Disk: 50GB local SSD (expanded from 20GB)
- Network Volume: 922TB at
/workspace - Region: Europe
- Cost:
$0.50/hour ($360/month if running 24/7)
Network Configuration
- VPN: Tailscale (replaces WireGuard due to RunPod UDP restrictions)
- GPU Server Tailscale IP: 100.100.108.13
- VPS Tailscale IP: (get with
tailscale ip -4on VPS)
SSH Access
Host gpu-pivoine
HostName 213.173.102.232
Port 29695
User root
IdentityFile ~/.ssh/id_ed25519
Note: RunPod Spot instances can be terminated and restarted with new ports/IPs. Update SSH config accordingly.
Software Stack
- Python: 3.11.10
- vLLM: 0.6.4.post1 (installed with pip)
- PyTorch: 2.5.1 with CUDA 12.4
- Tailscale: Installed via official script
vLLM Deployment
Custom Server: ai/simple_vllm_server.py
- Uses
AsyncLLMEnginedirectly to bypass multiprocessing issues - OpenAI-compatible API endpoints:
GET /v1/models- List available modelsPOST /v1/completions- Text completionPOST /v1/chat/completions- Chat completion
- Default model: Qwen/Qwen2.5-7B-Instruct
- Cache directory:
/workspace/huggingface_cache
Deployment Command:
# Copy server script to GPU server
scp ai/simple_vllm_server.py gpu-pivoine:/workspace/
# Start server
ssh gpu-pivoine "cd /workspace && nohup python3 simple_vllm_server.py > vllm.log 2>&1 &"
# Check status
ssh gpu-pivoine "curl http://localhost:8000/v1/models"
Server Configuration (environment variables):
VLLM_HOST: 0.0.0.0 (default)VLLM_PORT: 8000 (default)
Model Configuration
- Model: Qwen/Qwen2.5-7B-Instruct (no auth required)
- Context Length: 4096 tokens
- GPU Memory: 85% utilization
- Tensor Parallel: 1 (single GPU)
Known Issues & Solutions
Issue 1: vLLM Multiprocessing Errors
Problem: Default vLLM v1 engine fails with ZMQ/CUDA multiprocessing errors on RunPod.
Solution: Custom AsyncLLMEngine FastAPI server bypasses multiprocessing layer entirely.
Issue 2: Disk Space (Solved)
Problem: Original 20GB disk filled up with Hugging Face cache.
Solution: Expanded to 50GB and use /workspace for model cache.
Issue 3: Gated Models
Problem: Llama models require Hugging Face authentication.
Solution: Use Qwen 2.5 7B Instruct (no auth required) or set HF_TOKEN environment variable.
Issue 4: Spot Instance Volatility
Problem: RunPod Spot instances can be terminated anytime. Solution: Accept as trade-off for cost savings. Document SSH details for quick reconnection.
Monitoring
Check vLLM logs:
ssh gpu-pivoine "tail -f /workspace/vllm.log"
Check GPU usage:
ssh gpu-pivoine "nvidia-smi"
Check Tailscale status:
ssh gpu-pivoine "tailscale status"
Test API locally (on GPU server):
ssh gpu-pivoine "curl http://localhost:8000/v1/models"
Test API via Tailscale (from VPS):
curl http://100.100.108.13:8000/v1/models
LiteLLM Integration
Update VPS LiteLLM config at ai/litellm-config-gpu.yaml:
# Replace old WireGuard IP (10.8.0.2) with Tailscale IP
- model_name: qwen-2.5-7b
litellm_params:
model: openai/qwen-2.5-7b
api_base: http://100.100.108.13:8000/v1 # Tailscale IP
api_key: dummy
rpm: 1000
tpm: 100000
Restart LiteLLM:
arty restart litellm
Troubleshooting
Server not responding:
- Check if process is running:
pgrep -f simple_vllm_server - Check logs:
tail -100 /workspace/vllm.log - Check GPU availability:
nvidia-smi - Restart server:
pkill -f simple_vllm_server && python3 /workspace/simple_vllm_server.py &
Tailscale not connected:
- Check status:
tailscale status - Check daemon:
ps aux | grep tailscaled - Restart:
tailscale down && tailscale up
Model download failing:
- Check disk space:
df -h - Check cache directory:
ls -lah /workspace/huggingface_cache - Clear cache if needed:
rm -rf /workspace/huggingface_cache/*
Deployment Status ✅ COMPLETE
Deployment Date: 2025-11-21
- ✅ Deploy vLLM with Qwen 2.5 7B - COMPLETE
- ✅ Test API endpoints locally and via Tailscale - COMPLETE
- ✅ Update VPS LiteLLM configuration - COMPLETE
- ✅ Test end-to-end: Open WebUI → LiteLLM → vLLM - COMPLETE
- ⏳ Monitor performance and costs - ONGOING
Model Available: qwen-2.5-7b visible in Open WebUI at https://ai.pivoine.art
Next Steps
- ⏹️ Consider adding more models (Mistral, DeepSeek Coder)
- ⏹️ Set up auto-stop for idle periods to save costs
Cost Optimization Ideas
- Auto-stop: Configure RunPod to auto-stop after 30 minutes idle
- Spot Instances: Already using Spot for 50% cost reduction
- Scheduled Operation: Run only during business hours (8 hours/day = $120/month)
- Smaller Models: Use Mistral 7B or quantized models for lighter workloads
- Pay-as-you-go: Manually start/stop pod as needed
Performance Benchmarks
To be measured after deployment
Expected (based on RTX 4090):
- Qwen 2.5 7B: 50-80 tokens/second
- Context processing: ~2-3 seconds for 1000 tokens
- First token latency: ~200-300ms