Files
docker-compose/ai/GPU_DEPLOYMENT_LOG.md
Sebastian Krüger bb3dabcba7 feat(ai): complete GPU deployment with self-hosted Qwen 2.5 7B model
This commit finalizes the GPU infrastructure deployment on RunPod:

- Added qwen-2.5-7b model to LiteLLM configuration
  - Self-hosted on RunPod RTX 4090 GPU server
  - Connected via Tailscale VPN (100.100.108.13:8000)
  - OpenAI-compatible API endpoint
  - Rate limits: 1000 RPM, 100k TPM

- Marked GPU deployment as COMPLETE in deployment log
  - vLLM 0.6.4.post1 with custom AsyncLLMEngine server
  - Qwen/Qwen2.5-7B-Instruct model (14.25 GB)
  - 85% GPU memory utilization, 4096 context length
  - Successfully integrated with Open WebUI at ai.pivoine.art

Infrastructure:
- Provider: RunPod Spot Instance (~$0.50/hr)
- GPU: NVIDIA RTX 4090 24GB
- Disk: 50GB local SSD + 922TB network volume
- VPN: Tailscale (replaces WireGuard due to RunPod UDP restrictions)

Model now visible and accessible in Open WebUI for end users.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 13:18:17 +01:00

5.3 KiB

GPU Server Deployment Log

Current Deployment (2025-11-21)

Infrastructure

  • Provider: RunPod (Spot Instance)
  • GPU: NVIDIA RTX 4090 24GB
  • Disk: 50GB local SSD (expanded from 20GB)
  • Network Volume: 922TB at /workspace
  • Region: Europe
  • Cost: $0.50/hour ($360/month if running 24/7)

Network Configuration

  • VPN: Tailscale (replaces WireGuard due to RunPod UDP restrictions)
  • GPU Server Tailscale IP: 100.100.108.13
  • VPS Tailscale IP: (get with tailscale ip -4 on VPS)

SSH Access

Host gpu-pivoine
    HostName 213.173.102.232
    Port 29695
    User root
    IdentityFile ~/.ssh/id_ed25519

Note: RunPod Spot instances can be terminated and restarted with new ports/IPs. Update SSH config accordingly.

Software Stack

  • Python: 3.11.10
  • vLLM: 0.6.4.post1 (installed with pip)
  • PyTorch: 2.5.1 with CUDA 12.4
  • Tailscale: Installed via official script

vLLM Deployment

Custom Server: ai/simple_vllm_server.py

  • Uses AsyncLLMEngine directly to bypass multiprocessing issues
  • OpenAI-compatible API endpoints:
    • GET /v1/models - List available models
    • POST /v1/completions - Text completion
    • POST /v1/chat/completions - Chat completion
  • Default model: Qwen/Qwen2.5-7B-Instruct
  • Cache directory: /workspace/huggingface_cache

Deployment Command:

# Copy server script to GPU server
scp ai/simple_vllm_server.py gpu-pivoine:/workspace/

# Start server
ssh gpu-pivoine "cd /workspace && nohup python3 simple_vllm_server.py > vllm.log 2>&1 &"

# Check status
ssh gpu-pivoine "curl http://localhost:8000/v1/models"

Server Configuration (environment variables):

  • VLLM_HOST: 0.0.0.0 (default)
  • VLLM_PORT: 8000 (default)

Model Configuration

  • Model: Qwen/Qwen2.5-7B-Instruct (no auth required)
  • Context Length: 4096 tokens
  • GPU Memory: 85% utilization
  • Tensor Parallel: 1 (single GPU)

Known Issues & Solutions

Issue 1: vLLM Multiprocessing Errors

Problem: Default vLLM v1 engine fails with ZMQ/CUDA multiprocessing errors on RunPod. Solution: Custom AsyncLLMEngine FastAPI server bypasses multiprocessing layer entirely.

Issue 2: Disk Space (Solved)

Problem: Original 20GB disk filled up with Hugging Face cache. Solution: Expanded to 50GB and use /workspace for model cache.

Issue 3: Gated Models

Problem: Llama models require Hugging Face authentication. Solution: Use Qwen 2.5 7B Instruct (no auth required) or set HF_TOKEN environment variable.

Issue 4: Spot Instance Volatility

Problem: RunPod Spot instances can be terminated anytime. Solution: Accept as trade-off for cost savings. Document SSH details for quick reconnection.

Monitoring

Check vLLM logs:

ssh gpu-pivoine "tail -f /workspace/vllm.log"

Check GPU usage:

ssh gpu-pivoine "nvidia-smi"

Check Tailscale status:

ssh gpu-pivoine "tailscale status"

Test API locally (on GPU server):

ssh gpu-pivoine "curl http://localhost:8000/v1/models"

Test API via Tailscale (from VPS):

curl http://100.100.108.13:8000/v1/models

LiteLLM Integration

Update VPS LiteLLM config at ai/litellm-config-gpu.yaml:

# Replace old WireGuard IP (10.8.0.2) with Tailscale IP
- model_name: qwen-2.5-7b
  litellm_params:
    model: openai/qwen-2.5-7b
    api_base: http://100.100.108.13:8000/v1  # Tailscale IP
    api_key: dummy
    rpm: 1000
    tpm: 100000

Restart LiteLLM:

arty restart litellm

Troubleshooting

Server not responding:

  1. Check if process is running: pgrep -f simple_vllm_server
  2. Check logs: tail -100 /workspace/vllm.log
  3. Check GPU availability: nvidia-smi
  4. Restart server: pkill -f simple_vllm_server && python3 /workspace/simple_vllm_server.py &

Tailscale not connected:

  1. Check status: tailscale status
  2. Check daemon: ps aux | grep tailscaled
  3. Restart: tailscale down && tailscale up

Model download failing:

  1. Check disk space: df -h
  2. Check cache directory: ls -lah /workspace/huggingface_cache
  3. Clear cache if needed: rm -rf /workspace/huggingface_cache/*

Deployment Status COMPLETE

Deployment Date: 2025-11-21

  1. Deploy vLLM with Qwen 2.5 7B - COMPLETE
  2. Test API endpoints locally and via Tailscale - COMPLETE
  3. Update VPS LiteLLM configuration - COMPLETE
  4. Test end-to-end: Open WebUI → LiteLLM → vLLM - COMPLETE
  5. Monitor performance and costs - ONGOING

Model Available: qwen-2.5-7b visible in Open WebUI at https://ai.pivoine.art

Next Steps

  1. ⏹️ Consider adding more models (Mistral, DeepSeek Coder)
  2. ⏹️ Set up auto-stop for idle periods to save costs

Cost Optimization Ideas

  1. Auto-stop: Configure RunPod to auto-stop after 30 minutes idle
  2. Spot Instances: Already using Spot for 50% cost reduction
  3. Scheduled Operation: Run only during business hours (8 hours/day = $120/month)
  4. Smaller Models: Use Mistral 7B or quantized models for lighter workloads
  5. Pay-as-you-go: Manually start/stop pod as needed

Performance Benchmarks

To be measured after deployment

Expected (based on RTX 4090):

  • Qwen 2.5 7B: 50-80 tokens/second
  • Context processing: ~2-3 seconds for 1000 tokens
  • First token latency: ~200-300ms