Files

Sebastian Krüger bb3dabcba7 feat(ai): complete GPU deployment with self-hosted Qwen 2.5 7B model

This commit finalizes the GPU infrastructure deployment on RunPod:

- Added qwen-2.5-7b model to LiteLLM configuration
  - Self-hosted on RunPod RTX 4090 GPU server
  - Connected via Tailscale VPN (100.100.108.13:8000)
  - OpenAI-compatible API endpoint
  - Rate limits: 1000 RPM, 100k TPM

- Marked GPU deployment as COMPLETE in deployment log
  - vLLM 0.6.4.post1 with custom AsyncLLMEngine server
  - Qwen/Qwen2.5-7B-Instruct model (14.25 GB)
  - 85% GPU memory utilization, 4096 context length
  - Successfully integrated with Open WebUI at ai.pivoine.art

Infrastructure:
- Provider: RunPod Spot Instance (~$0.50/hr)
- GPU: NVIDIA RTX 4090 24GB
- Disk: 50GB local SSD + 922TB network volume
- VPN: Tailscale (replaces WireGuard due to RunPod UDP restrictions)

Model now visible and accessible in Open WebUI for end users.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-21 13:18:17 +01:00

5.3 KiB

Raw Blame History

GPU Server Deployment Log

Current Deployment (2025-11-21)

Infrastructure

Provider: RunPod (Spot Instance)
GPU: NVIDIA RTX 4090 24GB
Disk: 50GB local SSD (expanded from 20GB)
Network Volume: 922TB at /workspace
Region: Europe
Cost: ~~$0.50/hour (~~$360/month if running 24/7)

Network Configuration

VPN: Tailscale (replaces WireGuard due to RunPod UDP restrictions)
GPU Server Tailscale IP: 100.100.108.13
VPS Tailscale IP: (get with tailscale ip -4 on VPS)

SSH Access

Host gpu-pivoine
    HostName 213.173.102.232
    Port 29695
    User root
    IdentityFile ~/.ssh/id_ed25519

Note: RunPod Spot instances can be terminated and restarted with new ports/IPs. Update SSH config accordingly.

Software Stack

Python: 3.11.10
vLLM: 0.6.4.post1 (installed with pip)
PyTorch: 2.5.1 with CUDA 12.4
Tailscale: Installed via official script

vLLM Deployment

Custom Server: ai/simple_vllm_server.py

Uses AsyncLLMEngine directly to bypass multiprocessing issues
OpenAI-compatible API endpoints:
- GET /v1/models - List available models
- POST /v1/completions - Text completion
- POST /v1/chat/completions - Chat completion
Default model: Qwen/Qwen2.5-7B-Instruct
Cache directory: /workspace/huggingface_cache

Deployment Command:

# Copy server script to GPU server
scp ai/simple_vllm_server.py gpu-pivoine:/workspace/

# Start server
ssh gpu-pivoine "cd /workspace && nohup python3 simple_vllm_server.py > vllm.log 2>&1 &"

# Check status
ssh gpu-pivoine "curl http://localhost:8000/v1/models"

Server Configuration (environment variables):

VLLM_HOST: 0.0.0.0 (default)
VLLM_PORT: 8000 (default)

Model Configuration

Model: Qwen/Qwen2.5-7B-Instruct (no auth required)
Context Length: 4096 tokens
GPU Memory: 85% utilization
Tensor Parallel: 1 (single GPU)

Known Issues & Solutions

Issue 1: vLLM Multiprocessing Errors

Problem: Default vLLM v1 engine fails with ZMQ/CUDA multiprocessing errors on RunPod. Solution: Custom AsyncLLMEngine FastAPI server bypasses multiprocessing layer entirely.

Issue 2: Disk Space (Solved)

Problem: Original 20GB disk filled up with Hugging Face cache. Solution: Expanded to 50GB and use /workspace for model cache.

Issue 3: Gated Models

Problem: Llama models require Hugging Face authentication. Solution: Use Qwen 2.5 7B Instruct (no auth required) or set HF_TOKEN environment variable.

Issue 4: Spot Instance Volatility

Problem: RunPod Spot instances can be terminated anytime. Solution: Accept as trade-off for cost savings. Document SSH details for quick reconnection.

Monitoring

Check vLLM logs:

ssh gpu-pivoine "tail -f /workspace/vllm.log"

Check GPU usage:

ssh gpu-pivoine "nvidia-smi"

Check Tailscale status:

ssh gpu-pivoine "tailscale status"

Test API locally (on GPU server):

ssh gpu-pivoine "curl http://localhost:8000/v1/models"

Test API via Tailscale (from VPS):

curl http://100.100.108.13:8000/v1/models

LiteLLM Integration

Update VPS LiteLLM config at ai/litellm-config-gpu.yaml:

# Replace old WireGuard IP (10.8.0.2) with Tailscale IP
- model_name: qwen-2.5-7b
  litellm_params:
    model: openai/qwen-2.5-7b
    api_base: http://100.100.108.13:8000/v1  # Tailscale IP
    api_key: dummy
    rpm: 1000
    tpm: 100000

Restart LiteLLM:

arty restart litellm

Troubleshooting

Server not responding:

Check if process is running: pgrep -f simple_vllm_server
Check logs: tail -100 /workspace/vllm.log
Check GPU availability: nvidia-smi
Restart server: pkill -f simple_vllm_server && python3 /workspace/simple_vllm_server.py &

Tailscale not connected:

Check status: tailscale status
Check daemon: ps aux | grep tailscaled
Restart: tailscale down && tailscale up

Model download failing:

Check disk space: df -h
Check cache directory: ls -lah /workspace/huggingface_cache
Clear cache if needed: rm -rf /workspace/huggingface_cache/*

Deployment Status ✅ COMPLETE

Deployment Date: 2025-11-21

✅ Deploy vLLM with Qwen 2.5 7B - COMPLETE
✅ Test API endpoints locally and via Tailscale - COMPLETE
✅ Update VPS LiteLLM configuration - COMPLETE
✅ Test end-to-end: Open WebUI → LiteLLM → vLLM - COMPLETE
⏳ Monitor performance and costs - ONGOING

Model Available: qwen-2.5-7b visible in Open WebUI at https://ai.pivoine.art

Next Steps

⏹️ Consider adding more models (Mistral, DeepSeek Coder)
⏹️ Set up auto-stop for idle periods to save costs

Cost Optimization Ideas

Auto-stop: Configure RunPod to auto-stop after 30 minutes idle
Spot Instances: Already using Spot for 50% cost reduction
Scheduled Operation: Run only during business hours (8 hours/day = $120/month)
Smaller Models: Use Mistral 7B or quantized models for lighter workloads
Pay-as-you-go: Manually start/stop pod as needed

Performance Benchmarks

To be measured after deployment

Expected (based on RTX 4090):

Qwen 2.5 7B: 50-80 tokens/second
Context processing: ~2-3 seconds for 1000 tokens
First token latency: ~200-300ms

5.3 KiB Raw Blame History