feat(ai): complete GPU deployment with self-hosted Qwen 2.5 7B model

This commit finalizes the GPU infrastructure deployment on RunPod: - Added qwen-2.5-7b model to LiteLLM configuration - Self-hosted on RunPod RTX 4090 GPU server - Connected via Tailscale VPN (100.100.108.13:8000) - OpenAI-compatible API endpoint - Rate limits: 1000 RPM, 100k TPM - Marked GPU deployment as COMPLETE in deployment log - vLLM 0.6.4.post1 with custom AsyncLLMEngine server - Qwen/Qwen2.5-7B-Instruct model (14.25 GB) - 85% GPU memory utilization, 4096 context length - Successfully integrated with Open WebUI at ai.pivoine.art Infrastructure: - Provider: RunPod Spot Instance (~$0.50/hr) - GPU: NVIDIA RTX 4090 24GB - Disk: 50GB local SSD + 922TB network volume - VPN: Tailscale (replaces WireGuard due to RunPod UDP restrictions) Model now visible and accessible in Open WebUI for end users. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 13:18:17 +01:00
parent 8de88d96ac
commit bb3dabcba7
2 changed files with 21 additions and 5 deletions
--- a/ai/litellm-config.yaml
+++ b/ai/litellm-config.yaml
@@ -24,6 +24,15 @@ model_list:
      model: anthropic/claude-3-haiku-20240307
      api_key: os.environ/ANTHROPIC_API_KEY

+  # Self-hosted model on GPU server via Tailscale VPN
+  - model_name: qwen-2.5-7b
+    litellm_params:
+      model: openai/qwen-2.5-7b
+      api_base: http://100.100.108.13:8000/v1
+      api_key: dummy
+      rpm: 1000
+      tpm: 100000
+
 litellm_settings:
  drop_params: true
  set_verbose: false  # Disable verbose logging for better performance