From bb3dabcba73504c054222beaae5190237d34a59a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Sebastian=20Kr=C3=BCger?= <valknar@pivoine.art>
Date: Fri, 21 Nov 2025 13:18:17 +0100
Subject: [PATCH] feat(ai): complete GPU deployment with self-hosted Qwen 2.5
 7B model
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This commit finalizes the GPU infrastructure deployment on RunPod:

- Added qwen-2.5-7b model to LiteLLM configuration
  - Self-hosted on RunPod RTX 4090 GPU server
  - Connected via Tailscale VPN (100.100.108.13:8000)
  - OpenAI-compatible API endpoint
  - Rate limits: 1000 RPM, 100k TPM

- Marked GPU deployment as COMPLETE in deployment log
  - vLLM 0.6.4.post1 with custom AsyncLLMEngine server
  - Qwen/Qwen2.5-7B-Instruct model (14.25 GB)
  - 85% GPU memory utilization, 4096 context length
  - Successfully integrated with Open WebUI at ai.pivoine.art

Infrastructure:
- Provider: RunPod Spot Instance (~$0.50/hr)
- GPU: NVIDIA RTX 4090 24GB
- Disk: 50GB local SSD + 922TB network volume
- VPN: Tailscale (replaces WireGuard due to RunPod UDP restrictions)

Model now visible and accessible in Open WebUI for end users.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
---
 ai/GPU_DEPLOYMENT_LOG.md | 17 ++++++++++++-----
 ai/litellm-config.yaml   |  9 +++++++++
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/ai/GPU_DEPLOYMENT_LOG.md b/ai/GPU_DEPLOYMENT_LOG.md
index 428d4a4..cefd826 100644
--- a/ai/GPU_DEPLOYMENT_LOG.md
+++ b/ai/GPU_DEPLOYMENT_LOG.md
@@ -148,12 +148,19 @@ arty restart litellm
 2. Check cache directory: `ls -lah /workspace/huggingface_cache`
 3. Clear cache if needed: `rm -rf /workspace/huggingface_cache/*`
 
+### Deployment Status ✅ COMPLETE
+
+**Deployment Date**: 2025-11-21
+
+1. ✅ Deploy vLLM with Qwen 2.5 7B - COMPLETE
+2. ✅ Test API endpoints locally and via Tailscale - COMPLETE
+3. ✅ Update VPS LiteLLM configuration - COMPLETE
+4. ✅ Test end-to-end: Open WebUI → LiteLLM → vLLM - COMPLETE
+5. ⏳ Monitor performance and costs - ONGOING
+
+**Model Available**: `qwen-2.5-7b` visible in Open WebUI at https://ai.pivoine.art
+
 ### Next Steps
-1. ✅ Deploy vLLM with Qwen 2.5 7B
-2. ⏳ Test API endpoints locally and via Tailscale
-3. ⏳ Update VPS LiteLLM configuration
-4. ⏳ Test end-to-end: Open WebUI → LiteLLM → vLLM
-5. ⏹️ Monitor performance and costs
 6. ⏹️ Consider adding more models (Mistral, DeepSeek Coder)
 7. ⏹️ Set up auto-stop for idle periods to save costs
 
diff --git a/ai/litellm-config.yaml b/ai/litellm-config.yaml
index 06b3e3c..ba94370 100644
--- a/ai/litellm-config.yaml
+++ b/ai/litellm-config.yaml
@@ -24,6 +24,15 @@ model_list:
       model: anthropic/claude-3-haiku-20240307
       api_key: os.environ/ANTHROPIC_API_KEY
 
+  # Self-hosted model on GPU server via Tailscale VPN
+  - model_name: qwen-2.5-7b
+    litellm_params:
+      model: openai/qwen-2.5-7b
+      api_base: http://100.100.108.13:8000/v1
+      api_key: dummy
+      rpm: 1000
+      tpm: 100000
+
 litellm_settings:
   drop_params: true
   set_verbose: false  # Disable verbose logging for better performance