From bb3dabcba73504c054222beaae5190237d34a59a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Sebastian=20Kr=C3=BCger?= Date: Fri, 21 Nov 2025 13:18:17 +0100 Subject: [PATCH] feat(ai): complete GPU deployment with self-hosted Qwen 2.5 7B model MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This commit finalizes the GPU infrastructure deployment on RunPod: - Added qwen-2.5-7b model to LiteLLM configuration - Self-hosted on RunPod RTX 4090 GPU server - Connected via Tailscale VPN (100.100.108.13:8000) - OpenAI-compatible API endpoint - Rate limits: 1000 RPM, 100k TPM - Marked GPU deployment as COMPLETE in deployment log - vLLM 0.6.4.post1 with custom AsyncLLMEngine server - Qwen/Qwen2.5-7B-Instruct model (14.25 GB) - 85% GPU memory utilization, 4096 context length - Successfully integrated with Open WebUI at ai.pivoine.art Infrastructure: - Provider: RunPod Spot Instance (~$0.50/hr) - GPU: NVIDIA RTX 4090 24GB - Disk: 50GB local SSD + 922TB network volume - VPN: Tailscale (replaces WireGuard due to RunPod UDP restrictions) Model now visible and accessible in Open WebUI for end users. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- ai/GPU_DEPLOYMENT_LOG.md | 17 ++++++++++++----- ai/litellm-config.yaml | 9 +++++++++ 2 files changed, 21 insertions(+), 5 deletions(-) diff --git a/ai/GPU_DEPLOYMENT_LOG.md b/ai/GPU_DEPLOYMENT_LOG.md index 428d4a4..cefd826 100644 --- a/ai/GPU_DEPLOYMENT_LOG.md +++ b/ai/GPU_DEPLOYMENT_LOG.md @@ -148,12 +148,19 @@ arty restart litellm 2. Check cache directory: `ls -lah /workspace/huggingface_cache` 3. Clear cache if needed: `rm -rf /workspace/huggingface_cache/*` +### Deployment Status ✅ COMPLETE + +**Deployment Date**: 2025-11-21 + +1. ✅ Deploy vLLM with Qwen 2.5 7B - COMPLETE +2. ✅ Test API endpoints locally and via Tailscale - COMPLETE +3. ✅ Update VPS LiteLLM configuration - COMPLETE +4. ✅ Test end-to-end: Open WebUI → LiteLLM → vLLM - COMPLETE +5. âŗ Monitor performance and costs - ONGOING + +**Model Available**: `qwen-2.5-7b` visible in Open WebUI at https://ai.pivoine.art + ### Next Steps -1. ✅ Deploy vLLM with Qwen 2.5 7B -2. âŗ Test API endpoints locally and via Tailscale -3. âŗ Update VPS LiteLLM configuration -4. âŗ Test end-to-end: Open WebUI → LiteLLM → vLLM -5. âšī¸ Monitor performance and costs 6. âšī¸ Consider adding more models (Mistral, DeepSeek Coder) 7. âšī¸ Set up auto-stop for idle periods to save costs diff --git a/ai/litellm-config.yaml b/ai/litellm-config.yaml index 06b3e3c..ba94370 100644 --- a/ai/litellm-config.yaml +++ b/ai/litellm-config.yaml @@ -24,6 +24,15 @@ model_list: model: anthropic/claude-3-haiku-20240307 api_key: os.environ/ANTHROPIC_API_KEY + # Self-hosted model on GPU server via Tailscale VPN + - model_name: qwen-2.5-7b + litellm_params: + model: openai/qwen-2.5-7b + api_base: http://100.100.108.13:8000/v1 + api_key: dummy + rpm: 1000 + tpm: 100000 + litellm_settings: drop_params: true set_verbose: false # Disable verbose logging for better performance