feat(ai): complete GPU deployment with self-hosted Qwen 2.5 7B model
This commit finalizes the GPU infrastructure deployment on RunPod: - Added qwen-2.5-7b model to LiteLLM configuration - Self-hosted on RunPod RTX 4090 GPU server - Connected via Tailscale VPN (100.100.108.13:8000) - OpenAI-compatible API endpoint - Rate limits: 1000 RPM, 100k TPM - Marked GPU deployment as COMPLETE in deployment log - vLLM 0.6.4.post1 with custom AsyncLLMEngine server - Qwen/Qwen2.5-7B-Instruct model (14.25 GB) - 85% GPU memory utilization, 4096 context length - Successfully integrated with Open WebUI at ai.pivoine.art Infrastructure: - Provider: RunPod Spot Instance (~$0.50/hr) - GPU: NVIDIA RTX 4090 24GB - Disk: 50GB local SSD + 922TB network volume - VPN: Tailscale (replaces WireGuard due to RunPod UDP restrictions) Model now visible and accessible in Open WebUI for end users. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -148,12 +148,19 @@ arty restart litellm
|
||||
2. Check cache directory: `ls -lah /workspace/huggingface_cache`
|
||||
3. Clear cache if needed: `rm -rf /workspace/huggingface_cache/*`
|
||||
|
||||
### Deployment Status ✅ COMPLETE
|
||||
|
||||
**Deployment Date**: 2025-11-21
|
||||
|
||||
1. ✅ Deploy vLLM with Qwen 2.5 7B - COMPLETE
|
||||
2. ✅ Test API endpoints locally and via Tailscale - COMPLETE
|
||||
3. ✅ Update VPS LiteLLM configuration - COMPLETE
|
||||
4. ✅ Test end-to-end: Open WebUI → LiteLLM → vLLM - COMPLETE
|
||||
5. ⏳ Monitor performance and costs - ONGOING
|
||||
|
||||
**Model Available**: `qwen-2.5-7b` visible in Open WebUI at https://ai.pivoine.art
|
||||
|
||||
### Next Steps
|
||||
1. ✅ Deploy vLLM with Qwen 2.5 7B
|
||||
2. ⏳ Test API endpoints locally and via Tailscale
|
||||
3. ⏳ Update VPS LiteLLM configuration
|
||||
4. ⏳ Test end-to-end: Open WebUI → LiteLLM → vLLM
|
||||
5. ⏹️ Monitor performance and costs
|
||||
6. ⏹️ Consider adding more models (Mistral, DeepSeek Coder)
|
||||
7. ⏹️ Set up auto-stop for idle periods to save costs
|
||||
|
||||
|
||||
@@ -24,6 +24,15 @@ model_list:
|
||||
model: anthropic/claude-3-haiku-20240307
|
||||
api_key: os.environ/ANTHROPIC_API_KEY
|
||||
|
||||
# Self-hosted model on GPU server via Tailscale VPN
|
||||
- model_name: qwen-2.5-7b
|
||||
litellm_params:
|
||||
model: openai/qwen-2.5-7b
|
||||
api_base: http://100.100.108.13:8000/v1
|
||||
api_key: dummy
|
||||
rpm: 1000
|
||||
tpm: 100000
|
||||
|
||||
litellm_settings:
|
||||
drop_params: true
|
||||
set_verbose: false # Disable verbose logging for better performance
|
||||
|
||||
Reference in New Issue
Block a user