diff --git a/ai/GPU_DEPLOYMENT_LOG.md b/ai/GPU_DEPLOYMENT_LOG.md index 428d4a4..cefd826 100644 --- a/ai/GPU_DEPLOYMENT_LOG.md +++ b/ai/GPU_DEPLOYMENT_LOG.md @@ -148,12 +148,19 @@ arty restart litellm 2. Check cache directory: `ls -lah /workspace/huggingface_cache` 3. Clear cache if needed: `rm -rf /workspace/huggingface_cache/*` +### Deployment Status ✅ COMPLETE + +**Deployment Date**: 2025-11-21 + +1. ✅ Deploy vLLM with Qwen 2.5 7B - COMPLETE +2. ✅ Test API endpoints locally and via Tailscale - COMPLETE +3. ✅ Update VPS LiteLLM configuration - COMPLETE +4. ✅ Test end-to-end: Open WebUI → LiteLLM → vLLM - COMPLETE +5. ⏳ Monitor performance and costs - ONGOING + +**Model Available**: `qwen-2.5-7b` visible in Open WebUI at https://ai.pivoine.art + ### Next Steps -1. ✅ Deploy vLLM with Qwen 2.5 7B -2. ⏳ Test API endpoints locally and via Tailscale -3. ⏳ Update VPS LiteLLM configuration -4. ⏳ Test end-to-end: Open WebUI → LiteLLM → vLLM -5. ⏹️ Monitor performance and costs 6. ⏹️ Consider adding more models (Mistral, DeepSeek Coder) 7. ⏹️ Set up auto-stop for idle periods to save costs diff --git a/ai/litellm-config.yaml b/ai/litellm-config.yaml index 06b3e3c..ba94370 100644 --- a/ai/litellm-config.yaml +++ b/ai/litellm-config.yaml @@ -24,6 +24,15 @@ model_list: model: anthropic/claude-3-haiku-20240307 api_key: os.environ/ANTHROPIC_API_KEY + # Self-hosted model on GPU server via Tailscale VPN + - model_name: qwen-2.5-7b + litellm_params: + model: openai/qwen-2.5-7b + api_base: http://100.100.108.13:8000/v1 + api_key: dummy + rpm: 1000 + tpm: 100000 + litellm_settings: drop_params: true set_verbose: false # Disable verbose logging for better performance