This commit finalizes the GPU infrastructure deployment on RunPod:
- Added qwen-2.5-7b model to LiteLLM configuration
- Self-hosted on RunPod RTX 4090 GPU server
- Connected via Tailscale VPN (100.100.108.13:8000)
- OpenAI-compatible API endpoint
- Rate limits: 1000 RPM, 100k TPM
- Marked GPU deployment as COMPLETE in deployment log
- vLLM 0.6.4.post1 with custom AsyncLLMEngine server
- Qwen/Qwen2.5-7B-Instruct model (14.25 GB)
- 85% GPU memory utilization, 4096 context length
- Successfully integrated with Open WebUI at ai.pivoine.art
Infrastructure:
- Provider: RunPod Spot Instance (~$0.50/hr)
- GPU: NVIDIA RTX 4090 24GB
- Disk: 50GB local SSD + 922TB network volume
- VPN: Tailscale (replaces WireGuard due to RunPod UDP restrictions)
Model now visible and accessible in Open WebUI for end users.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Removed supports_prompt_caching parameter that was causing 400 errors.
Prompt caching is automatically enabled by Anthropic when the client
sends cache_control blocks in messages - no config needed.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Added supports_prompt_caching: true to all Claude models:
- claude-sonnet-4
- claude-sonnet-4.5
- claude-3-5-sonnet
- claude-3-opus
- claude-3-haiku
This enables Anthropic's prompt caching feature across all models,
significantly reducing latency and costs for repeated requests
with the same system prompts.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Configure LiteLLM to use existing Redis from core stack for caching:
- Enabled cache with Redis backend
- Set TTL to 1 hour for cached responses
- Uses core_redis container on default port
This will improve performance by caching API responses.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Disabled cache setting that requires Redis configuration.
Prompt caching at the Anthropic API level is still enabled
via supports_prompt_caching setting.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Reduce database logging overhead and enable prompt caching:
- Disabled verbose logging (set_verbose: false)
- Disabled spend tracking logs to reduce DB writes
- Disabled tag tracking and daily spend logs
- Removed success/failure callbacks
- Enabled prompt caching for claude-sonnet-4.5
- Set log level to ERROR only
- Removed --detailed_debug flag from command
This should significantly improve response times by eliminating
unnecessary database writes for every request.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
According to litellm docs, drop_params only drops OpenAI parameters.
Since prompt_cache_key is an Anthropic-specific parameter, we need
to use additional_drop_params to explicitly drop it.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Explicitly set drop_params and supports_prompt_caching=false for
claude-sonnet-4.5 model to prevent prompt_cache_key parameter from
being sent to Anthropic API.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add router_settings and default_litellm_params to ensure unsupported
parameters like prompt_cache_key are properly dropped when using codex
with the litellm proxy.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Added disable_responses_id_security setting to allow Codex CLI to access
the /responses endpoint without 401 errors. This removes the encryption
requirement on response IDs while maintaining API key authentication.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Changed API key reference from ${ANTHROPIC_API_KEY} to
os.environ/ANTHROPIC_API_KEY to match LiteLLM's documented syntax.
The os.environ/ prefix tells LiteLLM to use os.getenv() to retrieve
the environment variable at runtime, which is the correct way to
reference environment variables in LiteLLM config files.
Reference: https://docs.litellm.ai/docs/proxy/deploy🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Added LiteLLM as an OpenAI-compatible proxy for Anthropic's API to
enable Claude models in Open WebUI.
**New Service: litellm**
- Image: ghcr.io/berriai/litellm:main-latest
- Internal proxy on port 4000
- Converts Anthropic API to OpenAI-compatible format
- Health check with 30s intervals
- Not exposed via Traefik (internal only)
**LiteLLM Configuration (litellm-config.yaml)**
- Claude Sonnet 4 (claude-sonnet-4-20250514)
- Claude Sonnet 4.5 (claude-sonnet-4-5-20250929)
- Claude 3.5 Sonnet (claude-3-5-sonnet-20241022)
- Claude 3 Opus (claude-3-opus-20240229)
- Claude 3 Haiku (claude-3-haiku-20240307)
**Open WebUI Configuration Updates**
- Changed OPENAI_API_BASE_URLS to point to LiteLLM proxy
- URL: http://litellm:4000/v1
- Added litellm as dependency for webui service
- Dummy API key for proxy authentication
**Why LiteLLM?**
Anthropic's API uses different endpoint structure and authentication
headers compared to OpenAI. LiteLLM acts as a translation layer,
allowing Open WebUI to use Claude models through its OpenAI-compatible
interface.
**Available Models in Open WebUI**
- claude-sonnet-4 (latest Claude Sonnet 4)
- claude-sonnet-4.5 (Claude Sonnet 4.5)
- claude-3-5-sonnet
- claude-3-opus
- claude-3-haiku
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>