docker-compose

Author	SHA1	Message	Date
Sebastian Krüger	94ab4ae6dd	feat: enable system message support for qwen-2.5-7b	2025-11-23 14:36:34 +01:00
Sebastian Krüger	779e76974d	fix: use complete URL env var for vLLM API base - Replace GPU_TAILSCALE_IP interpolation with GPU_VLLM_API_URL - LiteLLM requires full URL in api_base with os.environ/ syntax 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-23 13:17:37 +01:00
Sebastian Krüger	f3f32c163f	feat: consolidate GPU IP with single GPU_TAILSCALE_IP variable - Replace COMFYUI_BACKEND_HOST and SUPERVISOR_BACKEND_HOST with GPU_TAILSCALE_IP - Update LiteLLM config to use os.environ/GPU_TAILSCALE_IP for vLLM models - Add GPU_TAILSCALE_IP env var to LiteLLM service - Configure qwen-2.5-7b and llama-3.1-8b to route through orchestrator 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-23 13:05:33 +01:00
Sebastian Krüger	bf402adb25	Add Llama 3.1 8B model to LiteLLM configuration	2025-11-21 21:30:18 +01:00
Sebastian Krüger	66d8c82e47	Remove Flux and MusicGen models from LiteLLM config ComfyUI now handles Flux image generation directly. MusicGen is not being used and has been removed.	2025-11-21 21:11:29 +01:00
Sebastian Krüger	0999e5d29f	feat: re-enable Redis caching in LiteLLM now that streaming is fixed	2025-11-21 19:40:57 +01:00
Sebastian Krüger	ec903c16c2	fix: use hosted_vllm/openai/ prefix for vLLM model via orchestrator	2025-11-21 19:18:33 +01:00
Sebastian Krüger	c81f312e9e	fix: use correct vLLM model ID from /v1/models endpoint	2025-11-21 19:06:56 +01:00
Sebastian Krüger	fe0cf487ee	fix: use correct vLLM model name with hosted_vllm prefix	2025-11-21 19:02:44 +01:00
Sebastian Krüger	81d4058c5d	revert: back to openai prefix for vLLM OpenAI-compatible endpoint	2025-11-21 18:57:10 +01:00
Sebastian Krüger	4a575bc0da	fix: use hosted_vllm prefix instead of openai for vLLM streaming compatibility	2025-11-21 18:54:40 +01:00
Sebastian Krüger	01a345979b	fix: disable drop_params to preserve streaming metadata in LiteLLM - Set drop_params: false in litellm_settings - Set modify_params: false in litellm_settings - Set drop_params: false in default_litellm_params - Commented out LITELLM_DROP_PARAMS env var - Removed --drop_params command flag These settings were stripping critical streaming parameters causing vLLM streaming responses to collapse into empty deltas	2025-11-21 18:46:33 +01:00
Sebastian Krüger	c58b5d36ba	revert: remove direct WebUI connection, focus on fixing LiteLLM streaming - Reverted direct orchestrator connection to WebUI - Added stream: true parameter to qwen-2.5-7b model config - Keep LiteLLM as single proxy for all models	2025-11-21 18:42:46 +01:00
Sebastian Krüger	dfde1df72f	fix: add /v1 suffix to vLLM api_base for proper endpoint routing	2025-11-21 18:00:53 +01:00
Sebastian Krüger	42a68bc0b5	fix: revert to openai prefix, remove /v1 suffix from api_base - Changed back from hosted_vllm/qwen-2.5-7b to openai/qwen-2.5-7b - Removed /v1 suffix from api_base (LiteLLM adds it automatically) - Added supports_system_messages: false for vLLM compatibility	2025-11-21 17:55:10 +01:00
Sebastian Krüger	699c8537b0	fix: use LiteLLM vLLM pass-through for qwen model - Changed model from openai/qwen-2.5-7b to hosted_vllm/qwen-2.5-7b - Implements proper vLLM integration per LiteLLM docs - Fixes streaming response forwarding issue	2025-11-21 17:52:34 +01:00
Sebastian Krüger	ed4d537499	Enable verbose logging in LiteLLM for streaming debug	2025-11-21 17:43:34 +01:00
Sebastian Krüger	92a7436716	fix(ai): add 600s timeout for qwen model requests via Tailscale	2025-11-21 17:06:01 +01:00
Sebastian Krüger	e2e0927291	feat: update LiteLLM to use RunPod GPU via Tailscale - Update api_base URLs from 100.100.108.13 to 100.121.199.88 (RunPod Tailscale IP) - All self-hosted models (qwen-2.5-7b, flux-schnell, musicgen-medium) now route through Tailscale VPN - Tested and verified connectivity between VPS and RunPod GPU orchestrator 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-21 16:42:27 +01:00
Sebastian Krüger	3ed3e68271	feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components Model Orchestrator (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations Text Generation (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support Image Generation (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux Music Generation (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance \| Model \| VRAM \| Startup \| Speed \| \|-------\|------\|---------\|-------\| \| Qwen 2.5 7B \| 14GB \| 120s \| ~50 tok/sec \| \| Flux.1 Schnell \| 14GB \| 60s \| 4-5s/image \| \| MusicGen Medium \| 11GB \| 45s \| 60-90s for 30s audio \| Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-21 14:12:13 +01:00
Sebastian Krüger	bb3dabcba7	feat(ai): complete GPU deployment with self-hosted Qwen 2.5 7B model This commit finalizes the GPU infrastructure deployment on RunPod: - Added qwen-2.5-7b model to LiteLLM configuration - Self-hosted on RunPod RTX 4090 GPU server - Connected via Tailscale VPN (100.100.108.13:8000) - OpenAI-compatible API endpoint - Rate limits: 1000 RPM, 100k TPM - Marked GPU deployment as COMPLETE in deployment log - vLLM 0.6.4.post1 with custom AsyncLLMEngine server - Qwen/Qwen2.5-7B-Instruct model (14.25 GB) - 85% GPU memory utilization, 4096 context length - Successfully integrated with Open WebUI at ai.pivoine.art Infrastructure: - Provider: RunPod Spot Instance (~$0.50/hr) - GPU: NVIDIA RTX 4090 24GB - Disk: 50GB local SSD + 922TB network volume - VPN: Tailscale (replaces WireGuard due to RunPod UDP restrictions) Model now visible and accessible in Open WebUI for end users. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-21 13:18:17 +01:00
Sebastian Krüger	8622f9dfa0	fix: remove drop_params from individual model configs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-16 18:53:44 +01:00
Sebastian Krüger	0146d1f043	fix: remove invalid supports_prompt_caching parameter Removed supports_prompt_caching parameter that was causing 400 errors. Prompt caching is automatically enabled by Anthropic when the client sends cache_control blocks in messages - no config needed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-16 16:09:17 +01:00
Sebastian Krüger	d26310afb7	feat: enable prompt caching for all Claude models Added supports_prompt_caching: true to all Claude models: - claude-sonnet-4 - claude-sonnet-4.5 - claude-3-5-sonnet - claude-3-opus - claude-3-haiku This enables Anthropic's prompt caching feature across all models, significantly reducing latency and costs for repeated requests with the same system prompts. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-16 16:07:29 +01:00
Sebastian Krüger	2014a82efb	feat: enable Redis caching for LiteLLM Configure LiteLLM to use existing Redis from core stack for caching: - Enabled cache with Redis backend - Set TTL to 1 hour for cached responses - Uses core_redis container on default port This will improve performance by caching API responses. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-16 16:05:14 +01:00
Sebastian Krüger	5cec1415ad	fix: disable LiteLLM cache to avoid Redis requirement Disabled cache setting that requires Redis configuration. Prompt caching at the Anthropic API level is still enabled via supports_prompt_caching setting. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-16 16:04:39 +01:00
Sebastian Krüger	8a18ae753d	perf: optimize LiteLLM for better performance Reduce database logging overhead and enable prompt caching: - Disabled verbose logging (set_verbose: false) - Disabled spend tracking logs to reduce DB writes - Disabled tag tracking and daily spend logs - Removed success/failure callbacks - Enabled prompt caching for claude-sonnet-4.5 - Set log level to ERROR only - Removed --detailed_debug flag from command This should significantly improve response times by eliminating unnecessary database writes for every request. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-16 16:03:19 +01:00
Sebastian Krüger	3ddc76e213	fix: add additional_drop_params at global litellm_settings level	2025-11-11 12:36:49 +01:00
Sebastian Krüger	cabac4b767	fix: use additional_drop_params to explicitly drop prompt_cache_key According to litellm docs, drop_params only drops OpenAI parameters. Since prompt_cache_key is an Anthropic-specific parameter, we need to use additional_drop_params to explicitly drop it. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-11 12:33:10 +01:00
Sebastian Krüger	da0dc2363a	fix: disable prompt caching and responses API in litellm - Add LITELLM_DROP_PARAMS environment variable - Disable cache in litellm_settings - Attempt to disable responses API endpoint - Remove invalid supports_prompt_caching parameter 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-11 12:27:06 +01:00
Sebastian Krüger	813823995c	fix: disable prompt caching for claude-sonnet-4.5 Explicitly set drop_params and supports_prompt_caching=false for claude-sonnet-4.5 model to prevent prompt_cache_key parameter from being sent to Anthropic API. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-11 12:22:27 +01:00
Sebastian Krüger	f36e0fa9eb	fix: enhance litellm parameter dropping for codex compatibility Add router_settings and default_litellm_params to ensure unsupported parameters like prompt_cache_key are properly dropped when using codex with the litellm proxy. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-11 12:14:00 +01:00
Sebastian Krüger	ce6c60d8e0	fix: disable responses ID security for Codex CLI compatibility Added disable_responses_id_security setting to allow Codex CLI to access the /responses endpoint without 401 errors. This removes the encryption requirement on response IDs while maintaining API key authentication. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-09 19:00:55 +01:00
Sebastian Krüger	cdb8d2ef34	fix: correct LiteLLM environment variable syntax Changed API key reference from ${ANTHROPIC_API_KEY} to os.environ/ANTHROPIC_API_KEY to match LiteLLM's documented syntax. The os.environ/ prefix tells LiteLLM to use os.getenv() to retrieve the environment variable at runtime, which is the correct way to reference environment variables in LiteLLM config files. Reference: https://docs.litellm.ai/docs/proxy/deploy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-09 00:30:07 +01:00
Sebastian Krüger	424e6d044d	fix: configure LiteLLM without database requirement	2025-11-08 23:02:07 +01:00
Sebastian Krüger	8eae3c650f	feat: add LiteLLM proxy for Anthropic Claude models Added LiteLLM as an OpenAI-compatible proxy for Anthropic's API to enable Claude models in Open WebUI. New Service: litellm - Image: ghcr.io/berriai/litellm:main-latest - Internal proxy on port 4000 - Converts Anthropic API to OpenAI-compatible format - Health check with 30s intervals - Not exposed via Traefik (internal only) LiteLLM Configuration (litellm-config.yaml) - Claude Sonnet 4 (claude-sonnet-4-20250514) - Claude Sonnet 4.5 (claude-sonnet-4-5-20250929) - Claude 3.5 Sonnet (claude-3-5-sonnet-20241022) - Claude 3 Opus (claude-3-opus-20240229) - Claude 3 Haiku (claude-3-haiku-20240307) Open WebUI Configuration Updates - Changed OPENAI_API_BASE_URLS to point to LiteLLM proxy - URL: http://litellm:4000/v1 - Added litellm as dependency for webui service - Dummy API key for proxy authentication Why LiteLLM? Anthropic's API uses different endpoint structure and authentication headers compared to OpenAI. LiteLLM acts as a translation layer, allowing Open WebUI to use Claude models through its OpenAI-compatible interface. Available Models in Open WebUI - claude-sonnet-4 (latest Claude Sonnet 4) - claude-sonnet-4.5 (Claude Sonnet 4.5) - claude-3-5-sonnet - claude-3-opus - claude-3-haiku 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-08 22:58:09 +01:00

36 Commits