Multi-modal AI stack (text/image/music generation) has been moved to:
Repository: ssh://git@dev.pivoine.art:2222/valknar/runpod.git
Updated ai/README.md to document:
- VPS AI services (Open WebUI, Crawl4AI, AI PostgreSQL)
- Reference to new runpod repository for GPU infrastructure
- Clear separation between VPS and GPU deployments
- Integration architecture via Tailscale VPN
Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with
automatic model switching based on request type. This enables text, image, and
music generation on the same hardware with sequential loading.
## New Components
**Model Orchestrator** (ai/model-orchestrator/):
- FastAPI service managing model lifecycle
- Automatic model detection and switching based on request type
- OpenAI-compatible API proxy for all models
- Simple YAML configuration for adding new models
- Docker SDK integration for service management
- Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations
**Text Generation** (ai/vllm/):
- Reorganized existing vLLM server into proper structure
- Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec)
- Docker containerized with CUDA 12.4 support
**Image Generation** (ai/flux/):
- Flux.1 Schnell for fast, high-quality images
- 14GB VRAM, 4-5 sec per image
- OpenAI DALL-E compatible API
- Pre-built image: ghcr.io/matatonic/openedai-images-flux
**Music Generation** (ai/musicgen/):
- Meta's MusicGen Medium (facebook/musicgen-medium)
- Text-to-music generation (11GB VRAM)
- 60-90 seconds for 30s audio clips
- Custom FastAPI wrapper with AudioCraft
## Architecture
```
VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000)
↓
┌───────────────┼───────────────┐
vLLM (8001) Flux (8002) MusicGen (8003)
[Only ONE active at a time - sequential loading]
```
## Configuration Files
- docker-compose.gpu.yaml: Main orchestration file for RunPod deployment
- model-orchestrator/models.yaml: Model registry (easy to add new models)
- .env.example: Environment variable template
- README.md: Comprehensive deployment and usage guide
## Updated Files
- litellm-config.yaml: Updated to route through orchestrator (port 9000)
- GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture
## Features
✅ Automatic model switching (30-120s latency)
✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU)
✅ Easy model addition via YAML configuration
✅ OpenAI-compatible APIs for all model types
✅ Centralized routing through LiteLLM proxy
✅ GPU memory safety (only one model loaded at time)
## Usage
Deploy to RunPod:
```bash
scp -r ai/* gpu-pivoine:/workspace/ai/
ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator"
```
Test models:
```bash
# Text
curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}'
# Image
curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}'
# Music
curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}'
```
All models available via Open WebUI at https://ai.pivoine.art
## Adding New Models
1. Add entry to models.yaml
2. Define Docker service in docker-compose.gpu.yaml
3. Restart orchestrator
That's it! The orchestrator automatically detects and manages the new model.
## Performance
| Model | VRAM | Startup | Speed |
|-------|------|---------|-------|
| Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec |
| Flux.1 Schnell | 14GB | 60s | 4-5s/image |
| MusicGen Medium | 11GB | 45s | 60-90s for 30s audio |
Model switching overhead: 30-120 seconds
## License Notes
- vLLM: Apache 2.0
- Flux.1: Apache 2.0
- AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
This commit finalizes the GPU infrastructure deployment on RunPod:
- Added qwen-2.5-7b model to LiteLLM configuration
- Self-hosted on RunPod RTX 4090 GPU server
- Connected via Tailscale VPN (100.100.108.13:8000)
- OpenAI-compatible API endpoint
- Rate limits: 1000 RPM, 100k TPM
- Marked GPU deployment as COMPLETE in deployment log
- vLLM 0.6.4.post1 with custom AsyncLLMEngine server
- Qwen/Qwen2.5-7B-Instruct model (14.25 GB)
- 85% GPU memory utilization, 4096 context length
- Successfully integrated with Open WebUI at ai.pivoine.art
Infrastructure:
- Provider: RunPod Spot Instance (~$0.50/hr)
- GPU: NVIDIA RTX 4090 24GB
- Disk: 50GB local SSD + 922TB network volume
- VPN: Tailscale (replaces WireGuard due to RunPod UDP restrictions)
Model now visible and accessible in Open WebUI for end users.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Removed supports_prompt_caching parameter that was causing 400 errors.
Prompt caching is automatically enabled by Anthropic when the client
sends cache_control blocks in messages - no config needed.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Added supports_prompt_caching: true to all Claude models:
- claude-sonnet-4
- claude-sonnet-4.5
- claude-3-5-sonnet
- claude-3-opus
- claude-3-haiku
This enables Anthropic's prompt caching feature across all models,
significantly reducing latency and costs for repeated requests
with the same system prompts.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Configure LiteLLM to use existing Redis from core stack for caching:
- Enabled cache with Redis backend
- Set TTL to 1 hour for cached responses
- Uses core_redis container on default port
This will improve performance by caching API responses.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Disabled cache setting that requires Redis configuration.
Prompt caching at the Anthropic API level is still enabled
via supports_prompt_caching setting.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Reduce database logging overhead and enable prompt caching:
- Disabled verbose logging (set_verbose: false)
- Disabled spend tracking logs to reduce DB writes
- Disabled tag tracking and daily spend logs
- Removed success/failure callbacks
- Enabled prompt caching for claude-sonnet-4.5
- Set log level to ERROR only
- Removed --detailed_debug flag from command
This should significantly improve response times by eliminating
unnecessary database writes for every request.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Replace HTTP Basic Auth with Authelia ForwardAuth for consistent
authentication across infrastructure:
- Asciinema Admin (admin.asciinema.dev.pivoine.art): Removed Basic Auth,
added Authelia protection
- FaceFusion (facefusion.ai.pivoine.art): Removed Basic Auth, added
Authelia protection
Updated Authelia access control to include both services with one_factor
policy.
All services now use Authelia for authentication, eliminating the need
to manage separate Basic Auth credentials.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Add Mailpit service to NET stack with web UI at mailpit.pivoine.art
- Configure Mailpit to relay all emails through IONOS SMTP
- Migrate all 11+ services to use Mailpit instead of direct IONOS SMTP:
* SEXY: Directus API
* UTIL: Joplin, Mattermost, Vaultwarden, Tandoor, Linkwarden
* DEV: Gitea, n8n, Asciinema
* AI: Open WebUI
* NET: Netdata (via msmtp)
- Centralize SMTP credentials in mailpit-relay.yaml
- Simplify service configs (no auth/TLS for internal SMTP)
- Enable email monitoring via Mailpit web UI with Basic Auth
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Replace DISABLE_SWAGGER_UI with NO_DOCS and NO_REDOC
- Following official LiteLLM documentation for disabling API docs
- Disables both Swagger UI and Redoc documentation interfaces
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Add --disable_swagger flag to LiteLLM command
- Improves security by hiding API documentation interface
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Watchtower was trying to pull updates from Docker Hub for facefusion-patched:3.5.0-cpu
which only exists locally, causing spam errors. Disabled Watchtower monitoring for this
container since it's a custom-built image with NSFW filter patches.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Fixed line number and function names to match actual source
- Added validation to ensure patch was applied
- Updated patch file with correct context
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
This patch file documents the exact change made to content_analyser.py
for disabling the NSFW content filter in Facefusion 3.5.0.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Added echo statements to track script execution
- Added -v flag to rm to show deleted files
- Confirmed deletion is working correctly
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Previous approach caused infinite download loop. Now waits for models
to download, then deletes NSFW models once, allowing Gradio to start.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Added Facefusion face swapping service to the AI stack:
**Configuration:**
- URL: https://facefusion.ai.pivoine.art
- Image: facefusion/facefusion:3.5.0-cpu
- Port: 7865
- Container: ai_facefusion
- Volume: ai_facefusion_data
- HTTP Basic Auth protection
- CPU execution mode (GPU when available)
**Changes:**
- Added facefusion service to ai/compose.yaml
- Added AI_FACEFUSION_* env vars to arty.yml
- Created ai_facefusion_data volume
- Removed old standalone facefusion stack
- Removed ai/README-export.md and ai/webui-export.py
Facefusion will run on CPU until GPU server is available.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
According to litellm docs, drop_params only drops OpenAI parameters.
Since prompt_cache_key is an Anthropic-specific parameter, we need
to use additional_drop_params to explicitly drop it.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Explicitly set drop_params and supports_prompt_caching=false for
claude-sonnet-4.5 model to prevent prompt_cache_key parameter from
being sent to Anthropic API.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add router_settings and default_litellm_params to ensure unsupported
parameters like prompt_cache_key are properly dropped when using codex
with the litellm proxy.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Added disable_responses_id_security setting to allow Codex CLI to access
the /responses endpoint without 401 errors. This removes the encryption
requirement on response IDs while maintaining API key authentication.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Created database initialization script following the core stack pattern.
The script automatically creates required databases on first initialization:
- openwebui: Open WebUI application database
- litellm: LiteLLM proxy database for API key management and tracking
Changes:
- Created ai/postgres/init/01-init-databases.sh
- Mounted init directory in ai_postgres service
- Added automatic privilege grants to AI_DB_USER
Note: Init script only runs on first database creation when volume is empty.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
LiteLLM now uses the ai_postgres database instance with a dedicated
'litellm' database for API key management, usage tracking, and rate limiting.
Changes:
- Set DATABASE_URL to postgresql://ai:password@ai_postgres:5432/litellm
- Added depends_on ai_postgres to ensure DB starts first
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>