Commit Graph

90 Commits

Author SHA1 Message Date
66d8c82e47 Remove Flux and MusicGen models from LiteLLM config
ComfyUI now handles Flux image generation directly.
MusicGen is not being used and has been removed.
2025-11-21 21:11:29 +01:00
904f7d3c2e feat(ai): add ComfyUI proxy service with Authelia SSO
- Add ComfyUI service to AI stack using nginx:alpine as reverse proxy
- Proxy to RunPod ComfyUI via Tailscale (100.121.199.88:8188)
- Configure Traefik routing for comfy.ai.pivoine.art
- Enable Authelia SSO middleware (net-authelia)
- Support WebSocket connections for real-time updates
- Set appropriate timeouts for image generation (300s)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 20:56:20 +01:00
9a964cff3c feat: add Flux image generation function for Open WebUI
- Add flux_image_gen.py manifold function for Flux.1 Schnell
- Auto-mount functions via Docker volume (./functions:/app/backend/data/functions:ro)
- Add comprehensive setup guide in FLUX_SETUP.md
- Update CLAUDE.md with Flux integration documentation
- Infrastructure as code approach - no manual import needed

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 20:20:33 +01:00
0999e5d29f feat: re-enable Redis caching in LiteLLM now that streaming is fixed 2025-11-21 19:40:57 +01:00
ec903c16c2 fix: use hosted_vllm/openai/ prefix for vLLM model via orchestrator 2025-11-21 19:18:33 +01:00
155016da97 debug: enable DEBUG logging for LiteLLM to troubleshoot streaming 2025-11-21 19:10:00 +01:00
c81f312e9e fix: use correct vLLM model ID from /v1/models endpoint 2025-11-21 19:06:56 +01:00
fe0cf487ee fix: use correct vLLM model name with hosted_vllm prefix 2025-11-21 19:02:44 +01:00
81d4058c5d revert: back to openai prefix for vLLM OpenAI-compatible endpoint 2025-11-21 18:57:10 +01:00
4a575bc0da fix: use hosted_vllm prefix instead of openai for vLLM streaming compatibility 2025-11-21 18:54:40 +01:00
01a345979b fix: disable drop_params to preserve streaming metadata in LiteLLM
- Set drop_params: false in litellm_settings
- Set modify_params: false in litellm_settings
- Set drop_params: false in default_litellm_params
- Commented out LITELLM_DROP_PARAMS env var
- Removed --drop_params command flag

These settings were stripping critical streaming parameters causing
vLLM streaming responses to collapse into empty deltas
2025-11-21 18:46:33 +01:00
c58b5d36ba revert: remove direct WebUI connection, focus on fixing LiteLLM streaming
- Reverted direct orchestrator connection to WebUI
- Added stream: true parameter to qwen-2.5-7b model config
- Keep LiteLLM as single proxy for all models
2025-11-21 18:42:46 +01:00
62fcf832da feat: add direct RunPod orchestrator connection to WebUI for streaming bypass
- Configure WebUI with both LiteLLM and direct orchestrator API base URLs
- This bypasses LiteLLM's streaming issues for the qwen-2.5-7b model
- WebUI will now show models from both endpoints
- Allows testing if LiteLLM is the bottleneck for streaming

Related to streaming fix in RunPod models/vllm/server.py
2025-11-21 18:38:31 +01:00
dfde1df72f fix: add /v1 suffix to vLLM api_base for proper endpoint routing 2025-11-21 18:00:53 +01:00
42a68bc0b5 fix: revert to openai prefix, remove /v1 suffix from api_base
- Changed back from hosted_vllm/qwen-2.5-7b to openai/qwen-2.5-7b
- Removed /v1 suffix from api_base (LiteLLM adds it automatically)
- Added supports_system_messages: false for vLLM compatibility
2025-11-21 17:55:10 +01:00
699c8537b0 fix: use LiteLLM vLLM pass-through for qwen model
- Changed model from openai/qwen-2.5-7b to hosted_vllm/qwen-2.5-7b
- Implements proper vLLM integration per LiteLLM docs
- Fixes streaming response forwarding issue
2025-11-21 17:52:34 +01:00
ed4d537499 Enable verbose logging in LiteLLM for streaming debug 2025-11-21 17:43:34 +01:00
103bbbad51 debug: enable INFO logging in LiteLLM for troubleshooting
Enable detailed logging to debug qwen model requests from WebUI.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 17:13:38 +01:00
92a7436716 fix(ai): add 600s timeout for qwen model requests via Tailscale 2025-11-21 17:06:01 +01:00
6aea9d018e feat(ai): disable Ollama API in WebUI, use LiteLLM only 2025-11-21 16:57:20 +01:00
e2e0927291 feat: update LiteLLM to use RunPod GPU via Tailscale
- Update api_base URLs from 100.100.108.13 to 100.121.199.88 (RunPod Tailscale IP)
- All self-hosted models (qwen-2.5-7b, flux-schnell, musicgen-medium) now route through Tailscale VPN
- Tested and verified connectivity between VPS and RunPod GPU orchestrator

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 16:42:27 +01:00
a5ed2be933 docs: remove outdated ai/README.md
Removed outdated AI infrastructure README that referenced GPU services.
VPS AI services (Open WebUI, Crawl4AI, facefusion) are documented in compose.yaml comments.
GPU infrastructure docs are now in dedicated runpod repository.
2025-11-21 14:42:23 +01:00
d5e37dbd3f cleanup: remove GPU/RunPod files from docker-compose repository
Removed GPU orchestration files migrated to dedicated runpod repository:
- Model orchestrator, vLLM, Flux, MusicGen services
- GPU Docker Compose files and configs
- GPU deployment scripts and documentation

Kept VPS AI services and facefusion:
- compose.yaml (VPS AI + facefusion)
- litellm-config.yaml (VPS LiteLLM)
- postgres/ (VPS PostgreSQL init)
- Dockerfile, entrypoint.sh, disable-nsfw-filter.patch (facefusion)
- README.md (updated with runpod reference)

GPU infrastructure now maintained at: ssh://git@dev.pivoine.art:2222/valknar/runpod.git
2025-11-21 14:41:10 +01:00
abcebd1d9b docs: migrate multi-modal AI orchestration to dedicated runpod repository
Multi-modal AI stack (text/image/music generation) has been moved to:
Repository: ssh://git@dev.pivoine.art:2222/valknar/runpod.git

Updated ai/README.md to document:
- VPS AI services (Open WebUI, Crawl4AI, AI PostgreSQL)
- Reference to new runpod repository for GPU infrastructure
- Clear separation between VPS and GPU deployments
- Integration architecture via Tailscale VPN
2025-11-21 14:36:36 +01:00
3ed3e68271 feat(ai): add multi-modal orchestration system for text, image, and music generation
Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with
automatic model switching based on request type. This enables text, image, and
music generation on the same hardware with sequential loading.

## New Components

**Model Orchestrator** (ai/model-orchestrator/):
- FastAPI service managing model lifecycle
- Automatic model detection and switching based on request type
- OpenAI-compatible API proxy for all models
- Simple YAML configuration for adding new models
- Docker SDK integration for service management
- Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations

**Text Generation** (ai/vllm/):
- Reorganized existing vLLM server into proper structure
- Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec)
- Docker containerized with CUDA 12.4 support

**Image Generation** (ai/flux/):
- Flux.1 Schnell for fast, high-quality images
- 14GB VRAM, 4-5 sec per image
- OpenAI DALL-E compatible API
- Pre-built image: ghcr.io/matatonic/openedai-images-flux

**Music Generation** (ai/musicgen/):
- Meta's MusicGen Medium (facebook/musicgen-medium)
- Text-to-music generation (11GB VRAM)
- 60-90 seconds for 30s audio clips
- Custom FastAPI wrapper with AudioCraft

## Architecture

```
VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000)
                                       ↓
                       ┌───────────────┼───────────────┐
                  vLLM (8001)    Flux (8002)    MusicGen (8003)
                   [Only ONE active at a time - sequential loading]
```

## Configuration Files

- docker-compose.gpu.yaml: Main orchestration file for RunPod deployment
- model-orchestrator/models.yaml: Model registry (easy to add new models)
- .env.example: Environment variable template
- README.md: Comprehensive deployment and usage guide

## Updated Files

- litellm-config.yaml: Updated to route through orchestrator (port 9000)
- GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture

## Features

 Automatic model switching (30-120s latency)
 Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU)
 Easy model addition via YAML configuration
 OpenAI-compatible APIs for all model types
 Centralized routing through LiteLLM proxy
 GPU memory safety (only one model loaded at time)

## Usage

Deploy to RunPod:
```bash
scp -r ai/* gpu-pivoine:/workspace/ai/
ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator"
```

Test models:
```bash
# Text
curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}'

# Image
curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}'

# Music
curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}'
```

All models available via Open WebUI at https://ai.pivoine.art

## Adding New Models

1. Add entry to models.yaml
2. Define Docker service in docker-compose.gpu.yaml
3. Restart orchestrator

That's it! The orchestrator automatically detects and manages the new model.

## Performance

| Model | VRAM | Startup | Speed |
|-------|------|---------|-------|
| Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec |
| Flux.1 Schnell | 14GB | 60s | 4-5s/image |
| MusicGen Medium | 11GB | 45s | 60-90s for 30s audio |

Model switching overhead: 30-120 seconds

## License Notes

- vLLM: Apache 2.0
- Flux.1: Apache 2.0
- AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
bb3dabcba7 feat(ai): complete GPU deployment with self-hosted Qwen 2.5 7B model
This commit finalizes the GPU infrastructure deployment on RunPod:

- Added qwen-2.5-7b model to LiteLLM configuration
  - Self-hosted on RunPod RTX 4090 GPU server
  - Connected via Tailscale VPN (100.100.108.13:8000)
  - OpenAI-compatible API endpoint
  - Rate limits: 1000 RPM, 100k TPM

- Marked GPU deployment as COMPLETE in deployment log
  - vLLM 0.6.4.post1 with custom AsyncLLMEngine server
  - Qwen/Qwen2.5-7B-Instruct model (14.25 GB)
  - 85% GPU memory utilization, 4096 context length
  - Successfully integrated with Open WebUI at ai.pivoine.art

Infrastructure:
- Provider: RunPod Spot Instance (~$0.50/hr)
- GPU: NVIDIA RTX 4090 24GB
- Disk: 50GB local SSD + 922TB network volume
- VPN: Tailscale (replaces WireGuard due to RunPod UDP restrictions)

Model now visible and accessible in Open WebUI for end users.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 13:18:17 +01:00
8de88d96ac docs(ai): add comprehensive GPU setup documentation and configs
- Add setup guides (SETUP_GUIDE, TAILSCALE_SETUP, DOCKER_GPU_SETUP, etc.)
- Add deployment configurations (litellm-config-gpu.yaml, gpu-server-compose.yaml)
- Add GPU_DEPLOYMENT_LOG.md with current infrastructure details
- Add GPU_EXPANSION_PLAN.md with complete provider comparison
- Add deploy-gpu-stack.sh automation script

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 12:57:06 +01:00
c0b1308ffe feat(ai): add GPU server deployment with vLLM and Tailscale
- Add simple_vllm_server.py: Custom AsyncLLMEngine FastAPI server
  - Bypasses multiprocessing issues on RunPod
  - OpenAI-compatible API (/v1/models, /v1/completions, /v1/chat/completions)
  - Uses Qwen 2.5 7B Instruct model

- Add comprehensive setup guides:
  - SETUP_GUIDE.md: RunPod account and GPU server setup
  - TAILSCALE_SETUP.md: VPN configuration (replaces WireGuard)
  - DOCKER_GPU_SETUP.md: Docker + NVIDIA Container Toolkit
  - README_GPU_SETUP.md: Main documentation hub

- Add deployment configurations:
  - litellm-config-gpu.yaml: LiteLLM config with GPU endpoints
  - gpu-server-compose.yaml: Docker Compose for GPU services
  - deploy-gpu-stack.sh: Automated deployment script

- Add GPU_DEPLOYMENT_LOG.md: Current deployment documentation
  - Network: Tailscale IP 100.100.108.13
  - Infrastructure: RunPod RTX 4090, 50GB disk
  - Known issues and troubleshooting guide

- Add GPU_EXPANSION_PLAN.md: 70-page comprehensive expansion plan

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 12:56:57 +01:00
8622f9dfa0 fix: remove drop_params from individual model configs
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-16 18:53:44 +01:00
0146d1f043 fix: remove invalid supports_prompt_caching parameter
Removed supports_prompt_caching parameter that was causing 400 errors.
Prompt caching is automatically enabled by Anthropic when the client
sends cache_control blocks in messages - no config needed.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-16 16:09:17 +01:00
d26310afb7 feat: enable prompt caching for all Claude models
Added supports_prompt_caching: true to all Claude models:
- claude-sonnet-4
- claude-sonnet-4.5
- claude-3-5-sonnet
- claude-3-opus
- claude-3-haiku

This enables Anthropic's prompt caching feature across all models,
significantly reducing latency and costs for repeated requests
with the same system prompts.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-16 16:07:29 +01:00
2014a82efb feat: enable Redis caching for LiteLLM
Configure LiteLLM to use existing Redis from core stack for caching:
- Enabled cache with Redis backend
- Set TTL to 1 hour for cached responses
- Uses core_redis container on default port

This will improve performance by caching API responses.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-16 16:05:14 +01:00
5cec1415ad fix: disable LiteLLM cache to avoid Redis requirement
Disabled cache setting that requires Redis configuration.
Prompt caching at the Anthropic API level is still enabled
via supports_prompt_caching setting.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-16 16:04:39 +01:00
8a18ae753d perf: optimize LiteLLM for better performance
Reduce database logging overhead and enable prompt caching:

- Disabled verbose logging (set_verbose: false)
- Disabled spend tracking logs to reduce DB writes
- Disabled tag tracking and daily spend logs
- Removed success/failure callbacks
- Enabled prompt caching for claude-sonnet-4.5
- Set log level to ERROR only
- Removed --detailed_debug flag from command

This should significantly improve response times by eliminating
unnecessary database writes for every request.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-16 16:03:19 +01:00
ffbcecc09d feat: replace Basic Auth with Authelia
Replace HTTP Basic Auth with Authelia ForwardAuth for consistent
authentication across infrastructure:

- Asciinema Admin (admin.asciinema.dev.pivoine.art): Removed Basic Auth,
  added Authelia protection
- FaceFusion (facefusion.ai.pivoine.art): Removed Basic Auth, added
  Authelia protection

Updated Authelia access control to include both services with one_factor
policy.

All services now use Authelia for authentication, eliminating the need
to manage separate Basic Auth credentials.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-15 21:54:27 +01:00
51267cc674 feat: add Mailpit SMTP relay and migrate all services
- Add Mailpit service to NET stack with web UI at mailpit.pivoine.art
- Configure Mailpit to relay all emails through IONOS SMTP
- Migrate all 11+ services to use Mailpit instead of direct IONOS SMTP:
  * SEXY: Directus API
  * UTIL: Joplin, Mattermost, Vaultwarden, Tandoor, Linkwarden
  * DEV: Gitea, n8n, Asciinema
  * AI: Open WebUI
  * NET: Netdata (via msmtp)
- Centralize SMTP credentials in mailpit-relay.yaml
- Simplify service configs (no auth/TLS for internal SMTP)
- Enable email monitoring via Mailpit web UI with Basic Auth

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-15 18:34:38 +01:00
709dcd8882 fix: use correct NO_DOCS and NO_REDOC environment variables
- Replace DISABLE_SWAGGER_UI with NO_DOCS and NO_REDOC
- Following official LiteLLM documentation for disabling API docs
- Disables both Swagger UI and Redoc documentation interfaces

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-14 02:17:40 +01:00
b66e28d874 fix: use DISABLE_SWAGGER_UI environment variable instead of invalid flag
- Remove invalid --disable_swagger command flag
- Add DISABLE_SWAGGER_UI=true environment variable
- Fixes LiteLLM startup error

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-14 02:15:31 +01:00
f1ff42f452 feat: disable Swagger UI in LiteLLM proxy
- Add --disable_swagger flag to LiteLLM command
- Improves security by hiding API documentation interface

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-14 02:14:43 +01:00
2934caa9ed fix: disable Watchtower for Facefusion custom local image
Watchtower was trying to pull updates from Docker Hub for facefusion-patched:3.5.0-cpu
which only exists locally, causing spam errors. Disabled Watchtower monitoring for this
container since it's a custom-built image with NSFW filter patches.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-13 08:30:51 +01:00
f71b150263 feat: add tty flag for Gradio to start properly 2025-11-13 06:18:58 +01:00
0b43299ffd fix: update content_analyser hash check in core.py for patched version 2025-11-13 06:16:14 +01:00
95099a443e feat: build custom Facefusion image with NSFW filter patch baked in 2025-11-13 06:05:42 +01:00
8f406f62c1 fix: add command with -u flag to start Facefusion 2025-11-13 06:01:09 +01:00
c2d25dde59 fix: remove entrypoint override to use default Facefusion startup 2025-11-13 05:59:05 +01:00
3c56f05286 fix: add Gradio environment variables and remove conflicting command 2025-11-13 05:52:13 +01:00
65865b7bb8 fix: add listen and port flags to start Gradio server properly 2025-11-13 05:51:24 +01:00
539f689269 fix: use run.py to start Gradio server 2025-11-13 05:50:37 +01:00
025118a25e fix: use simple run command without extra flags 2025-11-13 05:47:32 +01:00
72fd26f8ea fix: use headless-run command to start Gradio server 2025-11-13 05:46:20 +01:00