Files
docker-compose/ai/README.md

171 lines
7.0 KiB
Markdown
Raw Normal View History

# AI Infrastructure
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
This directory contains AI-related configurations for the VPS deployment.
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
## Multi-Modal GPU Infrastructure (Migrated)
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
**The multi-modal AI orchestration stack (text, image, music generation) has been moved to a dedicated repository:**
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
**Repository**: https://dev.pivoine.art/valknar/runpod
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
The RunPod repository contains:
- Model orchestrator for automatic switching between text, image, and music models
- vLLM + Qwen 2.5 7B (text generation)
- Flux.1 Schnell (image generation)
- MusicGen Medium (music generation)
- RunPod template creation scripts
- Complete deployment documentation
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
This separation allows for independent management of:
- **VPS Services** (this repo): Open WebUI, Crawl4AI, AI database
- **GPU Services** (runpod repo): Model inference, orchestration, RunPod templates
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
## VPS AI Services (ai/compose.yaml)
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
This compose stack manages the VPS-side AI infrastructure that integrates with the GPU server:
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
### Services
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
#### ai_postgres
Dedicated PostgreSQL 16 instance with pgvector extension for AI workloads:
- Vector similarity search support
- Isolated from core database for performance
- Used by Open WebUI for RAG and embeddings
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
#### webui (Open WebUI)
ChatGPT-like interface exposed at `ai.pivoine.art:8080`:
- Claude API integration via Anthropic
- RAG support with document upload
- Vector storage via pgvector
- Web search capability
- SMTP email via IONOS
- User signup enabled
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
#### crawl4ai
Internal web scraping service for LLM content preparation:
- API on port 11235 (not exposed publicly)
- Optimized for AI/RAG workflows
- Integration with Open WebUI and n8n
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
## Integration with GPU Server
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
The VPS AI services connect to the GPU server via Tailscale VPN:
- **VPS Tailscale IP**: 100.102.217.79
- **GPU Tailscale IP**: 100.100.108.13
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
**LiteLLM Proxy** (port 4000 on VPS) routes requests:
- Claude API for chat completions
- GPU orchestrator for self-hosted models (text, image, music)
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
See `../litellm-config.yaml` for routing configuration.
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
## Environment Variables
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
Required in `.env`:
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
```bash
# AI Database
AI_DB_PASSWORD=<password>
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
# Open WebUI
AI_WEBUI_SECRET_KEY=<secret>
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
# Claude API
ANTHROPIC_API_KEY=<api_key>
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
# Email (IONOS SMTP)
ADMIN_EMAIL=<email>
SMTP_HOST=smtp.ionos.com
SMTP_PORT=587
SMTP_USER=<smtp_user>
SMTP_PASSWORD=<smtp_password>
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
```
## Backup Configuration
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
AI services are backed up daily via Restic:
- **ai_postgres_data**: 3 AM (7 daily, 4 weekly, 6 monthly, 2 yearly)
- **ai_webui_data**: 3 AM (same retention)
- **ai_crawl4ai_data**: 3 AM (same retention)
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
Repository: `/mnt/hidrive/users/valknar/Backup`
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
## Management Commands
```bash
# Start AI stack
pnpm arty up ai_postgres webui crawl4ai
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
# View logs
docker logs -f ai_webui
docker logs -f ai_postgres
docker logs -f ai_crawl4ai
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
# Check Open WebUI
curl http://ai.pivoine.art:8080/health
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
# Restart AI services
pnpm arty restart ai_postgres webui crawl4ai
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
```
## GPU Server Management
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
For GPU server operations (model orchestration, template creation, etc.):
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
```bash
# Clone the dedicated repository
git clone ssh://git@dev.pivoine.art:2222/valknar/runpod.git
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
# See runpod repository for:
# - Model orchestration setup
# - RunPod template creation
# - GPU deployment guides
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
```
## Documentation
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
### VPS AI Services
- [GPU_DEPLOYMENT_LOG.md](GPU_DEPLOYMENT_LOG.md) - VPS AI deployment history
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
### GPU Server (Separate Repository)
- [runpod/README.md](https://dev.pivoine.art/valknar/runpod) - Main GPU documentation
- [runpod/DEPLOYMENT.md](https://dev.pivoine.art/valknar/runpod) - Deployment guide
- [runpod/RUNPOD_TEMPLATE.md](https://dev.pivoine.art/valknar/runpod) - Template creation
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
## Architecture Overview
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
```
┌─────────────────────────────────────────────────────────────────┐
│ VPS (Tailscale: 100.102.217.79) │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ LiteLLM Proxy (Port 4000) │ │
│ │ Routes to: Claude API + GPU Orchestrator │ │
│ └───────┬───────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────▼─────────┐ ┌──────────────┐ ┌─────────────────┐ │
│ │ Open WebUI │ │ Crawl4AI │ │ AI PostgreSQL │ │
│ │ Port: 8080 │ │ Port: 11235 │ │ + pgvector │ │
│ └─────────────────┘ └──────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│ Tailscale VPN
┌──────────────────────────────┼──────────────────────────────────┐
│ RunPod GPU Server (Tailscale: 100.100.108.13) │
│ ┌───────────────────────────▼──────────────────────────────┐ │
│ │ Orchestrator (Port 9000) │ │
│ │ Manages sequential model loading │ │
│ └─────┬──────────────┬──────────────────┬──────────────────┘ │
│ │ │ │ │
│ ┌─────▼──────┐ ┌────▼────────┐ ┌──────▼───────┐ │
│ │vLLM │ │Flux.1 │ │MusicGen │ │
│ │Qwen 2.5 7B │ │Schnell │ │Medium │ │
│ │Port: 8001 │ │Port: 8002 │ │Port: 8003 │ │
│ └────────────┘ └─────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components **Model Orchestrator** (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations **Text Generation** (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support **Image Generation** (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux **Music Generation** (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance | Model | VRAM | Startup | Speed | |-------|------|---------|-------| | Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | | Flux.1 Schnell | 14GB | 60s | 4-5s/image | | MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00
```
## Support
For issues:
- **VPS AI services**: Check logs via `docker logs`
- **GPU server**: See runpod repository documentation
- **LiteLLM routing**: Review `../litellm-config.yaml`