Files
docker-compose/ai/model-orchestrator/models.yaml
Sebastian Krüger 3ed3e68271 feat(ai): add multi-modal orchestration system for text, image, and music generation
Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with
automatic model switching based on request type. This enables text, image, and
music generation on the same hardware with sequential loading.

## New Components

**Model Orchestrator** (ai/model-orchestrator/):
- FastAPI service managing model lifecycle
- Automatic model detection and switching based on request type
- OpenAI-compatible API proxy for all models
- Simple YAML configuration for adding new models
- Docker SDK integration for service management
- Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations

**Text Generation** (ai/vllm/):
- Reorganized existing vLLM server into proper structure
- Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec)
- Docker containerized with CUDA 12.4 support

**Image Generation** (ai/flux/):
- Flux.1 Schnell for fast, high-quality images
- 14GB VRAM, 4-5 sec per image
- OpenAI DALL-E compatible API
- Pre-built image: ghcr.io/matatonic/openedai-images-flux

**Music Generation** (ai/musicgen/):
- Meta's MusicGen Medium (facebook/musicgen-medium)
- Text-to-music generation (11GB VRAM)
- 60-90 seconds for 30s audio clips
- Custom FastAPI wrapper with AudioCraft

## Architecture

```
VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000)
                                       ↓
                       ┌───────────────┼───────────────┐
                  vLLM (8001)    Flux (8002)    MusicGen (8003)
                   [Only ONE active at a time - sequential loading]
```

## Configuration Files

- docker-compose.gpu.yaml: Main orchestration file for RunPod deployment
- model-orchestrator/models.yaml: Model registry (easy to add new models)
- .env.example: Environment variable template
- README.md: Comprehensive deployment and usage guide

## Updated Files

- litellm-config.yaml: Updated to route through orchestrator (port 9000)
- GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture

## Features

 Automatic model switching (30-120s latency)
 Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU)
 Easy model addition via YAML configuration
 OpenAI-compatible APIs for all model types
 Centralized routing through LiteLLM proxy
 GPU memory safety (only one model loaded at time)

## Usage

Deploy to RunPod:
```bash
scp -r ai/* gpu-pivoine:/workspace/ai/
ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator"
```

Test models:
```bash
# Text
curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}'

# Image
curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}'

# Music
curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}'
```

All models available via Open WebUI at https://ai.pivoine.art

## Adding New Models

1. Add entry to models.yaml
2. Define Docker service in docker-compose.gpu.yaml
3. Restart orchestrator

That's it! The orchestrator automatically detects and manages the new model.

## Performance

| Model | VRAM | Startup | Speed |
|-------|------|---------|-------|
| Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec |
| Flux.1 Schnell | 14GB | 60s | 4-5s/image |
| MusicGen Medium | 11GB | 45s | 60-90s for 30s audio |

Model switching overhead: 30-120 seconds

## License Notes

- vLLM: Apache 2.0
- Flux.1: Apache 2.0
- AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 14:12:13 +01:00

90 lines
2.4 KiB
YAML

# Model Registry for AI Orchestrator
# Add new models by appending to this file
models:
# Text Generation Models
qwen-2.5-7b:
type: text
framework: vllm
docker_service: vllm-qwen
port: 8001
vram_gb: 14
startup_time_seconds: 120
endpoint: /v1/chat/completions
description: "Qwen 2.5 7B Instruct - Fast text generation, no authentication required"
# Image Generation Models
flux-schnell:
type: image
framework: openedai-images
docker_service: flux
port: 8002
vram_gb: 14
startup_time_seconds: 60
endpoint: /v1/images/generations
description: "Flux.1 Schnell - Fast high-quality image generation (4-5 sec/image)"
# Music Generation Models
musicgen-medium:
type: audio
framework: audiocraft
docker_service: musicgen
port: 8003
vram_gb: 11
startup_time_seconds: 45
endpoint: /v1/audio/generations
description: "MusicGen Medium - Text-to-music generation (60-90 sec for 30s audio)"
# Example: Add more models easily by uncommenting and customizing below
# Future Text Models:
# llama-3.1-8b:
# type: text
# framework: vllm
# docker_service: vllm-llama
# port: 8004
# vram_gb: 17
# startup_time_seconds: 120
# endpoint: /v1/chat/completions
# description: "Llama 3.1 8B Instruct - Meta's latest model"
# Future Image Models:
# sdxl:
# type: image
# framework: openedai-images
# docker_service: sdxl
# port: 8005
# vram_gb: 10
# startup_time_seconds: 45
# endpoint: /v1/images/generations
# description: "Stable Diffusion XL - High quality image generation"
# Future Audio Models:
# whisper-large:
# type: audio
# framework: faster-whisper
# docker_service: whisper
# port: 8006
# vram_gb: 3
# startup_time_seconds: 30
# endpoint: /v1/audio/transcriptions
# description: "Whisper Large v3 - Speech-to-text transcription"
#
# xtts-v2:
# type: audio
# framework: openedai-speech
# docker_service: tts
# port: 8007
# vram_gb: 3
# startup_time_seconds: 30
# endpoint: /v1/audio/speech
# description: "XTTS v2 - High-quality text-to-speech with voice cloning"
# Configuration
config:
gpu_memory_total_gb: 24
allow_concurrent_loading: false # Sequential loading only
model_switch_timeout_seconds: 300 # 5 minutes max for model switching
health_check_interval_seconds: 10
default_model: qwen-2.5-7b