docker-compose/ai/README.md at 3ed3e68271132207d40e2f38509721accc56ef0e

Files

Sebastian Krüger 3ed3e68271 feat(ai): add multi-modal orchestration system for text, image, and music generation

Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with
automatic model switching based on request type. This enables text, image, and
music generation on the same hardware with sequential loading.

## New Components

**Model Orchestrator** (ai/model-orchestrator/):
- FastAPI service managing model lifecycle
- Automatic model detection and switching based on request type
- OpenAI-compatible API proxy for all models
- Simple YAML configuration for adding new models
- Docker SDK integration for service management
- Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations

**Text Generation** (ai/vllm/):
- Reorganized existing vLLM server into proper structure
- Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec)
- Docker containerized with CUDA 12.4 support

**Image Generation** (ai/flux/):
- Flux.1 Schnell for fast, high-quality images
- 14GB VRAM, 4-5 sec per image
- OpenAI DALL-E compatible API
- Pre-built image: ghcr.io/matatonic/openedai-images-flux

**Music Generation** (ai/musicgen/):
- Meta's MusicGen Medium (facebook/musicgen-medium)
- Text-to-music generation (11GB VRAM)
- 60-90 seconds for 30s audio clips
- Custom FastAPI wrapper with AudioCraft

## Architecture

```
VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000)
                                       ↓
                       ┌───────────────┼───────────────┐
                  vLLM (8001)    Flux (8002)    MusicGen (8003)
                   [Only ONE active at a time - sequential loading]
```

## Configuration Files

- docker-compose.gpu.yaml: Main orchestration file for RunPod deployment
- model-orchestrator/models.yaml: Model registry (easy to add new models)
- .env.example: Environment variable template
- README.md: Comprehensive deployment and usage guide

## Updated Files

- litellm-config.yaml: Updated to route through orchestrator (port 9000)
- GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture

## Features

✅ Automatic model switching (30-120s latency)
✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU)
✅ Easy model addition via YAML configuration
✅ OpenAI-compatible APIs for all model types
✅ Centralized routing through LiteLLM proxy
✅ GPU memory safety (only one model loaded at time)

## Usage

Deploy to RunPod:
```bash
scp -r ai/* gpu-pivoine:/workspace/ai/
ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator"
```

Test models:
```bash
# Text
curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}'

# Image
curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}'

# Music
curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}'
```

All models available via Open WebUI at https://ai.pivoine.art

## Adding New Models

1. Add entry to models.yaml
2. Define Docker service in docker-compose.gpu.yaml
3. Restart orchestrator

That's it! The orchestrator automatically detects and manages the new model.

## Performance

| Model | VRAM | Startup | Speed |
|-------|------|---------|-------|
| Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec |
| Flux.1 Schnell | 14GB | 60s | 4-5s/image |
| MusicGen Medium | 11GB | 45s | 60-90s for 30s audio |

Model switching overhead: 30-120 seconds

## License Notes

- vLLM: Apache 2.0
- Flux.1: Apache 2.0
- AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-21 14:12:13 +01:00

14 KiB

Raw Blame History

Cost-optimized AI infrastructure running text, image, and music generation on a single RunPod RTX 4090 GPU.

Architecture Overview

This system provides a unified API for multiple AI model types with automatic model switching on a single GPU (24GB VRAM). All requests route through an intelligent orchestrator that manages model lifecycle.

Components

┌─────────────────────────────────────────────────────────────────┐
│                        VPS (Tailscale: 100.102.217.79)          │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │ LiteLLM Proxy (Port 4000)                                 │  │
│  │ Routes to: Claude API + GPU Orchestrator                  │  │
│  └────────────────────┬──────────────────────────────────────┘  │
└───────────────────────┼─────────────────────────────────────────┘
                        │ Tailscale VPN
┌───────────────────────┼─────────────────────────────────────────┐
│           RunPod GPU Server (Tailscale: 100.100.108.13)         │
│  ┌────────────────────▼──────────────────────────────────────┐  │
│  │ Orchestrator (Port 9000)                                  │  │
│  │ Manages sequential model loading based on request type   │  │
│  └─────┬──────────────┬──────────────────┬──────────────────┘  │
│        │              │                  │                      │
│  ┌─────▼──────┐ ┌────▼────────┐  ┌──────▼───────┐             │
│  │vLLM        │ │Flux.1       │  │MusicGen      │             │
│  │Qwen 2.5 7B │ │Schnell      │  │Medium        │             │
│  │Port: 8001  │ │Port: 8002   │  │Port: 8003    │             │
│  │VRAM: 14GB  │ │VRAM: 14GB   │  │VRAM: 11GB    │             │
│  └────────────┘ └─────────────┘  └──────────────┘             │
│                                                                 │
│  Only ONE model active at a time (sequential loading)          │
└─────────────────────────────────────────────────────────────────┘

Features

✅ Automatic Model Switching - Orchestrator detects request type and loads appropriate model ✅ OpenAI-Compatible APIs - Works with existing OpenAI clients and tools ✅ Cost-Optimized - Sequential loading on single GPU (~$0.50/hr vs ~$0.75/hr for multi-GPU) ✅ Easy Model Addition - Add new models by editing YAML config ✅ Centralized Routing - LiteLLM proxy provides unified API for all models ✅ GPU Memory Safe - Orchestrator ensures only one model loaded at a time

Supported Model Types

Text Generation

Qwen 2.5 7B Instruct (facebook/Qwen2.5-7B-Instruct)
VRAM: 14GB | Speed: Fast | OpenAI-compatible chat API

Image Generation

Flux.1 Schnell (black-forest-labs/FLUX.1-schnell)
VRAM: 14GB | Speed: 4-5 sec/image | OpenAI DALL-E compatible API

Music Generation

MusicGen Medium (facebook/musicgen-medium)
VRAM: 11GB | Speed: 60-90 sec for 30s audio | Custom audio API

Quick Start

1. Prerequisites

# On RunPod GPU server
- RunPod RTX 4090 instance (24GB VRAM)
- Docker & Docker Compose installed
- Tailscale VPN configured
- HuggingFace token (for model downloads)

2. Clone & Configure

# On local machine
cd ai/

# Create environment file
cp .env.example .env
# Edit .env and add your HF_TOKEN

3. Deploy to RunPod

# Copy all files to RunPod GPU server
scp -r ai/* gpu-pivoine:/workspace/ai/

# SSH to GPU server
ssh gpu-pivoine

# Navigate to project
cd /workspace/ai/

# Start orchestrator (always running)
docker compose -f docker-compose.gpu.yaml up -d orchestrator

# Orchestrator will automatically manage model services as needed

4. Test Deployment

# Check orchestrator health
curl http://100.100.108.13:9000/health

# Test text generation (auto-loads vLLM)
curl http://100.100.108.13:9000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-2.5-7b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Test image generation (auto-switches to Flux)
curl http://100.100.108.13:9000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "flux-schnell",
    "prompt": "a cute cat",
    "size": "1024x1024"
  }'

# Test music generation (auto-switches to MusicGen)
curl http://100.100.108.13:9000/v1/audio/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "musicgen-medium",
    "prompt": "upbeat electronic dance music",
    "duration": 30
  }'

5. Update VPS LiteLLM

# On VPS, restart LiteLLM to pick up new config
ssh vps
cd ~/Projects/docker-compose
arty restart litellm

Usage Examples

Via Open WebUI (https://ai.pivoine.art)

Text Generation:

Select model: qwen-2.5-7b
Type message and send
Orchestrator loads vLLM automatically

Image Generation:

Select model: flux-schnell
Enter image prompt
Orchestrator switches to Flux.1

Music Generation:

Select model: musicgen-medium
Describe the music you want
Orchestrator switches to MusicGen

Via API (Direct)

import openai

# Configure client to use orchestrator
client = openai.OpenAI(
    base_url="http://100.100.108.13:9000/v1",
    api_key="dummy"  # Not used but required
)

# Text generation
response = client.chat.completions.create(
    model="qwen-2.5-7b",
    messages=[{"role": "user", "content": "Write a haiku"}]
)

# Image generation
image = client.images.generate(
    model="flux-schnell",
    prompt="a sunset over mountains",
    size="1024x1024"
)

# Music generation (custom endpoint)
import requests
music = requests.post(
    "http://100.100.108.13:9000/v1/audio/generations",
    json={
        "model": "musicgen-medium",
        "prompt": "calm piano music",
        "duration": 30
    }
)

Adding New Models

Step 1: Update `models.yaml`

# Add to ai/model-orchestrator/models.yaml
models:
  llama-3.1-8b:  # New model
    type: text
    framework: vllm
    docker_service: vllm-llama
    port: 8004
    vram_gb: 17
    startup_time_seconds: 120
    endpoint: /v1/chat/completions
    description: "Llama 3.1 8B Instruct - Meta's latest model"

Step 2: Add Docker Service

# Add to ai/docker-compose.gpu.yaml
services:
  vllm-llama:
    build: ./vllm
    container_name: ai_vllm-llama_1
    command: >
      vllm serve meta-llama/Llama-3.1-8B-Instruct
      --port 8000 --dtype bfloat16
    ports:
      - "8004:8000"
    environment:
      - HF_TOKEN=${HF_TOKEN}
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    profiles: ["text"]
    restart: "no"

Step 3: Restart Orchestrator

ssh gpu-pivoine
cd /workspace/ai/
docker compose -f docker-compose.gpu.yaml restart orchestrator

That's it! The orchestrator automatically detects the new model.

Management Commands

Orchestrator

# Start orchestrator
docker compose -f docker-compose.gpu.yaml up -d orchestrator

# View orchestrator logs
docker logs -f ai_orchestrator

# Restart orchestrator
docker compose -f docker-compose.gpu.yaml restart orchestrator

# Check active model
curl http://100.100.108.13:9000/health

# List all models
curl http://100.100.108.13:9000/models

Manual Model Control

# Manually switch to specific model
curl -X POST http://100.100.108.13:9000/switch \
  -H "Content-Type: application/json" \
  -d '{"model": "flux-schnell"}'

# Check which model is running
curl http://100.100.108.13:9000/health | jq '.current_model'

Model Services

# Manually start a specific model (bypassing orchestrator)
docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen

# Stop a model
docker compose -f docker-compose.gpu.yaml stop vllm-qwen

# View model logs
docker logs -f ai_vllm-qwen_1
docker logs -f ai_flux_1
docker logs -f ai_musicgen_1

Monitoring

GPU Usage

ssh gpu-pivoine "nvidia-smi"

Model Status

# Which model is active?
curl http://100.100.108.13:9000/health

# Model memory usage
curl http://100.100.108.13:9000/health | jq '{current: .current_model, vram: .model_info.vram_gb}'

Performance

# Orchestrator logs (model switching)
docker logs -f ai_orchestrator

# Model-specific logs
docker logs -f ai_vllm-qwen_1
docker logs -f ai_flux_1
docker logs -f ai_musicgen_1

Troubleshooting

Model Won't Load

# Check orchestrator logs
docker logs ai_orchestrator

# Check if model service exists
docker compose -f docker-compose.gpu.yaml config | grep -A 10 "vllm-qwen"

# Manually test model service
docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen
curl http://localhost:8001/health

Orchestrator Can't Connect

# Check Docker socket permissions
ls -l /var/run/docker.sock

# Restart Docker daemon
sudo systemctl restart docker

# Rebuild orchestrator
docker compose -f docker-compose.gpu.yaml build orchestrator
docker compose -f docker-compose.gpu.yaml up -d orchestrator

Model Switching Too Slow

# Check model startup times in models.yaml
# Adjust startup_time_seconds if needed

# Pre-download models to /workspace cache
docker run --rm -it --gpus all \
  -v /workspace/huggingface_cache:/cache \
  -e HF_HOME=/cache \
  nvidia/cuda:12.4.0-runtime-ubuntu22.04 \
  huggingface-cli download facebook/musicgen-medium

File Structure

ai/
├── docker-compose.gpu.yaml    # Main orchestration file
├── .env.example               # Environment template
├── README.md                  # This file
│
├── model-orchestrator/        # Central orchestrator service
│   ├── orchestrator.py        # FastAPI app managing models
│   ├── models.yaml            # Model registry (EDIT TO ADD MODELS)
│   ├── Dockerfile
│   └── requirements.txt
│
├── vllm/                      # Text generation (vLLM)
│   ├── server.py              # Qwen 2.5 7B server
│   ├── Dockerfile
│   └── requirements.txt
│
├── flux/                      # Image generation (Flux.1 Schnell)
│   └── config/
│       └── config.json        # Flux configuration
│
├── musicgen/                  # Music generation (MusicGen)
│   ├── server.py              # MusicGen API server
│   ├── Dockerfile
│   └── requirements.txt
│
├── litellm-config.yaml        # LiteLLM proxy configuration
└── GPU_DEPLOYMENT_LOG.md      # Deployment history and notes

Cost Analysis

Current Setup (Single GPU)

Provider: RunPod Spot Instance
GPU: RTX 4090 24GB
Cost: ~$0.50/hour
Monthly: ~$360 (if running 24/7)
Optimized: ~$120 (8 hours/day during business hours)

Alternative: Multi-GPU (All Models Always On)

GPUs: 2× RTX 4090
Cost: ~$0.75/hour
Monthly: ~$540 (if running 24/7)
Trade-off: No switching latency, +$180/month

Recommendation

Stick with single GPU sequential loading for cost optimization. Model switching (30-120 seconds) is acceptable for most use cases.

Performance Expectations

Model	VRAM	Startup Time	Generation Speed
Qwen 2.5 7B	14GB	120s	~50 tokens/sec
Flux.1 Schnell	14GB	60s	~4-5 sec/image
MusicGen Medium	11GB	45s	~60-90 sec for 30s audio

Model Switching: 30-120 seconds (unload current + load new)

Security Notes

Orchestrator requires Docker socket access (/var/run/docker.sock)
All services run on private Tailscale network
No public exposure (only via VPS LiteLLM proxy)
HuggingFace token stored in .env (not committed to git)

Future Enhancements

⏹️ Add Llama 3.1 8B for alternative text generation
⏹️ Add Whisper Large v3 for speech-to-text
⏹️ Add XTTS v2 for text-to-speech
⏹️ Implement model preloading/caching for faster switching
⏹️ Add usage metrics and cost tracking
⏹️ Auto-stop GPU pod during idle periods

Support

For issues or questions:

Check orchestrator logs: docker logs ai_orchestrator
View model-specific logs: docker logs ai_<service>_1
Test direct model access: curl http://localhost:<port>/health
Review GPU deployment log: GPU_DEPLOYMENT_LOG.md

License

Built with:

vLLM - Apache 2.0
AudioCraft - MIT (code), CC-BY-NC (weights)
Flux.1 - Apache 2.0
LiteLLM - MIT

Note: MusicGen pre-trained weights are non-commercial (CC-BY-NC). Train your own models for commercial use with the MIT-licensed code.

14 KiB Raw Blame History Unescape Escape

Multi-Modal AI Orchestration System