Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with
automatic model switching based on request type. This enables text, image, and
music generation on the same hardware with sequential loading.
## New Components
**Model Orchestrator** (ai/model-orchestrator/):
- FastAPI service managing model lifecycle
- Automatic model detection and switching based on request type
- OpenAI-compatible API proxy for all models
- Simple YAML configuration for adding new models
- Docker SDK integration for service management
- Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations
**Text Generation** (ai/vllm/):
- Reorganized existing vLLM server into proper structure
- Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec)
- Docker containerized with CUDA 12.4 support
**Image Generation** (ai/flux/):
- Flux.1 Schnell for fast, high-quality images
- 14GB VRAM, 4-5 sec per image
- OpenAI DALL-E compatible API
- Pre-built image: ghcr.io/matatonic/openedai-images-flux
**Music Generation** (ai/musicgen/):
- Meta's MusicGen Medium (facebook/musicgen-medium)
- Text-to-music generation (11GB VRAM)
- 60-90 seconds for 30s audio clips
- Custom FastAPI wrapper with AudioCraft
## Architecture
```
VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000)
↓
┌───────────────┼───────────────┐
vLLM (8001) Flux (8002) MusicGen (8003)
[Only ONE active at a time - sequential loading]
```
## Configuration Files
- docker-compose.gpu.yaml: Main orchestration file for RunPod deployment
- model-orchestrator/models.yaml: Model registry (easy to add new models)
- .env.example: Environment variable template
- README.md: Comprehensive deployment and usage guide
## Updated Files
- litellm-config.yaml: Updated to route through orchestrator (port 9000)
- GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture
## Features
✅ Automatic model switching (30-120s latency)
✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU)
✅ Easy model addition via YAML configuration
✅ OpenAI-compatible APIs for all model types
✅ Centralized routing through LiteLLM proxy
✅ GPU memory safety (only one model loaded at time)
## Usage
Deploy to RunPod:
```bash
scp -r ai/* gpu-pivoine:/workspace/ai/
ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator"
```
Test models:
```bash
# Text
curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}'
# Image
curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}'
# Music
curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}'
```
All models available via Open WebUI at https://ai.pivoine.art
## Adding New Models
1. Add entry to models.yaml
2. Define Docker service in docker-compose.gpu.yaml
3. Restart orchestrator
That's it! The orchestrator automatically detects and manages the new model.
## Performance
| Model | VRAM | Startup | Speed |
|-------|------|---------|-------|
| Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec |
| Flux.1 Schnell | 14GB | 60s | 4-5s/image |
| MusicGen Medium | 11GB | 45s | 60-90s for 30s audio |
Model switching overhead: 30-120 seconds
## License Notes
- vLLM: Apache 2.0
- Flux.1: Apache 2.0
- AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
14 KiB
Multi-Modal AI Orchestration System
Cost-optimized AI infrastructure running text, image, and music generation on a single RunPod RTX 4090 GPU.
Architecture Overview
This system provides a unified API for multiple AI model types with automatic model switching on a single GPU (24GB VRAM). All requests route through an intelligent orchestrator that manages model lifecycle.
Components
┌─────────────────────────────────────────────────────────────────┐
│ VPS (Tailscale: 100.102.217.79) │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ LiteLLM Proxy (Port 4000) │ │
│ │ Routes to: Claude API + GPU Orchestrator │ │
│ └────────────────────┬──────────────────────────────────────┘ │
└───────────────────────┼─────────────────────────────────────────┘
│ Tailscale VPN
┌───────────────────────┼─────────────────────────────────────────┐
│ RunPod GPU Server (Tailscale: 100.100.108.13) │
│ ┌────────────────────▼──────────────────────────────────────┐ │
│ │ Orchestrator (Port 9000) │ │
│ │ Manages sequential model loading based on request type │ │
│ └─────┬──────────────┬──────────────────┬──────────────────┘ │
│ │ │ │ │
│ ┌─────▼──────┐ ┌────▼────────┐ ┌──────▼───────┐ │
│ │vLLM │ │Flux.1 │ │MusicGen │ │
│ │Qwen 2.5 7B │ │Schnell │ │Medium │ │
│ │Port: 8001 │ │Port: 8002 │ │Port: 8003 │ │
│ │VRAM: 14GB │ │VRAM: 14GB │ │VRAM: 11GB │ │
│ └────────────┘ └─────────────┘ └──────────────┘ │
│ │
│ Only ONE model active at a time (sequential loading) │
└─────────────────────────────────────────────────────────────────┘
Features
✅ Automatic Model Switching - Orchestrator detects request type and loads appropriate model ✅ OpenAI-Compatible APIs - Works with existing OpenAI clients and tools ✅ Cost-Optimized - Sequential loading on single GPU (~$0.50/hr vs ~$0.75/hr for multi-GPU) ✅ Easy Model Addition - Add new models by editing YAML config ✅ Centralized Routing - LiteLLM proxy provides unified API for all models ✅ GPU Memory Safe - Orchestrator ensures only one model loaded at a time
Supported Model Types
Text Generation
- Qwen 2.5 7B Instruct (facebook/Qwen2.5-7B-Instruct)
- VRAM: 14GB | Speed: Fast | OpenAI-compatible chat API
Image Generation
- Flux.1 Schnell (black-forest-labs/FLUX.1-schnell)
- VRAM: 14GB | Speed: 4-5 sec/image | OpenAI DALL-E compatible API
Music Generation
- MusicGen Medium (facebook/musicgen-medium)
- VRAM: 11GB | Speed: 60-90 sec for 30s audio | Custom audio API
Quick Start
1. Prerequisites
# On RunPod GPU server
- RunPod RTX 4090 instance (24GB VRAM)
- Docker & Docker Compose installed
- Tailscale VPN configured
- HuggingFace token (for model downloads)
2. Clone & Configure
# On local machine
cd ai/
# Create environment file
cp .env.example .env
# Edit .env and add your HF_TOKEN
3. Deploy to RunPod
# Copy all files to RunPod GPU server
scp -r ai/* gpu-pivoine:/workspace/ai/
# SSH to GPU server
ssh gpu-pivoine
# Navigate to project
cd /workspace/ai/
# Start orchestrator (always running)
docker compose -f docker-compose.gpu.yaml up -d orchestrator
# Orchestrator will automatically manage model services as needed
4. Test Deployment
# Check orchestrator health
curl http://100.100.108.13:9000/health
# Test text generation (auto-loads vLLM)
curl http://100.100.108.13:9000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-2.5-7b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Test image generation (auto-switches to Flux)
curl http://100.100.108.13:9000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "flux-schnell",
"prompt": "a cute cat",
"size": "1024x1024"
}'
# Test music generation (auto-switches to MusicGen)
curl http://100.100.108.13:9000/v1/audio/generations \
-H "Content-Type: application/json" \
-d '{
"model": "musicgen-medium",
"prompt": "upbeat electronic dance music",
"duration": 30
}'
5. Update VPS LiteLLM
# On VPS, restart LiteLLM to pick up new config
ssh vps
cd ~/Projects/docker-compose
arty restart litellm
Usage Examples
Via Open WebUI (https://ai.pivoine.art)
Text Generation:
- Select model:
qwen-2.5-7b - Type message and send
- Orchestrator loads vLLM automatically
Image Generation:
- Select model:
flux-schnell - Enter image prompt
- Orchestrator switches to Flux.1
Music Generation:
- Select model:
musicgen-medium - Describe the music you want
- Orchestrator switches to MusicGen
Via API (Direct)
import openai
# Configure client to use orchestrator
client = openai.OpenAI(
base_url="http://100.100.108.13:9000/v1",
api_key="dummy" # Not used but required
)
# Text generation
response = client.chat.completions.create(
model="qwen-2.5-7b",
messages=[{"role": "user", "content": "Write a haiku"}]
)
# Image generation
image = client.images.generate(
model="flux-schnell",
prompt="a sunset over mountains",
size="1024x1024"
)
# Music generation (custom endpoint)
import requests
music = requests.post(
"http://100.100.108.13:9000/v1/audio/generations",
json={
"model": "musicgen-medium",
"prompt": "calm piano music",
"duration": 30
}
)
Adding New Models
Step 1: Update models.yaml
# Add to ai/model-orchestrator/models.yaml
models:
llama-3.1-8b: # New model
type: text
framework: vllm
docker_service: vllm-llama
port: 8004
vram_gb: 17
startup_time_seconds: 120
endpoint: /v1/chat/completions
description: "Llama 3.1 8B Instruct - Meta's latest model"
Step 2: Add Docker Service
# Add to ai/docker-compose.gpu.yaml
services:
vllm-llama:
build: ./vllm
container_name: ai_vllm-llama_1
command: >
vllm serve meta-llama/Llama-3.1-8B-Instruct
--port 8000 --dtype bfloat16
ports:
- "8004:8000"
environment:
- HF_TOKEN=${HF_TOKEN}
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
profiles: ["text"]
restart: "no"
Step 3: Restart Orchestrator
ssh gpu-pivoine
cd /workspace/ai/
docker compose -f docker-compose.gpu.yaml restart orchestrator
That's it! The orchestrator automatically detects the new model.
Management Commands
Orchestrator
# Start orchestrator
docker compose -f docker-compose.gpu.yaml up -d orchestrator
# View orchestrator logs
docker logs -f ai_orchestrator
# Restart orchestrator
docker compose -f docker-compose.gpu.yaml restart orchestrator
# Check active model
curl http://100.100.108.13:9000/health
# List all models
curl http://100.100.108.13:9000/models
Manual Model Control
# Manually switch to specific model
curl -X POST http://100.100.108.13:9000/switch \
-H "Content-Type: application/json" \
-d '{"model": "flux-schnell"}'
# Check which model is running
curl http://100.100.108.13:9000/health | jq '.current_model'
Model Services
# Manually start a specific model (bypassing orchestrator)
docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen
# Stop a model
docker compose -f docker-compose.gpu.yaml stop vllm-qwen
# View model logs
docker logs -f ai_vllm-qwen_1
docker logs -f ai_flux_1
docker logs -f ai_musicgen_1
Monitoring
GPU Usage
ssh gpu-pivoine "nvidia-smi"
Model Status
# Which model is active?
curl http://100.100.108.13:9000/health
# Model memory usage
curl http://100.100.108.13:9000/health | jq '{current: .current_model, vram: .model_info.vram_gb}'
Performance
# Orchestrator logs (model switching)
docker logs -f ai_orchestrator
# Model-specific logs
docker logs -f ai_vllm-qwen_1
docker logs -f ai_flux_1
docker logs -f ai_musicgen_1
Troubleshooting
Model Won't Load
# Check orchestrator logs
docker logs ai_orchestrator
# Check if model service exists
docker compose -f docker-compose.gpu.yaml config | grep -A 10 "vllm-qwen"
# Manually test model service
docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen
curl http://localhost:8001/health
Orchestrator Can't Connect
# Check Docker socket permissions
ls -l /var/run/docker.sock
# Restart Docker daemon
sudo systemctl restart docker
# Rebuild orchestrator
docker compose -f docker-compose.gpu.yaml build orchestrator
docker compose -f docker-compose.gpu.yaml up -d orchestrator
Model Switching Too Slow
# Check model startup times in models.yaml
# Adjust startup_time_seconds if needed
# Pre-download models to /workspace cache
docker run --rm -it --gpus all \
-v /workspace/huggingface_cache:/cache \
-e HF_HOME=/cache \
nvidia/cuda:12.4.0-runtime-ubuntu22.04 \
huggingface-cli download facebook/musicgen-medium
File Structure
ai/
├── docker-compose.gpu.yaml # Main orchestration file
├── .env.example # Environment template
├── README.md # This file
│
├── model-orchestrator/ # Central orchestrator service
│ ├── orchestrator.py # FastAPI app managing models
│ ├── models.yaml # Model registry (EDIT TO ADD MODELS)
│ ├── Dockerfile
│ └── requirements.txt
│
├── vllm/ # Text generation (vLLM)
│ ├── server.py # Qwen 2.5 7B server
│ ├── Dockerfile
│ └── requirements.txt
│
├── flux/ # Image generation (Flux.1 Schnell)
│ └── config/
│ └── config.json # Flux configuration
│
├── musicgen/ # Music generation (MusicGen)
│ ├── server.py # MusicGen API server
│ ├── Dockerfile
│ └── requirements.txt
│
├── litellm-config.yaml # LiteLLM proxy configuration
└── GPU_DEPLOYMENT_LOG.md # Deployment history and notes
Cost Analysis
Current Setup (Single GPU)
- Provider: RunPod Spot Instance
- GPU: RTX 4090 24GB
- Cost: ~$0.50/hour
- Monthly: ~$360 (if running 24/7)
- Optimized: ~$120 (8 hours/day during business hours)
Alternative: Multi-GPU (All Models Always On)
- GPUs: 2× RTX 4090
- Cost: ~$0.75/hour
- Monthly: ~$540 (if running 24/7)
- Trade-off: No switching latency, +$180/month
Recommendation
Stick with single GPU sequential loading for cost optimization. Model switching (30-120 seconds) is acceptable for most use cases.
Performance Expectations
| Model | VRAM | Startup Time | Generation Speed |
|---|---|---|---|
| Qwen 2.5 7B | 14GB | 120s | ~50 tokens/sec |
| Flux.1 Schnell | 14GB | 60s | ~4-5 sec/image |
| MusicGen Medium | 11GB | 45s | ~60-90 sec for 30s audio |
Model Switching: 30-120 seconds (unload current + load new)
Security Notes
- Orchestrator requires Docker socket access (
/var/run/docker.sock) - All services run on private Tailscale network
- No public exposure (only via VPS LiteLLM proxy)
- HuggingFace token stored in
.env(not committed to git)
Future Enhancements
- ⏹️ Add Llama 3.1 8B for alternative text generation
- ⏹️ Add Whisper Large v3 for speech-to-text
- ⏹️ Add XTTS v2 for text-to-speech
- ⏹️ Implement model preloading/caching for faster switching
- ⏹️ Add usage metrics and cost tracking
- ⏹️ Auto-stop GPU pod during idle periods
Support
For issues or questions:
- Check orchestrator logs:
docker logs ai_orchestrator - View model-specific logs:
docker logs ai_<service>_1 - Test direct model access:
curl http://localhost:<port>/health - Review GPU deployment log:
GPU_DEPLOYMENT_LOG.md
License
Built with:
- vLLM - Apache 2.0
- AudioCraft - MIT (code), CC-BY-NC (weights)
- Flux.1 - Apache 2.0
- LiteLLM - MIT
Note: MusicGen pre-trained weights are non-commercial (CC-BY-NC). Train your own models for commercial use with the MIT-licensed code.