- Multi-modal AI infrastructure for RunPod RTX 4090 - Automatic model orchestration (text, image, music) - Text: vLLM + Qwen 2.5 7B Instruct - Image: Flux.1 Schnell via OpenEDAI - Music: MusicGen Medium via AudioCraft - Cost-optimized sequential loading on single GPU - Template preparation scripts for rapid deployment - Comprehensive documentation (README, DEPLOYMENT, TEMPLATE)
14 KiB
Multi-Modal AI Orchestration System
Cost-optimized AI infrastructure running text, image, and music generation on a single RunPod RTX 4090 GPU.
Architecture Overview
This system provides a unified API for multiple AI model types with automatic model switching on a single GPU (24GB VRAM). All requests route through an intelligent orchestrator that manages model lifecycle.
Components
┌─────────────────────────────────────────────────────────────────┐
│ VPS (Tailscale: 100.102.217.79) │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ LiteLLM Proxy (Port 4000) │ │
│ │ Routes to: Claude API + GPU Orchestrator │ │
│ └────────────────────┬──────────────────────────────────────┘ │
└───────────────────────┼─────────────────────────────────────────┘
│ Tailscale VPN
┌───────────────────────┼─────────────────────────────────────────┐
│ RunPod GPU Server (Tailscale: 100.100.108.13) │
│ ┌────────────────────▼──────────────────────────────────────┐ │
│ │ Orchestrator (Port 9000) │ │
│ │ Manages sequential model loading based on request type │ │
│ └─────┬──────────────┬──────────────────┬──────────────────┘ │
│ │ │ │ │
│ ┌─────▼──────┐ ┌────▼────────┐ ┌──────▼───────┐ │
│ │vLLM │ │Flux.1 │ │MusicGen │ │
│ │Qwen 2.5 7B │ │Schnell │ │Medium │ │
│ │Port: 8001 │ │Port: 8002 │ │Port: 8003 │ │
│ │VRAM: 14GB │ │VRAM: 14GB │ │VRAM: 11GB │ │
│ └────────────┘ └─────────────┘ └──────────────┘ │
│ │
│ Only ONE model active at a time (sequential loading) │
└─────────────────────────────────────────────────────────────────┘
Features
✅ Automatic Model Switching - Orchestrator detects request type and loads appropriate model ✅ OpenAI-Compatible APIs - Works with existing OpenAI clients and tools ✅ Cost-Optimized - Sequential loading on single GPU (~$0.50/hr vs ~$0.75/hr for multi-GPU) ✅ Easy Model Addition - Add new models by editing YAML config ✅ Centralized Routing - LiteLLM proxy provides unified API for all models ✅ GPU Memory Safe - Orchestrator ensures only one model loaded at a time
Supported Model Types
Text Generation
- Qwen 2.5 7B Instruct (facebook/Qwen2.5-7B-Instruct)
- VRAM: 14GB | Speed: Fast | OpenAI-compatible chat API
Image Generation
- Flux.1 Schnell (black-forest-labs/FLUX.1-schnell)
- VRAM: 14GB | Speed: 4-5 sec/image | OpenAI DALL-E compatible API
Music Generation
- MusicGen Medium (facebook/musicgen-medium)
- VRAM: 11GB | Speed: 60-90 sec for 30s audio | Custom audio API
Quick Start
1. Prerequisites
# On RunPod GPU server
- RunPod RTX 4090 instance (24GB VRAM)
- Docker & Docker Compose installed
- Tailscale VPN configured
- HuggingFace token (for model downloads)
2. Clone & Configure
# On local machine
cd ai/
# Create environment file
cp .env.example .env
# Edit .env and add your HF_TOKEN
3. Deploy to RunPod
# Copy all files to RunPod GPU server
scp -r ai/* gpu-pivoine:/workspace/ai/
# SSH to GPU server
ssh gpu-pivoine
# Navigate to project
cd /workspace/ai/
# Start orchestrator (always running)
docker compose -f docker-compose.gpu.yaml up -d orchestrator
# Orchestrator will automatically manage model services as needed
4. Test Deployment
# Check orchestrator health
curl http://100.100.108.13:9000/health
# Test text generation (auto-loads vLLM)
curl http://100.100.108.13:9000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-2.5-7b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Test image generation (auto-switches to Flux)
curl http://100.100.108.13:9000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "flux-schnell",
"prompt": "a cute cat",
"size": "1024x1024"
}'
# Test music generation (auto-switches to MusicGen)
curl http://100.100.108.13:9000/v1/audio/generations \
-H "Content-Type: application/json" \
-d '{
"model": "musicgen-medium",
"prompt": "upbeat electronic dance music",
"duration": 30
}'
5. Update VPS LiteLLM
# On VPS, restart LiteLLM to pick up new config
ssh vps
cd ~/Projects/docker-compose
arty restart litellm
Usage Examples
Via Open WebUI (https://ai.pivoine.art)
Text Generation:
- Select model:
qwen-2.5-7b - Type message and send
- Orchestrator loads vLLM automatically
Image Generation:
- Select model:
flux-schnell - Enter image prompt
- Orchestrator switches to Flux.1
Music Generation:
- Select model:
musicgen-medium - Describe the music you want
- Orchestrator switches to MusicGen
Via API (Direct)
import openai
# Configure client to use orchestrator
client = openai.OpenAI(
base_url="http://100.100.108.13:9000/v1",
api_key="dummy" # Not used but required
)
# Text generation
response = client.chat.completions.create(
model="qwen-2.5-7b",
messages=[{"role": "user", "content": "Write a haiku"}]
)
# Image generation
image = client.images.generate(
model="flux-schnell",
prompt="a sunset over mountains",
size="1024x1024"
)
# Music generation (custom endpoint)
import requests
music = requests.post(
"http://100.100.108.13:9000/v1/audio/generations",
json={
"model": "musicgen-medium",
"prompt": "calm piano music",
"duration": 30
}
)
Adding New Models
Step 1: Update models.yaml
# Add to ai/model-orchestrator/models.yaml
models:
llama-3.1-8b: # New model
type: text
framework: vllm
docker_service: vllm-llama
port: 8004
vram_gb: 17
startup_time_seconds: 120
endpoint: /v1/chat/completions
description: "Llama 3.1 8B Instruct - Meta's latest model"
Step 2: Add Docker Service
# Add to ai/docker-compose.gpu.yaml
services:
vllm-llama:
build: ./vllm
container_name: ai_vllm-llama_1
command: >
vllm serve meta-llama/Llama-3.1-8B-Instruct
--port 8000 --dtype bfloat16
ports:
- "8004:8000"
environment:
- HF_TOKEN=${HF_TOKEN}
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
profiles: ["text"]
restart: "no"
Step 3: Restart Orchestrator
ssh gpu-pivoine
cd /workspace/ai/
docker compose -f docker-compose.gpu.yaml restart orchestrator
That's it! The orchestrator automatically detects the new model.
Management Commands
Orchestrator
# Start orchestrator
docker compose -f docker-compose.gpu.yaml up -d orchestrator
# View orchestrator logs
docker logs -f ai_orchestrator
# Restart orchestrator
docker compose -f docker-compose.gpu.yaml restart orchestrator
# Check active model
curl http://100.100.108.13:9000/health
# List all models
curl http://100.100.108.13:9000/models
Manual Model Control
# Manually switch to specific model
curl -X POST http://100.100.108.13:9000/switch \
-H "Content-Type: application/json" \
-d '{"model": "flux-schnell"}'
# Check which model is running
curl http://100.100.108.13:9000/health | jq '.current_model'
Model Services
# Manually start a specific model (bypassing orchestrator)
docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen
# Stop a model
docker compose -f docker-compose.gpu.yaml stop vllm-qwen
# View model logs
docker logs -f ai_vllm-qwen_1
docker logs -f ai_flux_1
docker logs -f ai_musicgen_1
Monitoring
GPU Usage
ssh gpu-pivoine "nvidia-smi"
Model Status
# Which model is active?
curl http://100.100.108.13:9000/health
# Model memory usage
curl http://100.100.108.13:9000/health | jq '{current: .current_model, vram: .model_info.vram_gb}'
Performance
# Orchestrator logs (model switching)
docker logs -f ai_orchestrator
# Model-specific logs
docker logs -f ai_vllm-qwen_1
docker logs -f ai_flux_1
docker logs -f ai_musicgen_1
Troubleshooting
Model Won't Load
# Check orchestrator logs
docker logs ai_orchestrator
# Check if model service exists
docker compose -f docker-compose.gpu.yaml config | grep -A 10 "vllm-qwen"
# Manually test model service
docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen
curl http://localhost:8001/health
Orchestrator Can't Connect
# Check Docker socket permissions
ls -l /var/run/docker.sock
# Restart Docker daemon
sudo systemctl restart docker
# Rebuild orchestrator
docker compose -f docker-compose.gpu.yaml build orchestrator
docker compose -f docker-compose.gpu.yaml up -d orchestrator
Model Switching Too Slow
# Check model startup times in models.yaml
# Adjust startup_time_seconds if needed
# Pre-download models to /workspace cache
docker run --rm -it --gpus all \
-v /workspace/huggingface_cache:/cache \
-e HF_HOME=/cache \
nvidia/cuda:12.4.0-runtime-ubuntu22.04 \
huggingface-cli download facebook/musicgen-medium
File Structure
ai/
├── docker-compose.gpu.yaml # Main orchestration file
├── .env.example # Environment template
├── README.md # This file
│
├── model-orchestrator/ # Central orchestrator service
│ ├── orchestrator.py # FastAPI app managing models
│ ├── models.yaml # Model registry (EDIT TO ADD MODELS)
│ ├── Dockerfile
│ └── requirements.txt
│
├── vllm/ # Text generation (vLLM)
│ ├── server.py # Qwen 2.5 7B server
│ ├── Dockerfile
│ └── requirements.txt
│
├── flux/ # Image generation (Flux.1 Schnell)
│ └── config/
│ └── config.json # Flux configuration
│
├── musicgen/ # Music generation (MusicGen)
│ ├── server.py # MusicGen API server
│ ├── Dockerfile
│ └── requirements.txt
│
├── litellm-config.yaml # LiteLLM proxy configuration
└── GPU_DEPLOYMENT_LOG.md # Deployment history and notes
Cost Analysis
Current Setup (Single GPU)
- Provider: RunPod Spot Instance
- GPU: RTX 4090 24GB
- Cost: ~$0.50/hour
- Monthly: ~$360 (if running 24/7)
- Optimized: ~$120 (8 hours/day during business hours)
Alternative: Multi-GPU (All Models Always On)
- GPUs: 2× RTX 4090
- Cost: ~$0.75/hour
- Monthly: ~$540 (if running 24/7)
- Trade-off: No switching latency, +$180/month
Recommendation
Stick with single GPU sequential loading for cost optimization. Model switching (30-120 seconds) is acceptable for most use cases.
Performance Expectations
| Model | VRAM | Startup Time | Generation Speed |
|---|---|---|---|
| Qwen 2.5 7B | 14GB | 120s | ~50 tokens/sec |
| Flux.1 Schnell | 14GB | 60s | ~4-5 sec/image |
| MusicGen Medium | 11GB | 45s | ~60-90 sec for 30s audio |
Model Switching: 30-120 seconds (unload current + load new)
Security Notes
- Orchestrator requires Docker socket access (
/var/run/docker.sock) - All services run on private Tailscale network
- No public exposure (only via VPS LiteLLM proxy)
- HuggingFace token stored in
.env(not committed to git)
Future Enhancements
- ⏹️ Add Llama 3.1 8B for alternative text generation
- ⏹️ Add Whisper Large v3 for speech-to-text
- ⏹️ Add XTTS v2 for text-to-speech
- ⏹️ Implement model preloading/caching for faster switching
- ⏹️ Add usage metrics and cost tracking
- ⏹️ Auto-stop GPU pod during idle periods
Support
For issues or questions:
- Check orchestrator logs:
docker logs ai_orchestrator - View model-specific logs:
docker logs ai_<service>_1 - Test direct model access:
curl http://localhost:<port>/health - Review GPU deployment log:
GPU_DEPLOYMENT_LOG.md
License
Built with:
- vLLM - Apache 2.0
- AudioCraft - MIT (code), CC-BY-NC (weights)
- Flux.1 - Apache 2.0
- LiteLLM - MIT
Note: MusicGen pre-trained weights are non-commercial (CC-BY-NC). Train your own models for commercial use with the MIT-licensed code.