- Multi-modal AI infrastructure for RunPod RTX 4090 - Automatic model orchestration (text, image, music) - Text: vLLM + Qwen 2.5 7B Instruct - Image: Flux.1 Schnell via OpenEDAI - Music: MusicGen Medium via AudioCraft - Cost-optimized sequential loading on single GPU - Template preparation scripts for rapid deployment - Comprehensive documentation (README, DEPLOYMENT, TEMPLATE)
5.7 KiB
RunPod Multi-Modal AI Stack
Cost-optimized GPU deployment for text, image, and music generation on RunPod RTX 4090.
This repository contains everything needed to deploy and manage a multi-modal AI infrastructure on RunPod, featuring intelligent model orchestration that automatically switches between models based on request type.
Features
- Text Generation: Qwen 2.5 7B Instruct via vLLM (~50 tokens/sec)
- Image Generation: Flux.1 Schnell (~4-5 seconds per image)
- Music Generation: MusicGen Medium (30 seconds of audio in 60-90 seconds)
- Automatic Model Switching: Intelligent orchestrator manages sequential model loading
- OpenAI-Compatible APIs: Works with existing AI tools and clients
- Easy Model Addition: Just edit
model-orchestrator/models.yamlto add new models - Template Support: Create reusable templates for 2-3 minute deployments (vs 60-90 minutes)
Quick Start
Option 1: Deploy from Template (Recommended)
If you've already created a RunPod template:
- Deploy pod from template in RunPod dashboard
- SSH to the pod
- Create
.envfile with your credentials - Start orchestrator:
docker compose -f docker-compose.gpu.yaml up -d orchestrator
See: RUNPOD_TEMPLATE.md for template usage instructions.
Option 2: Fresh Deployment
For first-time setup on a new RunPod instance:
- Copy files to RunPod:
scp -r * gpu-server:/workspace/ai/ - SSH to GPU server:
ssh gpu-server - Run preparation script:
cd /workspace/ai && chmod +x scripts/prepare-template.sh && ./scripts/prepare-template.sh
See: DEPLOYMENT.md for detailed deployment guide.
Architecture
VPS (LiteLLM Proxy)
↓ Tailscale VPN
GPU Server (Orchestrator Port 9000)
├── vLLM (Qwen 2.5 7B) - Port 8001
├── Flux.1 Schnell - Port 8002
└── MusicGen Medium - Port 8003
All requests route through the orchestrator, which automatically loads the appropriate model. Only one model is active at a time for cost optimization (~$0.50/hr vs ~$0.75/hr for multi-GPU).
Cost Analysis
RunPod RTX 4090 Spot Instance:
- Hourly: ~$0.50
- Monthly (24/7): ~$360
- Monthly (8hr/day): ~$120
Template Benefits:
- Without Template: 60-90 minutes setup per Spot restart
- With Template: 2-3 minutes deployment time
- Spot Restart Frequency: 2-5 times per week (variable)
Documentation
Primary Docs
- DEPLOYMENT.md - Complete deployment and usage guide
- RUNPOD_TEMPLATE.md - Template creation and usage
- GPU_DEPLOYMENT_LOG.md - Deployment history and technical notes
Setup Guides (Historical)
DOCKER_GPU_SETUP.md- Docker configuration for GPU supportTAILSCALE_SETUP.md- Tailscale VPN setupWIREGUARD_SETUP.md- WireGuard VPN (deprecated, use Tailscale)SETUP_GUIDE.md- General setup instructions
Architecture Components
model-orchestrator/- FastAPI orchestrator managing model lifecyclevllm/- Text generation service (Qwen 2.5 7B)flux/- Image generation service (Flux.1 Schnell)musicgen/- Music generation service (MusicGen Medium)scripts/- Automation scripts
Creating a RunPod Template
Why create a template?
- Save 60-90 minutes on every Spot instance restart
- Pre-downloaded models (~37GB cached)
- Pre-built Docker images
- Ready-to-use configuration
How to create:
- Run
scripts/prepare-template.shon a fresh RunPod instance - Wait 45-60 minutes for models to download and images to build
- Save pod as template in RunPod dashboard
- Name:
multi-modal-ai-v1.0
See: RUNPOD_TEMPLATE.md for step-by-step guide.
Adding New Models
Adding models is easy! Just edit model-orchestrator/models.yaml:
models:
llama-3.1-8b: # New model
type: text
framework: vllm
docker_service: vllm-llama
port: 8004
vram_gb: 17
startup_time_seconds: 120
endpoint: /v1/chat/completions
Then add the Docker service to docker-compose.gpu.yaml and restart the orchestrator.
See: DEPLOYMENT.md for complete instructions.
Usage Examples
Text Generation
curl http://100.100.108.13:9000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Hello!"}]}'
Image Generation
curl http://100.100.108.13:9000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{"model": "flux-schnell", "prompt": "a cute cat", "size": "1024x1024"}'
Music Generation
curl http://100.100.108.13:9000/v1/audio/generations \
-H "Content-Type: application/json" \
-d '{"model": "musicgen-medium", "prompt": "upbeat electronic", "duration": 30}'
Infrastructure
Provider: RunPod (Spot Instance)
GPU: NVIDIA RTX 4090 24GB VRAM
Region: Europe
Network: Tailscale VPN (100.100.108.13)
Storage: 922TB network volume at /workspace
Monitoring
# Check active model
curl http://100.100.108.13:9000/health
# View orchestrator logs
docker logs -f ai_orchestrator
# GPU usage
nvidia-smi
Support
For issues:
- Check orchestrator logs:
docker logs ai_orchestrator - Review DEPLOYMENT.md
- Check GPU_DEPLOYMENT_LOG.md for deployment history
License
Built with:
- vLLM - Apache 2.0
- AudioCraft - MIT (code), CC-BY-NC (weights)
- Flux.1 - Apache 2.0
- LiteLLM - MIT
Note: MusicGen pre-trained weights are non-commercial (CC-BY-NC).