# Multi-Modal AI Orchestration System **Cost-optimized AI infrastructure running text, image, and music generation on a single RunPod RTX 4090 GPU.** ## Architecture Overview This system provides a unified API for multiple AI model types with automatic model switching on a single GPU (24GB VRAM). All requests route through an intelligent orchestrator that manages model lifecycle. ### Components ``` ┌─────────────────────────────────────────────────────────────────┐ │ VPS (Tailscale: 100.102.217.79) │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ LiteLLM Proxy (Port 4000) │ │ │ │ Routes to: Claude API + GPU Orchestrator │ │ │ └────────────────────┬──────────────────────────────────────┘ │ └───────────────────────┼─────────────────────────────────────────┘ │ Tailscale VPN ┌───────────────────────┼─────────────────────────────────────────┐ │ RunPod GPU Server (Tailscale: 100.100.108.13) │ │ ┌────────────────────▼──────────────────────────────────────┐ │ │ │ Orchestrator (Port 9000) │ │ │ │ Manages sequential model loading based on request type │ │ │ └─────┬──────────────┬──────────────────┬──────────────────┘ │ │ │ │ │ │ │ ┌─────▼──────┐ ┌────▼────────┐ ┌──────▼───────┐ │ │ │vLLM │ │Flux.1 │ │MusicGen │ │ │ │Qwen 2.5 7B │ │Schnell │ │Medium │ │ │ │Port: 8001 │ │Port: 8002 │ │Port: 8003 │ │ │ │VRAM: 14GB │ │VRAM: 14GB │ │VRAM: 11GB │ │ │ └────────────┘ └─────────────┘ └──────────────┘ │ │ │ │ Only ONE model active at a time (sequential loading) │ └─────────────────────────────────────────────────────────────────┘ ``` ### Features ✅ **Automatic Model Switching** - Orchestrator detects request type and loads appropriate model ✅ **OpenAI-Compatible APIs** - Works with existing OpenAI clients and tools ✅ **Cost-Optimized** - Sequential loading on single GPU (~$0.50/hr vs ~$0.75/hr for multi-GPU) ✅ **Easy Model Addition** - Add new models by editing YAML config ✅ **Centralized Routing** - LiteLLM proxy provides unified API for all models ✅ **GPU Memory Safe** - Orchestrator ensures only one model loaded at a time ## Supported Model Types ### Text Generation - **Qwen 2.5 7B Instruct** (facebook/Qwen2.5-7B-Instruct) - VRAM: 14GB | Speed: Fast | OpenAI-compatible chat API ### Image Generation - **Flux.1 Schnell** (black-forest-labs/FLUX.1-schnell) - VRAM: 14GB | Speed: 4-5 sec/image | OpenAI DALL-E compatible API ### Music Generation - **MusicGen Medium** (facebook/musicgen-medium) - VRAM: 11GB | Speed: 60-90 sec for 30s audio | Custom audio API ## Quick Start ### 1. Prerequisites ```bash # On RunPod GPU server - RunPod RTX 4090 instance (24GB VRAM) - Docker & Docker Compose installed - Tailscale VPN configured - HuggingFace token (for model downloads) ``` ### 2. Clone & Configure ```bash # On local machine cd ai/ # Create environment file cp .env.example .env # Edit .env and add your HF_TOKEN ``` ### 3. Deploy to RunPod ```bash # Copy all files to RunPod GPU server scp -r ai/* gpu-pivoine:/workspace/ai/ # SSH to GPU server ssh gpu-pivoine # Navigate to project cd /workspace/ai/ # Start orchestrator (always running) docker compose -f docker-compose.gpu.yaml up -d orchestrator # Orchestrator will automatically manage model services as needed ``` ### 4. Test Deployment ```bash # Check orchestrator health curl http://100.100.108.13:9000/health # Test text generation (auto-loads vLLM) curl http://100.100.108.13:9000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Hello!"}] }' # Test image generation (auto-switches to Flux) curl http://100.100.108.13:9000/v1/images/generations \ -H "Content-Type: application/json" \ -d '{ "model": "flux-schnell", "prompt": "a cute cat", "size": "1024x1024" }' # Test music generation (auto-switches to MusicGen) curl http://100.100.108.13:9000/v1/audio/generations \ -H "Content-Type: application/json" \ -d '{ "model": "musicgen-medium", "prompt": "upbeat electronic dance music", "duration": 30 }' ``` ### 5. Update VPS LiteLLM ```bash # On VPS, restart LiteLLM to pick up new config ssh vps cd ~/Projects/docker-compose arty restart litellm ``` ## Usage Examples ### Via Open WebUI (https://ai.pivoine.art) **Text Generation:** 1. Select model: `qwen-2.5-7b` 2. Type message and send 3. Orchestrator loads vLLM automatically **Image Generation:** 1. Select model: `flux-schnell` 2. Enter image prompt 3. Orchestrator switches to Flux.1 **Music Generation:** 1. Select model: `musicgen-medium` 2. Describe the music you want 3. Orchestrator switches to MusicGen ### Via API (Direct) ```python import openai # Configure client to use orchestrator client = openai.OpenAI( base_url="http://100.100.108.13:9000/v1", api_key="dummy" # Not used but required ) # Text generation response = client.chat.completions.create( model="qwen-2.5-7b", messages=[{"role": "user", "content": "Write a haiku"}] ) # Image generation image = client.images.generate( model="flux-schnell", prompt="a sunset over mountains", size="1024x1024" ) # Music generation (custom endpoint) import requests music = requests.post( "http://100.100.108.13:9000/v1/audio/generations", json={ "model": "musicgen-medium", "prompt": "calm piano music", "duration": 30 } ) ``` ## Adding New Models ### Step 1: Update `models.yaml` ```yaml # Add to ai/model-orchestrator/models.yaml models: llama-3.1-8b: # New model type: text framework: vllm docker_service: vllm-llama port: 8004 vram_gb: 17 startup_time_seconds: 120 endpoint: /v1/chat/completions description: "Llama 3.1 8B Instruct - Meta's latest model" ``` ### Step 2: Add Docker Service ```yaml # Add to ai/docker-compose.gpu.yaml services: vllm-llama: build: ./vllm container_name: ai_vllm-llama_1 command: > vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000 --dtype bfloat16 ports: - "8004:8000" environment: - HF_TOKEN=${HF_TOKEN} deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] profiles: ["text"] restart: "no" ``` ### Step 3: Restart Orchestrator ```bash ssh gpu-pivoine cd /workspace/ai/ docker compose -f docker-compose.gpu.yaml restart orchestrator ``` **That's it!** The orchestrator automatically detects the new model. ## Management Commands ### Orchestrator ```bash # Start orchestrator docker compose -f docker-compose.gpu.yaml up -d orchestrator # View orchestrator logs docker logs -f ai_orchestrator # Restart orchestrator docker compose -f docker-compose.gpu.yaml restart orchestrator # Check active model curl http://100.100.108.13:9000/health # List all models curl http://100.100.108.13:9000/models ``` ### Manual Model Control ```bash # Manually switch to specific model curl -X POST http://100.100.108.13:9000/switch \ -H "Content-Type: application/json" \ -d '{"model": "flux-schnell"}' # Check which model is running curl http://100.100.108.13:9000/health | jq '.current_model' ``` ### Model Services ```bash # Manually start a specific model (bypassing orchestrator) docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen # Stop a model docker compose -f docker-compose.gpu.yaml stop vllm-qwen # View model logs docker logs -f ai_vllm-qwen_1 docker logs -f ai_flux_1 docker logs -f ai_musicgen_1 ``` ## Monitoring ### GPU Usage ```bash ssh gpu-pivoine "nvidia-smi" ``` ### Model Status ```bash # Which model is active? curl http://100.100.108.13:9000/health # Model memory usage curl http://100.100.108.13:9000/health | jq '{current: .current_model, vram: .model_info.vram_gb}' ``` ### Performance ```bash # Orchestrator logs (model switching) docker logs -f ai_orchestrator # Model-specific logs docker logs -f ai_vllm-qwen_1 docker logs -f ai_flux_1 docker logs -f ai_musicgen_1 ``` ## Troubleshooting ### Model Won't Load ```bash # Check orchestrator logs docker logs ai_orchestrator # Check if model service exists docker compose -f docker-compose.gpu.yaml config | grep -A 10 "vllm-qwen" # Manually test model service docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen curl http://localhost:8001/health ``` ### Orchestrator Can't Connect ```bash # Check Docker socket permissions ls -l /var/run/docker.sock # Restart Docker daemon sudo systemctl restart docker # Rebuild orchestrator docker compose -f docker-compose.gpu.yaml build orchestrator docker compose -f docker-compose.gpu.yaml up -d orchestrator ``` ### Model Switching Too Slow ```bash # Check model startup times in models.yaml # Adjust startup_time_seconds if needed # Pre-download models to /workspace cache docker run --rm -it --gpus all \ -v /workspace/huggingface_cache:/cache \ -e HF_HOME=/cache \ nvidia/cuda:12.4.0-runtime-ubuntu22.04 \ huggingface-cli download facebook/musicgen-medium ``` ## File Structure ``` ai/ ├── docker-compose.gpu.yaml # Main orchestration file ├── .env.example # Environment template ├── README.md # This file │ ├── model-orchestrator/ # Central orchestrator service │ ├── orchestrator.py # FastAPI app managing models │ ├── models.yaml # Model registry (EDIT TO ADD MODELS) │ ├── Dockerfile │ └── requirements.txt │ ├── vllm/ # Text generation (vLLM) │ ├── server.py # Qwen 2.5 7B server │ ├── Dockerfile │ └── requirements.txt │ ├── flux/ # Image generation (Flux.1 Schnell) │ └── config/ │ └── config.json # Flux configuration │ ├── musicgen/ # Music generation (MusicGen) │ ├── server.py # MusicGen API server │ ├── Dockerfile │ └── requirements.txt │ ├── litellm-config.yaml # LiteLLM proxy configuration └── GPU_DEPLOYMENT_LOG.md # Deployment history and notes ``` ## Cost Analysis ### Current Setup (Single GPU) - **Provider**: RunPod Spot Instance - **GPU**: RTX 4090 24GB - **Cost**: ~$0.50/hour - **Monthly**: ~$360 (if running 24/7) - **Optimized**: ~$120 (8 hours/day during business hours) ### Alternative: Multi-GPU (All Models Always On) - **GPUs**: 2× RTX 4090 - **Cost**: ~$0.75/hour - **Monthly**: ~$540 (if running 24/7) - **Trade-off**: No switching latency, +$180/month ### Recommendation Stick with single GPU sequential loading for cost optimization. Model switching (30-120 seconds) is acceptable for most use cases. ## Performance Expectations | Model | VRAM | Startup Time | Generation Speed | |-------|------|--------------|------------------| | Qwen 2.5 7B | 14GB | 120s | ~50 tokens/sec | | Flux.1 Schnell | 14GB | 60s | ~4-5 sec/image | | MusicGen Medium | 11GB | 45s | ~60-90 sec for 30s audio | **Model Switching**: 30-120 seconds (unload current + load new) ## Security Notes - Orchestrator requires Docker socket access (`/var/run/docker.sock`) - All services run on private Tailscale network - No public exposure (only via VPS LiteLLM proxy) - HuggingFace token stored in `.env` (not committed to git) ## Future Enhancements 1. ⏹️ Add Llama 3.1 8B for alternative text generation 2. ⏹️ Add Whisper Large v3 for speech-to-text 3. ⏹️ Add XTTS v2 for text-to-speech 4. ⏹️ Implement model preloading/caching for faster switching 5. ⏹️ Add usage metrics and cost tracking 6. ⏹️ Auto-stop GPU pod during idle periods ## Support For issues or questions: - Check orchestrator logs: `docker logs ai_orchestrator` - View model-specific logs: `docker logs ai__1` - Test direct model access: `curl http://localhost:/health` - Review GPU deployment log: `GPU_DEPLOYMENT_LOG.md` ## License Built with: - [vLLM](https://github.com/vllm-project/vllm) - Apache 2.0 - [AudioCraft](https://github.com/facebookresearch/audiocraft) - MIT (code), CC-BY-NC (weights) - [Flux.1](https://github.com/black-forest-labs/flux) - Apache 2.0 - [LiteLLM](https://github.com/BerriAI/litellm) - MIT **Note**: MusicGen pre-trained weights are non-commercial (CC-BY-NC). Train your own models for commercial use with the MIT-licensed code.