# RunPod Multi-Modal AI Stack **Cost-optimized GPU deployment for text, image, and music generation on RunPod RTX 4090.** This repository contains everything needed to deploy and manage a multi-modal AI infrastructure on RunPod, featuring intelligent model orchestration that automatically switches between models based on request type. ## Features - **Text Generation**: Qwen 2.5 7B Instruct via vLLM (~50 tokens/sec) - **Image Generation**: Flux.1 Schnell (~4-5 seconds per image) - **Music Generation**: MusicGen Medium (30 seconds of audio in 60-90 seconds) - **Automatic Model Switching**: Intelligent orchestrator manages sequential model loading - **OpenAI-Compatible APIs**: Works with existing AI tools and clients - **Easy Model Addition**: Just edit `model-orchestrator/models.yaml` to add new models - **Template Support**: Create reusable templates for 2-3 minute deployments (vs 60-90 minutes) ## Quick Start ### Option 1: Deploy from Template (Recommended) If you've already created a RunPod template: 1. Deploy pod from template in RunPod dashboard 2. SSH to the pod 3. Create `.env` file with your credentials 4. Start orchestrator: `docker compose -f docker-compose.gpu.yaml up -d orchestrator` **See**: [RUNPOD_TEMPLATE.md](RUNPOD_TEMPLATE.md) for template usage instructions. ### Option 2: Fresh Deployment For first-time setup on a new RunPod instance: 1. Copy files to RunPod: `scp -r * gpu-server:/workspace/ai/` 2. SSH to GPU server: `ssh gpu-server` 3. Run preparation script: `cd /workspace/ai && chmod +x scripts/prepare-template.sh && ./scripts/prepare-template.sh` **See**: [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md) for detailed deployment guide. ## Architecture ``` VPS (LiteLLM Proxy) ↓ Tailscale VPN GPU Server (Orchestrator Port 9000) ├── vLLM (Qwen 2.5 7B) - Port 8001 ├── Flux.1 Schnell - Port 8002 └── MusicGen Medium - Port 8003 ``` All requests route through the orchestrator, which automatically loads the appropriate model. Only one model is active at a time for cost optimization (~$0.50/hr vs ~$0.75/hr for multi-GPU). ## Cost Analysis **RunPod RTX 4090 Spot Instance**: - **Hourly**: ~$0.50 - **Monthly (24/7)**: ~$360 - **Monthly (8hr/day)**: ~$120 **Template Benefits**: - **Without Template**: 60-90 minutes setup per Spot restart - **With Template**: 2-3 minutes deployment time - **Spot Restart Frequency**: 2-5 times per week (variable) ## Documentation - **[docs/DEPLOYMENT.md](docs/DEPLOYMENT.md)** - Complete deployment and usage guide - **[docs/RUNPOD_TEMPLATE.md](docs/RUNPOD_TEMPLATE.md)** - Template creation and usage - **[docs/GPU_DEPLOYMENT_LOG.md](docs/GPU_DEPLOYMENT_LOG.md)** - Deployment history and technical notes ### Architecture Components - `model-orchestrator/` - FastAPI orchestrator managing model lifecycle - `vllm/` - Text generation service (Qwen 2.5 7B) - `flux/` - Image generation service (Flux.1 Schnell) - `musicgen/` - Music generation service (MusicGen Medium) - `scripts/` - Automation scripts ## Creating a RunPod Template **Why create a template?** - Save 60-90 minutes on every Spot instance restart - Pre-downloaded models (~37GB cached) - Pre-built Docker images - Ready-to-use configuration **How to create:** 1. Run `scripts/prepare-template.sh` on a fresh RunPod instance 2. Wait 45-60 minutes for models to download and images to build 3. Save pod as template in RunPod dashboard 4. Name: `multi-modal-ai-v1.0` **See**: [docs/RUNPOD_TEMPLATE.md](docs/RUNPOD_TEMPLATE.md) for step-by-step guide. ## Adding New Models Adding models is easy! Just edit `model-orchestrator/models.yaml`: ```yaml models: llama-3.1-8b: # New model type: text framework: vllm docker_service: vllm-llama port: 8004 vram_gb: 17 startup_time_seconds: 120 endpoint: /v1/chat/completions ``` Then add the Docker service to `docker-compose.gpu.yaml` and restart the orchestrator. **See**: [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md#adding-new-models) for complete instructions. ## Usage Examples ### Text Generation ```bash curl http://100.100.108.13:9000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Hello!"}]}' ``` ### Image Generation ```bash curl http://100.100.108.13:9000/v1/images/generations \ -H "Content-Type: application/json" \ -d '{"model": "flux-schnell", "prompt": "a cute cat", "size": "1024x1024"}' ``` ### Music Generation ```bash curl http://100.100.108.13:9000/v1/audio/generations \ -H "Content-Type: application/json" \ -d '{"model": "musicgen-medium", "prompt": "upbeat electronic", "duration": 30}' ``` ## Infrastructure **Provider**: RunPod (Spot Instance) **GPU**: NVIDIA RTX 4090 24GB VRAM **Region**: Europe **Network**: Tailscale VPN (100.100.108.13) **Storage**: 922TB network volume at `/workspace` ## Monitoring ```bash # Check active model curl http://100.100.108.13:9000/health # View orchestrator logs docker logs -f ai_orchestrator # GPU usage nvidia-smi ``` ## Support For issues: 1. Check orchestrator logs: `docker logs ai_orchestrator` 2. Review [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md#troubleshooting) 3. Check [docs/GPU_DEPLOYMENT_LOG.md](docs/GPU_DEPLOYMENT_LOG.md) for deployment history ## License Built with: - [vLLM](https://github.com/vllm-project/vllm) - Apache 2.0 - [AudioCraft](https://github.com/facebookresearch/audiocraft) - MIT (code), CC-BY-NC (weights) - [Flux.1](https://github.com/black-forest-labs/flux) - Apache 2.0 - [LiteLLM](https://github.com/BerriAI/litellm) - MIT **Note**: MusicGen pre-trained weights are non-commercial (CC-BY-NC).