runpod/DEPLOYMENT.md at 277f1c95bd32b69617b65e4f3db3ea308f38dfb8

Files

Sebastian Krüger 277f1c95bd Initial commit: RunPod multi-modal AI orchestration stack

- Multi-modal AI infrastructure for RunPod RTX 4090
- Automatic model orchestration (text, image, music)
- Text: vLLM + Qwen 2.5 7B Instruct
- Image: Flux.1 Schnell via OpenEDAI
- Music: MusicGen Medium via AudioCraft
- Cost-optimized sequential loading on single GPU
- Template preparation scripts for rapid deployment
- Comprehensive documentation (README, DEPLOYMENT, TEMPLATE)

2025-11-21 14:34:55 +01:00

14 KiB

Raw Blame History

Cost-optimized AI infrastructure running text, image, and music generation on a single RunPod RTX 4090 GPU.

Architecture Overview

This system provides a unified API for multiple AI model types with automatic model switching on a single GPU (24GB VRAM). All requests route through an intelligent orchestrator that manages model lifecycle.

Components

┌─────────────────────────────────────────────────────────────────┐
│                        VPS (Tailscale: 100.102.217.79)          │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │ LiteLLM Proxy (Port 4000)                                 │  │
│  │ Routes to: Claude API + GPU Orchestrator                  │  │
│  └────────────────────┬──────────────────────────────────────┘  │
└───────────────────────┼─────────────────────────────────────────┘
                        │ Tailscale VPN
┌───────────────────────┼─────────────────────────────────────────┐
│           RunPod GPU Server (Tailscale: 100.100.108.13)         │
│  ┌────────────────────▼──────────────────────────────────────┐  │
│  │ Orchestrator (Port 9000)                                  │  │
│  │ Manages sequential model loading based on request type   │  │
│  └─────┬──────────────┬──────────────────┬──────────────────┘  │
│        │              │                  │                      │
│  ┌─────▼──────┐ ┌────▼────────┐  ┌──────▼───────┐             │
│  │vLLM        │ │Flux.1       │  │MusicGen      │             │
│  │Qwen 2.5 7B │ │Schnell      │  │Medium        │             │
│  │Port: 8001  │ │Port: 8002   │  │Port: 8003    │             │
│  │VRAM: 14GB  │ │VRAM: 14GB   │  │VRAM: 11GB    │             │
│  └────────────┘ └─────────────┘  └──────────────┘             │
│                                                                 │
│  Only ONE model active at a time (sequential loading)          │
└─────────────────────────────────────────────────────────────────┘

Features

✅ Automatic Model Switching - Orchestrator detects request type and loads appropriate model ✅ OpenAI-Compatible APIs - Works with existing OpenAI clients and tools ✅ Cost-Optimized - Sequential loading on single GPU (~$0.50/hr vs ~$0.75/hr for multi-GPU) ✅ Easy Model Addition - Add new models by editing YAML config ✅ Centralized Routing - LiteLLM proxy provides unified API for all models ✅ GPU Memory Safe - Orchestrator ensures only one model loaded at a time

Supported Model Types

Text Generation

Qwen 2.5 7B Instruct (facebook/Qwen2.5-7B-Instruct)
VRAM: 14GB | Speed: Fast | OpenAI-compatible chat API

Image Generation

Flux.1 Schnell (black-forest-labs/FLUX.1-schnell)
VRAM: 14GB | Speed: 4-5 sec/image | OpenAI DALL-E compatible API

Music Generation

MusicGen Medium (facebook/musicgen-medium)
VRAM: 11GB | Speed: 60-90 sec for 30s audio | Custom audio API

Quick Start

1. Prerequisites

# On RunPod GPU server
- RunPod RTX 4090 instance (24GB VRAM)
- Docker & Docker Compose installed
- Tailscale VPN configured
- HuggingFace token (for model downloads)

2. Clone & Configure

# On local machine
cd ai/

# Create environment file
cp .env.example .env
# Edit .env and add your HF_TOKEN

3. Deploy to RunPod

# Copy all files to RunPod GPU server
scp -r ai/* gpu-pivoine:/workspace/ai/

# SSH to GPU server
ssh gpu-pivoine

# Navigate to project
cd /workspace/ai/

# Start orchestrator (always running)
docker compose -f docker-compose.gpu.yaml up -d orchestrator

# Orchestrator will automatically manage model services as needed

4. Test Deployment

# Check orchestrator health
curl http://100.100.108.13:9000/health

# Test text generation (auto-loads vLLM)
curl http://100.100.108.13:9000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-2.5-7b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Test image generation (auto-switches to Flux)
curl http://100.100.108.13:9000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "flux-schnell",
    "prompt": "a cute cat",
    "size": "1024x1024"
  }'

# Test music generation (auto-switches to MusicGen)
curl http://100.100.108.13:9000/v1/audio/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "musicgen-medium",
    "prompt": "upbeat electronic dance music",
    "duration": 30
  }'

5. Update VPS LiteLLM

# On VPS, restart LiteLLM to pick up new config
ssh vps
cd ~/Projects/docker-compose
arty restart litellm

Usage Examples

Via Open WebUI (https://ai.pivoine.art)

Text Generation:

Select model: qwen-2.5-7b
Type message and send
Orchestrator loads vLLM automatically

Image Generation:

Select model: flux-schnell
Enter image prompt
Orchestrator switches to Flux.1

Music Generation:

Select model: musicgen-medium
Describe the music you want
Orchestrator switches to MusicGen

Via API (Direct)

import openai

# Configure client to use orchestrator
client = openai.OpenAI(
    base_url="http://100.100.108.13:9000/v1",
    api_key="dummy"  # Not used but required
)

# Text generation
response = client.chat.completions.create(
    model="qwen-2.5-7b",
    messages=[{"role": "user", "content": "Write a haiku"}]
)

# Image generation
image = client.images.generate(
    model="flux-schnell",
    prompt="a sunset over mountains",
    size="1024x1024"
)

# Music generation (custom endpoint)
import requests
music = requests.post(
    "http://100.100.108.13:9000/v1/audio/generations",
    json={
        "model": "musicgen-medium",
        "prompt": "calm piano music",
        "duration": 30
    }
)

Adding New Models

Step 1: Update `models.yaml`

# Add to ai/model-orchestrator/models.yaml
models:
  llama-3.1-8b:  # New model
    type: text
    framework: vllm
    docker_service: vllm-llama
    port: 8004
    vram_gb: 17
    startup_time_seconds: 120
    endpoint: /v1/chat/completions
    description: "Llama 3.1 8B Instruct - Meta's latest model"

Step 2: Add Docker Service

# Add to ai/docker-compose.gpu.yaml
services:
  vllm-llama:
    build: ./vllm
    container_name: ai_vllm-llama_1
    command: >
      vllm serve meta-llama/Llama-3.1-8B-Instruct
      --port 8000 --dtype bfloat16
    ports:
      - "8004:8000"
    environment:
      - HF_TOKEN=${HF_TOKEN}
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    profiles: ["text"]
    restart: "no"

Step 3: Restart Orchestrator

ssh gpu-pivoine
cd /workspace/ai/
docker compose -f docker-compose.gpu.yaml restart orchestrator

That's it! The orchestrator automatically detects the new model.

Management Commands

Orchestrator

# Start orchestrator
docker compose -f docker-compose.gpu.yaml up -d orchestrator

# View orchestrator logs
docker logs -f ai_orchestrator

# Restart orchestrator
docker compose -f docker-compose.gpu.yaml restart orchestrator

# Check active model
curl http://100.100.108.13:9000/health

# List all models
curl http://100.100.108.13:9000/models

Manual Model Control

# Manually switch to specific model
curl -X POST http://100.100.108.13:9000/switch \
  -H "Content-Type: application/json" \
  -d '{"model": "flux-schnell"}'

# Check which model is running
curl http://100.100.108.13:9000/health | jq '.current_model'

Model Services

# Manually start a specific model (bypassing orchestrator)
docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen

# Stop a model
docker compose -f docker-compose.gpu.yaml stop vllm-qwen

# View model logs
docker logs -f ai_vllm-qwen_1
docker logs -f ai_flux_1
docker logs -f ai_musicgen_1

Monitoring

GPU Usage

ssh gpu-pivoine "nvidia-smi"

Model Status

# Which model is active?
curl http://100.100.108.13:9000/health

# Model memory usage
curl http://100.100.108.13:9000/health | jq '{current: .current_model, vram: .model_info.vram_gb}'

Performance

# Orchestrator logs (model switching)
docker logs -f ai_orchestrator

# Model-specific logs
docker logs -f ai_vllm-qwen_1
docker logs -f ai_flux_1
docker logs -f ai_musicgen_1

Troubleshooting

Model Won't Load

# Check orchestrator logs
docker logs ai_orchestrator

# Check if model service exists
docker compose -f docker-compose.gpu.yaml config | grep -A 10 "vllm-qwen"

# Manually test model service
docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen
curl http://localhost:8001/health

Orchestrator Can't Connect

# Check Docker socket permissions
ls -l /var/run/docker.sock

# Restart Docker daemon
sudo systemctl restart docker

# Rebuild orchestrator
docker compose -f docker-compose.gpu.yaml build orchestrator
docker compose -f docker-compose.gpu.yaml up -d orchestrator

Model Switching Too Slow

# Check model startup times in models.yaml
# Adjust startup_time_seconds if needed

# Pre-download models to /workspace cache
docker run --rm -it --gpus all \
  -v /workspace/huggingface_cache:/cache \
  -e HF_HOME=/cache \
  nvidia/cuda:12.4.0-runtime-ubuntu22.04 \
  huggingface-cli download facebook/musicgen-medium

File Structure

ai/
├── docker-compose.gpu.yaml    # Main orchestration file
├── .env.example               # Environment template
├── README.md                  # This file
│
├── model-orchestrator/        # Central orchestrator service
│   ├── orchestrator.py        # FastAPI app managing models
│   ├── models.yaml            # Model registry (EDIT TO ADD MODELS)
│   ├── Dockerfile
│   └── requirements.txt
│
├── vllm/                      # Text generation (vLLM)
│   ├── server.py              # Qwen 2.5 7B server
│   ├── Dockerfile
│   └── requirements.txt
│
├── flux/                      # Image generation (Flux.1 Schnell)
│   └── config/
│       └── config.json        # Flux configuration
│
├── musicgen/                  # Music generation (MusicGen)
│   ├── server.py              # MusicGen API server
│   ├── Dockerfile
│   └── requirements.txt
│
├── litellm-config.yaml        # LiteLLM proxy configuration
└── GPU_DEPLOYMENT_LOG.md      # Deployment history and notes

Cost Analysis

Current Setup (Single GPU)

Provider: RunPod Spot Instance
GPU: RTX 4090 24GB
Cost: ~$0.50/hour
Monthly: ~$360 (if running 24/7)
Optimized: ~$120 (8 hours/day during business hours)

Alternative: Multi-GPU (All Models Always On)

GPUs: 2× RTX 4090
Cost: ~$0.75/hour
Monthly: ~$540 (if running 24/7)
Trade-off: No switching latency, +$180/month

Recommendation

Stick with single GPU sequential loading for cost optimization. Model switching (30-120 seconds) is acceptable for most use cases.

Performance Expectations

Model	VRAM	Startup Time	Generation Speed
Qwen 2.5 7B	14GB	120s	~50 tokens/sec
Flux.1 Schnell	14GB	60s	~4-5 sec/image
MusicGen Medium	11GB	45s	~60-90 sec for 30s audio

Model Switching: 30-120 seconds (unload current + load new)

Security Notes

Orchestrator requires Docker socket access (/var/run/docker.sock)
All services run on private Tailscale network
No public exposure (only via VPS LiteLLM proxy)
HuggingFace token stored in .env (not committed to git)

Future Enhancements

⏹️ Add Llama 3.1 8B for alternative text generation
⏹️ Add Whisper Large v3 for speech-to-text
⏹️ Add XTTS v2 for text-to-speech
⏹️ Implement model preloading/caching for faster switching
⏹️ Add usage metrics and cost tracking
⏹️ Auto-stop GPU pod during idle periods

Support

For issues or questions:

Check orchestrator logs: docker logs ai_orchestrator
View model-specific logs: docker logs ai_<service>_1
Test direct model access: curl http://localhost:<port>/health
Review GPU deployment log: GPU_DEPLOYMENT_LOG.md

License

Built with:

vLLM - Apache 2.0
AudioCraft - MIT (code), CC-BY-NC (weights)
Flux.1 - Apache 2.0
LiteLLM - MIT

Note: MusicGen pre-trained weights are non-commercial (CC-BY-NC). Train your own models for commercial use with the MIT-licensed code.

14 KiB Raw Blame History Unescape Escape

Multi-Modal AI Orchestration System