runpod/docs/DEPLOYMENT.md

# Multi-Modal AI Orchestration System

**Cost-optimized AI infrastructure running text, image, and music generation on a single RunPod RTX 4090 GPU.**

## Architecture Overview

This system provides a unified API for multiple AI model types with automatic model switching on a single GPU (24GB VRAM). All requests route through an intelligent orchestrator that manages model lifecycle.

### Components

```
┌─────────────────────────────────────────────────────────────────┐
│                        VPS (Tailscale: 100.102.217.79)          │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │ LiteLLM Proxy (Port 4000)                                 │  │
│  │ Routes to: Claude API + GPU Orchestrator                  │  │
│  └────────────────────┬──────────────────────────────────────┘  │
└───────────────────────┼─────────────────────────────────────────┘
                        │ Tailscale VPN
┌───────────────────────┼─────────────────────────────────────────┐
│           RunPod GPU Server (Tailscale: 100.100.108.13)         │
│  ┌────────────────────▼──────────────────────────────────────┐  │
│  │ Orchestrator (Port 9000)                                  │  │
│  │ Manages sequential model loading based on request type   │  │
│  └─────┬──────────────┬──────────────────┬──────────────────┘  │
│        │              │                  │                      │
│  ┌─────▼──────┐ ┌────▼────────┐  ┌──────▼───────┐             │
│  │vLLM        │ │Flux.1       │  │MusicGen      │             │
│  │Qwen 2.5 7B │ │Schnell      │  │Medium        │             │
│  │Port: 8001  │ │Port: 8002   │  │Port: 8003    │             │
│  │VRAM: 14GB  │ │VRAM: 14GB   │  │VRAM: 11GB    │             │
│  └────────────┘ └─────────────┘  └──────────────┘             │
│                                                                 │
│  Only ONE model active at a time (sequential loading)          │
└─────────────────────────────────────────────────────────────────┘
```

### Features

✅ **Automatic Model Switching** - Orchestrator detects request type and loads appropriate model
✅ **OpenAI-Compatible APIs** - Works with existing OpenAI clients and tools
✅ **Cost-Optimized** - Sequential loading on single GPU (~$0.50/hr vs ~$0.75/hr for multi-GPU)
✅ **Easy Model Addition** - Add new models by editing YAML config
✅ **Centralized Routing** - LiteLLM proxy provides unified API for all models
✅ **GPU Memory Safe** - Orchestrator ensures only one model loaded at a time

## Supported Model Types

### Text Generation
- **Qwen 2.5 7B Instruct** (facebook/Qwen2.5-7B-Instruct)
- VRAM: 14GB | Speed: Fast | OpenAI-compatible chat API

### Image Generation
- **Flux.1 Schnell** (black-forest-labs/FLUX.1-schnell)
- VRAM: 14GB | Speed: 4-5 sec/image | OpenAI DALL-E compatible API

### Music Generation
- **MusicGen Medium** (facebook/musicgen-medium)
- VRAM: 11GB | Speed: 60-90 sec for 30s audio | Custom audio API

## Quick Start

### 1. Prerequisites

```bash
# On RunPod GPU server
- RunPod RTX 4090 instance (24GB VRAM)
- Docker & Docker Compose installed
- Tailscale VPN configured
- HuggingFace token (for model downloads)
```

### 2. Clone & Configure

```bash
# On local machine
cd ai/

# Create environment file
cp .env.example .env
# Edit .env and add your HF_TOKEN
```

### 3. Deploy to RunPod

```bash
# Copy all files to RunPod GPU server
scp -r ai/* gpu-pivoine:/workspace/ai/

# SSH to GPU server
ssh gpu-pivoine

# Navigate to project
cd /workspace/ai/

# Start orchestrator (always running)
docker compose -f compose.yaml up -d orchestrator

# Orchestrator will automatically manage model services as needed
```

### 4. Test Deployment

```bash
# Check orchestrator health
curl http://100.100.108.13:9000/health

# Test text generation (auto-loads vLLM)
curl http://100.100.108.13:9000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-2.5-7b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Test image generation (auto-switches to Flux)
curl http://100.100.108.13:9000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "flux-schnell",
    "prompt": "a cute cat",
    "size": "1024x1024"
  }'

# Test music generation (auto-switches to MusicGen)
curl http://100.100.108.13:9000/v1/audio/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "musicgen-medium",
    "prompt": "upbeat electronic dance music",
    "duration": 30
  }'
```

### 5. Update VPS LiteLLM

```bash
# On VPS, restart LiteLLM to pick up new config
ssh vps
cd ~/Projects/docker-compose
arty restart litellm
```

## Usage Examples

### Via Open WebUI (https://ai.pivoine.art)

**Text Generation:**
1. Select model: `qwen-2.5-7b`
2. Type message and send
3. Orchestrator loads vLLM automatically

**Image Generation:**
1. Select model: `flux-schnell`
2. Enter image prompt
3. Orchestrator switches to Flux.1

**Music Generation:**
1. Select model: `musicgen-medium`
2. Describe the music you want
3. Orchestrator switches to MusicGen

### Via API (Direct)

```python
import openai

# Configure client to use orchestrator
client = openai.OpenAI(
    base_url="http://100.100.108.13:9000/v1",
    api_key="dummy"  # Not used but required
)

# Text generation
response = client.chat.completions.create(
    model="qwen-2.5-7b",
    messages=[{"role": "user", "content": "Write a haiku"}]
)

# Image generation
image = client.images.generate(
    model="flux-schnell",
    prompt="a sunset over mountains",
    size="1024x1024"
)

# Music generation (custom endpoint)
import requests
music = requests.post(
    "http://100.100.108.13:9000/v1/audio/generations",
    json={
        "model": "musicgen-medium",
        "prompt": "calm piano music",
        "duration": 30
    }
)
```

## Adding New Models

### Step 1: Update `models.yaml`

```yaml
# Add to ai/model-orchestrator/models.yaml
models:
  llama-3.1-8b:  # New model
    type: text
    framework: vllm
    docker_service: vllm-llama
    port: 8004
    vram_gb: 17
    startup_time_seconds: 120
    endpoint: /v1/chat/completions
    description: "Llama 3.1 8B Instruct - Meta's latest model"
```

### Step 2: Add Docker Service

```yaml
# Add to ai/compose.yaml
services:
  vllm-llama:
    build: ./vllm
    container_name: ai_vllm-llama_1
    command: >
      vllm serve meta-llama/Llama-3.1-8B-Instruct
      --port 8000 --dtype bfloat16
    ports:
      - "8004:8000"
    environment:
      - HF_TOKEN=${HF_TOKEN}
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    profiles: ["text"]
    restart: "no"
```

### Step 3: Restart Orchestrator

```bash
ssh gpu-pivoine
cd /workspace/ai/
docker compose -f compose.yaml restart orchestrator
```

**That's it!** The orchestrator automatically detects the new model.

## Management Commands

### Orchestrator

```bash
# Start orchestrator
docker compose -f compose.yaml up -d orchestrator

# View orchestrator logs
docker logs -f ai_orchestrator

# Restart orchestrator
docker compose -f compose.yaml restart orchestrator

# Check active model
curl http://100.100.108.13:9000/health

# List all models
curl http://100.100.108.13:9000/models
```

### Manual Model Control

```bash
# Manually switch to specific model
curl -X POST http://100.100.108.13:9000/switch \
  -H "Content-Type: application/json" \
  -d '{"model": "flux-schnell"}'

# Check which model is running
curl http://100.100.108.13:9000/health | jq '.current_model'
```

### Model Services

```bash
# Manually start a specific model (bypassing orchestrator)
docker compose -f compose.yaml --profile text up -d vllm-qwen

# Stop a model
docker compose -f compose.yaml stop vllm-qwen

# View model logs
docker logs -f ai_vllm-qwen_1
docker logs -f ai_flux_1
docker logs -f ai_musicgen_1
```

## Monitoring

### GPU Usage

```bash
ssh gpu-pivoine "nvidia-smi"
```

### Model Status

```bash
# Which model is active?
curl http://100.100.108.13:9000/health

# Model memory usage
curl http://100.100.108.13:9000/health | jq '{current: .current_model, vram: .model_info.vram_gb}'
```

### Performance

```bash
# Orchestrator logs (model switching)
docker logs -f ai_orchestrator

# Model-specific logs
docker logs -f ai_vllm-qwen_1
docker logs -f ai_flux_1
docker logs -f ai_musicgen_1
```

## Troubleshooting

### Model Won't Load

```bash
# Check orchestrator logs
docker logs ai_orchestrator

# Check if model service exists
docker compose -f compose.yaml config | grep -A 10 "vllm-qwen"

# Manually test model service
docker compose -f compose.yaml --profile text up -d vllm-qwen
curl http://localhost:8001/health
```

### Orchestrator Can't Connect

```bash
# Check Docker socket permissions
ls -l /var/run/docker.sock

# Restart Docker daemon
sudo systemctl restart docker

# Rebuild orchestrator
docker compose -f compose.yaml build orchestrator
docker compose -f compose.yaml up -d orchestrator
```

### Model Switching Too Slow

```bash
# Check model startup times in models.yaml
# Adjust startup_time_seconds if needed

# Pre-download models to /workspace cache
docker run --rm -it --gpus all \
  -v /workspace/huggingface_cache:/cache \
  -e HF_HOME=/cache \
  nvidia/cuda:12.4.0-runtime-ubuntu22.04 \
  huggingface-cli download facebook/musicgen-medium
```

## File Structure

```
ai/
├── compose.yaml    # Main orchestration file
├── .env.example               # Environment template
├── README.md                  # This file
│
├── model-orchestrator/        # Central orchestrator service
│   ├── orchestrator.py        # FastAPI app managing models
│   ├── models.yaml            # Model registry (EDIT TO ADD MODELS)
│   ├── Dockerfile
│   └── requirements.txt
│
├── vllm/                      # Text generation (vLLM)
│   ├── server.py              # Qwen 2.5 7B server
│   ├── Dockerfile
│   └── requirements.txt
│
├── flux/                      # Image generation (Flux.1 Schnell)
│   └── config/
│       └── config.json        # Flux configuration
│
├── musicgen/                  # Music generation (MusicGen)
│   ├── server.py              # MusicGen API server
│   ├── Dockerfile
│   └── requirements.txt
│
├── litellm-config.yaml        # LiteLLM proxy configuration
└── GPU_DEPLOYMENT_LOG.md      # Deployment history and notes
```

## Cost Analysis

### Current Setup (Single GPU)
- **Provider**: RunPod Spot Instance
- **GPU**: RTX 4090 24GB
- **Cost**: ~$0.50/hour
- **Monthly**: ~$360 (if running 24/7)
- **Optimized**: ~$120 (8 hours/day during business hours)

### Alternative: Multi-GPU (All Models Always On)
- **GPUs**: 2× RTX 4090
- **Cost**: ~$0.75/hour
- **Monthly**: ~$540 (if running 24/7)
- **Trade-off**: No switching latency, +$180/month

### Recommendation
Stick with single GPU sequential loading for cost optimization. Model switching (30-120 seconds) is acceptable for most use cases.

## Performance Expectations

| Model | VRAM | Startup Time | Generation Speed |
|-------|------|--------------|------------------|
| Qwen 2.5 7B | 14GB | 120s | ~50 tokens/sec |
| Flux.1 Schnell | 14GB | 60s | ~4-5 sec/image |
| MusicGen Medium | 11GB | 45s | ~60-90 sec for 30s audio |

**Model Switching**: 30-120 seconds (unload current + load new)

## Security Notes

- Orchestrator requires Docker socket access (`/var/run/docker.sock`)
- All services run on private Tailscale network
- No public exposure (only via VPS LiteLLM proxy)
- HuggingFace token stored in `.env` (not committed to git)

## Future Enhancements

1. ⏹️ Add Llama 3.1 8B for alternative text generation
2. ⏹️ Add Whisper Large v3 for speech-to-text
3. ⏹️ Add XTTS v2 for text-to-speech
4. ⏹️ Implement model preloading/caching for faster switching
5. ⏹️ Add usage metrics and cost tracking
6. ⏹️ Auto-stop GPU pod during idle periods

## Support

For issues or questions:
- Check orchestrator logs: `docker logs ai_orchestrator`
- View model-specific logs: `docker logs ai_<service>_1`
- Test direct model access: `curl http://localhost:<port>/health`
- Review GPU deployment log: `GPU_DEPLOYMENT_LOG.md`

## License

Built with:
- [vLLM](https://github.com/vllm-project/vllm) - Apache 2.0
- [AudioCraft](https://github.com/facebookresearch/audiocraft) - MIT (code), CC-BY-NC (weights)
- [Flux.1](https://github.com/black-forest-labs/flux) - Apache 2.0
- [LiteLLM](https://github.com/BerriAI/litellm) - MIT

**Note**: MusicGen pre-trained weights are non-commercial (CC-BY-NC). Train your own models for commercial use with the MIT-licensed code.