ai/README.md

# Multi-Modal AI Orchestration System

**Cost-optimized AI infrastructure running text, image, and music generation on a single RunPod RTX 4090 GPU.**

## Architecture Overview

This system provides a unified API for multiple AI model types with automatic model switching on a single GPU (24GB VRAM). All requests route through an intelligent orchestrator that manages model lifecycle.

### Components

```
┌─────────────────────────────────────────────────────────────────┐
│                        VPS (Tailscale: 100.102.217.79)          │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │ LiteLLM Proxy (Port 4000)                                 │  │
│  │ Routes to: Claude API + GPU Orchestrator                  │  │
│  └────────────────────┬──────────────────────────────────────┘  │
└───────────────────────┼─────────────────────────────────────────┘
                        │ Tailscale VPN
┌───────────────────────┼─────────────────────────────────────────┐
│           RunPod GPU Server (Tailscale: 100.100.108.13)         │
│  ┌────────────────────▼──────────────────────────────────────┐  │
│  │ Orchestrator (Port 9000)                                  │  │
│  │ Manages sequential model loading based on request type   │  │
│  └─────┬──────────────┬──────────────────┬──────────────────┘  │
│        │              │                  │                      │
│  ┌─────▼──────┐ ┌────▼────────┐  ┌──────▼───────┐             │
│  │vLLM        │ │Flux.1       │  │MusicGen      │             │
│  │Qwen 2.5 7B │ │Schnell      │  │Medium        │             │
│  │Port: 8001  │ │Port: 8002   │  │Port: 8003    │             │
│  │VRAM: 14GB  │ │VRAM: 14GB   │  │VRAM: 11GB    │             │
│  └────────────┘ └─────────────┘  └──────────────┘             │
│                                                                 │
│  Only ONE model active at a time (sequential loading)          │
└─────────────────────────────────────────────────────────────────┘
```

### Features

✅ **Automatic Model Switching** - Orchestrator detects request type and loads appropriate model
✅ **OpenAI-Compatible APIs** - Works with existing OpenAI clients and tools
✅ **Cost-Optimized** - Sequential loading on single GPU (~$0.50/hr vs ~$0.75/hr for multi-GPU)
✅ **Easy Model Addition** - Add new models by editing YAML config
✅ **Centralized Routing** - LiteLLM proxy provides unified API for all models
✅ **GPU Memory Safe** - Orchestrator ensures only one model loaded at a time

## Supported Model Types

### Text Generation
- **Qwen 2.5 7B Instruct** (facebook/Qwen2.5-7B-Instruct)
- VRAM: 14GB | Speed: Fast | OpenAI-compatible chat API

### Image Generation
- **Flux.1 Schnell** (black-forest-labs/FLUX.1-schnell)
- VRAM: 14GB | Speed: 4-5 sec/image | OpenAI DALL-E compatible API

### Music Generation
- **MusicGen Medium** (facebook/musicgen-medium)
- VRAM: 11GB | Speed: 60-90 sec for 30s audio | Custom audio API

## Quick Start

### 1. Prerequisites

```bash
# On RunPod GPU server
- RunPod RTX 4090 instance (24GB VRAM)
- Docker & Docker Compose installed
- Tailscale VPN configured
- HuggingFace token (for model downloads)
```

### 2. Clone & Configure

```bash
# On local machine
cd ai/

# Create environment file
cp .env.example .env
# Edit .env and add your HF_TOKEN
```

### 3. Deploy to RunPod

```bash
# Copy all files to RunPod GPU server
scp -r ai/* gpu-pivoine:/workspace/ai/

# SSH to GPU server
ssh gpu-pivoine

# Navigate to project
cd /workspace/ai/

# Start orchestrator (always running)
docker compose -f docker-compose.gpu.yaml up -d orchestrator

# Orchestrator will automatically manage model services as needed
```

### 4. Test Deployment

```bash
# Check orchestrator health
curl http://100.100.108.13:9000/health

# Test text generation (auto-loads vLLM)
curl http://100.100.108.13:9000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-2.5-7b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Test image generation (auto-switches to Flux)
curl http://100.100.108.13:9000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "flux-schnell",
    "prompt": "a cute cat",
    "size": "1024x1024"
  }'

# Test music generation (auto-switches to MusicGen)
curl http://100.100.108.13:9000/v1/audio/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "musicgen-medium",
    "prompt": "upbeat electronic dance music",
    "duration": 30
  }'
```

### 5. Update VPS LiteLLM

```bash
# On VPS, restart LiteLLM to pick up new config
ssh vps
cd ~/Projects/docker-compose
arty restart litellm
```

## Usage Examples

### Via Open WebUI (https://ai.pivoine.art)

**Text Generation:**
1. Select model: `qwen-2.5-7b`
2. Type message and send
3. Orchestrator loads vLLM automatically

**Image Generation:**
1. Select model: `flux-schnell`
2. Enter image prompt
3. Orchestrator switches to Flux.1

**Music Generation:**
1. Select model: `musicgen-medium`
2. Describe the music you want
3. Orchestrator switches to MusicGen

### Via API (Direct)

```python
import openai

# Configure client to use orchestrator
client = openai.OpenAI(
    base_url="http://100.100.108.13:9000/v1",
    api_key="dummy"  # Not used but required
)

# Text generation
response = client.chat.completions.create(
    model="qwen-2.5-7b",
    messages=[{"role": "user", "content": "Write a haiku"}]
)

# Image generation
image = client.images.generate(
    model="flux-schnell",
    prompt="a sunset over mountains",
    size="1024x1024"
)

# Music generation (custom endpoint)
import requests
music = requests.post(
    "http://100.100.108.13:9000/v1/audio/generations",
    json={
        "model": "musicgen-medium",
        "prompt": "calm piano music",
        "duration": 30
    }
)
```

## Adding New Models

### Step 1: Update `models.yaml`

```yaml
# Add to ai/model-orchestrator/models.yaml
models:
  llama-3.1-8b:  # New model
    type: text
    framework: vllm
    docker_service: vllm-llama
    port: 8004
    vram_gb: 17
    startup_time_seconds: 120
    endpoint: /v1/chat/completions
    description: "Llama 3.1 8B Instruct - Meta's latest model"
```

### Step 2: Add Docker Service

```yaml
# Add to ai/docker-compose.gpu.yaml
services:
  vllm-llama:
    build: ./vllm
    container_name: ai_vllm-llama_1
    command: >
      vllm serve meta-llama/Llama-3.1-8B-Instruct
      --port 8000 --dtype bfloat16
    ports:
      - "8004:8000"
    environment:
      - HF_TOKEN=${HF_TOKEN}
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    profiles: ["text"]
    restart: "no"
```

### Step 3: Restart Orchestrator

```bash
ssh gpu-pivoine
cd /workspace/ai/
docker compose -f docker-compose.gpu.yaml restart orchestrator
```

**That's it!** The orchestrator automatically detects the new model.

## Management Commands

### Orchestrator

```bash
# Start orchestrator
docker compose -f docker-compose.gpu.yaml up -d orchestrator

# View orchestrator logs
docker logs -f ai_orchestrator

# Restart orchestrator
docker compose -f docker-compose.gpu.yaml restart orchestrator

# Check active model
curl http://100.100.108.13:9000/health

# List all models
curl http://100.100.108.13:9000/models
```

### Manual Model Control

```bash
# Manually switch to specific model
curl -X POST http://100.100.108.13:9000/switch \
  -H "Content-Type: application/json" \
  -d '{"model": "flux-schnell"}'

# Check which model is running
curl http://100.100.108.13:9000/health | jq '.current_model'
```

### Model Services

```bash
# Manually start a specific model (bypassing orchestrator)
docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen

# Stop a model
docker compose -f docker-compose.gpu.yaml stop vllm-qwen

# View model logs
docker logs -f ai_vllm-qwen_1
docker logs -f ai_flux_1
docker logs -f ai_musicgen_1
```

## Monitoring

### GPU Usage

```bash
ssh gpu-pivoine "nvidia-smi"
```

### Model Status

```bash
# Which model is active?
curl http://100.100.108.13:9000/health

# Model memory usage
curl http://100.100.108.13:9000/health | jq '{current: .current_model, vram: .model_info.vram_gb}'
```

### Performance

```bash
# Orchestrator logs (model switching)
docker logs -f ai_orchestrator

# Model-specific logs
docker logs -f ai_vllm-qwen_1
docker logs -f ai_flux_1
docker logs -f ai_musicgen_1
```

## Troubleshooting

### Model Won't Load

```bash
# Check orchestrator logs
docker logs ai_orchestrator

# Check if model service exists
docker compose -f docker-compose.gpu.yaml config | grep -A 10 "vllm-qwen"

# Manually test model service
docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen
curl http://localhost:8001/health
```

### Orchestrator Can't Connect

```bash
# Check Docker socket permissions
ls -l /var/run/docker.sock

# Restart Docker daemon
sudo systemctl restart docker

# Rebuild orchestrator
docker compose -f docker-compose.gpu.yaml build orchestrator
docker compose -f docker-compose.gpu.yaml up -d orchestrator
```

### Model Switching Too Slow

```bash
# Check model startup times in models.yaml
# Adjust startup_time_seconds if needed

# Pre-download models to /workspace cache
docker run --rm -it --gpus all \
  -v /workspace/huggingface_cache:/cache \
  -e HF_HOME=/cache \
  nvidia/cuda:12.4.0-runtime-ubuntu22.04 \
  huggingface-cli download facebook/musicgen-medium
```

## File Structure

```
ai/
├── docker-compose.gpu.yaml    # Main orchestration file
├── .env.example               # Environment template
├── README.md                  # This file
│
├── model-orchestrator/        # Central orchestrator service
│   ├── orchestrator.py        # FastAPI app managing models
│   ├── models.yaml            # Model registry (EDIT TO ADD MODELS)
│   ├── Dockerfile
│   └── requirements.txt
│
├── vllm/                      # Text generation (vLLM)
│   ├── server.py              # Qwen 2.5 7B server
│   ├── Dockerfile
│   └── requirements.txt
│
├── flux/                      # Image generation (Flux.1 Schnell)
│   └── config/
│       └── config.json        # Flux configuration
│
├── musicgen/                  # Music generation (MusicGen)
│   ├── server.py              # MusicGen API server
│   ├── Dockerfile
│   └── requirements.txt
│
├── litellm-config.yaml        # LiteLLM proxy configuration
└── GPU_DEPLOYMENT_LOG.md      # Deployment history and notes
```

## Cost Analysis

### Current Setup (Single GPU)
- **Provider**: RunPod Spot Instance
- **GPU**: RTX 4090 24GB
- **Cost**: ~$0.50/hour
- **Monthly**: ~$360 (if running 24/7)
- **Optimized**: ~$120 (8 hours/day during business hours)

### Alternative: Multi-GPU (All Models Always On)
- **GPUs**: 2× RTX 4090
- **Cost**: ~$0.75/hour
- **Monthly**: ~$540 (if running 24/7)
- **Trade-off**: No switching latency, +$180/month

### Recommendation
Stick with single GPU sequential loading for cost optimization. Model switching (30-120 seconds) is acceptable for most use cases.

## Performance Expectations

| Model | VRAM | Startup Time | Generation Speed |
|-------|------|--------------|------------------|
| Qwen 2.5 7B | 14GB | 120s | ~50 tokens/sec |
| Flux.1 Schnell | 14GB | 60s | ~4-5 sec/image |
| MusicGen Medium | 11GB | 45s | ~60-90 sec for 30s audio |

**Model Switching**: 30-120 seconds (unload current + load new)

## Security Notes

- Orchestrator requires Docker socket access (`/var/run/docker.sock`)
- All services run on private Tailscale network
- No public exposure (only via VPS LiteLLM proxy)
- HuggingFace token stored in `.env` (not committed to git)

## Future Enhancements

1. ⏹️ Add Llama 3.1 8B for alternative text generation
2. ⏹️ Add Whisper Large v3 for speech-to-text
3. ⏹️ Add XTTS v2 for text-to-speech
4. ⏹️ Implement model preloading/caching for faster switching
5. ⏹️ Add usage metrics and cost tracking
6. ⏹️ Auto-stop GPU pod during idle periods

## Support

For issues or questions:
- Check orchestrator logs: `docker logs ai_orchestrator`
- View model-specific logs: `docker logs ai_<service>_1`
- Test direct model access: `curl http://localhost:<port>/health`
- Review GPU deployment log: `GPU_DEPLOYMENT_LOG.md`

## License

Built with:
- [vLLM](https://github.com/vllm-project/vllm) - Apache 2.0
- [AudioCraft](https://github.com/facebookresearch/audiocraft) - MIT (code), CC-BY-NC (weights)
- [Flux.1](https://github.com/black-forest-labs/flux) - Apache 2.0
- [LiteLLM](https://github.com/BerriAI/litellm) - MIT

**Note**: MusicGen pre-trained weights are non-commercial (CC-BY-NC). Train your own models for commercial use with the MIT-licensed code.
-												feat(ai): add multi-modal orchestration system for text, image, and music generation

Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with
automatic model switching based on request type. This enables text, image, and
music generation on the same hardware with sequential loading.

## New Components

**Model Orchestrator** (ai/model-orchestrator/):
- FastAPI service managing model lifecycle
- Automatic model detection and switching based on request type
- OpenAI-compatible API proxy for all models
- Simple YAML configuration for adding new models
- Docker SDK integration for service management
- Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations

**Text Generation** (ai/vllm/):
- Reorganized existing vLLM server into proper structure
- Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec)
- Docker containerized with CUDA 12.4 support

**Image Generation** (ai/flux/):
- Flux.1 Schnell for fast, high-quality images
- 14GB VRAM, 4-5 sec per image
- OpenAI DALL-E compatible API
- Pre-built image: ghcr.io/matatonic/openedai-images-flux

**Music Generation** (ai/musicgen/):
- Meta's MusicGen Medium (facebook/musicgen-medium)
- Text-to-music generation (11GB VRAM)
- 60-90 seconds for 30s audio clips
- Custom FastAPI wrapper with AudioCraft

## Architecture

```
VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000)
                                       ↓
                       ┌───────────────┼───────────────┐
                  vLLM (8001)    Flux (8002)    MusicGen (8003)
                   [Only ONE active at a time - sequential loading]
```

## Configuration Files

- docker-compose.gpu.yaml: Main orchestration file for RunPod deployment
- model-orchestrator/models.yaml: Model registry (easy to add new models)
- .env.example: Environment variable template
- README.md: Comprehensive deployment and usage guide

## Updated Files

- litellm-config.yaml: Updated to route through orchestrator (port 9000)
- GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture

## Features

✅ Automatic model switching (30-120s latency)
✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU)
✅ Easy model addition via YAML configuration
✅ OpenAI-compatible APIs for all model types
✅ Centralized routing through LiteLLM proxy
✅ GPU memory safety (only one model loaded at time)

## Usage

Deploy to RunPod:
```bash
scp -r ai/* gpu-pivoine:/workspace/ai/
ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator"
```

Test models:
```bash
# Text
curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}'

# Image
curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}'

# Music
curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}'
```

All models available via Open WebUI at https://ai.pivoine.art

## Adding New Models

1. Add entry to models.yaml
2. Define Docker service in docker-compose.gpu.yaml
3. Restart orchestrator

That's it! The orchestrator automatically detects and manages the new model.

## Performance

| Model | VRAM | Startup | Speed |
|-------|------|---------|-------|
| Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec |
| Flux.1 Schnell | 14GB | 60s | 4-5s/image |
| MusicGen Medium | 11GB | 45s | 60-90s for 30s audio |

Model switching overhead: 30-120 seconds

## License Notes

- vLLM: Apache 2.0
- Flux.1: Apache 2.0
- AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-21 14:12:13 +01:00
+								# Multi-Modal AI Orchestration System
 								**Cost-optimized AI infrastructure running text, image, and music generation on a single RunPod RTX 4090 GPU.**
 								## Architecture Overview
 								This system provides a unified API for multiple AI model types with automatic model switching on a single GPU (24GB VRAM). All requests route through an intelligent orchestrator that manages model lifecycle.
 								### Components
 								```
 								┌─────────────────────────────────────────────────────────────────┐
 								│                        VPS (Tailscale: 100.102.217.79)          │
 								│  ┌───────────────────────────────────────────────────────────┐  │
 								│  │ LiteLLM Proxy (Port 4000)                                 │  │
 								│  │ Routes to: Claude API + GPU Orchestrator                  │  │
 								│  └────────────────────┬──────────────────────────────────────┘  │
 								└───────────────────────┼─────────────────────────────────────────┘
 								                        │ Tailscale VPN
 								┌───────────────────────┼─────────────────────────────────────────┐
 								│           RunPod GPU Server (Tailscale: 100.100.108.13)         │
 								│  ┌────────────────────▼──────────────────────────────────────┐  │
 								│  │ Orchestrator (Port 9000)                                  │  │
 								│  │ Manages sequential model loading based on request type   │  │
 								│  └─────┬──────────────┬──────────────────┬──────────────────┘  │
 								│        │              │                  │                      │
 								│  ┌─────▼──────┐ ┌────▼────────┐  ┌──────▼───────┐             │
 								│  │vLLM        │ │Flux.1       │  │MusicGen      │             │
 								│  │Qwen 2.5 7B │ │Schnell      │  │Medium        │             │
 								│  │Port: 8001  │ │Port: 8002   │  │Port: 8003    │             │
 								│  │VRAM: 14GB  │ │VRAM: 14GB   │  │VRAM: 11GB    │             │
 								│  └────────────┘ └─────────────┘  └──────────────┘             │
 								│                                                                 │
 								│  Only ONE model active at a time (sequential loading)          │
 								└─────────────────────────────────────────────────────────────────┘
 								```
 								### Features
 								✅ **Automatic Model Switching** - Orchestrator detects request type and loads appropriate model
 								✅ **OpenAI-Compatible APIs** - Works with existing OpenAI clients and tools
 								✅ **Cost-Optimized** - Sequential loading on single GPU (~$0.50/hr vs ~$0.75/hr for multi-GPU)
 								✅ **Easy Model Addition** - Add new models by editing YAML config
 								✅ **Centralized Routing** - LiteLLM proxy provides unified API for all models
 								✅ **GPU Memory Safe** - Orchestrator ensures only one model loaded at a time
 								## Supported Model Types
 								### Text Generation
 								- **Qwen 2.5 7B Instruct** (facebook/Qwen2.5-7B-Instruct)
 								- VRAM: 14GB | Speed: Fast | OpenAI-compatible chat API
 								### Image Generation
 								- **Flux.1 Schnell** (black-forest-labs/FLUX.1-schnell)
 								- VRAM: 14GB | Speed: 4-5 sec/image | OpenAI DALL-E compatible API
 								### Music Generation
 								- **MusicGen Medium** (facebook/musicgen-medium)
 								- VRAM: 11GB | Speed: 60-90 sec for 30s audio | Custom audio API
 								## Quick Start
 								### 1. Prerequisites
 								```bash
 								# On RunPod GPU server
 								- RunPod RTX 4090 instance (24GB VRAM)
 								- Docker & Docker Compose installed
 								- Tailscale VPN configured
 								- HuggingFace token (for model downloads)
 								```
 								### 2. Clone & Configure
 								```bash
 								# On local machine
 								cd ai/
 								# Create environment file
 								cp .env.example .env
 								# Edit .env and add your HF_TOKEN
 								```
 								### 3. Deploy to RunPod
 								```bash
 								# Copy all files to RunPod GPU server
 								scp -r ai/* gpu-pivoine:/workspace/ai/
 								# SSH to GPU server
 								ssh gpu-pivoine
 								# Navigate to project
 								cd /workspace/ai/
 								# Start orchestrator (always running)
 								docker compose -f docker-compose.gpu.yaml up -d orchestrator
 								# Orchestrator will automatically manage model services as needed
 								```
 								### 4. Test Deployment
 								```bash
 								# Check orchestrator health
 								curl http://100.100.108.13:9000/health
 								# Test text generation (auto-loads vLLM)
 								curl http://100.100.108.13:9000/v1/chat/completions \
 								  -H "Content-Type: application/json" \
 								  -d '{
 								    "model": "qwen-2.5-7b",
 								    "messages": [{"role": "user", "content": "Hello!"}]
 								  }'
 								# Test image generation (auto-switches to Flux)
 								curl http://100.100.108.13:9000/v1/images/generations \
 								  -H "Content-Type: application/json" \
 								  -d '{
 								    "model": "flux-schnell",
 								    "prompt": "a cute cat",
 								    "size": "1024x1024"
 								  }'
 								# Test music generation (auto-switches to MusicGen)
 								curl http://100.100.108.13:9000/v1/audio/generations \
 								  -H "Content-Type: application/json" \
 								  -d '{
 								    "model": "musicgen-medium",
 								    "prompt": "upbeat electronic dance music",
 								    "duration": 30
 								  }'
 								```
 								### 5. Update VPS LiteLLM
 								```bash
 								# On VPS, restart LiteLLM to pick up new config
 								ssh vps
 								cd ~/Projects/docker-compose
 								arty restart litellm
 								```
 								## Usage Examples
 								### Via Open WebUI (https://ai.pivoine.art)
 								**Text Generation:**
 . Select model: `qwen-2.5-7b`
 . Type message and send
 . Orchestrator loads vLLM automatically
 								**Image Generation:**
 . Select model: `flux-schnell`
 . Enter image prompt
 . Orchestrator switches to Flux.1
 								**Music Generation:**
 . Select model: `musicgen-medium`
 . Describe the music you want
 . Orchestrator switches to MusicGen
 								### Via API (Direct)
 								```python
 								import openai
 								# Configure client to use orchestrator
 								client = openai.OpenAI(
 								    base_url="http://100.100.108.13:9000/v1",
 								    api_key="dummy"  # Not used but required
 								)
 								# Text generation
 								response = client.chat.completions.create(
 								    model="qwen-2.5-7b",
 								    messages=[{"role": "user", "content": "Write a haiku"}]
 								)
 								# Image generation
 								image = client.images.generate(
 								    model="flux-schnell",
 								    prompt="a sunset over mountains",
 								    size="1024x1024"
 								)
 								# Music generation (custom endpoint)
 								import requests
 								music = requests.post(
 								    "http://100.100.108.13:9000/v1/audio/generations",
 								    json={
 								        "model": "musicgen-medium",
 								        "prompt": "calm piano music",
 								        "duration": 30
 								    }
 								)
 								```
 								## Adding New Models
 								### Step 1: Update `models.yaml`
 								```yaml
 								# Add to ai/model-orchestrator/models.yaml
 								models:
 								  llama-3.1-8b:  # New model
 								    type: text
 								    framework: vllm
 								    docker_service: vllm-llama
 								    port: 8004
 								    vram_gb: 17
 								    startup_time_seconds: 120
 								    endpoint: /v1/chat/completions
 								    description: "Llama 3.1 8B Instruct - Meta's latest model"
 								```
 								### Step 2: Add Docker Service
 								```yaml
 								# Add to ai/docker-compose.gpu.yaml
 								services:
 								  vllm-llama:
 								    build: ./vllm
 								    container_name: ai_vllm-llama_1
 								    command: >
 								      vllm serve meta-llama/Llama-3.1-8B-Instruct
 								      --port 8000 --dtype bfloat16
 								    ports:
 								      - "8004:8000"
 								    environment:
 								      - HF_TOKEN=${HF_TOKEN}
 								    deploy:
 								      resources:
 								        reservations:
 								          devices:
 								            - driver: nvidia
 								              count: 1
 								              capabilities: [gpu]
 								    profiles: ["text"]
 								    restart: "no"
 								```
 								### Step 3: Restart Orchestrator
 								```bash
 								ssh gpu-pivoine
 								cd /workspace/ai/
 								docker compose -f docker-compose.gpu.yaml restart orchestrator
 								```
 								**That's it!** The orchestrator automatically detects the new model.
 								## Management Commands
 								### Orchestrator
 								```bash
 								# Start orchestrator
 								docker compose -f docker-compose.gpu.yaml up -d orchestrator
 								# View orchestrator logs
 								docker logs -f ai_orchestrator
 								# Restart orchestrator
 								docker compose -f docker-compose.gpu.yaml restart orchestrator
 								# Check active model
 								curl http://100.100.108.13:9000/health
 								# List all models
 								curl http://100.100.108.13:9000/models
 								```
 								### Manual Model Control
 								```bash
 								# Manually switch to specific model
 								curl -X POST http://100.100.108.13:9000/switch \
 								  -H "Content-Type: application/json" \
 								  -d '{"model": "flux-schnell"}'
 								# Check which model is running
 								curl http://100.100.108.13:9000/health | jq '.current_model'
 								```
 								### Model Services
 								```bash
 								# Manually start a specific model (bypassing orchestrator)
 								docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen
 								# Stop a model
 								docker compose -f docker-compose.gpu.yaml stop vllm-qwen
 								# View model logs
 								docker logs -f ai_vllm-qwen_1
 								docker logs -f ai_flux_1
 								docker logs -f ai_musicgen_1
 								```
 								## Monitoring
 								### GPU Usage
 								```bash
 								ssh gpu-pivoine "nvidia-smi"
 								```
 								### Model Status
 								```bash
 								# Which model is active?
 								curl http://100.100.108.13:9000/health
 								# Model memory usage
 								curl http://100.100.108.13:9000/health | jq '{current: .current_model, vram: .model_info.vram_gb}'
 								```
 								### Performance
 								```bash
 								# Orchestrator logs (model switching)
 								docker logs -f ai_orchestrator
 								# Model-specific logs
 								docker logs -f ai_vllm-qwen_1
 								docker logs -f ai_flux_1
 								docker logs -f ai_musicgen_1
 								```
 								## Troubleshooting
 								### Model Won't Load
 								```bash
 								# Check orchestrator logs
 								docker logs ai_orchestrator
 								# Check if model service exists
 								docker compose -f docker-compose.gpu.yaml config | grep -A 10 "vllm-qwen"
 								# Manually test model service
 								docker compose -f docker-compose.gpu.yaml --profile text up -d vllm-qwen
 								curl http://localhost:8001/health
 								```
 								### Orchestrator Can't Connect
 								```bash
 								# Check Docker socket permissions
 								ls -l /var/run/docker.sock
 								# Restart Docker daemon
 								sudo systemctl restart docker
 								# Rebuild orchestrator
 								docker compose -f docker-compose.gpu.yaml build orchestrator
 								docker compose -f docker-compose.gpu.yaml up -d orchestrator
 								```
 								### Model Switching Too Slow
 								```bash
 								# Check model startup times in models.yaml
 								# Adjust startup_time_seconds if needed
 								# Pre-download models to /workspace cache
 								docker run --rm -it --gpus all \
 								  -v /workspace/huggingface_cache:/cache \
 								  -e HF_HOME=/cache \
 								  nvidia/cuda:12.4.0-runtime-ubuntu22.04 \
 								  huggingface-cli download facebook/musicgen-medium
 								```
 								## File Structure
 								```
 								ai/
 								├── docker-compose.gpu.yaml    # Main orchestration file
 								├── .env.example               # Environment template
 								├── README.md                  # This file
 								│
 								├── model-orchestrator/        # Central orchestrator service
 								│   ├── orchestrator.py        # FastAPI app managing models
 								│   ├── models.yaml            # Model registry (EDIT TO ADD MODELS)
 								│   ├── Dockerfile
 								│   └── requirements.txt
 								│
 								├── vllm/                      # Text generation (vLLM)
 								│   ├── server.py              # Qwen 2.5 7B server
 								│   ├── Dockerfile
 								│   └── requirements.txt
 								│
 								├── flux/                      # Image generation (Flux.1 Schnell)
 								│   └── config/
 								│       └── config.json        # Flux configuration
 								│
 								├── musicgen/                  # Music generation (MusicGen)
 								│   ├── server.py              # MusicGen API server
 								│   ├── Dockerfile
 								│   └── requirements.txt
 								│
 								├── litellm-config.yaml        # LiteLLM proxy configuration
 								└── GPU_DEPLOYMENT_LOG.md      # Deployment history and notes
 								```
 								## Cost Analysis
 								### Current Setup (Single GPU)
 								- **Provider**: RunPod Spot Instance
 								- **GPU**: RTX 4090 24GB
 								- **Cost**: ~$0.50/hour
 								- **Monthly**: ~$360 (if running 24/7)
 								- **Optimized**: ~$120 (8 hours/day during business hours)
 								### Alternative: Multi-GPU (All Models Always On)
 								- **GPUs**: 2× RTX 4090
 								- **Cost**: ~$0.75/hour
 								- **Monthly**: ~$540 (if running 24/7)
 								- **Trade-off**: No switching latency, +$180/month
 								### Recommendation
 								Stick with single GPU sequential loading for cost optimization. Model switching (30-120 seconds) is acceptable for most use cases.
 								## Performance Expectations
 								| Model | VRAM | Startup Time | Generation Speed |
 								|-------|------|--------------|------------------|
 								| Qwen 2.5 7B | 14GB | 120s | ~50 tokens/sec |
 								| Flux.1 Schnell | 14GB | 60s | ~4-5 sec/image |
 								| MusicGen Medium | 11GB | 45s | ~60-90 sec for 30s audio |
 								**Model Switching**: 30-120 seconds (unload current + load new)
 								## Security Notes
 								- Orchestrator requires Docker socket access (`/var/run/docker.sock`)
 								- All services run on private Tailscale network
 								- No public exposure (only via VPS LiteLLM proxy)
 								- HuggingFace token stored in `.env` (not committed to git)
 								## Future Enhancements
 . ⏹️ Add Llama 3.1 8B for alternative text generation
 . ⏹️ Add Whisper Large v3 for speech-to-text
 . ⏹️ Add XTTS v2 for text-to-speech
 . ⏹️ Implement model preloading/caching for faster switching
 . ⏹️ Add usage metrics and cost tracking
 . ⏹️ Auto-stop GPU pod during idle periods
 								## Support
 								For issues or questions:
 								- Check orchestrator logs: `docker logs ai_orchestrator`
 								- View model-specific logs: `docker logs ai_<service>_1`
 								- Test direct model access: `curl http://localhost:<port>/health`
 								- Review GPU deployment log: `GPU_DEPLOYMENT_LOG.md`
 								## License
 								Built with:
 								- [vLLM](https://github.com/vllm-project/vllm) - Apache 2.0
 								- [AudioCraft](https://github.com/facebookresearch/audiocraft) - MIT (code), CC-BY-NC (weights)
 								- [Flux.1](https://github.com/black-forest-labs/flux) - Apache 2.0
 								- [LiteLLM](https://github.com/BerriAI/litellm) - MIT
 								**Note**: MusicGen pre-trained weights are non-commercial (CC-BY-NC). Train your own models for commercial use with the MIT-licensed code.