# RunPod Multi-Modal AI Stack

**Cost-optimized GPU deployment for text, image, and music generation on RunPod RTX 4090.**

This repository contains everything needed to deploy and manage a multi-modal AI infrastructure on RunPod, featuring intelligent model orchestration that automatically switches between models based on request type.

## Features

- **Text Generation**: Qwen 2.5 7B Instruct via vLLM (~50 tokens/sec)
- **Image Generation**: Flux.1 Schnell (~4-5 seconds per image)
- **Music Generation**: MusicGen Medium (30 seconds of audio in 60-90 seconds)
- **Automatic Model Switching**: Intelligent orchestrator manages sequential model loading
- **OpenAI-Compatible APIs**: Works with existing AI tools and clients
- **Easy Model Addition**: Just edit `model-orchestrator/models.yaml` to add new models
- **Template Support**: Create reusable templates for 2-3 minute deployments (vs 60-90 minutes)

## Quick Start

### Option 1: Deploy from Template (Recommended)

If you've already created a RunPod template:

1. Deploy pod from template in RunPod dashboard
2. SSH to the pod
3. Create `.env` file with your credentials
4. Start orchestrator: `docker compose -f docker-compose.gpu.yaml up -d orchestrator`

**See**: [RUNPOD_TEMPLATE.md](RUNPOD_TEMPLATE.md) for template usage instructions.

### Option 2: Fresh Deployment

For first-time setup on a new RunPod instance:

1. Copy files to RunPod: `scp -r * gpu-server:/workspace/ai/`
2. SSH to GPU server: `ssh gpu-server`
3. Run preparation script: `cd /workspace/ai && chmod +x scripts/prepare-template.sh && ./scripts/prepare-template.sh`

**See**: [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md) for detailed deployment guide.

## Architecture

```
VPS (LiteLLM Proxy)
    ↓ Tailscale VPN
GPU Server (Orchestrator Port 9000)
    ├── vLLM (Qwen 2.5 7B) - Port 8001
    ├── Flux.1 Schnell - Port 8002
    └── MusicGen Medium - Port 8003
```

All requests route through the orchestrator, which automatically loads the appropriate model. Only one model is active at a time for cost optimization (~$0.50/hr vs ~$0.75/hr for multi-GPU).

## Cost Analysis

**RunPod RTX 4090 Spot Instance**:
- **Hourly**: ~$0.50
- **Monthly (24/7)**: ~$360
- **Monthly (8hr/day)**: ~$120

**Template Benefits**:
- **Without Template**: 60-90 minutes setup per Spot restart
- **With Template**: 2-3 minutes deployment time
- **Spot Restart Frequency**: 2-5 times per week (variable)

## Documentation

- **[docs/DEPLOYMENT.md](docs/DEPLOYMENT.md)** - Complete deployment and usage guide
- **[docs/RUNPOD_TEMPLATE.md](docs/RUNPOD_TEMPLATE.md)** - Template creation and usage
- **[docs/GPU_DEPLOYMENT_LOG.md](docs/GPU_DEPLOYMENT_LOG.md)** - Deployment history and technical notes

### Architecture Components
- `model-orchestrator/` - FastAPI orchestrator managing model lifecycle
- `vllm/` - Text generation service (Qwen 2.5 7B)
- `flux/` - Image generation service (Flux.1 Schnell)
- `musicgen/` - Music generation service (MusicGen Medium)
- `scripts/` - Automation scripts

## Creating a RunPod Template

**Why create a template?**
- Save 60-90 minutes on every Spot instance restart
- Pre-downloaded models (~37GB cached)
- Pre-built Docker images
- Ready-to-use configuration

**How to create:**
1. Run `scripts/prepare-template.sh` on a fresh RunPod instance
2. Wait 45-60 minutes for models to download and images to build
3. Save pod as template in RunPod dashboard
4. Name: `multi-modal-ai-v1.0`

**See**: [docs/RUNPOD_TEMPLATE.md](docs/RUNPOD_TEMPLATE.md) for step-by-step guide.

## Adding New Models

Adding models is easy! Just edit `model-orchestrator/models.yaml`:

```yaml
models:
  llama-3.1-8b:  # New model
    type: text
    framework: vllm
    docker_service: vllm-llama
    port: 8004
    vram_gb: 17
    startup_time_seconds: 120
    endpoint: /v1/chat/completions
```

Then add the Docker service to `docker-compose.gpu.yaml` and restart the orchestrator.

**See**: [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md#adding-new-models) for complete instructions.

## Usage Examples

### Text Generation
```bash
curl http://100.100.108.13:9000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Hello!"}]}'
```

### Image Generation
```bash
curl http://100.100.108.13:9000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{"model": "flux-schnell", "prompt": "a cute cat", "size": "1024x1024"}'
```

### Music Generation
```bash
curl http://100.100.108.13:9000/v1/audio/generations \
  -H "Content-Type: application/json" \
  -d '{"model": "musicgen-medium", "prompt": "upbeat electronic", "duration": 30}'
```

## Infrastructure

**Provider**: RunPod (Spot Instance)
**GPU**: NVIDIA RTX 4090 24GB VRAM
**Region**: Europe
**Network**: Tailscale VPN (100.100.108.13)
**Storage**: 922TB network volume at `/workspace`

## Monitoring

```bash
# Check active model
curl http://100.100.108.13:9000/health

# View orchestrator logs
docker logs -f ai_orchestrator

# GPU usage
nvidia-smi
```

## Support

For issues:
1. Check orchestrator logs: `docker logs ai_orchestrator`
2. Review [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md#troubleshooting)
3. Check [docs/GPU_DEPLOYMENT_LOG.md](docs/GPU_DEPLOYMENT_LOG.md) for deployment history

## License

Built with:
- [vLLM](https://github.com/vllm-project/vllm) - Apache 2.0
- [AudioCraft](https://github.com/facebookresearch/audiocraft) - MIT (code), CC-BY-NC (weights)
- [Flux.1](https://github.com/black-forest-labs/flux) - Apache 2.0
- [LiteLLM](https://github.com/BerriAI/litellm) - MIT

**Note**: MusicGen pre-trained weights are non-commercial (CC-BY-NC).