README.md

# RunPod Multi-Modal AI Stack

**Cost-optimized GPU deployment for text, image, and music generation on RunPod RTX 4090.**

This repository contains everything needed to deploy and manage a multi-modal AI infrastructure on RunPod, featuring intelligent model orchestration that automatically switches between models based on request type.

## Features

- **Text Generation**: Qwen 2.5 7B Instruct via vLLM (~50 tokens/sec)
- **Image Generation**: Flux.1 Schnell (~4-5 seconds per image)
- **Music Generation**: MusicGen Medium (30 seconds of audio in 60-90 seconds)
- **Automatic Model Switching**: Intelligent orchestrator manages sequential model loading
- **OpenAI-Compatible APIs**: Works with existing AI tools and clients
- **Easy Model Addition**: Just edit `model-orchestrator/models.yaml` to add new models
- **Template Support**: Create reusable templates for 2-3 minute deployments (vs 60-90 minutes)

## Quick Start

### Option 1: Deploy from Template (Recommended)

If you've already created a RunPod template:

1. Deploy pod from template in RunPod dashboard
2. SSH to the pod
3. Create `.env` file with your credentials
4. Start orchestrator: `docker compose -f compose.yaml up -d orchestrator`

**See**: [RUNPOD_TEMPLATE.md](RUNPOD_TEMPLATE.md) for template usage instructions.

### Option 2: Fresh Deployment

For first-time setup on a new RunPod instance:

1. Copy files to RunPod: `scp -r * gpu-server:/workspace/ai/`
2. SSH to GPU server: `ssh gpu-server`
3. Run preparation script: `cd /workspace/ai && chmod +x scripts/prepare-template.sh && ./scripts/prepare-template.sh`

**See**: [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md) for detailed deployment guide.

## Architecture

```
VPS (LiteLLM Proxy)
    ↓ Tailscale VPN
GPU Server (Orchestrator Port 9000)
    ├── vLLM (Qwen 2.5 7B) - Port 8001
    ├── Flux.1 Schnell - Port 8002
    └── MusicGen Medium - Port 8003
```

All requests route through the orchestrator, which automatically loads the appropriate model. Only one model is active at a time for cost optimization (~$0.50/hr vs ~$0.75/hr for multi-GPU).

## Cost Analysis

**RunPod RTX 4090 Spot Instance**:
- **Hourly**: ~$0.50
- **Monthly (24/7)**: ~$360
- **Monthly (8hr/day)**: ~$120

**Template Benefits**:
- **Without Template**: 60-90 minutes setup per Spot restart
- **With Template**: 2-3 minutes deployment time
- **Spot Restart Frequency**: 2-5 times per week (variable)

## Documentation

- **[docs/DEPLOYMENT.md](docs/DEPLOYMENT.md)** - Complete deployment and usage guide
- **[docs/RUNPOD_TEMPLATE.md](docs/RUNPOD_TEMPLATE.md)** - Template creation and usage
- **[docs/GPU_DEPLOYMENT_LOG.md](docs/GPU_DEPLOYMENT_LOG.md)** - Deployment history and technical notes

### Architecture Components
- `model-orchestrator/` - FastAPI orchestrator managing model lifecycle
- `vllm/` - Text generation service (Qwen 2.5 7B)
- `flux/` - Image generation service (Flux.1 Schnell)
- `musicgen/` - Music generation service (MusicGen Medium)
- `scripts/` - Automation scripts

## Creating a RunPod Template

**Why create a template?**
- Save 60-90 minutes on every Spot instance restart
- Pre-downloaded models (~37GB cached)
- Pre-built Docker images
- Ready-to-use configuration

**How to create:**
1. Run `scripts/prepare-template.sh` on a fresh RunPod instance
2. Wait 45-60 minutes for models to download and images to build
3. Save pod as template in RunPod dashboard
4. Name: `multi-modal-ai-v1.0`

**See**: [docs/RUNPOD_TEMPLATE.md](docs/RUNPOD_TEMPLATE.md) for step-by-step guide.

## Adding New Models

Adding models is easy! Just edit `model-orchestrator/models.yaml`:

```yaml
models:
  llama-3.1-8b:  # New model
    type: text
    framework: vllm
    docker_service: vllm-llama
    port: 8004
    vram_gb: 17
    startup_time_seconds: 120
    endpoint: /v1/chat/completions
```

Then add the Docker service to `compose.yaml` and restart the orchestrator.

**See**: [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md#adding-new-models) for complete instructions.

## Usage Examples

### Text Generation
```bash
curl http://100.100.108.13:9000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Hello!"}]}'
```

### Image Generation
```bash
curl http://100.100.108.13:9000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{"model": "flux-schnell", "prompt": "a cute cat", "size": "1024x1024"}'
```

### Music Generation
```bash
curl http://100.100.108.13:9000/v1/audio/generations \
  -H "Content-Type: application/json" \
  -d '{"model": "musicgen-medium", "prompt": "upbeat electronic", "duration": 30}'
```

## Infrastructure

**Provider**: RunPod (Spot Instance)
**GPU**: NVIDIA RTX 4090 24GB VRAM
**Region**: Europe
**Network**: Tailscale VPN (100.100.108.13)
**Storage**: 922TB network volume at `/workspace`

## Monitoring

```bash
# Check active model
curl http://100.100.108.13:9000/health

# View orchestrator logs
docker logs -f ai_orchestrator

# GPU usage
nvidia-smi
```

## Support

For issues:
1. Check orchestrator logs: `docker logs ai_orchestrator`
2. Review [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md#troubleshooting)
3. Check [docs/GPU_DEPLOYMENT_LOG.md](docs/GPU_DEPLOYMENT_LOG.md) for deployment history

## License

Built with:
- [vLLM](https://github.com/vllm-project/vllm) - Apache 2.0
- [AudioCraft](https://github.com/facebookresearch/audiocraft) - MIT (code), CC-BY-NC (weights)
- [Flux.1](https://github.com/black-forest-labs/flux) - Apache 2.0
- [LiteLLM](https://github.com/BerriAI/litellm) - MIT

**Note**: MusicGen pre-trained weights are non-commercial (CC-BY-NC).
Initial commit: RunPod multi-modal AI orchestration stack - Multi-modal AI infrastructure for RunPod RTX 4090 - Automatic model orchestration (text, image, music) - Text: vLLM + Qwen 2.5 7B Instruct - Image: Flux.1 Schnell via OpenEDAI - Music: MusicGen Medium via AudioCraft - Cost-optimized sequential loading on single GPU - Template preparation scripts for rapid deployment - Comprehensive documentation (README, DEPLOYMENT, TEMPLATE) 2025-11-21 14:34:55 +01:00			`# RunPod Multi-Modal AI Stack`

			`Cost-optimized GPU deployment for text, image, and music generation on RunPod RTX 4090.`

			`This repository contains everything needed to deploy and manage a multi-modal AI infrastructure on RunPod, featuring intelligent model orchestration that automatically switches between models based on request type.`

			`## Features`

			`- Text Generation: Qwen 2.5 7B Instruct via vLLM (~50 tokens/sec)`
			`- Image Generation: Flux.1 Schnell (~4-5 seconds per image)`
			`- Music Generation: MusicGen Medium (30 seconds of audio in 60-90 seconds)`
			`- Automatic Model Switching: Intelligent orchestrator manages sequential model loading`
			`- OpenAI-Compatible APIs: Works with existing AI tools and clients`
			- Easy Model Addition: Just edit `model-orchestrator/models.yaml` to add new models
			`- Template Support: Create reusable templates for 2-3 minute deployments (vs 60-90 minutes)`

			`## Quick Start`

			`### Option 1: Deploy from Template (Recommended)`

			`If you've already created a RunPod template:`

			`1. Deploy pod from template in RunPod dashboard`
			`2. SSH to the pod`
			3. Create `.env` file with your credentials
refactor: rename docker-compose.gpu.yaml to compose.yaml Simplified compose file naming to follow Docker Compose best practices: - Renamed docker-compose.gpu.yaml to compose.yaml - Updated all references in documentation files (README.md, DEPLOYMENT.md, GPU_DEPLOYMENT_LOG.md, RUNPOD_TEMPLATE.md) - Updated references in scripts (prepare-template.sh) This change enables simpler command syntax: - Before: docker compose -f docker-compose.gpu.yaml up -d orchestrator - After: docker compose up -d orchestrator Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-11-21 14:49:42 +01:00			4. Start orchestrator: `docker compose -f compose.yaml up -d orchestrator`
Initial commit: RunPod multi-modal AI orchestration stack - Multi-modal AI infrastructure for RunPod RTX 4090 - Automatic model orchestration (text, image, music) - Text: vLLM + Qwen 2.5 7B Instruct - Image: Flux.1 Schnell via OpenEDAI - Music: MusicGen Medium via AudioCraft - Cost-optimized sequential loading on single GPU - Template preparation scripts for rapid deployment - Comprehensive documentation (README, DEPLOYMENT, TEMPLATE) 2025-11-21 14:34:55 +01:00
			`See: [RUNPOD_TEMPLATE.md](RUNPOD_TEMPLATE.md) for template usage instructions.`

			`### Option 2: Fresh Deployment`

			`For first-time setup on a new RunPod instance:`

			1. Copy files to RunPod: `scp -r * gpu-server:/workspace/ai/`
			2. SSH to GPU server: `ssh gpu-server`
			3. Run preparation script: `cd /workspace/ai && chmod +x scripts/prepare-template.sh && ./scripts/prepare-template.sh`

refactor: clean up runpod repository structure Removed facefusion and VPS-related files: - compose.yaml, postgres/, litellm-config.yaml (VPS services) - Dockerfile, entrypoint.sh, disable-nsfw-filter.patch (facefusion) Removed outdated documentation: - DOCKER_GPU_SETUP.md, README_GPU_SETUP.md, SETUP_GUIDE.md - TAILSCALE_SETUP.md, WIREGUARD_SETUP.md (covered in DEPLOYMENT.md) - GPU_EXPANSION_PLAN.md (historical planning doc) - gpu-server-compose.yaml, litellm-config-gpu.yaml (old versions) - deploy-gpu-stack.sh, simple_vllm_server.py (old scripts) Organized documentation: - Created docs/ directory - Moved DEPLOYMENT.md, RUNPOD_TEMPLATE.md, GPU_DEPLOYMENT_LOG.md to docs/ - Updated all documentation links in README.md Final structure: - Clean root directory with only GPU-specific files - Organized documentation in docs/ - Model services in dedicated directories (model-orchestrator/, vllm/, flux/, musicgen/) - Automation scripts in scripts/ 2025-11-21 14:45:49 +01:00			`See: [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md) for detailed deployment guide.`
Initial commit: RunPod multi-modal AI orchestration stack - Multi-modal AI infrastructure for RunPod RTX 4090 - Automatic model orchestration (text, image, music) - Text: vLLM + Qwen 2.5 7B Instruct - Image: Flux.1 Schnell via OpenEDAI - Music: MusicGen Medium via AudioCraft - Cost-optimized sequential loading on single GPU - Template preparation scripts for rapid deployment - Comprehensive documentation (README, DEPLOYMENT, TEMPLATE) 2025-11-21 14:34:55 +01:00
			`## Architecture`

			```
			`VPS (LiteLLM Proxy)`
			`↓ Tailscale VPN`
			`GPU Server (Orchestrator Port 9000)`
			`├── vLLM (Qwen 2.5 7B) - Port 8001`
			`├── Flux.1 Schnell - Port 8002`
			`└── MusicGen Medium - Port 8003`
			```

			`All requests route through the orchestrator, which automatically loads the appropriate model. Only one model is active at a time for cost optimization (~$0.50/hr vs ~$0.75/hr for multi-GPU).`

			`## Cost Analysis`

			`RunPod RTX 4090 Spot Instance:`
			`- Hourly: ~$0.50`
			`- Monthly (24/7): ~$360`
			`- Monthly (8hr/day): ~$120`

			`Template Benefits:`
			`- Without Template: 60-90 minutes setup per Spot restart`
			`- With Template: 2-3 minutes deployment time`
			`- Spot Restart Frequency: 2-5 times per week (variable)`

			`## Documentation`

refactor: clean up runpod repository structure Removed facefusion and VPS-related files: - compose.yaml, postgres/, litellm-config.yaml (VPS services) - Dockerfile, entrypoint.sh, disable-nsfw-filter.patch (facefusion) Removed outdated documentation: - DOCKER_GPU_SETUP.md, README_GPU_SETUP.md, SETUP_GUIDE.md - TAILSCALE_SETUP.md, WIREGUARD_SETUP.md (covered in DEPLOYMENT.md) - GPU_EXPANSION_PLAN.md (historical planning doc) - gpu-server-compose.yaml, litellm-config-gpu.yaml (old versions) - deploy-gpu-stack.sh, simple_vllm_server.py (old scripts) Organized documentation: - Created docs/ directory - Moved DEPLOYMENT.md, RUNPOD_TEMPLATE.md, GPU_DEPLOYMENT_LOG.md to docs/ - Updated all documentation links in README.md Final structure: - Clean root directory with only GPU-specific files - Organized documentation in docs/ - Model services in dedicated directories (model-orchestrator/, vllm/, flux/, musicgen/) - Automation scripts in scripts/ 2025-11-21 14:45:49 +01:00			`- [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md) - Complete deployment and usage guide`
			`- [docs/RUNPOD_TEMPLATE.md](docs/RUNPOD_TEMPLATE.md) - Template creation and usage`
			`- [docs/GPU_DEPLOYMENT_LOG.md](docs/GPU_DEPLOYMENT_LOG.md) - Deployment history and technical notes`
Initial commit: RunPod multi-modal AI orchestration stack - Multi-modal AI infrastructure for RunPod RTX 4090 - Automatic model orchestration (text, image, music) - Text: vLLM + Qwen 2.5 7B Instruct - Image: Flux.1 Schnell via OpenEDAI - Music: MusicGen Medium via AudioCraft - Cost-optimized sequential loading on single GPU - Template preparation scripts for rapid deployment - Comprehensive documentation (README, DEPLOYMENT, TEMPLATE) 2025-11-21 14:34:55 +01:00
			`### Architecture Components`
			- `model-orchestrator/` - FastAPI orchestrator managing model lifecycle
			- `vllm/` - Text generation service (Qwen 2.5 7B)
			- `flux/` - Image generation service (Flux.1 Schnell)
			- `musicgen/` - Music generation service (MusicGen Medium)
			- `scripts/` - Automation scripts

			`## Creating a RunPod Template`

			`Why create a template?`
			`- Save 60-90 minutes on every Spot instance restart`
			`- Pre-downloaded models (~37GB cached)`
			`- Pre-built Docker images`
			`- Ready-to-use configuration`

			`How to create:`
			1. Run `scripts/prepare-template.sh` on a fresh RunPod instance
			`2. Wait 45-60 minutes for models to download and images to build`
			`3. Save pod as template in RunPod dashboard`
			4. Name: `multi-modal-ai-v1.0`

refactor: clean up runpod repository structure Removed facefusion and VPS-related files: - compose.yaml, postgres/, litellm-config.yaml (VPS services) - Dockerfile, entrypoint.sh, disable-nsfw-filter.patch (facefusion) Removed outdated documentation: - DOCKER_GPU_SETUP.md, README_GPU_SETUP.md, SETUP_GUIDE.md - TAILSCALE_SETUP.md, WIREGUARD_SETUP.md (covered in DEPLOYMENT.md) - GPU_EXPANSION_PLAN.md (historical planning doc) - gpu-server-compose.yaml, litellm-config-gpu.yaml (old versions) - deploy-gpu-stack.sh, simple_vllm_server.py (old scripts) Organized documentation: - Created docs/ directory - Moved DEPLOYMENT.md, RUNPOD_TEMPLATE.md, GPU_DEPLOYMENT_LOG.md to docs/ - Updated all documentation links in README.md Final structure: - Clean root directory with only GPU-specific files - Organized documentation in docs/ - Model services in dedicated directories (model-orchestrator/, vllm/, flux/, musicgen/) - Automation scripts in scripts/ 2025-11-21 14:45:49 +01:00			`See: [docs/RUNPOD_TEMPLATE.md](docs/RUNPOD_TEMPLATE.md) for step-by-step guide.`
Initial commit: RunPod multi-modal AI orchestration stack - Multi-modal AI infrastructure for RunPod RTX 4090 - Automatic model orchestration (text, image, music) - Text: vLLM + Qwen 2.5 7B Instruct - Image: Flux.1 Schnell via OpenEDAI - Music: MusicGen Medium via AudioCraft - Cost-optimized sequential loading on single GPU - Template preparation scripts for rapid deployment - Comprehensive documentation (README, DEPLOYMENT, TEMPLATE) 2025-11-21 14:34:55 +01:00
			`## Adding New Models`

			Adding models is easy! Just edit `model-orchestrator/models.yaml`:

			```yaml
			`models:`
			`llama-3.1-8b: # New model`
			`type: text`
			`framework: vllm`
			`docker_service: vllm-llama`
			`port: 8004`
			`vram_gb: 17`
			`startup_time_seconds: 120`
			`endpoint: /v1/chat/completions`
			```

refactor: rename docker-compose.gpu.yaml to compose.yaml Simplified compose file naming to follow Docker Compose best practices: - Renamed docker-compose.gpu.yaml to compose.yaml - Updated all references in documentation files (README.md, DEPLOYMENT.md, GPU_DEPLOYMENT_LOG.md, RUNPOD_TEMPLATE.md) - Updated references in scripts (prepare-template.sh) This change enables simpler command syntax: - Before: docker compose -f docker-compose.gpu.yaml up -d orchestrator - After: docker compose up -d orchestrator Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-11-21 14:49:42 +01:00			Then add the Docker service to `compose.yaml` and restart the orchestrator.
Initial commit: RunPod multi-modal AI orchestration stack - Multi-modal AI infrastructure for RunPod RTX 4090 - Automatic model orchestration (text, image, music) - Text: vLLM + Qwen 2.5 7B Instruct - Image: Flux.1 Schnell via OpenEDAI - Music: MusicGen Medium via AudioCraft - Cost-optimized sequential loading on single GPU - Template preparation scripts for rapid deployment - Comprehensive documentation (README, DEPLOYMENT, TEMPLATE) 2025-11-21 14:34:55 +01:00
refactor: clean up runpod repository structure Removed facefusion and VPS-related files: - compose.yaml, postgres/, litellm-config.yaml (VPS services) - Dockerfile, entrypoint.sh, disable-nsfw-filter.patch (facefusion) Removed outdated documentation: - DOCKER_GPU_SETUP.md, README_GPU_SETUP.md, SETUP_GUIDE.md - TAILSCALE_SETUP.md, WIREGUARD_SETUP.md (covered in DEPLOYMENT.md) - GPU_EXPANSION_PLAN.md (historical planning doc) - gpu-server-compose.yaml, litellm-config-gpu.yaml (old versions) - deploy-gpu-stack.sh, simple_vllm_server.py (old scripts) Organized documentation: - Created docs/ directory - Moved DEPLOYMENT.md, RUNPOD_TEMPLATE.md, GPU_DEPLOYMENT_LOG.md to docs/ - Updated all documentation links in README.md Final structure: - Clean root directory with only GPU-specific files - Organized documentation in docs/ - Model services in dedicated directories (model-orchestrator/, vllm/, flux/, musicgen/) - Automation scripts in scripts/ 2025-11-21 14:45:49 +01:00			`See: [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md#adding-new-models) for complete instructions.`
Initial commit: RunPod multi-modal AI orchestration stack - Multi-modal AI infrastructure for RunPod RTX 4090 - Automatic model orchestration (text, image, music) - Text: vLLM + Qwen 2.5 7B Instruct - Image: Flux.1 Schnell via OpenEDAI - Music: MusicGen Medium via AudioCraft - Cost-optimized sequential loading on single GPU - Template preparation scripts for rapid deployment - Comprehensive documentation (README, DEPLOYMENT, TEMPLATE) 2025-11-21 14:34:55 +01:00
			`## Usage Examples`

			`### Text Generation`
			```bash
			`curl http://100.100.108.13:9000/v1/chat/completions \`
			`-H "Content-Type: application/json" \`
			`-d '{"model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Hello!"}]}'`
			```

			`### Image Generation`
			```bash
			`curl http://100.100.108.13:9000/v1/images/generations \`
			`-H "Content-Type: application/json" \`
			`-d '{"model": "flux-schnell", "prompt": "a cute cat", "size": "1024x1024"}'`
			```

			`### Music Generation`
			```bash
			`curl http://100.100.108.13:9000/v1/audio/generations \`
			`-H "Content-Type: application/json" \`
			`-d '{"model": "musicgen-medium", "prompt": "upbeat electronic", "duration": 30}'`
			```

			`## Infrastructure`

			`Provider: RunPod (Spot Instance)`
			`GPU: NVIDIA RTX 4090 24GB VRAM`
			`Region: Europe`
			`Network: Tailscale VPN (100.100.108.13)`
			Storage: 922TB network volume at `/workspace`

			`## Monitoring`

			```bash
			`# Check active model`
			`curl http://100.100.108.13:9000/health`

			`# View orchestrator logs`
			`docker logs -f ai_orchestrator`

			`# GPU usage`
			`nvidia-smi`
			```

			`## Support`

			`For issues:`
			1. Check orchestrator logs: `docker logs ai_orchestrator`
refactor: clean up runpod repository structure Removed facefusion and VPS-related files: - compose.yaml, postgres/, litellm-config.yaml (VPS services) - Dockerfile, entrypoint.sh, disable-nsfw-filter.patch (facefusion) Removed outdated documentation: - DOCKER_GPU_SETUP.md, README_GPU_SETUP.md, SETUP_GUIDE.md - TAILSCALE_SETUP.md, WIREGUARD_SETUP.md (covered in DEPLOYMENT.md) - GPU_EXPANSION_PLAN.md (historical planning doc) - gpu-server-compose.yaml, litellm-config-gpu.yaml (old versions) - deploy-gpu-stack.sh, simple_vllm_server.py (old scripts) Organized documentation: - Created docs/ directory - Moved DEPLOYMENT.md, RUNPOD_TEMPLATE.md, GPU_DEPLOYMENT_LOG.md to docs/ - Updated all documentation links in README.md Final structure: - Clean root directory with only GPU-specific files - Organized documentation in docs/ - Model services in dedicated directories (model-orchestrator/, vllm/, flux/, musicgen/) - Automation scripts in scripts/ 2025-11-21 14:45:49 +01:00			`2. Review [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md#troubleshooting)`
			`3. Check [docs/GPU_DEPLOYMENT_LOG.md](docs/GPU_DEPLOYMENT_LOG.md) for deployment history`
Initial commit: RunPod multi-modal AI orchestration stack - Multi-modal AI infrastructure for RunPod RTX 4090 - Automatic model orchestration (text, image, music) - Text: vLLM + Qwen 2.5 7B Instruct - Image: Flux.1 Schnell via OpenEDAI - Music: MusicGen Medium via AudioCraft - Cost-optimized sequential loading on single GPU - Template preparation scripts for rapid deployment - Comprehensive documentation (README, DEPLOYMENT, TEMPLATE) 2025-11-21 14:34:55 +01:00
			`## License`

			`Built with:`
			`- [vLLM](https://github.com/vllm-project/vllm) - Apache 2.0`
			`- [AudioCraft](https://github.com/facebookresearch/audiocraft) - MIT (code), CC-BY-NC (weights)`
			`- [Flux.1](https://github.com/black-forest-labs/flux) - Apache 2.0`
			`- [LiteLLM](https://github.com/BerriAI/litellm) - MIT`

			`Note: MusicGen pre-trained weights are non-commercial (CC-BY-NC).`