Files
runpod/README.md

174 lines
5.5 KiB
Markdown
Raw Normal View History

# RunPod Multi-Modal AI Stack
**Cost-optimized GPU deployment for text, image, and music generation on RunPod RTX 4090.**
This repository contains everything needed to deploy and manage a multi-modal AI infrastructure on RunPod, featuring intelligent model orchestration that automatically switches between models based on request type.
## Features
- **Text Generation**: Qwen 2.5 7B Instruct via vLLM (~50 tokens/sec)
- **Image Generation**: Flux.1 Schnell (~4-5 seconds per image)
- **Music Generation**: MusicGen Medium (30 seconds of audio in 60-90 seconds)
- **Automatic Model Switching**: Intelligent orchestrator manages sequential model loading
- **OpenAI-Compatible APIs**: Works with existing AI tools and clients
- **Easy Model Addition**: Just edit `model-orchestrator/models.yaml` to add new models
- **Template Support**: Create reusable templates for 2-3 minute deployments (vs 60-90 minutes)
## Quick Start
### Option 1: Deploy from Template (Recommended)
If you've already created a RunPod template:
1. Deploy pod from template in RunPod dashboard
2. SSH to the pod
3. Create `.env` file with your credentials
4. Start orchestrator: `docker compose -f compose.yaml up -d orchestrator`
**See**: [RUNPOD_TEMPLATE.md](RUNPOD_TEMPLATE.md) for template usage instructions.
### Option 2: Fresh Deployment
For first-time setup on a new RunPod instance:
1. Copy files to RunPod: `scp -r * gpu-server:/workspace/ai/`
2. SSH to GPU server: `ssh gpu-server`
3. Run preparation script: `cd /workspace/ai && chmod +x scripts/prepare-template.sh && ./scripts/prepare-template.sh`
**See**: [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md) for detailed deployment guide.
## Architecture
```
VPS (LiteLLM Proxy)
↓ Tailscale VPN
GPU Server (Orchestrator Port 9000)
├── vLLM (Qwen 2.5 7B) - Port 8001
├── Flux.1 Schnell - Port 8002
└── MusicGen Medium - Port 8003
```
All requests route through the orchestrator, which automatically loads the appropriate model. Only one model is active at a time for cost optimization (~$0.50/hr vs ~$0.75/hr for multi-GPU).
## Cost Analysis
**RunPod RTX 4090 Spot Instance**:
- **Hourly**: ~$0.50
- **Monthly (24/7)**: ~$360
- **Monthly (8hr/day)**: ~$120
**Template Benefits**:
- **Without Template**: 60-90 minutes setup per Spot restart
- **With Template**: 2-3 minutes deployment time
- **Spot Restart Frequency**: 2-5 times per week (variable)
## Documentation
- **[docs/DEPLOYMENT.md](docs/DEPLOYMENT.md)** - Complete deployment and usage guide
- **[docs/RUNPOD_TEMPLATE.md](docs/RUNPOD_TEMPLATE.md)** - Template creation and usage
- **[docs/GPU_DEPLOYMENT_LOG.md](docs/GPU_DEPLOYMENT_LOG.md)** - Deployment history and technical notes
### Architecture Components
- `model-orchestrator/` - FastAPI orchestrator managing model lifecycle
- `vllm/` - Text generation service (Qwen 2.5 7B)
- `flux/` - Image generation service (Flux.1 Schnell)
- `musicgen/` - Music generation service (MusicGen Medium)
- `scripts/` - Automation scripts
## Creating a RunPod Template
**Why create a template?**
- Save 60-90 minutes on every Spot instance restart
- Pre-downloaded models (~37GB cached)
- Pre-built Docker images
- Ready-to-use configuration
**How to create:**
1. Run `scripts/prepare-template.sh` on a fresh RunPod instance
2. Wait 45-60 minutes for models to download and images to build
3. Save pod as template in RunPod dashboard
4. Name: `multi-modal-ai-v1.0`
**See**: [docs/RUNPOD_TEMPLATE.md](docs/RUNPOD_TEMPLATE.md) for step-by-step guide.
## Adding New Models
Adding models is easy! Just edit `model-orchestrator/models.yaml`:
```yaml
models:
llama-3.1-8b: # New model
type: text
framework: vllm
docker_service: vllm-llama
port: 8004
vram_gb: 17
startup_time_seconds: 120
endpoint: /v1/chat/completions
```
Then add the Docker service to `compose.yaml` and restart the orchestrator.
**See**: [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md#adding-new-models) for complete instructions.
## Usage Examples
### Text Generation
```bash
curl http://100.100.108.13:9000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Hello!"}]}'
```
### Image Generation
```bash
curl http://100.100.108.13:9000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{"model": "flux-schnell", "prompt": "a cute cat", "size": "1024x1024"}'
```
### Music Generation
```bash
curl http://100.100.108.13:9000/v1/audio/generations \
-H "Content-Type: application/json" \
-d '{"model": "musicgen-medium", "prompt": "upbeat electronic", "duration": 30}'
```
## Infrastructure
**Provider**: RunPod (Spot Instance)
**GPU**: NVIDIA RTX 4090 24GB VRAM
**Region**: Europe
**Network**: Tailscale VPN (100.100.108.13)
**Storage**: 922TB network volume at `/workspace`
## Monitoring
```bash
# Check active model
curl http://100.100.108.13:9000/health
# View orchestrator logs
docker logs -f ai_orchestrator
# GPU usage
nvidia-smi
```
## Support
For issues:
1. Check orchestrator logs: `docker logs ai_orchestrator`
2. Review [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md#troubleshooting)
3. Check [docs/GPU_DEPLOYMENT_LOG.md](docs/GPU_DEPLOYMENT_LOG.md) for deployment history
## License
Built with:
- [vLLM](https://github.com/vllm-project/vllm) - Apache 2.0
- [AudioCraft](https://github.com/facebookresearch/audiocraft) - MIT (code), CC-BY-NC (weights)
- [Flux.1](https://github.com/black-forest-labs/flux) - Apache 2.0
- [LiteLLM](https://github.com/BerriAI/litellm) - MIT
**Note**: MusicGen pre-trained weights are non-commercial (CC-BY-NC).