runpod/README.md

# AI Model Orchestrator for RunPod

A lightweight, process-based AI model orchestrator designed for single-GPU environments. Automatically switches between text, image, and audio generation models based on API requests, maximizing GPU utilization on resource-constrained hardware.

## Overview

This orchestrator solves a common problem: **running multiple large AI models on a single GPU** by:
- Loading models **sequentially** (one at a time) to fit within GPU memory constraints
- **Automatically switching** models based on incoming API requests
- Providing **OpenAI-compatible endpoints** for seamless integration
- Supporting streaming responses for real-time text generation

**Perfect for:**
- RunPod RTX 4090 instances (24GB VRAM)
- Cost-effective GPU servers
- Self-hosted AI infrastructure
- Integration with LiteLLM, Open WebUI, and other AI proxies

## Architecture

```
Client Request → LiteLLM Proxy → Orchestrator (port 9000) → Model Service (port 8000+)
                  (Optional)       ↓
                              Model Registry
                              (models.yaml)
                                   ↓
                            Sequential Loading
                         (stops current, starts new)
```

### Key Components

- **Orchestrator** (`orchestrator_subprocess.py`): FastAPI proxy managing model lifecycle via Python subprocesses
- **Model Registry** (`models.yaml`): YAML configuration defining available models and their properties
- **Model Services**: Individual Python servers (vLLM, OpenedAI-Images, AudioCraft) running specific models
- **Management Scripts**: Bash utilities for starting, stopping, and downloading models

### Why Process-Based?

Unlike the Docker-based version (`orchestrator.py`), the subprocess implementation:
- Works seamlessly in RunPod's containerized environment (no Docker-in-Docker complexity)
- Starts models faster (direct Python process spawning)
- Simplifies debugging (single process tree)
- Reduces overhead (no container management layer)

## Available Models

### Text Generation (vLLM)
- **qwen-2.5-7b**: Qwen 2.5 7B Instruct (14GB VRAM, ~2min startup)
  - Fast text generation with streaming support
  - Endpoint: `/v1/chat/completions`

### Image Generation (OpenedAI-Images)
- **flux-schnell**: Flux.1 Schnell (14GB VRAM, ~1min startup)
  - High-quality images in 4-5 seconds
  - Endpoint: `/v1/images/generations`

### Music Generation (AudioCraft)
- **musicgen-medium**: MusicGen Medium (11GB VRAM, ~45s startup)
  - Text-to-music generation (60-90s for 30s audio)
  - Endpoint: `/v1/audio/generations`

## Installation

### Prerequisites
- Python 3.10+
- CUDA-capable GPU (tested on RTX 4090)
- Ubuntu/Debian-based system
- HuggingFace account with API token

### Quick Start (RunPod)

1. **Clone the repository:**
   ```bash
   cd /workspace
   git clone https://github.com/yourusername/runpod.git ai
   cd ai
   ```

2. **Configure environment:**
   ```bash
   cp .env.example .env
   # Edit .env and set:
   # - HF_TOKEN=your_huggingface_token
   # - PORT=9000 (orchestrator port)
   ```

3. **Run setup (installs dependencies, downloads models):**
   ```bash
   bash scripts/install.sh
   ```

4. **Start all services:**
   ```bash
   bash scripts/start-all.sh
   ```

5. **Verify orchestrator is running:**
   ```bash
   curl http://localhost:9000/health
   ```

### Manual Installation

If you prefer step-by-step setup:

```bash
# Install system dependencies via Ansible
ansible-playbook playbook.yml --tags base,python,dependencies

# Download model weights
bash scripts/download-models.sh

# Start orchestrator
python3 model-orchestrator/orchestrator_subprocess.py
```

## Usage

### Starting Services

```bash
# Start orchestrator + all models
bash scripts/start-all.sh

# Start orchestrator only (models start on-demand)
python3 model-orchestrator/orchestrator_subprocess.py
```

Services start in the background and automatically load models when requested.

### Stopping Services

```bash
# Stop all services
bash scripts/stop-all.sh

# Stop orchestrator only (leaves models running)
pkill -f orchestrator_subprocess.py
```

### Testing the API

**Text generation (streaming):**
```bash
curl -s -N -X POST http://localhost:9000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen-2.5-7b",
    "messages": [{"role": "user", "content": "Count to 5"}],
    "max_tokens": 50,
    "stream": true
  }'
```

**Image generation:**
```bash
curl -X POST http://localhost:9000/v1/images/generations \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "flux-schnell",
    "prompt": "A serene mountain landscape at sunset",
    "size": "1024x1024"
  }'
```

**Music generation:**
```bash
curl -X POST http://localhost:9000/v1/audio/generations \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "musicgen-medium",
    "prompt": "Upbeat electronic dance music",
    "duration": 30
  }'
```

**List available models:**
```bash
curl http://localhost:9000/v1/models
```

**Health check:**
```bash
curl http://localhost:9000/health
```

## Integration with LiteLLM

The orchestrator is designed to work behind LiteLLM for unified API management:

**LiteLLM config (`litellm-config.yaml`):**
```yaml
model_list:
  - model_name: qwen-2.5-7b
    litellm_params:
      model: hosted_vllm/openai/qwen-2.5-7b
      api_base: http://100.121.199.88:9000/v1  # Tailscale VPN
      api_key: dummy
      stream: true
      timeout: 600
```

**Key configuration notes:**
- Use `hosted_vllm/openai/` prefix for vLLM models via orchestrator
- Set `stream: true` to enable token-by-token streaming
- Use Tailscale IP for secure VPN access from remote services

## Configuration

### Model Registry (`model-orchestrator/models.yaml`)

Models are defined in YAML with these properties:

```yaml
qwen-2.5-7b:
  type: text                          # text, image, or audio
  framework: vllm                     # vllm, openedai-images, audiocraft
  service_script: models/vllm/server.py
  port: 8000
  vram_gb: 14                         # GPU memory requirement
  startup_time_seconds: 120           # Max time to wait for health check
  endpoint: /v1/chat/completions      # API endpoint path
  description: "Human-readable description"
```

### Adding New Models

1. **Add model definition to `models.yaml`:**
   ```yaml
   llama-3.1-8b:
     type: text
     framework: vllm
     service_script: models/vllm/server_llama.py
     port: 8004
     vram_gb: 17
     startup_time_seconds: 120
     endpoint: /v1/chat/completions
     description: "Llama 3.1 8B Instruct"
   ```

2. **Create model service script** (`models/vllm/server_llama.py`):
   ```python
   import os
   from vllm import LLM
   from vllm.entrypoints.openai.api_server import run_server

   model = "meta-llama/Llama-3.1-8B-Instruct"
   port = int(os.getenv("PORT", 8004))
   run_server(model=model, port=port)
   ```

3. **Download model weights:**
   ```bash
   bash scripts/download-models.sh
   ```

4. **Restart orchestrator:**
   ```bash
   bash scripts/stop-all.sh && bash scripts/start-all.sh
   ```

## Deployment Options

### Option 1: RunPod (Current Setup)

Designed for RunPod's containerized environment:
- Services run as background processes in `/workspace/ai/`
- Managed by `start-all.sh` and `stop-all.sh`
- No systemd (RunPod containers don't support it)

### Option 2: Dedicated VPS with systemd

For production VPS deployment:

1. **Install systemd service:**
   ```bash
   sudo cp systemd/ai-orchestrator.service /etc/systemd/system/
   sudo systemctl daemon-reload
   sudo systemctl enable ai-orchestrator
   sudo systemctl start ai-orchestrator
   ```

2. **Check logs:**
   ```bash
   sudo journalctl -u ai-orchestrator -f
   ```

### Option 3: Tailscale VPN Integration

Connect RunPod GPU to your VPS infrastructure:

1. **On RunPod instance:**
   ```bash
   # Install Tailscale
   curl -fsSL https://tailscale.com/install.sh | sh

   # Connect to VPN
   tailscale up --advertise-tags=tag:gpu
   ```

2. **On VPS (configure LiteLLM):**
   ```yaml
   # Use Tailscale IP in litellm-config.yaml
   api_base: http://100.121.199.88:9000/v1
   ```

3. **Benefits:**
   - Secure encrypted tunnel (no public exposure)
   - Low latency direct connection
   - Access GPU from any device on Tailscale network

## API Endpoints

### Orchestrator Management

- `GET /health` - Health check with current model status
- `GET /v1/models` - OpenAI-compatible models list
- `POST /switch` - Manually switch to a specific model

### Model Endpoints (Proxied)

The orchestrator automatically routes requests based on endpoint and model name:

- `POST /v1/chat/completions` - Text generation (text models)
- `POST /v1/images/generations` - Image generation (image models)
- `POST /v1/audio/generations` - Music generation (audio models)

**Streaming Support:**
- Set `"stream": true` in request body
- Returns Server-Sent Events (SSE) stream
- Works through LiteLLM proxy (with correct configuration)

## Troubleshooting

### Streaming Returns Empty Deltas

**Symptom:** LiteLLM shows streaming chunks but content is empty:
```json
{"choices":[{"delta":{"content":""},"index":0}]}
```

**Solution:** Use correct model path format in LiteLLM:
```yaml
# ✅ Correct
model: hosted_vllm/openai/qwen-2.5-7b

# ❌ Wrong
model: openai/qwen-2.5-7b
```

### Model Fails to Start

**Check logs:**
```bash
# RunPod
ps aux | grep orchestrator
ps aux | grep vllm

# VPS with systemd
sudo journalctl -u ai-orchestrator -n 50
```

**Common issues:**
- Out of GPU memory: Check VRAM usage with `nvidia-smi`
- Missing model weights: Run `bash scripts/download-models.sh`
- Port conflicts: Check if port is already in use with `lsof -i :9000`

### Orchestrator Not Responding

**Test model directly (bypass orchestrator):**
```bash
# Start vLLM service directly
cd /workspace/ai
python3 models/vllm/server.py

# Test on port 8000
curl -X POST http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Hi"}]}'
```

### HuggingFace Download Errors

**Symptom:** "Repository not found" or "Authentication required"

**Solution:**
```bash
# Check HF_TOKEN is set
echo $HF_TOKEN

# Set token manually
export HF_TOKEN=your_token_here

# Re-run download
bash scripts/download-models.sh
```

## Project Structure

```
runpod/
├── model-orchestrator/
│   ├── orchestrator_subprocess.py  # Main orchestrator (process-based)
│   ├── orchestrator.py             # Alternative Docker-based version
│   └── models.yaml                 # Model registry configuration
├── models/
│   ├── vllm/
│   │   ├── server.py               # vLLM text generation service
│   │   └── requirements.txt
│   ├── flux/
│   │   ├── server.py               # Flux image generation service
│   │   └── requirements.txt
│   └── musicgen/
│       ├── server.py               # MusicGen audio generation service
│       └── requirements.txt
├── scripts/
│   ├── install.sh                  # Setup script (Ansible + downloads)
│   ├── start-all.sh                # Start orchestrator + models
│   ├── stop-all.sh                 # Stop all services
│   └── download-models.sh          # Download model weights from HuggingFace
├── systemd/
│   └── ai-orchestrator.service     # systemd service file (for VPS)
├── playbook.yml                    # Ansible playbook for system setup
├── inventory.yml                   # Ansible inventory
├── .env.example                    # Environment variables template
└── README.md                       # This file
```

## Performance Notes

- **Model switching time:** 30-120 seconds (depends on model size)
- **Text generation:** ~20-40 tokens/second (qwen-2.5-7b on RTX 4090)
- **Image generation:** 4-5 seconds per image (Flux Schnell)
- **Music generation:** 60-90 seconds for 30s audio (MusicGen Medium)

## Future Enhancements

- [ ] Parallel model loading (for multi-GPU setups)
- [ ] Model preloading based on usage patterns
- [ ] Prometheus metrics exporter
- [ ] Web dashboard for model management
- [ ] Support for quantized models (lower VRAM requirements)
- [ ] Add Whisper (speech-to-text)
- [ ] Add XTTS-v2 (text-to-speech with voice cloning)

## License

MIT License - See LICENSE file for details.

## Contributing

Contributions welcome! Please:
1. Fork the repository
2. Create a feature branch
3. Test your changes on RunPod
4. Submit a pull request

## Support

- **Issues:** https://github.com/yourusername/runpod/issues
- **Discussions:** https://github.com/yourusername/runpod/discussions
- **Documentation:** This README + inline code comments

## Acknowledgments

Built with:
- [vLLM](https://github.com/vllm-project/vllm) - Fast LLM inference
- [OpenedAI-Images](https://github.com/matatonic/openedai-images) - OpenAI-compatible image generation
- [AudioCraft](https://github.com/facebookresearch/audiocraft) - Music generation
- [FastAPI](https://fastapi.tiangolo.com/) - Modern Python web framework
- [LiteLLM](https://github.com/BerriAI/litellm) - Universal LLM proxy