runpod/ARCHITECTURE.md

# RunPod Multi-Modal AI Architecture

**Clean, extensible distributed AI infrastructure spanning VPS and GPU**

## Design Principles

1. **Distributed** - VPS (UI/proxy) + GPU (models) connected via Tailscale
2. **No Docker on GPU** - Direct Python for RunPod compatibility
3. **Extensible** - Adding new models requires minimal code
4. **Maintainable** - Clear structure and separation of concerns
5. **Simple** - One command to start, easy to debug
6. **OpenAI Compatible** - Works with standard AI tools

## Directory Structure

```
runpod/
├── core/                          # Core infrastructure
│   ├── base_service.py           # Abstract base class for all services
│   ├── service_manager.py        # Process lifecycle management
│   └── requirements.txt          # Core dependencies
│
├── model-orchestrator/            # Request orchestration
│   ├── orchestrator.py           # Main orchestrator (process-based)
│   ├── models.yaml               # Model registry (simple config)
│   └── requirements.txt          # Orchestrator dependencies
│
├── models/                        # Model service implementations
│   ├── vllm/                     # Text generation
│   │   ├── server.py             # vLLM service (inherits base_service)
│   │   └── requirements.txt      # vLLM dependencies
│   │
│   ├── flux/                     # Image generation
│   │   ├── server.py             # Flux service
│   │   └── requirements.txt      # Flux dependencies
│   │
│   └── musicgen/                 # Music generation
│       ├── server.py             # MusicGen service
│       └── requirements.txt      # AudioCraft dependencies
│
├── scripts/                       # Deployment & management
│   ├── install.sh                # Install all dependencies
│   ├── download-models.sh        # Pre-download models
│   ├── start-all.sh              # Start orchestrator + services
│   ├── stop-all.sh               # Stop all services
│   └── prepare-template.sh       # RunPod template preparation
│
├── systemd/                       # Optional systemd services
│   ├── ai-orchestrator.service
│   └── install-services.sh
│
└── docs/                          # Documentation
    ├── ADDING_MODELS.md          # Guide for adding new models
    ├── DEPLOYMENT.md             # Deployment guide
    └── RUNPOD_TEMPLATE.md        # Template creation guide
```

## Component Responsibilities

### Core (`core/`)
- **base_service.py**: Abstract base class for all model services
  - Health check endpoint
  - Graceful shutdown
  - Logging configuration
  - Common utilities

- **service_manager.py**: Process lifecycle management
  - Start/stop services
  - Health monitoring
  - Auto-restart on failure
  - Resource cleanup

### Orchestrator (`model-orchestrator/`)
- **orchestrator.py**: Routes requests to appropriate model
  - Reads `models.yaml` configuration
  - Manages model switching
  - Proxies requests to services
  - OpenAI-compatible API

- **models.yaml**: Simple model registry
  ```yaml
  models:
    model-name:
      type: text|image|audio
      service_script: path/to/server.py
      port: 8001
      startup_time: 120
      endpoint: /v1/chat/completions
  ```

### Models (`models/`)
Each model directory contains:
- **server.py**: Service implementation (inherits `BaseService`)
- **requirements.txt**: Model-specific dependencies

Services are standalone - can run independently for testing.

### Scripts (`scripts/`)
- **install.sh**: Install Python packages for all services
- **download-models.sh**: Pre-download models to `/workspace`
- **start-all.sh**: Start orchestrator (which manages model services)
- **stop-all.sh**: Graceful shutdown of all services
- **prepare-template.sh**: RunPod template preparation

## Adding a New Model (3 steps)

### 1. Create Model Service

```python
# models/mymodel/server.py
from core.base_service import BaseService

class MyModelService(BaseService):
    def __init__(self):
        super().__init__(
            name="mymodel",
            port=8004
        )

    async def initialize(self):
        """Load model"""
        self.model = load_my_model()

    def create_app(self):
        """Define FastAPI routes"""
        @self.app.post("/v1/mymodel/generate")
        async def generate(request: MyRequest):
            return self.model.generate(request.prompt)

if __name__ == "__main__":
    service = MyModelService()
    service.run()
```

### 2. Add to Registry

```yaml
# model-orchestrator/models.yaml
models:
  mymodel:
    type: custom
    service_script: models/mymodel/server.py
    port: 8004
    startup_time: 60
    endpoint: /v1/mymodel/generate
```

### 3. Add Dependencies

```
# models/mymodel/requirements.txt
transformers==4.36.0
torch==2.1.0
```

That's it! The orchestrator handles everything else.

## Request Flow

```
Client Request
     ↓
Orchestrator (port 9000)
     ↓ (determines model from endpoint)
Model Service (port 8001-800X)
     ↓
Response
```

## Startup Flow

1. Run `scripts/start-all.sh`
2. Orchestrator starts on port 9000
3. Orchestrator reads `models.yaml`
4. On first request:
   - Orchestrator starts appropriate model service
   - Waits for health check
   - Proxies request
5. On subsequent requests:
   - If same model: direct proxy
   - If different model: stop current, start new

## Benefits

- **Simple**: No Docker complexity, just Python
- **Fast**: No container overhead, direct execution
- **Debuggable**: Standard Python processes, easy to inspect
- **Extensible**: Add models by creating one file + YAML entry
- **Maintainable**: Clear structure, base classes, DRY principles
- **Portable**: Works anywhere Python runs (local, RunPod, other cloud)

## Development Workflow

```bash
# Local development
python3 models/vllm/server.py          # Test service directly
python3 model-orchestrator/orchestrator.py  # Test orchestrator

# RunPod deployment
./scripts/install.sh                    # Install dependencies
./scripts/download-models.sh            # Pre-download models
./scripts/start-all.sh                  # Start everything

# Create template
./scripts/prepare-template.sh           # Prepare for template save
```

## Future Enhancements

- Load balancing across multiple GPUs
- Model pooling (keep multiple models loaded)
- Batch request queueing
- Metrics and monitoring
- Auto-scaling based on demand