# RunPod Multi-Modal AI Architecture **Clean, extensible distributed AI infrastructure spanning VPS and GPU** ## Design Principles 1. **Distributed** - VPS (UI/proxy) + GPU (models) connected via Tailscale 2. **No Docker on GPU** - Direct Python for RunPod compatibility 3. **Extensible** - Adding new models requires minimal code 4. **Maintainable** - Clear structure and separation of concerns 5. **Simple** - One command to start, easy to debug 6. **OpenAI Compatible** - Works with standard AI tools ## Directory Structure ``` runpod/ ├── core/ # Core infrastructure │ ├── base_service.py # Abstract base class for all services │ ├── service_manager.py # Process lifecycle management │ └── requirements.txt # Core dependencies │ ├── model-orchestrator/ # Request orchestration │ ├── orchestrator.py # Main orchestrator (process-based) │ ├── models.yaml # Model registry (simple config) │ └── requirements.txt # Orchestrator dependencies │ ├── models/ # Model service implementations │ ├── vllm/ # Text generation │ │ ├── server.py # vLLM service (inherits base_service) │ │ └── requirements.txt # vLLM dependencies │ │ │ ├── flux/ # Image generation │ │ ├── server.py # Flux service │ │ └── requirements.txt # Flux dependencies │ │ │ └── musicgen/ # Music generation │ ├── server.py # MusicGen service │ └── requirements.txt # AudioCraft dependencies │ ├── scripts/ # Deployment & management │ ├── install.sh # Install all dependencies │ ├── download-models.sh # Pre-download models │ ├── start-all.sh # Start orchestrator + services │ ├── stop-all.sh # Stop all services │ └── prepare-template.sh # RunPod template preparation │ ├── systemd/ # Optional systemd services │ ├── ai-orchestrator.service │ └── install-services.sh │ └── docs/ # Documentation ├── ADDING_MODELS.md # Guide for adding new models ├── DEPLOYMENT.md # Deployment guide └── RUNPOD_TEMPLATE.md # Template creation guide ``` ## Component Responsibilities ### Core (`core/`) - **base_service.py**: Abstract base class for all model services - Health check endpoint - Graceful shutdown - Logging configuration - Common utilities - **service_manager.py**: Process lifecycle management - Start/stop services - Health monitoring - Auto-restart on failure - Resource cleanup ### Orchestrator (`model-orchestrator/`) - **orchestrator.py**: Routes requests to appropriate model - Reads `models.yaml` configuration - Manages model switching - Proxies requests to services - OpenAI-compatible API - **models.yaml**: Simple model registry ```yaml models: model-name: type: text|image|audio service_script: path/to/server.py port: 8001 startup_time: 120 endpoint: /v1/chat/completions ``` ### Models (`models/`) Each model directory contains: - **server.py**: Service implementation (inherits `BaseService`) - **requirements.txt**: Model-specific dependencies Services are standalone - can run independently for testing. ### Scripts (`scripts/`) - **install.sh**: Install Python packages for all services - **download-models.sh**: Pre-download models to `/workspace` - **start-all.sh**: Start orchestrator (which manages model services) - **stop-all.sh**: Graceful shutdown of all services - **prepare-template.sh**: RunPod template preparation ## Adding a New Model (3 steps) ### 1. Create Model Service ```python # models/mymodel/server.py from core.base_service import BaseService class MyModelService(BaseService): def __init__(self): super().__init__( name="mymodel", port=8004 ) async def initialize(self): """Load model""" self.model = load_my_model() def create_app(self): """Define FastAPI routes""" @self.app.post("/v1/mymodel/generate") async def generate(request: MyRequest): return self.model.generate(request.prompt) if __name__ == "__main__": service = MyModelService() service.run() ``` ### 2. Add to Registry ```yaml # model-orchestrator/models.yaml models: mymodel: type: custom service_script: models/mymodel/server.py port: 8004 startup_time: 60 endpoint: /v1/mymodel/generate ``` ### 3. Add Dependencies ``` # models/mymodel/requirements.txt transformers==4.36.0 torch==2.1.0 ``` That's it! The orchestrator handles everything else. ## Request Flow ``` Client Request ↓ Orchestrator (port 9000) ↓ (determines model from endpoint) Model Service (port 8001-800X) ↓ Response ``` ## Startup Flow 1. Run `scripts/start-all.sh` 2. Orchestrator starts on port 9000 3. Orchestrator reads `models.yaml` 4. On first request: - Orchestrator starts appropriate model service - Waits for health check - Proxies request 5. On subsequent requests: - If same model: direct proxy - If different model: stop current, start new ## Benefits - **Simple**: No Docker complexity, just Python - **Fast**: No container overhead, direct execution - **Debuggable**: Standard Python processes, easy to inspect - **Extensible**: Add models by creating one file + YAML entry - **Maintainable**: Clear structure, base classes, DRY principles - **Portable**: Works anywhere Python runs (local, RunPod, other cloud) ## Development Workflow ```bash # Local development python3 models/vllm/server.py # Test service directly python3 model-orchestrator/orchestrator.py # Test orchestrator # RunPod deployment ./scripts/install.sh # Install dependencies ./scripts/download-models.sh # Pre-download models ./scripts/start-all.sh # Start everything # Create template ./scripts/prepare-template.sh # Prepare for template save ``` ## Future Enhancements - Load balancing across multiple GPUs - Model pooling (keep multiple models loaded) - Batch request queueing - Metrics and monitoring - Auto-scaling based on demand