Created comprehensive architecture document for RunPod deployment: **Key Design Principles:** - No Docker (direct Python for RunPod compatibility) - Extensible (add models in 3 simple steps) - Maintainable (clear structure, base classes) - Simple (one command startup) **Structure:** - core/ - Base service class + service manager - model-orchestrator/ - Request routing - models/ - Service implementations (vllm, flux, musicgen) - scripts/ - Install, start, stop, template prep - docs/ - Adding models, deployment, templates **Adding New Models:** 1. Create server.py inheriting BaseService 2. Add entry to models.yaml 3. Add requirements.txt That's it! Orchestrator handles lifecycle automatically. Next: Implement base_service.py and refactor existing services. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
6.3 KiB
6.3 KiB
RunPod Multi-Modal AI Architecture
Clean, extensible Python-based architecture for RunPod GPU instances
Design Principles
- No Docker - Direct Python execution for RunPod compatibility
- Extensible - Adding new models requires minimal code
- Maintainable - Clear structure and separation of concerns
- Simple - One command to start, easy to debug
Directory Structure
runpod/
├── core/ # Core infrastructure
│ ├── base_service.py # Abstract base class for all services
│ ├── service_manager.py # Process lifecycle management
│ └── requirements.txt # Core dependencies
│
├── model-orchestrator/ # Request orchestration
│ ├── orchestrator.py # Main orchestrator (process-based)
│ ├── models.yaml # Model registry (simple config)
│ └── requirements.txt # Orchestrator dependencies
│
├── models/ # Model service implementations
│ ├── vllm/ # Text generation
│ │ ├── server.py # vLLM service (inherits base_service)
│ │ └── requirements.txt # vLLM dependencies
│ │
│ ├── flux/ # Image generation
│ │ ├── server.py # Flux service
│ │ └── requirements.txt # Flux dependencies
│ │
│ └── musicgen/ # Music generation
│ ├── server.py # MusicGen service
│ └── requirements.txt # AudioCraft dependencies
│
├── scripts/ # Deployment & management
│ ├── install.sh # Install all dependencies
│ ├── download-models.sh # Pre-download models
│ ├── start-all.sh # Start orchestrator + services
│ ├── stop-all.sh # Stop all services
│ └── prepare-template.sh # RunPod template preparation
│
├── systemd/ # Optional systemd services
│ ├── ai-orchestrator.service
│ └── install-services.sh
│
└── docs/ # Documentation
├── ADDING_MODELS.md # Guide for adding new models
├── DEPLOYMENT.md # Deployment guide
└── RUNPOD_TEMPLATE.md # Template creation guide
Component Responsibilities
Core (core/)
-
base_service.py: Abstract base class for all model services
- Health check endpoint
- Graceful shutdown
- Logging configuration
- Common utilities
-
service_manager.py: Process lifecycle management
- Start/stop services
- Health monitoring
- Auto-restart on failure
- Resource cleanup
Orchestrator (model-orchestrator/)
-
orchestrator.py: Routes requests to appropriate model
- Reads
models.yamlconfiguration - Manages model switching
- Proxies requests to services
- OpenAI-compatible API
- Reads
-
models.yaml: Simple model registry
models: model-name: type: text|image|audio service_script: path/to/server.py port: 8001 startup_time: 120 endpoint: /v1/chat/completions
Models (models/)
Each model directory contains:
- server.py: Service implementation (inherits
BaseService) - requirements.txt: Model-specific dependencies
Services are standalone - can run independently for testing.
Scripts (scripts/)
- install.sh: Install Python packages for all services
- download-models.sh: Pre-download models to
/workspace - start-all.sh: Start orchestrator (which manages model services)
- stop-all.sh: Graceful shutdown of all services
- prepare-template.sh: RunPod template preparation
Adding a New Model (3 steps)
1. Create Model Service
# models/mymodel/server.py
from core.base_service import BaseService
class MyModelService(BaseService):
def __init__(self):
super().__init__(
name="mymodel",
port=8004
)
async def initialize(self):
"""Load model"""
self.model = load_my_model()
def create_app(self):
"""Define FastAPI routes"""
@self.app.post("/v1/mymodel/generate")
async def generate(request: MyRequest):
return self.model.generate(request.prompt)
if __name__ == "__main__":
service = MyModelService()
service.run()
2. Add to Registry
# model-orchestrator/models.yaml
models:
mymodel:
type: custom
service_script: models/mymodel/server.py
port: 8004
startup_time: 60
endpoint: /v1/mymodel/generate
3. Add Dependencies
# models/mymodel/requirements.txt
transformers==4.36.0
torch==2.1.0
That's it! The orchestrator handles everything else.
Request Flow
Client Request
↓
Orchestrator (port 9000)
↓ (determines model from endpoint)
Model Service (port 8001-800X)
↓
Response
Startup Flow
- Run
scripts/start-all.sh - Orchestrator starts on port 9000
- Orchestrator reads
models.yaml - On first request:
- Orchestrator starts appropriate model service
- Waits for health check
- Proxies request
- On subsequent requests:
- If same model: direct proxy
- If different model: stop current, start new
Benefits
- Simple: No Docker complexity, just Python
- Fast: No container overhead, direct execution
- Debuggable: Standard Python processes, easy to inspect
- Extensible: Add models by creating one file + YAML entry
- Maintainable: Clear structure, base classes, DRY principles
- Portable: Works anywhere Python runs (local, RunPod, other cloud)
Development Workflow
# Local development
python3 models/vllm/server.py # Test service directly
python3 model-orchestrator/orchestrator.py # Test orchestrator
# RunPod deployment
./scripts/install.sh # Install dependencies
./scripts/download-models.sh # Pre-download models
./scripts/start-all.sh # Start everything
# Create template
./scripts/prepare-template.sh # Prepare for template save
Future Enhancements
- Load balancing across multiple GPUs
- Model pooling (keep multiple models loaded)
- Batch request queueing
- Metrics and monitoring
- Auto-scaling based on demand