Files
runpod/ARCHITECTURE.md
Sebastian Krüger 03a430894d docs: add clean extensible architecture design
Created comprehensive architecture document for RunPod deployment:

**Key Design Principles:**
- No Docker (direct Python for RunPod compatibility)
- Extensible (add models in 3 simple steps)
- Maintainable (clear structure, base classes)
- Simple (one command startup)

**Structure:**
- core/ - Base service class + service manager
- model-orchestrator/ - Request routing
- models/ - Service implementations (vllm, flux, musicgen)
- scripts/ - Install, start, stop, template prep
- docs/ - Adding models, deployment, templates

**Adding New Models:**
1. Create server.py inheriting BaseService
2. Add entry to models.yaml
3. Add requirements.txt

That's it! Orchestrator handles lifecycle automatically.

Next: Implement base_service.py and refactor existing services.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 15:16:51 +01:00

6.3 KiB

RunPod Multi-Modal AI Architecture

Clean, extensible Python-based architecture for RunPod GPU instances

Design Principles

  1. No Docker - Direct Python execution for RunPod compatibility
  2. Extensible - Adding new models requires minimal code
  3. Maintainable - Clear structure and separation of concerns
  4. Simple - One command to start, easy to debug

Directory Structure

runpod/
├── core/                          # Core infrastructure
│   ├── base_service.py           # Abstract base class for all services
│   ├── service_manager.py        # Process lifecycle management
│   └── requirements.txt          # Core dependencies
│
├── model-orchestrator/            # Request orchestration
│   ├── orchestrator.py           # Main orchestrator (process-based)
│   ├── models.yaml               # Model registry (simple config)
│   └── requirements.txt          # Orchestrator dependencies
│
├── models/                        # Model service implementations
│   ├── vllm/                     # Text generation
│   │   ├── server.py             # vLLM service (inherits base_service)
│   │   └── requirements.txt      # vLLM dependencies
│   │
│   ├── flux/                     # Image generation
│   │   ├── server.py             # Flux service
│   │   └── requirements.txt      # Flux dependencies
│   │
│   └── musicgen/                 # Music generation
│       ├── server.py             # MusicGen service
│       └── requirements.txt      # AudioCraft dependencies
│
├── scripts/                       # Deployment & management
│   ├── install.sh                # Install all dependencies
│   ├── download-models.sh        # Pre-download models
│   ├── start-all.sh              # Start orchestrator + services
│   ├── stop-all.sh               # Stop all services
│   └── prepare-template.sh       # RunPod template preparation
│
├── systemd/                       # Optional systemd services
│   ├── ai-orchestrator.service
│   └── install-services.sh
│
└── docs/                          # Documentation
    ├── ADDING_MODELS.md          # Guide for adding new models
    ├── DEPLOYMENT.md             # Deployment guide
    └── RUNPOD_TEMPLATE.md        # Template creation guide

Component Responsibilities

Core (core/)

  • base_service.py: Abstract base class for all model services

    • Health check endpoint
    • Graceful shutdown
    • Logging configuration
    • Common utilities
  • service_manager.py: Process lifecycle management

    • Start/stop services
    • Health monitoring
    • Auto-restart on failure
    • Resource cleanup

Orchestrator (model-orchestrator/)

  • orchestrator.py: Routes requests to appropriate model

    • Reads models.yaml configuration
    • Manages model switching
    • Proxies requests to services
    • OpenAI-compatible API
  • models.yaml: Simple model registry

    models:
      model-name:
        type: text|image|audio
        service_script: path/to/server.py
        port: 8001
        startup_time: 120
        endpoint: /v1/chat/completions
    

Models (models/)

Each model directory contains:

  • server.py: Service implementation (inherits BaseService)
  • requirements.txt: Model-specific dependencies

Services are standalone - can run independently for testing.

Scripts (scripts/)

  • install.sh: Install Python packages for all services
  • download-models.sh: Pre-download models to /workspace
  • start-all.sh: Start orchestrator (which manages model services)
  • stop-all.sh: Graceful shutdown of all services
  • prepare-template.sh: RunPod template preparation

Adding a New Model (3 steps)

1. Create Model Service

# models/mymodel/server.py
from core.base_service import BaseService

class MyModelService(BaseService):
    def __init__(self):
        super().__init__(
            name="mymodel",
            port=8004
        )

    async def initialize(self):
        """Load model"""
        self.model = load_my_model()

    def create_app(self):
        """Define FastAPI routes"""
        @self.app.post("/v1/mymodel/generate")
        async def generate(request: MyRequest):
            return self.model.generate(request.prompt)

if __name__ == "__main__":
    service = MyModelService()
    service.run()

2. Add to Registry

# model-orchestrator/models.yaml
models:
  mymodel:
    type: custom
    service_script: models/mymodel/server.py
    port: 8004
    startup_time: 60
    endpoint: /v1/mymodel/generate

3. Add Dependencies

# models/mymodel/requirements.txt
transformers==4.36.0
torch==2.1.0

That's it! The orchestrator handles everything else.

Request Flow

Client Request
     ↓
Orchestrator (port 9000)
     ↓ (determines model from endpoint)
Model Service (port 8001-800X)
     ↓
Response

Startup Flow

  1. Run scripts/start-all.sh
  2. Orchestrator starts on port 9000
  3. Orchestrator reads models.yaml
  4. On first request:
    • Orchestrator starts appropriate model service
    • Waits for health check
    • Proxies request
  5. On subsequent requests:
    • If same model: direct proxy
    • If different model: stop current, start new

Benefits

  • Simple: No Docker complexity, just Python
  • Fast: No container overhead, direct execution
  • Debuggable: Standard Python processes, easy to inspect
  • Extensible: Add models by creating one file + YAML entry
  • Maintainable: Clear structure, base classes, DRY principles
  • Portable: Works anywhere Python runs (local, RunPod, other cloud)

Development Workflow

# Local development
python3 models/vllm/server.py          # Test service directly
python3 model-orchestrator/orchestrator.py  # Test orchestrator

# RunPod deployment
./scripts/install.sh                    # Install dependencies
./scripts/download-models.sh            # Pre-download models
./scripts/start-all.sh                  # Start everything

# Create template
./scripts/prepare-template.sh           # Prepare for template save

Future Enhancements

  • Load balancing across multiple GPUs
  • Model pooling (keep multiple models loaded)
  • Batch request queueing
  • Metrics and monitoring
  • Auto-scaling based on demand