runpod/ARCHITECTURE.md at 9ee626a78e4723af5fd203ecf05e1f1176c9c8da

Files

Sebastian Krüger 9ee626a78e feat: implement Ansible-based process architecture for RunPod

Major architecture overhaul to address RunPod Docker limitations:

Core Infrastructure:
- Add base_service.py: Abstract base class for all AI services
- Add service_manager.py: Process lifecycle management
- Add core/requirements.txt: Core dependencies

Model Services (Standalone Python):
- Add models/vllm/server.py: Qwen 2.5 7B text generation
- Add models/flux/server.py: Flux.1 Schnell image generation
- Add models/musicgen/server.py: MusicGen Medium music generation
- Each service inherits from GPUService base class
- OpenAI-compatible APIs
- Standalone execution support

Ansible Deployment:
- Add playbook.yml: Comprehensive deployment automation
- Add ansible.cfg: Ansible configuration
- Add inventory.yml: Localhost inventory
- Tags: base, python, dependencies, models, tailscale, validate, cleanup

Scripts:
- Add scripts/install.sh: Full installation wrapper
- Add scripts/download-models.sh: Model download wrapper
- Add scripts/start-all.sh: Start orchestrator
- Add scripts/stop-all.sh: Stop all services

Documentation:
- Update ARCHITECTURE.md: Document distributed VPS+GPU architecture

Benefits:
- No Docker: Avoids RunPod CAP_SYS_ADMIN limitations
- Fully reproducible via Ansible
- Extensible: Add models in 3 steps
- Direct Python execution (no container overhead)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-21 15:37:18 +01:00

6.5 KiB

Raw Blame History

Clean, extensible distributed AI infrastructure spanning VPS and GPU

Design Principles

Distributed - VPS (UI/proxy) + GPU (models) connected via Tailscale
No Docker on GPU - Direct Python for RunPod compatibility
Extensible - Adding new models requires minimal code
Maintainable - Clear structure and separation of concerns
Simple - One command to start, easy to debug
OpenAI Compatible - Works with standard AI tools

Directory Structure

runpod/
├── core/                          # Core infrastructure
│   ├── base_service.py           # Abstract base class for all services
│   ├── service_manager.py        # Process lifecycle management
│   └── requirements.txt          # Core dependencies
│
├── model-orchestrator/            # Request orchestration
│   ├── orchestrator.py           # Main orchestrator (process-based)
│   ├── models.yaml               # Model registry (simple config)
│   └── requirements.txt          # Orchestrator dependencies
│
├── models/                        # Model service implementations
│   ├── vllm/                     # Text generation
│   │   ├── server.py             # vLLM service (inherits base_service)
│   │   └── requirements.txt      # vLLM dependencies
│   │
│   ├── flux/                     # Image generation
│   │   ├── server.py             # Flux service
│   │   └── requirements.txt      # Flux dependencies
│   │
│   └── musicgen/                 # Music generation
│       ├── server.py             # MusicGen service
│       └── requirements.txt      # AudioCraft dependencies
│
├── scripts/                       # Deployment & management
│   ├── install.sh                # Install all dependencies
│   ├── download-models.sh        # Pre-download models
│   ├── start-all.sh              # Start orchestrator + services
│   ├── stop-all.sh               # Stop all services
│   └── prepare-template.sh       # RunPod template preparation
│
├── systemd/                       # Optional systemd services
│   ├── ai-orchestrator.service
│   └── install-services.sh
│
└── docs/                          # Documentation
    ├── ADDING_MODELS.md          # Guide for adding new models
    ├── DEPLOYMENT.md             # Deployment guide
    └── RUNPOD_TEMPLATE.md        # Template creation guide

Component Responsibilities

Core (`core/`)

base_service.py: Abstract base class for all model services
- Health check endpoint
- Graceful shutdown
- Logging configuration
- Common utilities
service_manager.py: Process lifecycle management
- Start/stop services
- Health monitoring
- Auto-restart on failure
- Resource cleanup

Orchestrator (`model-orchestrator/`)

orchestrator.py: Routes requests to appropriate model
- Reads models.yaml configuration
- Manages model switching
- Proxies requests to services
- OpenAI-compatible API

models.yaml: Simple model registry

models:
  model-name:
    type: text|image|audio
    service_script: path/to/server.py
    port: 8001
    startup_time: 120
    endpoint: /v1/chat/completions

Models (`models/`)

Each model directory contains:

server.py: Service implementation (inherits BaseService)
requirements.txt: Model-specific dependencies

Services are standalone - can run independently for testing.

Scripts (`scripts/`)

install.sh: Install Python packages for all services
download-models.sh: Pre-download models to /workspace
start-all.sh: Start orchestrator (which manages model services)
stop-all.sh: Graceful shutdown of all services
prepare-template.sh: RunPod template preparation

Adding a New Model (3 steps)

1. Create Model Service

# models/mymodel/server.py
from core.base_service import BaseService

class MyModelService(BaseService):
    def __init__(self):
        super().__init__(
            name="mymodel",
            port=8004
        )

    async def initialize(self):
        """Load model"""
        self.model = load_my_model()

    def create_app(self):
        """Define FastAPI routes"""
        @self.app.post("/v1/mymodel/generate")
        async def generate(request: MyRequest):
            return self.model.generate(request.prompt)

if __name__ == "__main__":
    service = MyModelService()
    service.run()

2. Add to Registry

# model-orchestrator/models.yaml
models:
  mymodel:
    type: custom
    service_script: models/mymodel/server.py
    port: 8004
    startup_time: 60
    endpoint: /v1/mymodel/generate

3. Add Dependencies

# models/mymodel/requirements.txt
transformers==4.36.0
torch==2.1.0

That's it! The orchestrator handles everything else.

Request Flow

Client Request
     ↓
Orchestrator (port 9000)
     ↓ (determines model from endpoint)
Model Service (port 8001-800X)
     ↓
Response

Startup Flow

Run scripts/start-all.sh
Orchestrator starts on port 9000
Orchestrator reads models.yaml
On first request:
- Orchestrator starts appropriate model service
- Waits for health check
- Proxies request
On subsequent requests:
- If same model: direct proxy
- If different model: stop current, start new

Benefits

Simple: No Docker complexity, just Python
Fast: No container overhead, direct execution
Debuggable: Standard Python processes, easy to inspect
Extensible: Add models by creating one file + YAML entry
Maintainable: Clear structure, base classes, DRY principles
Portable: Works anywhere Python runs (local, RunPod, other cloud)

Development Workflow

# Local development
python3 models/vllm/server.py          # Test service directly
python3 model-orchestrator/orchestrator.py  # Test orchestrator

# RunPod deployment
./scripts/install.sh                    # Install dependencies
./scripts/download-models.sh            # Pre-download models
./scripts/start-all.sh                  # Start everything

# Create template
./scripts/prepare-template.sh           # Prepare for template save

Future Enhancements

Load balancing across multiple GPUs
Model pooling (keep multiple models loaded)
Batch request queueing
Metrics and monitoring
Auto-scaling based on demand

6.5 KiB Raw Blame History

RunPod Multi-Modal AI Architecture