Files
runpod/ARCHITECTURE.md
Sebastian Krüger 9ee626a78e feat: implement Ansible-based process architecture for RunPod
Major architecture overhaul to address RunPod Docker limitations:

Core Infrastructure:
- Add base_service.py: Abstract base class for all AI services
- Add service_manager.py: Process lifecycle management
- Add core/requirements.txt: Core dependencies

Model Services (Standalone Python):
- Add models/vllm/server.py: Qwen 2.5 7B text generation
- Add models/flux/server.py: Flux.1 Schnell image generation
- Add models/musicgen/server.py: MusicGen Medium music generation
- Each service inherits from GPUService base class
- OpenAI-compatible APIs
- Standalone execution support

Ansible Deployment:
- Add playbook.yml: Comprehensive deployment automation
- Add ansible.cfg: Ansible configuration
- Add inventory.yml: Localhost inventory
- Tags: base, python, dependencies, models, tailscale, validate, cleanup

Scripts:
- Add scripts/install.sh: Full installation wrapper
- Add scripts/download-models.sh: Model download wrapper
- Add scripts/start-all.sh: Start orchestrator
- Add scripts/stop-all.sh: Stop all services

Documentation:
- Update ARCHITECTURE.md: Document distributed VPS+GPU architecture

Benefits:
- No Docker: Avoids RunPod CAP_SYS_ADMIN limitations
- Fully reproducible via Ansible
- Extensible: Add models in 3 steps
- Direct Python execution (no container overhead)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 15:37:18 +01:00

6.5 KiB

RunPod Multi-Modal AI Architecture

Clean, extensible distributed AI infrastructure spanning VPS and GPU

Design Principles

  1. Distributed - VPS (UI/proxy) + GPU (models) connected via Tailscale
  2. No Docker on GPU - Direct Python for RunPod compatibility
  3. Extensible - Adding new models requires minimal code
  4. Maintainable - Clear structure and separation of concerns
  5. Simple - One command to start, easy to debug
  6. OpenAI Compatible - Works with standard AI tools

Directory Structure

runpod/
├── core/                          # Core infrastructure
│   ├── base_service.py           # Abstract base class for all services
│   ├── service_manager.py        # Process lifecycle management
│   └── requirements.txt          # Core dependencies
│
├── model-orchestrator/            # Request orchestration
│   ├── orchestrator.py           # Main orchestrator (process-based)
│   ├── models.yaml               # Model registry (simple config)
│   └── requirements.txt          # Orchestrator dependencies
│
├── models/                        # Model service implementations
│   ├── vllm/                     # Text generation
│   │   ├── server.py             # vLLM service (inherits base_service)
│   │   └── requirements.txt      # vLLM dependencies
│   │
│   ├── flux/                     # Image generation
│   │   ├── server.py             # Flux service
│   │   └── requirements.txt      # Flux dependencies
│   │
│   └── musicgen/                 # Music generation
│       ├── server.py             # MusicGen service
│       └── requirements.txt      # AudioCraft dependencies
│
├── scripts/                       # Deployment & management
│   ├── install.sh                # Install all dependencies
│   ├── download-models.sh        # Pre-download models
│   ├── start-all.sh              # Start orchestrator + services
│   ├── stop-all.sh               # Stop all services
│   └── prepare-template.sh       # RunPod template preparation
│
├── systemd/                       # Optional systemd services
│   ├── ai-orchestrator.service
│   └── install-services.sh
│
└── docs/                          # Documentation
    ├── ADDING_MODELS.md          # Guide for adding new models
    ├── DEPLOYMENT.md             # Deployment guide
    └── RUNPOD_TEMPLATE.md        # Template creation guide

Component Responsibilities

Core (core/)

  • base_service.py: Abstract base class for all model services

    • Health check endpoint
    • Graceful shutdown
    • Logging configuration
    • Common utilities
  • service_manager.py: Process lifecycle management

    • Start/stop services
    • Health monitoring
    • Auto-restart on failure
    • Resource cleanup

Orchestrator (model-orchestrator/)

  • orchestrator.py: Routes requests to appropriate model

    • Reads models.yaml configuration
    • Manages model switching
    • Proxies requests to services
    • OpenAI-compatible API
  • models.yaml: Simple model registry

    models:
      model-name:
        type: text|image|audio
        service_script: path/to/server.py
        port: 8001
        startup_time: 120
        endpoint: /v1/chat/completions
    

Models (models/)

Each model directory contains:

  • server.py: Service implementation (inherits BaseService)
  • requirements.txt: Model-specific dependencies

Services are standalone - can run independently for testing.

Scripts (scripts/)

  • install.sh: Install Python packages for all services
  • download-models.sh: Pre-download models to /workspace
  • start-all.sh: Start orchestrator (which manages model services)
  • stop-all.sh: Graceful shutdown of all services
  • prepare-template.sh: RunPod template preparation

Adding a New Model (3 steps)

1. Create Model Service

# models/mymodel/server.py
from core.base_service import BaseService

class MyModelService(BaseService):
    def __init__(self):
        super().__init__(
            name="mymodel",
            port=8004
        )

    async def initialize(self):
        """Load model"""
        self.model = load_my_model()

    def create_app(self):
        """Define FastAPI routes"""
        @self.app.post("/v1/mymodel/generate")
        async def generate(request: MyRequest):
            return self.model.generate(request.prompt)

if __name__ == "__main__":
    service = MyModelService()
    service.run()

2. Add to Registry

# model-orchestrator/models.yaml
models:
  mymodel:
    type: custom
    service_script: models/mymodel/server.py
    port: 8004
    startup_time: 60
    endpoint: /v1/mymodel/generate

3. Add Dependencies

# models/mymodel/requirements.txt
transformers==4.36.0
torch==2.1.0

That's it! The orchestrator handles everything else.

Request Flow

Client Request
     ↓
Orchestrator (port 9000)
     ↓ (determines model from endpoint)
Model Service (port 8001-800X)
     ↓
Response

Startup Flow

  1. Run scripts/start-all.sh
  2. Orchestrator starts on port 9000
  3. Orchestrator reads models.yaml
  4. On first request:
    • Orchestrator starts appropriate model service
    • Waits for health check
    • Proxies request
  5. On subsequent requests:
    • If same model: direct proxy
    • If different model: stop current, start new

Benefits

  • Simple: No Docker complexity, just Python
  • Fast: No container overhead, direct execution
  • Debuggable: Standard Python processes, easy to inspect
  • Extensible: Add models by creating one file + YAML entry
  • Maintainable: Clear structure, base classes, DRY principles
  • Portable: Works anywhere Python runs (local, RunPod, other cloud)

Development Workflow

# Local development
python3 models/vllm/server.py          # Test service directly
python3 model-orchestrator/orchestrator.py  # Test orchestrator

# RunPod deployment
./scripts/install.sh                    # Install dependencies
./scripts/download-models.sh            # Pre-download models
./scripts/start-all.sh                  # Start everything

# Create template
./scripts/prepare-template.sh           # Prepare for template save

Future Enhancements

  • Load balancing across multiple GPUs
  • Model pooling (keep multiple models loaded)
  • Batch request queueing
  • Metrics and monitoring
  • Auto-scaling based on demand