Files

Sebastian Krüger 664da9f4ea feat: add Supervisor process manager for service management

- Add supervisord.conf with ComfyUI and orchestrator services
- Update Ansible playbook with supervisor installation tag
- Rewrite start-all.sh and stop-all.sh to use Supervisor
- Add status.sh script for checking service status
- Update arty.yml with supervisor commands and shortcuts
- Update CLAUDE.md with Supervisor documentation and troubleshooting
- Services now auto-restart on crashes with centralized logging

Benefits:
- Better process control than manual pkill/background jobs
- Auto-restart on service crashes
- Centralized log management in /workspace/logs/
- Web interface for monitoring (port 9001)
- Works perfectly in RunPod containers (no systemd needed)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-22 09:22:16 +01:00

15 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Overview

This is a lightweight, process-based AI model orchestrator designed for RunPod GPU instances (specifically RTX 4090 with 24GB VRAM). It manages sequential loading of multiple large AI models on a single GPU, providing OpenAI-compatible API endpoints for text, image, and audio generation.

Key Design Philosophy:

Sequential model loading - Only one model active at a time to fit within GPU memory constraints
Process-based architecture - Uses Python subprocess instead of Docker-in-Docker for RunPod compatibility
Automatic model switching - Orchestrator detects request types and switches models on-demand
OpenAI-compatible APIs - Works seamlessly with LiteLLM proxy and other AI tools

Architecture

Core Components

Orchestrator (model-orchestrator/orchestrator_subprocess.py)
- FastAPI proxy server listening on port 9000
- Manages model lifecycle via Python subprocesses
- Routes requests to appropriate model services
- Handles sequential model loading/unloading
Model Registry (model-orchestrator/models.yaml)
- YAML configuration defining available models
- Specifies: type, framework, service script, port, VRAM requirements, startup time
- Easy to extend with new models
Model Services (models/*/)
- Individual Python servers running specific AI models
- vLLM for text generation (Qwen 2.5 7B, Llama 3.1 8B)
- ComfyUI for image/video/audio generation (FLUX, SDXL, CogVideoX, MusicGen)
Ansible Provisioning (playbook.yml)
- Complete infrastructure-as-code setup
- Installs dependencies, downloads models, configures services
- Supports selective installation via tags

Why Process-Based Instead of Docker?

The subprocess implementation (orchestrator_subprocess.py) is preferred over the Docker version (orchestrator.py) because:

RunPod instances run in containers - Docker-in-Docker adds complexity
Faster model startup (direct Python process spawning)
Simpler debugging (single process tree)
Reduced overhead (no container management layer)

Note: Always use orchestrator_subprocess.py for RunPod deployments.

Common Commands

Repository Management with Arty

This project uses Arty for repository and deployment management. See arty.yml for full configuration.

# Clone all repositories (fresh deployment)
arty sync --env prod          # Production: Essential nodes only
arty sync --env dev           # Development: All nodes including optional
arty sync --env minimal       # Minimal: Just orchestrator + ComfyUI base

# Run deployment scripts
arty run setup/full           # Show setup instructions
arty run models/link-comfyui  # Link downloaded models to ComfyUI
arty run deps/comfyui-nodes   # Install custom node dependencies
arty run services/start       # Start orchestrator
arty run services/stop        # Stop all services

# Health checks
arty run health/orchestrator  # Check orchestrator
arty run health/comfyui      # Check ComfyUI
arty run check/gpu           # nvidia-smi
arty run check/models        # Show cache size

Initial Setup

# 1. Clone repositories with Arty (fresh RunPod instance)
arty sync --env prod

# 2. Configure environment
cd /workspace/ai
cp .env.example .env
# Edit .env and set HF_TOKEN=your_huggingface_token

# 3. Full deployment with Ansible
ansible-playbook playbook.yml

# 4. Essential ComfyUI setup (faster, ~80GB instead of ~137GB)
ansible-playbook playbook.yml --tags comfyui-essential

# 5. Link models to ComfyUI
arty run models/link-comfyui

# 6. Install custom node dependencies
arty run deps/comfyui-nodes

# 7. Selective installation (base system + Python + vLLM models only)
ansible-playbook playbook.yml --tags base,python,dependencies

Service Management

This project uses Supervisor for process management, providing auto-restart, centralized logging, and easy service control.

# Start all services (with Supervisor)
bash scripts/start-all.sh           # Starts supervisor daemon + services
arty run services/start             # Same via arty

# Stop all services
bash scripts/stop-all.sh            # Stops all services + supervisor
arty run services/stop              # Same via arty

# Check service status
bash scripts/status.sh              # Show all service status
arty run services/status            # Same via arty
supervisorctl status                # Direct supervisor command

# Individual service control
supervisorctl start orchestrator    # Start orchestrator
supervisorctl restart comfyui       # Restart ComfyUI
supervisorctl stop orchestrator     # Stop orchestrator
arty run services/restart-comfyui   # Restart ComfyUI via arty

# View logs
supervisorctl tail -f comfyui       # Follow ComfyUI logs
supervisorctl tail -f orchestrator  # Follow orchestrator logs
arty run services/logs              # Follow ComfyUI logs via arty

# Web interface
# Access at http://localhost:9001 (username: admin, password: runpod2024)

Supervisor Configuration:

Config file: /workspace/supervisord.conf
Log directory: /workspace/logs/
PID file: /workspace/supervisord.pid
Socket: /workspace/supervisor.sock

Services managed:

comfyui - ComfyUI server (port 8188, autostart enabled)
orchestrator - Model orchestrator (port 9000, autostart disabled)

Testing

# Health check
curl http://localhost:9000/health

# List available models
curl http://localhost:9000/v1/models

# Test text generation (streaming)
curl -s -N -X POST http://localhost:9000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen-2.5-7b",
    "messages": [{"role": "user", "content": "Count to 5"}],
    "max_tokens": 50,
    "stream": true
  }'

# Test image generation
curl -X POST http://localhost:9000/v1/images/generations \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "flux-schnell",
    "prompt": "A serene mountain landscape at sunset",
    "size": "1024x1024"
  }'

Ansible Tags Reference

System Setup:

base - Base system packages
python - Python environment setup
dependencies - Install Python packages

Model Installation:

models - Download vLLM/Flux/MusicGen models (legacy)
comfyui - Install ComfyUI base
comfyui-essential - Quick setup (ComfyUI + essential models only, ~80GB)
comfyui-models-image - Image generation models (FLUX, SDXL, SD3.5)
comfyui-models-video - Video generation models (CogVideoX, SVD)
comfyui-models-audio - Audio generation models (MusicGen variants)
comfyui-models-support - CLIP, IP-Adapter, ControlNet models
comfyui-models-all - All ComfyUI models (~137GB)
comfyui-nodes - Install essential custom nodes

Infrastructure:

tailscale - Install Tailscale VPN client
supervisor - Install and configure Supervisor process manager
systemd - Configure systemd services (use never - not for RunPod)
validate - Health checks (use never - run explicitly)

Adding New Models

Add model definition to model-orchestrator/models.yaml:

llama-3.1-8b:
  type: text
  framework: vllm
  service_script: models/vllm/server_llama.py
  port: 8001
  vram_gb: 17
  startup_time_seconds: 120
  endpoint: /v1/chat/completions
  description: "Llama 3.1 8B Instruct"

Create service script (models/vllm/server_llama.py):

import os
from vllm.entrypoints.openai.api_server import run_server

model = "meta-llama/Llama-3.1-8B-Instruct"
port = int(os.getenv("PORT", 8001))
run_server(model=model, port=port)

Download model (handled by Ansible playbook or manually via HuggingFace CLI)
Restart orchestrator:

bash scripts/stop-all.sh && bash scripts/start-all.sh

Key Implementation Details

Model Switching Logic

The orchestrator automatically switches models based on:

Endpoint path - /v1/chat/completions → text models, /v1/images/generations → image models
Model name in request - Matches against model registry
Sequential loading - Stops current model before starting new one to conserve VRAM

See orchestrator_subprocess.py:64-100 for process management implementation.

Model Registry Structure

Each model in models.yaml requires:

type - text, image, or audio
framework - vllm, openedai-images, audiocraft, comfyui
service_script - Relative path to Python/shell script
port - Service port (8000+)
vram_gb - GPU memory requirement
startup_time_seconds - Max health check timeout
endpoint - API endpoint path
description - Human-readable description

Environment Variables

Set in .env file:

HF_TOKEN - Required - HuggingFace API token for model downloads
GPU_TAILSCALE_IP - Optional - Tailscale IP for VPN access

Models are cached in:

/workspace/huggingface_cache - HuggingFace models
/workspace/models - Other model files
/workspace/ComfyUI/models - ComfyUI model directory structure

Integration with LiteLLM

For unified API management through LiteLLM proxy:

LiteLLM configuration (litellm-config.yaml on VPS):

model_list:
  - model_name: qwen-2.5-7b
    litellm_params:
      model: hosted_vllm/openai/qwen-2.5-7b  # Use hosted_vllm prefix!
      api_base: http://100.121.199.88:9000/v1  # Tailscale VPN IP
      api_key: dummy
      stream: true
      timeout: 600

Critical: Use hosted_vllm/openai/ prefix for vLLM models to enable proper streaming support. Wrong prefix causes empty delta chunks.

ComfyUI Installation

ComfyUI provides advanced image/video/audio generation capabilities:

Directory structure created:

/workspace/ComfyUI/
├── models/
│   ├── checkpoints/        # FLUX, SDXL, SD3 models
│   ├── clip_vision/        # CLIP vision models
│   ├── video_models/       # CogVideoX, SVD
│   ├── audio_models/       # MusicGen
│   └── custom_nodes/       # Extension nodes

Essential custom nodes installed:

ComfyUI-Manager - Model/node management GUI
ComfyUI-VideoHelperSuite - Video operations
ComfyUI-AnimateDiff-Evolved - Video generation
ComfyUI_IPAdapter_plus - Style transfer
ComfyUI-Impact-Pack - Auto face enhancement
comfyui-sound-lab - Audio generation

VRAM requirements for 24GB GPU:

FLUX Schnell FP16: 23GB (leaves 1GB)
SDXL Base: 12GB
CogVideoX-5B: 12GB (with optimizations)
MusicGen Medium: 8GB

See COMFYUI_MODELS.md for detailed model catalog and usage examples.

Deployment Workflow

RunPod Deployment (Current Setup)

Clone repository:

cd /workspace
git clone <repo-url> ai
cd ai

Configure environment:

cp .env.example .env
# Edit .env, set HF_TOKEN

Run Ansible provisioning:

ansible-playbook playbook.yml
# Or selective: --tags base,python,comfyui-essential

Start services:
```
bash scripts/start-all.sh
```
Verify:
```
curl http://localhost:9000/health
```

Tailscale VPN Integration

To connect RunPod GPU to VPS infrastructure:

# On RunPod instance
curl -fsSL https://tailscale.com/install.sh | sh
tailscaled --tun=userspace-networking --socks5-server=localhost:1055 &
tailscale up --advertise-tags=tag:gpu
tailscale ip -4  # Get IP for LiteLLM config

Benefits: Secure tunnel, no public exposure, low latency.

Project Structure

runpod/
├── model-orchestrator/
│   ├── orchestrator_subprocess.py  # Main orchestrator (USE THIS)
│   ├── orchestrator.py             # Docker-based version (legacy)
│   ├── models.yaml                 # Model registry
│   └── requirements.txt
├── models/
│   ├── vllm/
│   │   ├── server.py               # vLLM text generation service
│   │   └── requirements.txt
│   └── comfyui/
│       ├── start.sh                # ComfyUI startup script
│       └── requirements.txt
├── scripts/
│   ├── start-all.sh                # Start all services with Supervisor
│   ├── stop-all.sh                 # Stop all services
│   └── status.sh                   # Check service status
├── supervisord.conf                # Supervisor process manager config
├── arty.yml                        # Arty repository manager config
├── playbook.yml                    # Ansible provisioning playbook
├── inventory.yml                   # Ansible inventory (localhost)
├── ansible.cfg                     # Ansible configuration
├── .env.example                    # Environment variables template
├── CLAUDE.md                       # This file
├── COMFYUI_MODELS.md               # ComfyUI models catalog
├── MODELS_LINKED.md                # Model linkage documentation
├── comfyui_models.yaml             # ComfyUI model configuration
└── README.md                       # User documentation

Troubleshooting

Model fails to start

Check VRAM: nvidia-smi
Verify model weights downloaded: ls -lh /workspace/huggingface_cache
Check port conflicts: lsof -i :9000
Test model directly: python3 models/vllm/server.py

Streaming returns empty deltas

Use correct LiteLLM model prefix: hosted_vllm/openai/model-name
Set stream: true in LiteLLM config
Verify orchestrator proxies streaming correctly

HuggingFace download errors

Check token: echo $HF_TOKEN
Set in .env: HF_TOKEN=your_token_here
Re-run Ansible: ansible-playbook playbook.yml --tags dependencies

Out of storage space

Check disk usage: df -h /workspace
Use essential tags: --tags comfyui-essential (~80GB vs ~137GB)
Clear cache: rm -rf /workspace/huggingface_cache

Supervisor not running

Check status: bash scripts/status.sh
View logs: cat /workspace/logs/supervisord.log
Start supervisor: bash scripts/start-all.sh
Check for stale PID: rm -f /workspace/supervisord.pid then restart

Service won't start

Check supervisor status: supervisorctl status
View service logs: supervisorctl tail -f comfyui or supervisorctl tail -f orchestrator
Check error logs: cat /workspace/logs/comfyui.err.log
Restart service: supervisorctl restart comfyui
Check if port is in use: ss -tulpn | grep :8188

Orchestrator not responding

Check supervisor status: supervisorctl status orchestrator
View logs: supervisorctl tail -f orchestrator or cat /workspace/logs/orchestrator.err.log
Restart: supervisorctl restart orchestrator
Manual start for debugging: cd /workspace/ai && python3 model-orchestrator/orchestrator_subprocess.py

Performance Notes

Model switching time: 30-120 seconds (depends on model size)
Text generation: ~20-40 tokens/second (Qwen 2.5 7B on RTX 4090)
Image generation: 4-5 seconds per image (FLUX Schnell)
Music generation: 60-90 seconds for 30s audio (MusicGen Medium)

Important Conventions

Always use orchestrator_subprocess.py - Not the Docker version
Sequential loading only - One model active at a time for 24GB VRAM
Models downloaded by Ansible - Use playbook tags, not manual downloads
Services run as processes - Not systemd (RunPod containers don't support it)
Environment managed via .env - Required: HF_TOKEN
Port 9000 for orchestrator - Model services use 8000+

15 KiB Raw Blame History