Files
runpod/CLAUDE.md
Sebastian Krüger 664da9f4ea feat: add Supervisor process manager for service management
- Add supervisord.conf with ComfyUI and orchestrator services
- Update Ansible playbook with supervisor installation tag
- Rewrite start-all.sh and stop-all.sh to use Supervisor
- Add status.sh script for checking service status
- Update arty.yml with supervisor commands and shortcuts
- Update CLAUDE.md with Supervisor documentation and troubleshooting
- Services now auto-restart on crashes with centralized logging

Benefits:
- Better process control than manual pkill/background jobs
- Auto-restart on service crashes
- Centralized log management in /workspace/logs/
- Web interface for monitoring (port 9001)
- Works perfectly in RunPod containers (no systemd needed)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-22 09:22:16 +01:00

15 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Overview

This is a lightweight, process-based AI model orchestrator designed for RunPod GPU instances (specifically RTX 4090 with 24GB VRAM). It manages sequential loading of multiple large AI models on a single GPU, providing OpenAI-compatible API endpoints for text, image, and audio generation.

Key Design Philosophy:

  • Sequential model loading - Only one model active at a time to fit within GPU memory constraints
  • Process-based architecture - Uses Python subprocess instead of Docker-in-Docker for RunPod compatibility
  • Automatic model switching - Orchestrator detects request types and switches models on-demand
  • OpenAI-compatible APIs - Works seamlessly with LiteLLM proxy and other AI tools

Architecture

Core Components

  1. Orchestrator (model-orchestrator/orchestrator_subprocess.py)

    • FastAPI proxy server listening on port 9000
    • Manages model lifecycle via Python subprocesses
    • Routes requests to appropriate model services
    • Handles sequential model loading/unloading
  2. Model Registry (model-orchestrator/models.yaml)

    • YAML configuration defining available models
    • Specifies: type, framework, service script, port, VRAM requirements, startup time
    • Easy to extend with new models
  3. Model Services (models/*/)

    • Individual Python servers running specific AI models
    • vLLM for text generation (Qwen 2.5 7B, Llama 3.1 8B)
    • ComfyUI for image/video/audio generation (FLUX, SDXL, CogVideoX, MusicGen)
  4. Ansible Provisioning (playbook.yml)

    • Complete infrastructure-as-code setup
    • Installs dependencies, downloads models, configures services
    • Supports selective installation via tags

Why Process-Based Instead of Docker?

The subprocess implementation (orchestrator_subprocess.py) is preferred over the Docker version (orchestrator.py) because:

  • RunPod instances run in containers - Docker-in-Docker adds complexity
  • Faster model startup (direct Python process spawning)
  • Simpler debugging (single process tree)
  • Reduced overhead (no container management layer)

Note: Always use orchestrator_subprocess.py for RunPod deployments.

Common Commands

Repository Management with Arty

This project uses Arty for repository and deployment management. See arty.yml for full configuration.

# Clone all repositories (fresh deployment)
arty sync --env prod          # Production: Essential nodes only
arty sync --env dev           # Development: All nodes including optional
arty sync --env minimal       # Minimal: Just orchestrator + ComfyUI base

# Run deployment scripts
arty run setup/full           # Show setup instructions
arty run models/link-comfyui  # Link downloaded models to ComfyUI
arty run deps/comfyui-nodes   # Install custom node dependencies
arty run services/start       # Start orchestrator
arty run services/stop        # Stop all services

# Health checks
arty run health/orchestrator  # Check orchestrator
arty run health/comfyui      # Check ComfyUI
arty run check/gpu           # nvidia-smi
arty run check/models        # Show cache size

Initial Setup

# 1. Clone repositories with Arty (fresh RunPod instance)
arty sync --env prod

# 2. Configure environment
cd /workspace/ai
cp .env.example .env
# Edit .env and set HF_TOKEN=your_huggingface_token

# 3. Full deployment with Ansible
ansible-playbook playbook.yml

# 4. Essential ComfyUI setup (faster, ~80GB instead of ~137GB)
ansible-playbook playbook.yml --tags comfyui-essential

# 5. Link models to ComfyUI
arty run models/link-comfyui

# 6. Install custom node dependencies
arty run deps/comfyui-nodes

# 7. Selective installation (base system + Python + vLLM models only)
ansible-playbook playbook.yml --tags base,python,dependencies

Service Management

This project uses Supervisor for process management, providing auto-restart, centralized logging, and easy service control.

# Start all services (with Supervisor)
bash scripts/start-all.sh           # Starts supervisor daemon + services
arty run services/start             # Same via arty

# Stop all services
bash scripts/stop-all.sh            # Stops all services + supervisor
arty run services/stop              # Same via arty

# Check service status
bash scripts/status.sh              # Show all service status
arty run services/status            # Same via arty
supervisorctl status                # Direct supervisor command

# Individual service control
supervisorctl start orchestrator    # Start orchestrator
supervisorctl restart comfyui       # Restart ComfyUI
supervisorctl stop orchestrator     # Stop orchestrator
arty run services/restart-comfyui   # Restart ComfyUI via arty

# View logs
supervisorctl tail -f comfyui       # Follow ComfyUI logs
supervisorctl tail -f orchestrator  # Follow orchestrator logs
arty run services/logs              # Follow ComfyUI logs via arty

# Web interface
# Access at http://localhost:9001 (username: admin, password: runpod2024)

Supervisor Configuration:

  • Config file: /workspace/supervisord.conf
  • Log directory: /workspace/logs/
  • PID file: /workspace/supervisord.pid
  • Socket: /workspace/supervisor.sock

Services managed:

  • comfyui - ComfyUI server (port 8188, autostart enabled)
  • orchestrator - Model orchestrator (port 9000, autostart disabled)

Testing

# Health check
curl http://localhost:9000/health

# List available models
curl http://localhost:9000/v1/models

# Test text generation (streaming)
curl -s -N -X POST http://localhost:9000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen-2.5-7b",
    "messages": [{"role": "user", "content": "Count to 5"}],
    "max_tokens": 50,
    "stream": true
  }'

# Test image generation
curl -X POST http://localhost:9000/v1/images/generations \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "flux-schnell",
    "prompt": "A serene mountain landscape at sunset",
    "size": "1024x1024"
  }'

Ansible Tags Reference

System Setup:

  • base - Base system packages
  • python - Python environment setup
  • dependencies - Install Python packages

Model Installation:

  • models - Download vLLM/Flux/MusicGen models (legacy)
  • comfyui - Install ComfyUI base
  • comfyui-essential - Quick setup (ComfyUI + essential models only, ~80GB)
  • comfyui-models-image - Image generation models (FLUX, SDXL, SD3.5)
  • comfyui-models-video - Video generation models (CogVideoX, SVD)
  • comfyui-models-audio - Audio generation models (MusicGen variants)
  • comfyui-models-support - CLIP, IP-Adapter, ControlNet models
  • comfyui-models-all - All ComfyUI models (~137GB)
  • comfyui-nodes - Install essential custom nodes

Infrastructure:

  • tailscale - Install Tailscale VPN client
  • supervisor - Install and configure Supervisor process manager
  • systemd - Configure systemd services (use never - not for RunPod)
  • validate - Health checks (use never - run explicitly)

Adding New Models

  1. Add model definition to model-orchestrator/models.yaml:
llama-3.1-8b:
  type: text
  framework: vllm
  service_script: models/vllm/server_llama.py
  port: 8001
  vram_gb: 17
  startup_time_seconds: 120
  endpoint: /v1/chat/completions
  description: "Llama 3.1 8B Instruct"
  1. Create service script (models/vllm/server_llama.py):
import os
from vllm.entrypoints.openai.api_server import run_server

model = "meta-llama/Llama-3.1-8B-Instruct"
port = int(os.getenv("PORT", 8001))
run_server(model=model, port=port)
  1. Download model (handled by Ansible playbook or manually via HuggingFace CLI)

  2. Restart orchestrator:

bash scripts/stop-all.sh && bash scripts/start-all.sh

Key Implementation Details

Model Switching Logic

The orchestrator automatically switches models based on:

  • Endpoint path - /v1/chat/completions → text models, /v1/images/generations → image models
  • Model name in request - Matches against model registry
  • Sequential loading - Stops current model before starting new one to conserve VRAM

See orchestrator_subprocess.py:64-100 for process management implementation.

Model Registry Structure

Each model in models.yaml requires:

  • type - text, image, or audio
  • framework - vllm, openedai-images, audiocraft, comfyui
  • service_script - Relative path to Python/shell script
  • port - Service port (8000+)
  • vram_gb - GPU memory requirement
  • startup_time_seconds - Max health check timeout
  • endpoint - API endpoint path
  • description - Human-readable description

Environment Variables

Set in .env file:

  • HF_TOKEN - Required - HuggingFace API token for model downloads
  • GPU_TAILSCALE_IP - Optional - Tailscale IP for VPN access

Models are cached in:

  • /workspace/huggingface_cache - HuggingFace models
  • /workspace/models - Other model files
  • /workspace/ComfyUI/models - ComfyUI model directory structure

Integration with LiteLLM

For unified API management through LiteLLM proxy:

LiteLLM configuration (litellm-config.yaml on VPS):

model_list:
  - model_name: qwen-2.5-7b
    litellm_params:
      model: hosted_vllm/openai/qwen-2.5-7b  # Use hosted_vllm prefix!
      api_base: http://100.121.199.88:9000/v1  # Tailscale VPN IP
      api_key: dummy
      stream: true
      timeout: 600

Critical: Use hosted_vllm/openai/ prefix for vLLM models to enable proper streaming support. Wrong prefix causes empty delta chunks.

ComfyUI Installation

ComfyUI provides advanced image/video/audio generation capabilities:

Directory structure created:

/workspace/ComfyUI/
├── models/
│   ├── checkpoints/        # FLUX, SDXL, SD3 models
│   ├── clip_vision/        # CLIP vision models
│   ├── video_models/       # CogVideoX, SVD
│   ├── audio_models/       # MusicGen
│   └── custom_nodes/       # Extension nodes

Essential custom nodes installed:

  • ComfyUI-Manager - Model/node management GUI
  • ComfyUI-VideoHelperSuite - Video operations
  • ComfyUI-AnimateDiff-Evolved - Video generation
  • ComfyUI_IPAdapter_plus - Style transfer
  • ComfyUI-Impact-Pack - Auto face enhancement
  • comfyui-sound-lab - Audio generation

VRAM requirements for 24GB GPU:

  • FLUX Schnell FP16: 23GB (leaves 1GB)
  • SDXL Base: 12GB
  • CogVideoX-5B: 12GB (with optimizations)
  • MusicGen Medium: 8GB

See COMFYUI_MODELS.md for detailed model catalog and usage examples.

Deployment Workflow

RunPod Deployment (Current Setup)

  1. Clone repository:

    cd /workspace
    git clone <repo-url> ai
    cd ai
    
  2. Configure environment:

    cp .env.example .env
    # Edit .env, set HF_TOKEN
    
  3. Run Ansible provisioning:

    ansible-playbook playbook.yml
    # Or selective: --tags base,python,comfyui-essential
    
  4. Start services:

    bash scripts/start-all.sh
    
  5. Verify:

    curl http://localhost:9000/health
    

Tailscale VPN Integration

To connect RunPod GPU to VPS infrastructure:

# On RunPod instance
curl -fsSL https://tailscale.com/install.sh | sh
tailscaled --tun=userspace-networking --socks5-server=localhost:1055 &
tailscale up --advertise-tags=tag:gpu
tailscale ip -4  # Get IP for LiteLLM config

Benefits: Secure tunnel, no public exposure, low latency.

Project Structure

runpod/
├── model-orchestrator/
│   ├── orchestrator_subprocess.py  # Main orchestrator (USE THIS)
│   ├── orchestrator.py             # Docker-based version (legacy)
│   ├── models.yaml                 # Model registry
│   └── requirements.txt
├── models/
│   ├── vllm/
│   │   ├── server.py               # vLLM text generation service
│   │   └── requirements.txt
│   └── comfyui/
│       ├── start.sh                # ComfyUI startup script
│       └── requirements.txt
├── scripts/
│   ├── start-all.sh                # Start all services with Supervisor
│   ├── stop-all.sh                 # Stop all services
│   └── status.sh                   # Check service status
├── supervisord.conf                # Supervisor process manager config
├── arty.yml                        # Arty repository manager config
├── playbook.yml                    # Ansible provisioning playbook
├── inventory.yml                   # Ansible inventory (localhost)
├── ansible.cfg                     # Ansible configuration
├── .env.example                    # Environment variables template
├── CLAUDE.md                       # This file
├── COMFYUI_MODELS.md               # ComfyUI models catalog
├── MODELS_LINKED.md                # Model linkage documentation
├── comfyui_models.yaml             # ComfyUI model configuration
└── README.md                       # User documentation

Troubleshooting

Model fails to start

  • Check VRAM: nvidia-smi
  • Verify model weights downloaded: ls -lh /workspace/huggingface_cache
  • Check port conflicts: lsof -i :9000
  • Test model directly: python3 models/vllm/server.py

Streaming returns empty deltas

  • Use correct LiteLLM model prefix: hosted_vllm/openai/model-name
  • Set stream: true in LiteLLM config
  • Verify orchestrator proxies streaming correctly

HuggingFace download errors

  • Check token: echo $HF_TOKEN
  • Set in .env: HF_TOKEN=your_token_here
  • Re-run Ansible: ansible-playbook playbook.yml --tags dependencies

Out of storage space

  • Check disk usage: df -h /workspace
  • Use essential tags: --tags comfyui-essential (~80GB vs ~137GB)
  • Clear cache: rm -rf /workspace/huggingface_cache

Supervisor not running

  • Check status: bash scripts/status.sh
  • View logs: cat /workspace/logs/supervisord.log
  • Start supervisor: bash scripts/start-all.sh
  • Check for stale PID: rm -f /workspace/supervisord.pid then restart

Service won't start

  • Check supervisor status: supervisorctl status
  • View service logs: supervisorctl tail -f comfyui or supervisorctl tail -f orchestrator
  • Check error logs: cat /workspace/logs/comfyui.err.log
  • Restart service: supervisorctl restart comfyui
  • Check if port is in use: ss -tulpn | grep :8188

Orchestrator not responding

  • Check supervisor status: supervisorctl status orchestrator
  • View logs: supervisorctl tail -f orchestrator or cat /workspace/logs/orchestrator.err.log
  • Restart: supervisorctl restart orchestrator
  • Manual start for debugging: cd /workspace/ai && python3 model-orchestrator/orchestrator_subprocess.py

Performance Notes

  • Model switching time: 30-120 seconds (depends on model size)
  • Text generation: ~20-40 tokens/second (Qwen 2.5 7B on RTX 4090)
  • Image generation: 4-5 seconds per image (FLUX Schnell)
  • Music generation: 60-90 seconds for 30s audio (MusicGen Medium)

Important Conventions

  • Always use orchestrator_subprocess.py - Not the Docker version
  • Sequential loading only - One model active at a time for 24GB VRAM
  • Models downloaded by Ansible - Use playbook tags, not manual downloads
  • Services run as processes - Not systemd (RunPod containers don't support it)
  • Environment managed via .env - Required: HF_TOKEN
  • Port 9000 for orchestrator - Model services use 8000+