Files
runpod/CLAUDE.md
Sebastian Krüger e29f77c90b feat: add dedicated CivitAI NSFW model downloader
- Add models_civitai.yaml with 6 NSFW SDXL checkpoints
- Create artifact_civitai_download.sh with beautiful purple/magenta CLI
- Update .env.example with CIVITAI_API_KEY documentation
- Update CLAUDE.md with CivitAI usage instructions
- Rename comfyui_models.yaml to models_huggingface.yaml for clarity

Features:
- Dedicated config and downloader for CivitAI models
- Same elegant architecture as HuggingFace downloader
- Retry logic, rate limiting, progress bars
- Models: LUSTIFY, Pony Diffusion V6, RealVisXL, etc.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-23 17:58:25 +01:00

18 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Overview

This is a lightweight AI model serving infrastructure for RunPod GPU instances (specifically RTX 4090 with 24GB VRAM). It provides OpenAI-compatible API endpoints for text, image, and audio generation via dedicated vLLM servers and ComfyUI.

Key Design Philosophy:

  • Direct vLLM access - Dedicated FastAPI servers for each text model (no proxy layer)
  • Process-based architecture - Managed by Supervisor for reliability
  • Simple service management - Start/stop models independently as needed
  • OpenAI-compatible APIs - Works seamlessly with LiteLLM proxy and other AI tools

Architecture

Core Components

  1. vLLM Servers (vllm/server_qwen.py, vllm/server_llama.py)

    • Dedicated FastAPI servers for each text model
    • Qwen 2.5 7B on port 8000, Llama 3.1 8B on port 8001
    • Direct AsyncLLMEngine access for optimal streaming performance
    • max_model_len=20000 optimized for 24GB VRAM
  2. ComfyUI (comfyui/)

    • Image/video/audio generation server on port 8188
    • FLUX, SDXL, CogVideoX, MusicGen models
    • Advanced workflow system with custom nodes
  3. Supervisor (supervisord.conf)

    • Process manager for all AI services
    • Auto-restart, centralized logging
    • Web interface on port 9001
  4. Ansible Provisioning (playbook.yml)

    • Complete infrastructure-as-code setup
    • Installs dependencies, downloads models, configures services
    • Supports selective installation via tags

Common Commands

Repository Management with Arty

This project uses Arty for repository and deployment management. See arty.yml for full configuration.

# Clone all repositories (fresh deployment)
arty sync --env prod          # Production: Essential nodes only
arty sync --env dev           # Development: All nodes including optional
arty sync --env minimal       # Minimal: Just vLLM + ComfyUI base

# Run deployment scripts
arty run setup/full           # Show setup instructions
arty run models/link-comfyui  # Link downloaded models to ComfyUI
arty run deps/comfyui-nodes   # Install custom node dependencies
arty run services/start       # Start supervisor
arty run services/stop        # Stop all services

# Health checks
arty run health/comfyui      # Check ComfyUI
arty run check/gpu           # nvidia-smi
arty run check/models        # Show cache size

Initial Setup

# 1. Clone repositories with Arty (fresh RunPod instance)
arty sync --env prod

# 2. Configure environment
cd /workspace/ai
cp .env.example .env
# Edit .env and set HF_TOKEN=your_huggingface_token

# 3. Full deployment with Ansible
ansible-playbook playbook.yml

# 4. Essential ComfyUI setup (faster, ~80GB instead of ~137GB)
ansible-playbook playbook.yml --tags comfyui-essential

# 5. Link models to ComfyUI
arty run models/link-comfyui

# 6. Install custom node dependencies
arty run deps/comfyui-nodes

# 7. Selective installation (base system + Python + vLLM models only)
ansible-playbook playbook.yml --tags base,python,dependencies

Service Management

This project uses Supervisor for process management, providing auto-restart, centralized logging, and easy service control.

# Start all services (with Supervisor)
bash scripts/start-all.sh           # Starts supervisor daemon + services
arty run services/start             # Same via arty

# Stop all services
bash scripts/stop-all.sh            # Stops all services + supervisor
arty run services/stop              # Same via arty

# Check service status
bash scripts/status.sh              # Show all service status
arty run services/status            # Same via arty
supervisorctl status                # Direct supervisor command

# Individual service control
supervisorctl start vllm-qwen       # Start Qwen vLLM server
supervisorctl start vllm-llama      # Start Llama vLLM server
supervisorctl restart comfyui       # Restart ComfyUI
supervisorctl stop vllm-qwen        # Stop Qwen vLLM server
arty run services/restart-comfyui   # Restart ComfyUI via arty

# View logs
supervisorctl tail -f comfyui       # Follow ComfyUI logs
supervisorctl tail -f vllm-qwen     # Follow Qwen vLLM logs
supervisorctl tail -f vllm-llama    # Follow Llama vLLM logs
arty run services/logs              # Follow ComfyUI logs via arty

# Web interface
# Access at http://localhost:9001 (username: admin, password: runpod2024)

Supervisor Configuration:

  • Config file: /workspace/supervisord.conf
  • Log directory: /workspace/logs/
  • PID file: /workspace/supervisord.pid
  • Socket: /workspace/supervisor.sock

Services managed:

  • comfyui - ComfyUI server (port 8188, autostart enabled)
  • vllm-qwen - Qwen 2.5 7B vLLM server (port 8000, autostart disabled)
  • vllm-llama - Llama 3.1 8B vLLM server (port 8001, autostart disabled)
  • webdav-sync - WebDAV sync service for ComfyUI outputs

GPU Memory Management and Mode Switching

VRAM Constraints (RTX 4090 - 24GB total):

The GPU has limited memory, which requires manual service switching:

Service Model VRAM Usage Compatible With
ComfyUI FLUX Schnell FP16 ~23GB None (uses all VRAM)
ComfyUI SDXL Base ~12GB Small vLLM models
vLLM Qwen 2.5 7B ~14GB None (conflicts with ComfyUI)
vLLM Llama 3.1 8B ~17GB None (conflicts with ComfyUI)

Mode Switching Workflow:

Since ComfyUI and vLLM models cannot run simultaneously (they exceed 24GB combined), you must manually switch modes:

Switch to Text Generation Mode (vLLM):

# 1. Stop ComfyUI
supervisorctl stop comfyui

# 2. Start desired vLLM server
supervisorctl start vllm-qwen    # or vllm-llama

# 3. Verify
supervisorctl status
nvidia-smi  # Check VRAM usage

Switch to Image/Video/Audio Generation Mode (ComfyUI):

# 1. Stop vLLM servers
supervisorctl stop vllm-qwen vllm-llama

# 2. Start ComfyUI
supervisorctl start comfyui

# 3. Verify
supervisorctl status
nvidia-smi  # Check VRAM usage

Access via Supervisor Web UI:

You can also switch modes using the Supervisor web interface:

  • URL: https://supervisor.ai.pivoine.art (via VPS proxy) or http://100.114.60.40:9001 (direct Tailscale)
  • Username: admin
  • Password: runpod2024
  • Click "Start" or "Stop" buttons for each service

Integration with LiteLLM:

The vLLM servers integrate with LiteLLM on the VPS for unified API access:

  • vLLM models (qwen-2.5-7b, llama-3.1-8b) available via direct connections
  • Requests route directly to vLLM servers (ports 8000, 8001)
  • Environment variables GPU_VLLM_QWEN_URL and GPU_VLLM_LLAMA_URL configure connections
  • LiteLLM config uses os.environ/GPU_VLLM_*_URL syntax for dynamic URLs

Testing

# Health check (Qwen vLLM server)
curl http://localhost:8000/health

# List available models
curl http://localhost:8000/v1/models

# Test text generation (streaming)
curl -s -N -X POST http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen-2.5-7b",
    "messages": [{"role": "user", "content": "Count to 5"}],
    "max_tokens": 50,
    "stream": true
  }'

# Test ComfyUI (image generation)
curl -X POST http://localhost:8188/prompt \
  -H 'Content-Type: application/json' \
  -d '{
    "size": "1024x1024"
  }'

Ansible Tags Reference

System Setup:

  • base - Base system packages
  • python - Python environment setup
  • dependencies - Install Python packages

Model Installation:

  • models - Download vLLM/Flux/MusicGen models (legacy)
  • comfyui - Install ComfyUI base
  • comfyui-essential - Quick setup (ComfyUI + essential models only, ~80GB)
  • comfyui-models-image - Image generation models (FLUX, SDXL, SD3.5)
  • comfyui-models-video - Video generation models (CogVideoX, SVD)
  • comfyui-models-audio - Audio generation models (MusicGen variants)
  • comfyui-models-support - CLIP, IP-Adapter, ControlNet models
  • comfyui-models-all - All ComfyUI models (~137GB)
  • comfyui-nodes - Install essential custom nodes

Infrastructure:

  • tailscale - Install Tailscale VPN client
  • supervisor - Install and configure Supervisor process manager
  • systemd - Configure systemd services (use never - not for RunPod)
  • validate - Health checks (use never - run explicitly)

Adding New vLLM Models

  1. Create dedicated server script (e.g., vllm/server_mistral.py):
#!/usr/bin/env python3
from vllm import AsyncLLMEngine, AsyncEngineArgs
# ... (copy structure from server_qwen.py)

model_name: str = "mistralai/Mistral-7B-Instruct-v0.3"
port = 8002  # Choose unique port
  1. Add supervisor configuration (supervisord.conf):
[program:vllm-mistral]
command=vllm/venv/bin/python vllm/server_mistral.py
directory=.
autostart=false
autorestart=true
environment=HF_HOME="../huggingface_cache",HF_TOKEN="%(ENV_HF_TOKEN)s"
  1. Download model (handled by Ansible playbook or manually via HuggingFace CLI)

  2. Add to LiteLLM config (on VPS):

- model_name: mistral-7b
  litellm_params:
    model: hosted_vllm/openai/mistral-7b
    api_base: os.environ/GPU_VLLM_MISTRAL_URL
    api_key: "EMPTY"
  1. Start the service:
supervisorctl start vllm-mistral

Downloading NSFW Models from CivitAI

The project includes dedicated tooling for downloading NSFW models from CivitAI, separate from the HuggingFace downloader.

Files:

  • Configuration: models_civitai.yaml - Defines available CivitAI models
  • Downloader: artifact_civitai_download.sh - Beautiful CLI tool for downloading models

Requirements:

Available Models (Essential):

  1. LUSTIFY v7.0 GGWP (6.31GB) - Photoreal NSFW checkpoint for women
  2. Pony Diffusion V6 XL (6.5GB) - Anime/furry NSFW with balanced content
  3. RealVisXL V5.0 (6.8GB) - Photorealistic NSFW model

Usage:

# Download all essential NSFW models (19.61GB)
artifact_civitai_download.sh

# Download only (no symlinks)
artifact_civitai_download.sh download

# Create symlinks only (models already downloaded)
artifact_civitai_download.sh link

# Custom configuration file
artifact_civitai_download.sh -c /path/to/models_civitai.yaml

# Help
artifact_civitai_download.sh --help

Directory Structure:

  • Models downloaded to: /workspace/models/civitai/ (cache)
  • Symlinks created in: /workspace/ComfyUI/models/checkpoints/

Features:

  • Beautiful purple/magenta themed CLI (vs cyan for HuggingFace)
  • Automatic retry with exponential backoff for rate limiting
  • Resume interrupted downloads
  • Progress bars and status indicators
  • Parallel download support (respects rate limits)

Configuration Example (models_civitai.yaml):

model_categories:
  nsfw_checkpoints:
    - name: lustify-v7-ggwp
      version_id: "1094291"
      model_id: "573152"
      description: "LUSTIFY v7.0 GGWP - Photoreal NSFW checkpoint"
      size_gb: 6.31
      essential: true
      type: checkpoints

Installation Profiles:

  • essential: 3 models (LUSTIFY, Pony, RealVisXL) - 19.61GB, 30-60 min
  • complete: All 6 models - 39.21GB, 1-2 hours
  • realistic_only: LUSTIFY + RealVisXL - 13.11GB, 20-40 min
  • anime_only: Pony + WAI-NSFW - 13.0GB, 20-40 min

Important Notes:

  • All models are SDXL-based (require ~12GB VRAM)
  • Cannot run simultaneously with FLUX models (both use ~23GB)
  • Can run with smaller models like standard SDXL (~12GB total)
  • Rate limiting: 5 second delay between downloads (configurable)

Key Implementation Details

Direct vLLM Server Architecture

Each vLLM model runs as a dedicated FastAPI server:

  • Independent process management via Supervisor
  • Direct AsyncLLMEngine access for optimal performance
  • OpenAI-compatible /v1/chat/completions endpoint
  • Streaming support with proper delta formatting

See vllm/server_qwen.py for reference implementation.

Set in .env file:

  • HF_TOKEN - Required - HuggingFace API token for model downloads
  • GPU_TAILSCALE_IP - Optional - Tailscale IP for VPN access

Models are cached in:

  • /workspace/huggingface_cache - HuggingFace models
  • /workspace/models - Other model files
  • /workspace/ComfyUI/models - ComfyUI model directory structure

Integration with LiteLLM

For unified API management through LiteLLM proxy:

LiteLLM configuration (litellm-config.yaml on VPS):

model_list:
  - model_name: qwen-2.5-7b
    litellm_params:
      model: hosted_vllm/openai/qwen-2.5-7b  # Use hosted_vllm prefix!
      api_base: http://100.121.199.88:9000/v1  # Tailscale VPN IP
      api_key: dummy
      stream: true
      timeout: 600

Critical: Use hosted_vllm/openai/ prefix for vLLM models to enable proper streaming support. Wrong prefix causes empty delta chunks.

ComfyUI Installation

ComfyUI provides advanced image/video/audio generation capabilities:

Directory structure created:

/workspace/ComfyUI/
├── models/
│   ├── checkpoints/        # FLUX, SDXL, SD3 models
│   ├── clip_vision/        # CLIP vision models
│   ├── video_models/       # CogVideoX, SVD
│   ├── audio_models/       # MusicGen
│   └── custom_nodes/       # Extension nodes

Essential custom nodes installed:

  • ComfyUI-Manager - Model/node management GUI
  • ComfyUI-VideoHelperSuite - Video operations
  • ComfyUI-AnimateDiff-Evolved - Video generation
  • ComfyUI_IPAdapter_plus - Style transfer
  • ComfyUI-Impact-Pack - Auto face enhancement
  • comfyui-sound-lab - Audio generation

VRAM requirements for 24GB GPU:

  • FLUX Schnell FP16: 23GB (leaves 1GB)
  • SDXL Base: 12GB
  • CogVideoX-5B: 12GB (with optimizations)
  • MusicGen Medium: 8GB

See COMFYUI_MODELS.md for detailed model catalog and usage examples.

Deployment Workflow

RunPod Deployment (Current Setup)

  1. Clone repository:

    cd /workspace
    git clone <repo-url> ai
    cd ai
    
  2. Configure environment:

    cp .env.example .env
    # Edit .env, set HF_TOKEN
    
  3. Run Ansible provisioning:

    ansible-playbook playbook.yml
    # Or selective: --tags base,python,comfyui-essential
    
  4. Start services:

    bash scripts/start-all.sh
    
  5. Verify:

    curl http://localhost:9000/health
    

Tailscale VPN Integration

To connect RunPod GPU to VPS infrastructure:

# On RunPod instance
curl -fsSL https://tailscale.com/install.sh | sh
tailscaled --tun=userspace-networking --socks5-server=localhost:1055 &
tailscale up --advertise-tags=tag:gpu
tailscale ip -4  # Get IP for LiteLLM config

Benefits: Secure tunnel, no public exposure, low latency.

Project Structure

runpod/
├── model-orchestrator/
│   ├── orchestrator_subprocess.py  # Main orchestrator (USE THIS)
│   ├── orchestrator.py             # Docker-based version (legacy)
│   ├── models.yaml                 # Model registry
│   └── requirements.txt
├── models/
│   ├── vllm/
│   │   ├── server.py               # vLLM text generation service
│   │   └── requirements.txt
│   └── comfyui/
│       ├── start.sh                # ComfyUI startup script
│       └── requirements.txt
├── scripts/
│   ├── start-all.sh                # Start all services with Supervisor
│   ├── stop-all.sh                 # Stop all services
│   └── status.sh                   # Check service status
├── supervisord.conf                # Supervisor process manager config
├── arty.yml                        # Arty repository manager config
├── playbook.yml                    # Ansible provisioning playbook
├── inventory.yml                   # Ansible inventory (localhost)
├── ansible.cfg                     # Ansible configuration
├── .env.example                    # Environment variables template
├── CLAUDE.md                       # This file
├── COMFYUI_MODELS.md               # ComfyUI models catalog
├── MODELS_LINKED.md                # Model linkage documentation
├── comfyui_models.yaml             # ComfyUI model configuration
└── README.md                       # User documentation

Troubleshooting

Model fails to start

  • Check VRAM: nvidia-smi
  • Verify model weights downloaded: ls -lh /workspace/huggingface_cache
  • Check port conflicts: lsof -i :9000
  • Test model directly: python3 models/vllm/server.py

Streaming returns empty deltas

  • Use correct LiteLLM model prefix: hosted_vllm/openai/model-name
  • Set stream: true in LiteLLM config
  • Verify orchestrator proxies streaming correctly

HuggingFace download errors

  • Check token: echo $HF_TOKEN
  • Set in .env: HF_TOKEN=your_token_here
  • Re-run Ansible: ansible-playbook playbook.yml --tags dependencies

Out of storage space

  • Check disk usage: df -h /workspace
  • Use essential tags: --tags comfyui-essential (~80GB vs ~137GB)
  • Clear cache: rm -rf /workspace/huggingface_cache

Supervisor not running

  • Check status: bash scripts/status.sh
  • View logs: cat /workspace/logs/supervisord.log
  • Start supervisor: bash scripts/start-all.sh
  • Check for stale PID: rm -f /workspace/supervisord.pid then restart

Service won't start

  • Check supervisor status: supervisorctl status
  • View service logs: supervisorctl tail -f comfyui or supervisorctl tail -f orchestrator
  • Check error logs: cat /workspace/logs/comfyui.err.log
  • Restart service: supervisorctl restart comfyui
  • Check if port is in use: ss -tulpn | grep :8188

Orchestrator not responding

  • Check supervisor status: supervisorctl status orchestrator
  • View logs: supervisorctl tail -f orchestrator or cat /workspace/logs/orchestrator.err.log
  • Restart: supervisorctl restart orchestrator
  • Manual start for debugging: cd /workspace/ai && python3 model-orchestrator/orchestrator_subprocess.py

Performance Notes

  • Model switching time: 30-120 seconds (depends on model size)
  • Text generation: ~20-40 tokens/second (Qwen 2.5 7B on RTX 4090)
  • Image generation: 4-5 seconds per image (FLUX Schnell)
  • Music generation: 60-90 seconds for 30s audio (MusicGen Medium)

Important Conventions

  • Always use orchestrator_subprocess.py - Not the Docker version
  • Sequential loading only - One model active at a time for 24GB VRAM
  • Models downloaded by Ansible - Use playbook tags, not manual downloads
  • Services run as processes - Not systemd (RunPod containers don't support it)
  • Environment managed via .env - Required: HF_TOKEN
  • Port 9000 for orchestrator - Model services use 8000+