Files

Sebastian Krüger e29f77c90b feat: add dedicated CivitAI NSFW model downloader

- Add models_civitai.yaml with 6 NSFW SDXL checkpoints
- Create artifact_civitai_download.sh with beautiful purple/magenta CLI
- Update .env.example with CIVITAI_API_KEY documentation
- Update CLAUDE.md with CivitAI usage instructions
- Rename comfyui_models.yaml to models_huggingface.yaml for clarity

Features:
- Dedicated config and downloader for CivitAI models
- Same elegant architecture as HuggingFace downloader
- Retry logic, rate limiting, progress bars
- Models: LUSTIFY, Pony Diffusion V6, RealVisXL, etc.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-23 17:58:25 +01:00

18 KiB

Raw Permalink Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Overview

This is a lightweight AI model serving infrastructure for RunPod GPU instances (specifically RTX 4090 with 24GB VRAM). It provides OpenAI-compatible API endpoints for text, image, and audio generation via dedicated vLLM servers and ComfyUI.

Key Design Philosophy:

Direct vLLM access - Dedicated FastAPI servers for each text model (no proxy layer)
Process-based architecture - Managed by Supervisor for reliability
Simple service management - Start/stop models independently as needed
OpenAI-compatible APIs - Works seamlessly with LiteLLM proxy and other AI tools

Architecture

Core Components

vLLM Servers (vllm/server_qwen.py, vllm/server_llama.py)
- Dedicated FastAPI servers for each text model
- Qwen 2.5 7B on port 8000, Llama 3.1 8B on port 8001
- Direct AsyncLLMEngine access for optimal streaming performance
- max_model_len=20000 optimized for 24GB VRAM
ComfyUI (comfyui/)
- Image/video/audio generation server on port 8188
- FLUX, SDXL, CogVideoX, MusicGen models
- Advanced workflow system with custom nodes
Supervisor (supervisord.conf)
- Process manager for all AI services
- Auto-restart, centralized logging
- Web interface on port 9001
Ansible Provisioning (playbook.yml)
- Complete infrastructure-as-code setup
- Installs dependencies, downloads models, configures services
- Supports selective installation via tags

Common Commands

Repository Management with Arty

This project uses Arty for repository and deployment management. See arty.yml for full configuration.

# Clone all repositories (fresh deployment)
arty sync --env prod          # Production: Essential nodes only
arty sync --env dev           # Development: All nodes including optional
arty sync --env minimal       # Minimal: Just vLLM + ComfyUI base

# Run deployment scripts
arty run setup/full           # Show setup instructions
arty run models/link-comfyui  # Link downloaded models to ComfyUI
arty run deps/comfyui-nodes   # Install custom node dependencies
arty run services/start       # Start supervisor
arty run services/stop        # Stop all services

# Health checks
arty run health/comfyui      # Check ComfyUI
arty run check/gpu           # nvidia-smi
arty run check/models        # Show cache size

Initial Setup

# 1. Clone repositories with Arty (fresh RunPod instance)
arty sync --env prod

# 2. Configure environment
cd /workspace/ai
cp .env.example .env
# Edit .env and set HF_TOKEN=your_huggingface_token

# 3. Full deployment with Ansible
ansible-playbook playbook.yml

# 4. Essential ComfyUI setup (faster, ~80GB instead of ~137GB)
ansible-playbook playbook.yml --tags comfyui-essential

# 5. Link models to ComfyUI
arty run models/link-comfyui

# 6. Install custom node dependencies
arty run deps/comfyui-nodes

# 7. Selective installation (base system + Python + vLLM models only)
ansible-playbook playbook.yml --tags base,python,dependencies

Service Management

This project uses Supervisor for process management, providing auto-restart, centralized logging, and easy service control.

# Start all services (with Supervisor)
bash scripts/start-all.sh           # Starts supervisor daemon + services
arty run services/start             # Same via arty

# Stop all services
bash scripts/stop-all.sh            # Stops all services + supervisor
arty run services/stop              # Same via arty

# Check service status
bash scripts/status.sh              # Show all service status
arty run services/status            # Same via arty
supervisorctl status                # Direct supervisor command

# Individual service control
supervisorctl start vllm-qwen       # Start Qwen vLLM server
supervisorctl start vllm-llama      # Start Llama vLLM server
supervisorctl restart comfyui       # Restart ComfyUI
supervisorctl stop vllm-qwen        # Stop Qwen vLLM server
arty run services/restart-comfyui   # Restart ComfyUI via arty

# View logs
supervisorctl tail -f comfyui       # Follow ComfyUI logs
supervisorctl tail -f vllm-qwen     # Follow Qwen vLLM logs
supervisorctl tail -f vllm-llama    # Follow Llama vLLM logs
arty run services/logs              # Follow ComfyUI logs via arty

# Web interface
# Access at http://localhost:9001 (username: admin, password: runpod2024)

Supervisor Configuration:

Config file: /workspace/supervisord.conf
Log directory: /workspace/logs/
PID file: /workspace/supervisord.pid
Socket: /workspace/supervisor.sock

Services managed:

comfyui - ComfyUI server (port 8188, autostart enabled)
vllm-qwen - Qwen 2.5 7B vLLM server (port 8000, autostart disabled)
vllm-llama - Llama 3.1 8B vLLM server (port 8001, autostart disabled)
webdav-sync - WebDAV sync service for ComfyUI outputs

GPU Memory Management and Mode Switching

VRAM Constraints (RTX 4090 - 24GB total):

The GPU has limited memory, which requires manual service switching:

Service	Model	VRAM Usage	Compatible With
ComfyUI	FLUX Schnell FP16	~23GB	None (uses all VRAM)
ComfyUI	SDXL Base	~12GB	Small vLLM models
vLLM	Qwen 2.5 7B	~14GB	None (conflicts with ComfyUI)
vLLM	Llama 3.1 8B	~17GB	None (conflicts with ComfyUI)

Mode Switching Workflow:

Since ComfyUI and vLLM models cannot run simultaneously (they exceed 24GB combined), you must manually switch modes:

Switch to Text Generation Mode (vLLM):

# 1. Stop ComfyUI
supervisorctl stop comfyui

# 2. Start desired vLLM server
supervisorctl start vllm-qwen    # or vllm-llama

# 3. Verify
supervisorctl status
nvidia-smi  # Check VRAM usage

Switch to Image/Video/Audio Generation Mode (ComfyUI):

# 1. Stop vLLM servers
supervisorctl stop vllm-qwen vllm-llama

# 2. Start ComfyUI
supervisorctl start comfyui

# 3. Verify
supervisorctl status
nvidia-smi  # Check VRAM usage

Access via Supervisor Web UI:

You can also switch modes using the Supervisor web interface:

URL: https://supervisor.ai.pivoine.art (via VPS proxy) or http://100.114.60.40:9001 (direct Tailscale)
Username: admin
Password: runpod2024
Click "Start" or "Stop" buttons for each service

Integration with LiteLLM:

The vLLM servers integrate with LiteLLM on the VPS for unified API access:

vLLM models (qwen-2.5-7b, llama-3.1-8b) available via direct connections
Requests route directly to vLLM servers (ports 8000, 8001)
Environment variables GPU_VLLM_QWEN_URL and GPU_VLLM_LLAMA_URL configure connections
LiteLLM config uses os.environ/GPU_VLLM_*_URL syntax for dynamic URLs

Testing

# Health check (Qwen vLLM server)
curl http://localhost:8000/health

# List available models
curl http://localhost:8000/v1/models

# Test text generation (streaming)
curl -s -N -X POST http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen-2.5-7b",
    "messages": [{"role": "user", "content": "Count to 5"}],
    "max_tokens": 50,
    "stream": true
  }'

# Test ComfyUI (image generation)
curl -X POST http://localhost:8188/prompt \
  -H 'Content-Type: application/json' \
  -d '{
    "size": "1024x1024"
  }'

Ansible Tags Reference

System Setup:

base - Base system packages
python - Python environment setup
dependencies - Install Python packages

Model Installation:

models - Download vLLM/Flux/MusicGen models (legacy)
comfyui - Install ComfyUI base
comfyui-essential - Quick setup (ComfyUI + essential models only, ~80GB)
comfyui-models-image - Image generation models (FLUX, SDXL, SD3.5)
comfyui-models-video - Video generation models (CogVideoX, SVD)
comfyui-models-audio - Audio generation models (MusicGen variants)
comfyui-models-support - CLIP, IP-Adapter, ControlNet models
comfyui-models-all - All ComfyUI models (~137GB)
comfyui-nodes - Install essential custom nodes

Infrastructure:

tailscale - Install Tailscale VPN client
supervisor - Install and configure Supervisor process manager
systemd - Configure systemd services (use never - not for RunPod)
validate - Health checks (use never - run explicitly)

Adding New vLLM Models

Create dedicated server script (e.g., vllm/server_mistral.py):

#!/usr/bin/env python3
from vllm import AsyncLLMEngine, AsyncEngineArgs
# ... (copy structure from server_qwen.py)

model_name: str = "mistralai/Mistral-7B-Instruct-v0.3"
port = 8002  # Choose unique port

Add supervisor configuration (supervisord.conf):

[program:vllm-mistral]
command=vllm/venv/bin/python vllm/server_mistral.py
directory=.
autostart=false
autorestart=true
environment=HF_HOME="../huggingface_cache",HF_TOKEN="%(ENV_HF_TOKEN)s"

Download model (handled by Ansible playbook or manually via HuggingFace CLI)
Add to LiteLLM config (on VPS):

- model_name: mistral-7b
  litellm_params:
    model: hosted_vllm/openai/mistral-7b
    api_base: os.environ/GPU_VLLM_MISTRAL_URL
    api_key: "EMPTY"

Start the service:

supervisorctl start vllm-mistral

Downloading NSFW Models from CivitAI

The project includes dedicated tooling for downloading NSFW models from CivitAI, separate from the HuggingFace downloader.

Files:

Configuration: models_civitai.yaml - Defines available CivitAI models
Downloader: artifact_civitai_download.sh - Beautiful CLI tool for downloading models

Requirements:

CivitAI API key (set in .env file as CIVITAI_API_KEY)
Get your key from: https://civitai.com/user/account

Available Models (Essential):

LUSTIFY v7.0 GGWP (6.31GB) - Photoreal NSFW checkpoint for women
Pony Diffusion V6 XL (6.5GB) - Anime/furry NSFW with balanced content
RealVisXL V5.0 (6.8GB) - Photorealistic NSFW model

Usage:

# Download all essential NSFW models (19.61GB)
artifact_civitai_download.sh

# Download only (no symlinks)
artifact_civitai_download.sh download

# Create symlinks only (models already downloaded)
artifact_civitai_download.sh link

# Custom configuration file
artifact_civitai_download.sh -c /path/to/models_civitai.yaml

# Help
artifact_civitai_download.sh --help

Directory Structure:

Models downloaded to: /workspace/models/civitai/ (cache)
Symlinks created in: /workspace/ComfyUI/models/checkpoints/

Features:

Beautiful purple/magenta themed CLI (vs cyan for HuggingFace)
Automatic retry with exponential backoff for rate limiting
Resume interrupted downloads
Progress bars and status indicators
Parallel download support (respects rate limits)

Configuration Example (models_civitai.yaml):

model_categories:
  nsfw_checkpoints:
    - name: lustify-v7-ggwp
      version_id: "1094291"
      model_id: "573152"
      description: "LUSTIFY v7.0 GGWP - Photoreal NSFW checkpoint"
      size_gb: 6.31
      essential: true
      type: checkpoints

Installation Profiles:

essential: 3 models (LUSTIFY, Pony, RealVisXL) - 19.61GB, 30-60 min
complete: All 6 models - 39.21GB, 1-2 hours
realistic_only: LUSTIFY + RealVisXL - 13.11GB, 20-40 min
anime_only: Pony + WAI-NSFW - 13.0GB, 20-40 min

Important Notes:

All models are SDXL-based (require ~12GB VRAM)
Cannot run simultaneously with FLUX models (both use ~23GB)
Can run with smaller models like standard SDXL (~12GB total)
Rate limiting: 5 second delay between downloads (configurable)

Key Implementation Details

Direct vLLM Server Architecture

Each vLLM model runs as a dedicated FastAPI server:

Independent process management via Supervisor
Direct AsyncLLMEngine access for optimal performance
OpenAI-compatible /v1/chat/completions endpoint
Streaming support with proper delta formatting

See vllm/server_qwen.py for reference implementation.

Set in .env file:

HF_TOKEN - Required - HuggingFace API token for model downloads
GPU_TAILSCALE_IP - Optional - Tailscale IP for VPN access

Models are cached in:

/workspace/huggingface_cache - HuggingFace models
/workspace/models - Other model files
/workspace/ComfyUI/models - ComfyUI model directory structure

Integration with LiteLLM

For unified API management through LiteLLM proxy:

LiteLLM configuration (litellm-config.yaml on VPS):

model_list:
  - model_name: qwen-2.5-7b
    litellm_params:
      model: hosted_vllm/openai/qwen-2.5-7b  # Use hosted_vllm prefix!
      api_base: http://100.121.199.88:9000/v1  # Tailscale VPN IP
      api_key: dummy
      stream: true
      timeout: 600

Critical: Use hosted_vllm/openai/ prefix for vLLM models to enable proper streaming support. Wrong prefix causes empty delta chunks.

ComfyUI Installation

ComfyUI provides advanced image/video/audio generation capabilities:

Directory structure created:

/workspace/ComfyUI/
├── models/
│   ├── checkpoints/        # FLUX, SDXL, SD3 models
│   ├── clip_vision/        # CLIP vision models
│   ├── video_models/       # CogVideoX, SVD
│   ├── audio_models/       # MusicGen
│   └── custom_nodes/       # Extension nodes

Essential custom nodes installed:

ComfyUI-Manager - Model/node management GUI
ComfyUI-VideoHelperSuite - Video operations
ComfyUI-AnimateDiff-Evolved - Video generation
ComfyUI_IPAdapter_plus - Style transfer
ComfyUI-Impact-Pack - Auto face enhancement
comfyui-sound-lab - Audio generation

VRAM requirements for 24GB GPU:

FLUX Schnell FP16: 23GB (leaves 1GB)
SDXL Base: 12GB
CogVideoX-5B: 12GB (with optimizations)
MusicGen Medium: 8GB

See COMFYUI_MODELS.md for detailed model catalog and usage examples.

Deployment Workflow

RunPod Deployment (Current Setup)

Clone repository:

cd /workspace
git clone <repo-url> ai
cd ai

Configure environment:

cp .env.example .env
# Edit .env, set HF_TOKEN

Run Ansible provisioning:

ansible-playbook playbook.yml
# Or selective: --tags base,python,comfyui-essential

Start services:
```
bash scripts/start-all.sh
```
Verify:
```
curl http://localhost:9000/health
```

Tailscale VPN Integration

To connect RunPod GPU to VPS infrastructure:

# On RunPod instance
curl -fsSL https://tailscale.com/install.sh | sh
tailscaled --tun=userspace-networking --socks5-server=localhost:1055 &
tailscale up --advertise-tags=tag:gpu
tailscale ip -4  # Get IP for LiteLLM config

Benefits: Secure tunnel, no public exposure, low latency.

Project Structure

runpod/
├── model-orchestrator/
│   ├── orchestrator_subprocess.py  # Main orchestrator (USE THIS)
│   ├── orchestrator.py             # Docker-based version (legacy)
│   ├── models.yaml                 # Model registry
│   └── requirements.txt
├── models/
│   ├── vllm/
│   │   ├── server.py               # vLLM text generation service
│   │   └── requirements.txt
│   └── comfyui/
│       ├── start.sh                # ComfyUI startup script
│       └── requirements.txt
├── scripts/
│   ├── start-all.sh                # Start all services with Supervisor
│   ├── stop-all.sh                 # Stop all services
│   └── status.sh                   # Check service status
├── supervisord.conf                # Supervisor process manager config
├── arty.yml                        # Arty repository manager config
├── playbook.yml                    # Ansible provisioning playbook
├── inventory.yml                   # Ansible inventory (localhost)
├── ansible.cfg                     # Ansible configuration
├── .env.example                    # Environment variables template
├── CLAUDE.md                       # This file
├── COMFYUI_MODELS.md               # ComfyUI models catalog
├── MODELS_LINKED.md                # Model linkage documentation
├── comfyui_models.yaml             # ComfyUI model configuration
└── README.md                       # User documentation

Troubleshooting

Model fails to start

Check VRAM: nvidia-smi
Verify model weights downloaded: ls -lh /workspace/huggingface_cache
Check port conflicts: lsof -i :9000
Test model directly: python3 models/vllm/server.py

Streaming returns empty deltas

Use correct LiteLLM model prefix: hosted_vllm/openai/model-name
Set stream: true in LiteLLM config
Verify orchestrator proxies streaming correctly

HuggingFace download errors

Check token: echo $HF_TOKEN
Set in .env: HF_TOKEN=your_token_here
Re-run Ansible: ansible-playbook playbook.yml --tags dependencies

Out of storage space

Check disk usage: df -h /workspace
Use essential tags: --tags comfyui-essential (~80GB vs ~137GB)
Clear cache: rm -rf /workspace/huggingface_cache

Supervisor not running

Check status: bash scripts/status.sh
View logs: cat /workspace/logs/supervisord.log
Start supervisor: bash scripts/start-all.sh
Check for stale PID: rm -f /workspace/supervisord.pid then restart

Service won't start

Check supervisor status: supervisorctl status
View service logs: supervisorctl tail -f comfyui or supervisorctl tail -f orchestrator
Check error logs: cat /workspace/logs/comfyui.err.log
Restart service: supervisorctl restart comfyui
Check if port is in use: ss -tulpn | grep :8188

Orchestrator not responding

Check supervisor status: supervisorctl status orchestrator
View logs: supervisorctl tail -f orchestrator or cat /workspace/logs/orchestrator.err.log
Restart: supervisorctl restart orchestrator
Manual start for debugging: cd /workspace/ai && python3 model-orchestrator/orchestrator_subprocess.py

Performance Notes

Model switching time: 30-120 seconds (depends on model size)
Text generation: ~20-40 tokens/second (Qwen 2.5 7B on RTX 4090)
Image generation: 4-5 seconds per image (FLUX Schnell)
Music generation: 60-90 seconds for 30s audio (MusicGen Medium)

Important Conventions

Always use orchestrator_subprocess.py - Not the Docker version
Sequential loading only - One model active at a time for 24GB VRAM
Models downloaded by Ansible - Use playbook tags, not manual downloads
Services run as processes - Not systemd (RunPod containers don't support it)
Environment managed via .env - Required: HF_TOKEN
Port 9000 for orchestrator - Model services use 8000+

18 KiB Raw Permalink Blame History