- Add models_civitai.yaml with 6 NSFW SDXL checkpoints - Create artifact_civitai_download.sh with beautiful purple/magenta CLI - Update .env.example with CIVITAI_API_KEY documentation - Update CLAUDE.md with CivitAI usage instructions - Rename comfyui_models.yaml to models_huggingface.yaml for clarity Features: - Dedicated config and downloader for CivitAI models - Same elegant architecture as HuggingFace downloader - Retry logic, rate limiting, progress bars - Models: LUSTIFY, Pony Diffusion V6, RealVisXL, etc. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
18 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Overview
This is a lightweight AI model serving infrastructure for RunPod GPU instances (specifically RTX 4090 with 24GB VRAM). It provides OpenAI-compatible API endpoints for text, image, and audio generation via dedicated vLLM servers and ComfyUI.
Key Design Philosophy:
- Direct vLLM access - Dedicated FastAPI servers for each text model (no proxy layer)
- Process-based architecture - Managed by Supervisor for reliability
- Simple service management - Start/stop models independently as needed
- OpenAI-compatible APIs - Works seamlessly with LiteLLM proxy and other AI tools
Architecture
Core Components
-
vLLM Servers (
vllm/server_qwen.py,vllm/server_llama.py)- Dedicated FastAPI servers for each text model
- Qwen 2.5 7B on port 8000, Llama 3.1 8B on port 8001
- Direct AsyncLLMEngine access for optimal streaming performance
- max_model_len=20000 optimized for 24GB VRAM
-
ComfyUI (
comfyui/)- Image/video/audio generation server on port 8188
- FLUX, SDXL, CogVideoX, MusicGen models
- Advanced workflow system with custom nodes
-
Supervisor (
supervisord.conf)- Process manager for all AI services
- Auto-restart, centralized logging
- Web interface on port 9001
-
Ansible Provisioning (
playbook.yml)- Complete infrastructure-as-code setup
- Installs dependencies, downloads models, configures services
- Supports selective installation via tags
Common Commands
Repository Management with Arty
This project uses Arty for repository and deployment management. See arty.yml for full configuration.
# Clone all repositories (fresh deployment)
arty sync --env prod # Production: Essential nodes only
arty sync --env dev # Development: All nodes including optional
arty sync --env minimal # Minimal: Just vLLM + ComfyUI base
# Run deployment scripts
arty run setup/full # Show setup instructions
arty run models/link-comfyui # Link downloaded models to ComfyUI
arty run deps/comfyui-nodes # Install custom node dependencies
arty run services/start # Start supervisor
arty run services/stop # Stop all services
# Health checks
arty run health/comfyui # Check ComfyUI
arty run check/gpu # nvidia-smi
arty run check/models # Show cache size
Initial Setup
# 1. Clone repositories with Arty (fresh RunPod instance)
arty sync --env prod
# 2. Configure environment
cd /workspace/ai
cp .env.example .env
# Edit .env and set HF_TOKEN=your_huggingface_token
# 3. Full deployment with Ansible
ansible-playbook playbook.yml
# 4. Essential ComfyUI setup (faster, ~80GB instead of ~137GB)
ansible-playbook playbook.yml --tags comfyui-essential
# 5. Link models to ComfyUI
arty run models/link-comfyui
# 6. Install custom node dependencies
arty run deps/comfyui-nodes
# 7. Selective installation (base system + Python + vLLM models only)
ansible-playbook playbook.yml --tags base,python,dependencies
Service Management
This project uses Supervisor for process management, providing auto-restart, centralized logging, and easy service control.
# Start all services (with Supervisor)
bash scripts/start-all.sh # Starts supervisor daemon + services
arty run services/start # Same via arty
# Stop all services
bash scripts/stop-all.sh # Stops all services + supervisor
arty run services/stop # Same via arty
# Check service status
bash scripts/status.sh # Show all service status
arty run services/status # Same via arty
supervisorctl status # Direct supervisor command
# Individual service control
supervisorctl start vllm-qwen # Start Qwen vLLM server
supervisorctl start vllm-llama # Start Llama vLLM server
supervisorctl restart comfyui # Restart ComfyUI
supervisorctl stop vllm-qwen # Stop Qwen vLLM server
arty run services/restart-comfyui # Restart ComfyUI via arty
# View logs
supervisorctl tail -f comfyui # Follow ComfyUI logs
supervisorctl tail -f vllm-qwen # Follow Qwen vLLM logs
supervisorctl tail -f vllm-llama # Follow Llama vLLM logs
arty run services/logs # Follow ComfyUI logs via arty
# Web interface
# Access at http://localhost:9001 (username: admin, password: runpod2024)
Supervisor Configuration:
- Config file:
/workspace/supervisord.conf - Log directory:
/workspace/logs/ - PID file:
/workspace/supervisord.pid - Socket:
/workspace/supervisor.sock
Services managed:
comfyui- ComfyUI server (port 8188, autostart enabled)vllm-qwen- Qwen 2.5 7B vLLM server (port 8000, autostart disabled)vllm-llama- Llama 3.1 8B vLLM server (port 8001, autostart disabled)webdav-sync- WebDAV sync service for ComfyUI outputs
GPU Memory Management and Mode Switching
VRAM Constraints (RTX 4090 - 24GB total):
The GPU has limited memory, which requires manual service switching:
| Service | Model | VRAM Usage | Compatible With |
|---|---|---|---|
| ComfyUI | FLUX Schnell FP16 | ~23GB | None (uses all VRAM) |
| ComfyUI | SDXL Base | ~12GB | Small vLLM models |
| vLLM | Qwen 2.5 7B | ~14GB | None (conflicts with ComfyUI) |
| vLLM | Llama 3.1 8B | ~17GB | None (conflicts with ComfyUI) |
Mode Switching Workflow:
Since ComfyUI and vLLM models cannot run simultaneously (they exceed 24GB combined), you must manually switch modes:
Switch to Text Generation Mode (vLLM):
# 1. Stop ComfyUI
supervisorctl stop comfyui
# 2. Start desired vLLM server
supervisorctl start vllm-qwen # or vllm-llama
# 3. Verify
supervisorctl status
nvidia-smi # Check VRAM usage
Switch to Image/Video/Audio Generation Mode (ComfyUI):
# 1. Stop vLLM servers
supervisorctl stop vllm-qwen vllm-llama
# 2. Start ComfyUI
supervisorctl start comfyui
# 3. Verify
supervisorctl status
nvidia-smi # Check VRAM usage
Access via Supervisor Web UI:
You can also switch modes using the Supervisor web interface:
- URL:
https://supervisor.ai.pivoine.art(via VPS proxy) orhttp://100.114.60.40:9001(direct Tailscale) - Username:
admin - Password:
runpod2024 - Click "Start" or "Stop" buttons for each service
Integration with LiteLLM:
The vLLM servers integrate with LiteLLM on the VPS for unified API access:
- vLLM models (qwen-2.5-7b, llama-3.1-8b) available via direct connections
- Requests route directly to vLLM servers (ports 8000, 8001)
- Environment variables
GPU_VLLM_QWEN_URLandGPU_VLLM_LLAMA_URLconfigure connections - LiteLLM config uses
os.environ/GPU_VLLM_*_URLsyntax for dynamic URLs
Testing
# Health check (Qwen vLLM server)
curl http://localhost:8000/health
# List available models
curl http://localhost:8000/v1/models
# Test text generation (streaming)
curl -s -N -X POST http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen-2.5-7b",
"messages": [{"role": "user", "content": "Count to 5"}],
"max_tokens": 50,
"stream": true
}'
# Test ComfyUI (image generation)
curl -X POST http://localhost:8188/prompt \
-H 'Content-Type: application/json' \
-d '{
"size": "1024x1024"
}'
Ansible Tags Reference
System Setup:
base- Base system packagespython- Python environment setupdependencies- Install Python packages
Model Installation:
models- Download vLLM/Flux/MusicGen models (legacy)comfyui- Install ComfyUI basecomfyui-essential- Quick setup (ComfyUI + essential models only, ~80GB)comfyui-models-image- Image generation models (FLUX, SDXL, SD3.5)comfyui-models-video- Video generation models (CogVideoX, SVD)comfyui-models-audio- Audio generation models (MusicGen variants)comfyui-models-support- CLIP, IP-Adapter, ControlNet modelscomfyui-models-all- All ComfyUI models (~137GB)comfyui-nodes- Install essential custom nodes
Infrastructure:
tailscale- Install Tailscale VPN clientsupervisor- Install and configure Supervisor process managersystemd- Configure systemd services (usenever- not for RunPod)validate- Health checks (usenever- run explicitly)
Adding New vLLM Models
- Create dedicated server script (e.g.,
vllm/server_mistral.py):
#!/usr/bin/env python3
from vllm import AsyncLLMEngine, AsyncEngineArgs
# ... (copy structure from server_qwen.py)
model_name: str = "mistralai/Mistral-7B-Instruct-v0.3"
port = 8002 # Choose unique port
- Add supervisor configuration (
supervisord.conf):
[program:vllm-mistral]
command=vllm/venv/bin/python vllm/server_mistral.py
directory=.
autostart=false
autorestart=true
environment=HF_HOME="../huggingface_cache",HF_TOKEN="%(ENV_HF_TOKEN)s"
-
Download model (handled by Ansible playbook or manually via HuggingFace CLI)
-
Add to LiteLLM config (on VPS):
- model_name: mistral-7b
litellm_params:
model: hosted_vllm/openai/mistral-7b
api_base: os.environ/GPU_VLLM_MISTRAL_URL
api_key: "EMPTY"
- Start the service:
supervisorctl start vllm-mistral
Downloading NSFW Models from CivitAI
The project includes dedicated tooling for downloading NSFW models from CivitAI, separate from the HuggingFace downloader.
Files:
- Configuration:
models_civitai.yaml- Defines available CivitAI models - Downloader:
artifact_civitai_download.sh- Beautiful CLI tool for downloading models
Requirements:
- CivitAI API key (set in
.envfile asCIVITAI_API_KEY) - Get your key from: https://civitai.com/user/account
Available Models (Essential):
- LUSTIFY v7.0 GGWP (6.31GB) - Photoreal NSFW checkpoint for women
- Pony Diffusion V6 XL (6.5GB) - Anime/furry NSFW with balanced content
- RealVisXL V5.0 (6.8GB) - Photorealistic NSFW model
Usage:
# Download all essential NSFW models (19.61GB)
artifact_civitai_download.sh
# Download only (no symlinks)
artifact_civitai_download.sh download
# Create symlinks only (models already downloaded)
artifact_civitai_download.sh link
# Custom configuration file
artifact_civitai_download.sh -c /path/to/models_civitai.yaml
# Help
artifact_civitai_download.sh --help
Directory Structure:
- Models downloaded to:
/workspace/models/civitai/(cache) - Symlinks created in:
/workspace/ComfyUI/models/checkpoints/
Features:
- Beautiful purple/magenta themed CLI (vs cyan for HuggingFace)
- Automatic retry with exponential backoff for rate limiting
- Resume interrupted downloads
- Progress bars and status indicators
- Parallel download support (respects rate limits)
Configuration Example (models_civitai.yaml):
model_categories:
nsfw_checkpoints:
- name: lustify-v7-ggwp
version_id: "1094291"
model_id: "573152"
description: "LUSTIFY v7.0 GGWP - Photoreal NSFW checkpoint"
size_gb: 6.31
essential: true
type: checkpoints
Installation Profiles:
essential: 3 models (LUSTIFY, Pony, RealVisXL) - 19.61GB, 30-60 mincomplete: All 6 models - 39.21GB, 1-2 hoursrealistic_only: LUSTIFY + RealVisXL - 13.11GB, 20-40 minanime_only: Pony + WAI-NSFW - 13.0GB, 20-40 min
Important Notes:
- All models are SDXL-based (require ~12GB VRAM)
- Cannot run simultaneously with FLUX models (both use ~23GB)
- Can run with smaller models like standard SDXL (~12GB total)
- Rate limiting: 5 second delay between downloads (configurable)
Key Implementation Details
Direct vLLM Server Architecture
Each vLLM model runs as a dedicated FastAPI server:
- Independent process management via Supervisor
- Direct AsyncLLMEngine access for optimal performance
- OpenAI-compatible
/v1/chat/completionsendpoint - Streaming support with proper delta formatting
See vllm/server_qwen.py for reference implementation.
Set in .env file:
HF_TOKEN- Required - HuggingFace API token for model downloadsGPU_TAILSCALE_IP- Optional - Tailscale IP for VPN access
Models are cached in:
/workspace/huggingface_cache- HuggingFace models/workspace/models- Other model files/workspace/ComfyUI/models- ComfyUI model directory structure
Integration with LiteLLM
For unified API management through LiteLLM proxy:
LiteLLM configuration (litellm-config.yaml on VPS):
model_list:
- model_name: qwen-2.5-7b
litellm_params:
model: hosted_vllm/openai/qwen-2.5-7b # Use hosted_vllm prefix!
api_base: http://100.121.199.88:9000/v1 # Tailscale VPN IP
api_key: dummy
stream: true
timeout: 600
Critical: Use hosted_vllm/openai/ prefix for vLLM models to enable proper streaming support. Wrong prefix causes empty delta chunks.
ComfyUI Installation
ComfyUI provides advanced image/video/audio generation capabilities:
Directory structure created:
/workspace/ComfyUI/
├── models/
│ ├── checkpoints/ # FLUX, SDXL, SD3 models
│ ├── clip_vision/ # CLIP vision models
│ ├── video_models/ # CogVideoX, SVD
│ ├── audio_models/ # MusicGen
│ └── custom_nodes/ # Extension nodes
Essential custom nodes installed:
- ComfyUI-Manager - Model/node management GUI
- ComfyUI-VideoHelperSuite - Video operations
- ComfyUI-AnimateDiff-Evolved - Video generation
- ComfyUI_IPAdapter_plus - Style transfer
- ComfyUI-Impact-Pack - Auto face enhancement
- comfyui-sound-lab - Audio generation
VRAM requirements for 24GB GPU:
- FLUX Schnell FP16: 23GB (leaves 1GB)
- SDXL Base: 12GB
- CogVideoX-5B: 12GB (with optimizations)
- MusicGen Medium: 8GB
See COMFYUI_MODELS.md for detailed model catalog and usage examples.
Deployment Workflow
RunPod Deployment (Current Setup)
-
Clone repository:
cd /workspace git clone <repo-url> ai cd ai -
Configure environment:
cp .env.example .env # Edit .env, set HF_TOKEN -
Run Ansible provisioning:
ansible-playbook playbook.yml # Or selective: --tags base,python,comfyui-essential -
Start services:
bash scripts/start-all.sh -
Verify:
curl http://localhost:9000/health
Tailscale VPN Integration
To connect RunPod GPU to VPS infrastructure:
# On RunPod instance
curl -fsSL https://tailscale.com/install.sh | sh
tailscaled --tun=userspace-networking --socks5-server=localhost:1055 &
tailscale up --advertise-tags=tag:gpu
tailscale ip -4 # Get IP for LiteLLM config
Benefits: Secure tunnel, no public exposure, low latency.
Project Structure
runpod/
├── model-orchestrator/
│ ├── orchestrator_subprocess.py # Main orchestrator (USE THIS)
│ ├── orchestrator.py # Docker-based version (legacy)
│ ├── models.yaml # Model registry
│ └── requirements.txt
├── models/
│ ├── vllm/
│ │ ├── server.py # vLLM text generation service
│ │ └── requirements.txt
│ └── comfyui/
│ ├── start.sh # ComfyUI startup script
│ └── requirements.txt
├── scripts/
│ ├── start-all.sh # Start all services with Supervisor
│ ├── stop-all.sh # Stop all services
│ └── status.sh # Check service status
├── supervisord.conf # Supervisor process manager config
├── arty.yml # Arty repository manager config
├── playbook.yml # Ansible provisioning playbook
├── inventory.yml # Ansible inventory (localhost)
├── ansible.cfg # Ansible configuration
├── .env.example # Environment variables template
├── CLAUDE.md # This file
├── COMFYUI_MODELS.md # ComfyUI models catalog
├── MODELS_LINKED.md # Model linkage documentation
├── comfyui_models.yaml # ComfyUI model configuration
└── README.md # User documentation
Troubleshooting
Model fails to start
- Check VRAM:
nvidia-smi - Verify model weights downloaded:
ls -lh /workspace/huggingface_cache - Check port conflicts:
lsof -i :9000 - Test model directly:
python3 models/vllm/server.py
Streaming returns empty deltas
- Use correct LiteLLM model prefix:
hosted_vllm/openai/model-name - Set
stream: truein LiteLLM config - Verify orchestrator proxies streaming correctly
HuggingFace download errors
- Check token:
echo $HF_TOKEN - Set in .env:
HF_TOKEN=your_token_here - Re-run Ansible:
ansible-playbook playbook.yml --tags dependencies
Out of storage space
- Check disk usage:
df -h /workspace - Use essential tags:
--tags comfyui-essential(~80GB vs ~137GB) - Clear cache:
rm -rf /workspace/huggingface_cache
Supervisor not running
- Check status:
bash scripts/status.sh - View logs:
cat /workspace/logs/supervisord.log - Start supervisor:
bash scripts/start-all.sh - Check for stale PID:
rm -f /workspace/supervisord.pidthen restart
Service won't start
- Check supervisor status:
supervisorctl status - View service logs:
supervisorctl tail -f comfyuiorsupervisorctl tail -f orchestrator - Check error logs:
cat /workspace/logs/comfyui.err.log - Restart service:
supervisorctl restart comfyui - Check if port is in use:
ss -tulpn | grep :8188
Orchestrator not responding
- Check supervisor status:
supervisorctl status orchestrator - View logs:
supervisorctl tail -f orchestratororcat /workspace/logs/orchestrator.err.log - Restart:
supervisorctl restart orchestrator - Manual start for debugging:
cd /workspace/ai && python3 model-orchestrator/orchestrator_subprocess.py
Performance Notes
- Model switching time: 30-120 seconds (depends on model size)
- Text generation: ~20-40 tokens/second (Qwen 2.5 7B on RTX 4090)
- Image generation: 4-5 seconds per image (FLUX Schnell)
- Music generation: 60-90 seconds for 30s audio (MusicGen Medium)
Important Conventions
- Always use
orchestrator_subprocess.py- Not the Docker version - Sequential loading only - One model active at a time for 24GB VRAM
- Models downloaded by Ansible - Use playbook tags, not manual downloads
- Services run as processes - Not systemd (RunPod containers don't support it)
- Environment managed via .env - Required: HF_TOKEN
- Port 9000 for orchestrator - Model services use 8000+