runpod/CLAUDE.md

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Overview

This is a lightweight AI model serving infrastructure for RunPod GPU instances (specifically RTX 4090 with 24GB VRAM). It provides OpenAI-compatible API endpoints for text, image, and audio generation via dedicated vLLM servers and ComfyUI.

**Key Design Philosophy:**
- **Direct vLLM access** - Dedicated FastAPI servers for each text model (no proxy layer)
- **Process-based architecture** - Managed by Supervisor for reliability
- **Simple service management** - Start/stop models independently as needed
- **OpenAI-compatible APIs** - Works seamlessly with LiteLLM proxy and other AI tools

## Architecture

### Core Components

1. **vLLM Servers** (`vllm/server_qwen.py`, `vllm/server_llama.py`)
   - Dedicated FastAPI servers for each text model
   - Qwen 2.5 7B on port 8000, Llama 3.1 8B on port 8001
   - Direct AsyncLLMEngine access for optimal streaming performance
   - max_model_len=20000 optimized for 24GB VRAM

2. **ComfyUI** (`comfyui/`)
   - Image/video/audio generation server on port 8188
   - FLUX, SDXL, CogVideoX, MusicGen models
   - Advanced workflow system with custom nodes

3. **Supervisor** (`supervisord.conf`)
   - Process manager for all AI services
   - Auto-restart, centralized logging
   - Web interface on port 9001

4. **Ansible Provisioning** (`playbook.yml`)
   - Complete infrastructure-as-code setup
   - Installs dependencies, downloads models, configures services
   - Supports selective installation via tags

## Common Commands

### Repository Management with Arty

This project uses Arty for repository and deployment management. See `arty.yml` for full configuration.

```bash
# Clone all repositories (fresh deployment)
arty sync --env prod          # Production: Essential nodes only
arty sync --env dev           # Development: All nodes including optional
arty sync --env minimal       # Minimal: Just vLLM + ComfyUI base

# Run deployment scripts
arty run setup/full           # Show setup instructions
arty run models/link-comfyui  # Link downloaded models to ComfyUI
arty run deps/comfyui-nodes   # Install custom node dependencies
arty run services/start       # Start supervisor
arty run services/stop        # Stop all services

# Health checks
arty run health/comfyui      # Check ComfyUI
arty run check/gpu           # nvidia-smi
arty run check/models        # Show cache size
```

### Initial Setup

```bash
# 1. Clone repositories with Arty (fresh RunPod instance)
arty sync --env prod

# 2. Configure environment
cd /workspace/ai
cp .env.example .env
# Edit .env and set HF_TOKEN=your_huggingface_token

# 3. Full deployment with Ansible
ansible-playbook playbook.yml

# 4. Essential ComfyUI setup (faster, ~80GB instead of ~137GB)
ansible-playbook playbook.yml --tags comfyui-essential

# 5. Link models to ComfyUI
arty run models/link-comfyui

# 6. Install custom node dependencies
arty run deps/comfyui-nodes

# 7. Selective installation (base system + Python + vLLM models only)
ansible-playbook playbook.yml --tags base,python,dependencies
```

### Service Management

This project uses **Supervisor** for process management, providing auto-restart, centralized logging, and easy service control.

```bash
# Start all services (with Supervisor)
bash scripts/start-all.sh           # Starts supervisor daemon + services
arty run services/start             # Same via arty

# Stop all services
bash scripts/stop-all.sh            # Stops all services + supervisor
arty run services/stop              # Same via arty

# Check service status
bash scripts/status.sh              # Show all service status
arty run services/status            # Same via arty
supervisorctl status                # Direct supervisor command

# Individual service control
supervisorctl start vllm-qwen       # Start Qwen vLLM server
supervisorctl start vllm-llama      # Start Llama vLLM server
supervisorctl restart comfyui       # Restart ComfyUI
supervisorctl stop vllm-qwen        # Stop Qwen vLLM server
arty run services/restart-comfyui   # Restart ComfyUI via arty

# View logs
supervisorctl tail -f comfyui       # Follow ComfyUI logs
supervisorctl tail -f vllm-qwen     # Follow Qwen vLLM logs
supervisorctl tail -f vllm-llama    # Follow Llama vLLM logs
arty run services/logs              # Follow ComfyUI logs via arty

# Web interface
# Access at http://localhost:9001 (username: admin, password: runpod2024)
```

**Supervisor Configuration:**
- Config file: `/workspace/supervisord.conf`
- Log directory: `/workspace/logs/`
- PID file: `/workspace/supervisord.pid`
- Socket: `/workspace/supervisor.sock`

**Services managed:**
- `comfyui` - ComfyUI server (port 8188, autostart enabled)
- `vllm-qwen` - Qwen 2.5 7B vLLM server (port 8000, autostart disabled)
- `vllm-llama` - Llama 3.1 8B vLLM server (port 8001, autostart disabled)
- `webdav-sync` - WebDAV sync service for ComfyUI outputs

### GPU Memory Management and Mode Switching

**VRAM Constraints (RTX 4090 - 24GB total):**

The GPU has limited memory, which requires manual service switching:

| Service | Model | VRAM Usage | Compatible With |
|---------|-------|------------|-----------------|
| ComfyUI | FLUX Schnell FP16 | ~23GB | None (uses all VRAM) |
| ComfyUI | SDXL Base | ~12GB | Small vLLM models |
| vLLM | Qwen 2.5 7B | ~14GB | None (conflicts with ComfyUI) |
| vLLM | Llama 3.1 8B | ~17GB | None (conflicts with ComfyUI) |

**Mode Switching Workflow:**

Since ComfyUI and vLLM models cannot run simultaneously (they exceed 24GB combined), you must manually switch modes:

**Switch to Text Generation Mode (vLLM):**
```bash
# 1. Stop ComfyUI
supervisorctl stop comfyui

# 2. Start desired vLLM server
supervisorctl start vllm-qwen    # or vllm-llama

# 3. Verify
supervisorctl status
nvidia-smi  # Check VRAM usage
```

**Switch to Image/Video/Audio Generation Mode (ComfyUI):**
```bash
# 1. Stop vLLM servers
supervisorctl stop vllm-qwen vllm-llama

# 2. Start ComfyUI
supervisorctl start comfyui

# 3. Verify
supervisorctl status
nvidia-smi  # Check VRAM usage
```

**Access via Supervisor Web UI:**

You can also switch modes using the Supervisor web interface:
- URL: `https://supervisor.ai.pivoine.art` (via VPS proxy) or `http://100.114.60.40:9001` (direct Tailscale)
- Username: `admin`
- Password: `runpod2024`
- Click "Start" or "Stop" buttons for each service

**Integration with LiteLLM:**

The vLLM servers integrate with LiteLLM on the VPS for unified API access:
- vLLM models (qwen-2.5-7b, llama-3.1-8b) available via direct connections
- Requests route directly to vLLM servers (ports 8000, 8001)
- Environment variables `GPU_VLLM_QWEN_URL` and `GPU_VLLM_LLAMA_URL` configure connections
- LiteLLM config uses `os.environ/GPU_VLLM_*_URL` syntax for dynamic URLs

### Testing

```bash
# Health check (Qwen vLLM server)
curl http://localhost:8000/health

# List available models
curl http://localhost:8000/v1/models

# Test text generation (streaming)
curl -s -N -X POST http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen-2.5-7b",
    "messages": [{"role": "user", "content": "Count to 5"}],
    "max_tokens": 50,
    "stream": true
  }'

# Test ComfyUI (image generation)
curl -X POST http://localhost:8188/prompt \
  -H 'Content-Type: application/json' \
  -d '{
    "size": "1024x1024"
  }'
```

### Ansible Tags Reference

**System Setup:**
- `base` - Base system packages
- `python` - Python environment setup
- `dependencies` - Install Python packages

**Model Installation:**
- `models` - Download vLLM/Flux/MusicGen models (legacy)
- `comfyui` - Install ComfyUI base
- `comfyui-essential` - Quick setup (ComfyUI + essential models only, ~80GB)
- `comfyui-models-image` - Image generation models (FLUX, SDXL, SD3.5)
- `comfyui-models-video` - Video generation models (CogVideoX, SVD)
- `comfyui-models-audio` - Audio generation models (MusicGen variants)
- `comfyui-models-support` - CLIP, IP-Adapter, ControlNet models
- `comfyui-models-all` - All ComfyUI models (~137GB)
- `comfyui-nodes` - Install essential custom nodes

**Infrastructure:**
- `tailscale` - Install Tailscale VPN client
- `supervisor` - Install and configure Supervisor process manager
- `systemd` - Configure systemd services (use `never` - not for RunPod)
- `validate` - Health checks (use `never` - run explicitly)

### Adding New vLLM Models

1. **Create dedicated server script** (e.g., `vllm/server_mistral.py`):

```python
#!/usr/bin/env python3
from vllm import AsyncLLMEngine, AsyncEngineArgs
# ... (copy structure from server_qwen.py)

model_name: str = "mistralai/Mistral-7B-Instruct-v0.3"
port = 8002  # Choose unique port
```

2. **Add supervisor configuration** (`supervisord.conf`):

```ini
[program:vllm-mistral]
command=vllm/venv/bin/python vllm/server_mistral.py
directory=.
autostart=false
autorestart=true
environment=HF_HOME="../huggingface_cache",HF_TOKEN="%(ENV_HF_TOKEN)s"
```

3. **Download model** (handled by Ansible playbook or manually via HuggingFace CLI)

4. **Add to LiteLLM config** (on VPS):

```yaml
- model_name: mistral-7b
  litellm_params:
    model: hosted_vllm/openai/mistral-7b
    api_base: os.environ/GPU_VLLM_MISTRAL_URL
    api_key: "EMPTY"
```

5. **Start the service:**

```bash
supervisorctl start vllm-mistral
```

### Downloading NSFW Models from CivitAI

The project includes dedicated tooling for downloading NSFW models from CivitAI, separate from the HuggingFace downloader.

**Files:**
- Configuration: `models_civitai.yaml` - Defines available CivitAI models
- Downloader: `artifact_civitai_download.sh` - Beautiful CLI tool for downloading models

**Requirements:**
- CivitAI API key (set in `.env` file as `CIVITAI_API_KEY`)
- Get your key from: https://civitai.com/user/account

**Available Models (Essential):**
1. **LUSTIFY v7.0 GGWP** (6.31GB) - Photoreal NSFW checkpoint for women
2. **Pony Diffusion V6 XL** (6.5GB) - Anime/furry NSFW with balanced content
3. **RealVisXL V5.0** (6.8GB) - Photorealistic NSFW model

**Usage:**

```bash
# Download all essential NSFW models (19.61GB)
artifact_civitai_download.sh

# Download only (no symlinks)
artifact_civitai_download.sh download

# Create symlinks only (models already downloaded)
artifact_civitai_download.sh link

# Custom configuration file
artifact_civitai_download.sh -c /path/to/models_civitai.yaml

# Help
artifact_civitai_download.sh --help
```

**Directory Structure:**
- Models downloaded to: `/workspace/models/civitai/` (cache)
- Symlinks created in: `/workspace/ComfyUI/models/checkpoints/`

**Features:**
- Beautiful purple/magenta themed CLI (vs cyan for HuggingFace)
- Automatic retry with exponential backoff for rate limiting
- Resume interrupted downloads
- Progress bars and status indicators
- Parallel download support (respects rate limits)

**Configuration Example (`models_civitai.yaml`):**

```yaml
model_categories:
  nsfw_checkpoints:
    - name: lustify-v7-ggwp
      version_id: "1094291"
      model_id: "573152"
      description: "LUSTIFY v7.0 GGWP - Photoreal NSFW checkpoint"
      size_gb: 6.31
      essential: true
      type: checkpoints
```

**Installation Profiles:**
- `essential`: 3 models (LUSTIFY, Pony, RealVisXL) - 19.61GB, 30-60 min
- `complete`: All 6 models - 39.21GB, 1-2 hours
- `realistic_only`: LUSTIFY + RealVisXL - 13.11GB, 20-40 min
- `anime_only`: Pony + WAI-NSFW - 13.0GB, 20-40 min

**Important Notes:**
- All models are SDXL-based (require ~12GB VRAM)
- Cannot run simultaneously with FLUX models (both use ~23GB)
- Can run with smaller models like standard SDXL (~12GB total)
- Rate limiting: 5 second delay between downloads (configurable)

## Key Implementation Details

### Direct vLLM Server Architecture

Each vLLM model runs as a dedicated FastAPI server:
- Independent process management via Supervisor
- Direct AsyncLLMEngine access for optimal performance
- OpenAI-compatible `/v1/chat/completions` endpoint
- Streaming support with proper delta formatting

See `vllm/server_qwen.py` for reference implementation.

Set in `.env` file:
- `HF_TOKEN` - **Required** - HuggingFace API token for model downloads
- `GPU_TAILSCALE_IP` - Optional - Tailscale IP for VPN access

Models are cached in:
- `/workspace/huggingface_cache` - HuggingFace models
- `/workspace/models` - Other model files
- `/workspace/ComfyUI/models` - ComfyUI model directory structure

### Integration with LiteLLM

For unified API management through LiteLLM proxy:

**LiteLLM configuration (`litellm-config.yaml` on VPS):**
```yaml
model_list:
  - model_name: qwen-2.5-7b
    litellm_params:
      model: hosted_vllm/openai/qwen-2.5-7b  # Use hosted_vllm prefix!
      api_base: http://100.121.199.88:9000/v1  # Tailscale VPN IP
      api_key: dummy
      stream: true
      timeout: 600
```

**Critical:** Use `hosted_vllm/openai/` prefix for vLLM models to enable proper streaming support. Wrong prefix causes empty delta chunks.

### ComfyUI Installation

ComfyUI provides advanced image/video/audio generation capabilities:

**Directory structure created:**
```
/workspace/ComfyUI/
├── models/
│   ├── checkpoints/        # FLUX, SDXL, SD3 models
│   ├── clip_vision/        # CLIP vision models
│   ├── video_models/       # CogVideoX, SVD
│   ├── audio_models/       # MusicGen
│   └── custom_nodes/       # Extension nodes
```

**Essential custom nodes installed:**
- ComfyUI-Manager - Model/node management GUI
- ComfyUI-VideoHelperSuite - Video operations
- ComfyUI-AnimateDiff-Evolved - Video generation
- ComfyUI_IPAdapter_plus - Style transfer
- ComfyUI-Impact-Pack - Auto face enhancement
- comfyui-sound-lab - Audio generation

**VRAM requirements for 24GB GPU:**
- FLUX Schnell FP16: 23GB (leaves 1GB)
- SDXL Base: 12GB
- CogVideoX-5B: 12GB (with optimizations)
- MusicGen Medium: 8GB

See `COMFYUI_MODELS.md` for detailed model catalog and usage examples.

## Deployment Workflow

### RunPod Deployment (Current Setup)

1. **Clone repository:**
   ```bash
   cd /workspace
   git clone <repo-url> ai
   cd ai
   ```

2. **Configure environment:**
   ```bash
   cp .env.example .env
   # Edit .env, set HF_TOKEN
   ```

3. **Run Ansible provisioning:**
   ```bash
   ansible-playbook playbook.yml
   # Or selective: --tags base,python,comfyui-essential
   ```

4. **Start services:**
   ```bash
   bash scripts/start-all.sh
   ```

5. **Verify:**
   ```bash
   curl http://localhost:9000/health
   ```

### Tailscale VPN Integration

To connect RunPod GPU to VPS infrastructure:

```bash
# On RunPod instance
curl -fsSL https://tailscale.com/install.sh | sh
tailscaled --tun=userspace-networking --socks5-server=localhost:1055 &
tailscale up --advertise-tags=tag:gpu
tailscale ip -4  # Get IP for LiteLLM config
```

Benefits: Secure tunnel, no public exposure, low latency.

## Project Structure

```
runpod/
├── model-orchestrator/
│   ├── orchestrator_subprocess.py  # Main orchestrator (USE THIS)
│   ├── orchestrator.py             # Docker-based version (legacy)
│   ├── models.yaml                 # Model registry
│   └── requirements.txt
├── models/
│   ├── vllm/
│   │   ├── server.py               # vLLM text generation service
│   │   └── requirements.txt
│   └── comfyui/
│       ├── start.sh                # ComfyUI startup script
│       └── requirements.txt
├── scripts/
│   ├── start-all.sh                # Start all services with Supervisor
│   ├── stop-all.sh                 # Stop all services
│   └── status.sh                   # Check service status
├── supervisord.conf                # Supervisor process manager config
├── arty.yml                        # Arty repository manager config
├── playbook.yml                    # Ansible provisioning playbook
├── inventory.yml                   # Ansible inventory (localhost)
├── ansible.cfg                     # Ansible configuration
├── .env.example                    # Environment variables template
├── CLAUDE.md                       # This file
├── COMFYUI_MODELS.md               # ComfyUI models catalog
├── MODELS_LINKED.md                # Model linkage documentation
├── comfyui_models.yaml             # ComfyUI model configuration
└── README.md                       # User documentation
```

## Troubleshooting

### Model fails to start
- Check VRAM: `nvidia-smi`
- Verify model weights downloaded: `ls -lh /workspace/huggingface_cache`
- Check port conflicts: `lsof -i :9000`
- Test model directly: `python3 models/vllm/server.py`

### Streaming returns empty deltas
- Use correct LiteLLM model prefix: `hosted_vllm/openai/model-name`
- Set `stream: true` in LiteLLM config
- Verify orchestrator proxies streaming correctly

### HuggingFace download errors
- Check token: `echo $HF_TOKEN`
- Set in .env: `HF_TOKEN=your_token_here`
- Re-run Ansible: `ansible-playbook playbook.yml --tags dependencies`

### Out of storage space
- Check disk usage: `df -h /workspace`
- Use essential tags: `--tags comfyui-essential` (~80GB vs ~137GB)
- Clear cache: `rm -rf /workspace/huggingface_cache`

### Supervisor not running
- Check status: `bash scripts/status.sh`
- View logs: `cat /workspace/logs/supervisord.log`
- Start supervisor: `bash scripts/start-all.sh`
- Check for stale PID: `rm -f /workspace/supervisord.pid` then restart

### Service won't start
- Check supervisor status: `supervisorctl status`
- View service logs: `supervisorctl tail -f comfyui` or `supervisorctl tail -f orchestrator`
- Check error logs: `cat /workspace/logs/comfyui.err.log`
- Restart service: `supervisorctl restart comfyui`
- Check if port is in use: `ss -tulpn | grep :8188`

### Orchestrator not responding
- Check supervisor status: `supervisorctl status orchestrator`
- View logs: `supervisorctl tail -f orchestrator` or `cat /workspace/logs/orchestrator.err.log`
- Restart: `supervisorctl restart orchestrator`
- Manual start for debugging: `cd /workspace/ai && python3 model-orchestrator/orchestrator_subprocess.py`

## Performance Notes

- **Model switching time:** 30-120 seconds (depends on model size)
- **Text generation:** ~20-40 tokens/second (Qwen 2.5 7B on RTX 4090)
- **Image generation:** 4-5 seconds per image (FLUX Schnell)
- **Music generation:** 60-90 seconds for 30s audio (MusicGen Medium)

## Important Conventions

- **Always use `orchestrator_subprocess.py`** - Not the Docker version
- **Sequential loading only** - One model active at a time for 24GB VRAM
- **Models downloaded by Ansible** - Use playbook tags, not manual downloads
- **Services run as processes** - Not systemd (RunPod containers don't support it)
- **Environment managed via .env** - Required: HF_TOKEN
- **Port 9000 for orchestrator** - Model services use 8000+