Files
runpod/CLAUDE.md

513 lines
17 KiB
Markdown

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Overview
This is a lightweight, process-based AI model orchestrator designed for RunPod GPU instances (specifically RTX 4090 with 24GB VRAM). It manages sequential loading of multiple large AI models on a single GPU, providing OpenAI-compatible API endpoints for text, image, and audio generation.
**Key Design Philosophy:**
- **Sequential model loading** - Only one model active at a time to fit within GPU memory constraints
- **Process-based architecture** - Uses Python subprocess instead of Docker-in-Docker for RunPod compatibility
- **Automatic model switching** - Orchestrator detects request types and switches models on-demand
- **OpenAI-compatible APIs** - Works seamlessly with LiteLLM proxy and other AI tools
## Architecture
### Core Components
1. **Orchestrator** (`model-orchestrator/orchestrator_subprocess.py`)
- FastAPI proxy server listening on port 9000
- Manages model lifecycle via Python subprocesses
- Routes requests to appropriate model services
- Handles sequential model loading/unloading
2. **Model Registry** (`model-orchestrator/models.yaml`)
- YAML configuration defining available models
- Specifies: type, framework, service script, port, VRAM requirements, startup time
- Easy to extend with new models
3. **Model Services** (`models/*/`)
- Individual Python servers running specific AI models
- vLLM for text generation (Qwen 2.5 7B, Llama 3.1 8B)
- ComfyUI for image/video/audio generation (FLUX, SDXL, CogVideoX, MusicGen)
4. **Ansible Provisioning** (`playbook.yml`)
- Complete infrastructure-as-code setup
- Installs dependencies, downloads models, configures services
- Supports selective installation via tags
### Why Process-Based Instead of Docker?
The subprocess implementation (`orchestrator_subprocess.py`) is preferred over the Docker version (`orchestrator.py`) because:
- RunPod instances run in containers - Docker-in-Docker adds complexity
- Faster model startup (direct Python process spawning)
- Simpler debugging (single process tree)
- Reduced overhead (no container management layer)
**Note:** Always use `orchestrator_subprocess.py` for RunPod deployments.
## Common Commands
### Repository Management with Arty
This project uses Arty for repository and deployment management. See `arty.yml` for full configuration.
```bash
# Clone all repositories (fresh deployment)
arty sync --env prod # Production: Essential nodes only
arty sync --env dev # Development: All nodes including optional
arty sync --env minimal # Minimal: Just orchestrator + ComfyUI base
# Run deployment scripts
arty run setup/full # Show setup instructions
arty run models/link-comfyui # Link downloaded models to ComfyUI
arty run deps/comfyui-nodes # Install custom node dependencies
arty run services/start # Start orchestrator
arty run services/stop # Stop all services
# Health checks
arty run health/orchestrator # Check orchestrator
arty run health/comfyui # Check ComfyUI
arty run check/gpu # nvidia-smi
arty run check/models # Show cache size
```
### Initial Setup
```bash
# 1. Clone repositories with Arty (fresh RunPod instance)
arty sync --env prod
# 2. Configure environment
cd /workspace/ai
cp .env.example .env
# Edit .env and set HF_TOKEN=your_huggingface_token
# 3. Full deployment with Ansible
ansible-playbook playbook.yml
# 4. Essential ComfyUI setup (faster, ~80GB instead of ~137GB)
ansible-playbook playbook.yml --tags comfyui-essential
# 5. Link models to ComfyUI
arty run models/link-comfyui
# 6. Install custom node dependencies
arty run deps/comfyui-nodes
# 7. Selective installation (base system + Python + vLLM models only)
ansible-playbook playbook.yml --tags base,python,dependencies
```
### Service Management
This project uses **Supervisor** for process management, providing auto-restart, centralized logging, and easy service control.
```bash
# Start all services (with Supervisor)
bash scripts/start-all.sh # Starts supervisor daemon + services
arty run services/start # Same via arty
# Stop all services
bash scripts/stop-all.sh # Stops all services + supervisor
arty run services/stop # Same via arty
# Check service status
bash scripts/status.sh # Show all service status
arty run services/status # Same via arty
supervisorctl status # Direct supervisor command
# Individual service control
supervisorctl start orchestrator # Start orchestrator
supervisorctl restart comfyui # Restart ComfyUI
supervisorctl stop orchestrator # Stop orchestrator
arty run services/restart-comfyui # Restart ComfyUI via arty
# View logs
supervisorctl tail -f comfyui # Follow ComfyUI logs
supervisorctl tail -f orchestrator # Follow orchestrator logs
arty run services/logs # Follow ComfyUI logs via arty
# Web interface
# Access at http://localhost:9001 (username: admin, password: runpod2024)
```
**Supervisor Configuration:**
- Config file: `/workspace/supervisord.conf`
- Log directory: `/workspace/logs/`
- PID file: `/workspace/supervisord.pid`
- Socket: `/workspace/supervisor.sock`
**Services managed:**
- `comfyui` - ComfyUI server (port 8188, autostart enabled)
- `orchestrator` - Model orchestrator (port 9000, autostart disabled)
### GPU Memory Management and Mode Switching
**VRAM Constraints (RTX 4090 - 24GB total):**
The GPU has limited memory, which requires manual service switching:
| Service | Model | VRAM Usage | Compatible With |
|---------|-------|------------|-----------------|
| ComfyUI | FLUX Schnell FP16 | ~23GB | None (uses all VRAM) |
| ComfyUI | SDXL Base | ~12GB | Small vLLM models |
| vLLM | Qwen 2.5 7B | ~14GB | None (conflicts with ComfyUI) |
| vLLM | Llama 3.1 8B | ~17GB | None (conflicts with ComfyUI) |
**Mode Switching Workflow:**
Since ComfyUI and vLLM models cannot run simultaneously (they exceed 24GB combined), you must manually switch modes:
**Switch to Text Generation Mode (vLLM):**
```bash
# 1. Stop ComfyUI
supervisorctl stop comfyui
# 2. Start orchestrator (manages vLLM models)
supervisorctl start orchestrator
# 3. Verify
supervisorctl status
nvidia-smi # Check VRAM usage
```
**Switch to Image/Video/Audio Generation Mode (ComfyUI):**
```bash
# 1. Stop orchestrator (stops all vLLM models)
supervisorctl stop orchestrator
# 2. Start ComfyUI
supervisorctl start comfyui
# 3. Verify
supervisorctl status
nvidia-smi # Check VRAM usage
```
**Access via Supervisor Web UI:**
You can also switch modes using the Supervisor web interface:
- URL: `https://supervisor.ai.pivoine.art` (via VPS proxy) or `http://100.114.60.40:9001` (direct Tailscale)
- Username: `admin`
- Password: `runpod2024`
- Click "Start" or "Stop" buttons for each service
**Integration with LiteLLM:**
The orchestrator integrates with LiteLLM on the VPS for unified API access:
- vLLM models (qwen-2.5-7b, llama-3.1-8b) available when orchestrator is running
- Requests route through orchestrator (port 9000) which handles model loading
- Environment variable `GPU_TAILSCALE_IP` (100.114.60.40) configures connection
- LiteLLM config uses `os.environ/GPU_TAILSCALE_IP` syntax for dynamic IP
### Testing
```bash
# Health check
curl http://localhost:9000/health
# List available models
curl http://localhost:9000/v1/models
# Test text generation (streaming)
curl -s -N -X POST http://localhost:9000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen-2.5-7b",
"messages": [{"role": "user", "content": "Count to 5"}],
"max_tokens": 50,
"stream": true
}'
# Test image generation
curl -X POST http://localhost:9000/v1/images/generations \
-H 'Content-Type: application/json' \
-d '{
"model": "flux-schnell",
"prompt": "A serene mountain landscape at sunset",
"size": "1024x1024"
}'
```
### Ansible Tags Reference
**System Setup:**
- `base` - Base system packages
- `python` - Python environment setup
- `dependencies` - Install Python packages
**Model Installation:**
- `models` - Download vLLM/Flux/MusicGen models (legacy)
- `comfyui` - Install ComfyUI base
- `comfyui-essential` - Quick setup (ComfyUI + essential models only, ~80GB)
- `comfyui-models-image` - Image generation models (FLUX, SDXL, SD3.5)
- `comfyui-models-video` - Video generation models (CogVideoX, SVD)
- `comfyui-models-audio` - Audio generation models (MusicGen variants)
- `comfyui-models-support` - CLIP, IP-Adapter, ControlNet models
- `comfyui-models-all` - All ComfyUI models (~137GB)
- `comfyui-nodes` - Install essential custom nodes
**Infrastructure:**
- `tailscale` - Install Tailscale VPN client
- `supervisor` - Install and configure Supervisor process manager
- `systemd` - Configure systemd services (use `never` - not for RunPod)
- `validate` - Health checks (use `never` - run explicitly)
### Adding New Models
1. **Add model definition to `model-orchestrator/models.yaml`:**
```yaml
llama-3.1-8b:
type: text
framework: vllm
service_script: models/vllm/server_llama.py
port: 8001
vram_gb: 17
startup_time_seconds: 120
endpoint: /v1/chat/completions
description: "Llama 3.1 8B Instruct"
```
2. **Create service script** (`models/vllm/server_llama.py`):
```python
import os
from vllm.entrypoints.openai.api_server import run_server
model = "meta-llama/Llama-3.1-8B-Instruct"
port = int(os.getenv("PORT", 8001))
run_server(model=model, port=port)
```
3. **Download model** (handled by Ansible playbook or manually via HuggingFace CLI)
4. **Restart orchestrator:**
```bash
bash scripts/stop-all.sh && bash scripts/start-all.sh
```
## Key Implementation Details
### Model Switching Logic
The orchestrator automatically switches models based on:
- **Endpoint path** - `/v1/chat/completions` → text models, `/v1/images/generations` → image models
- **Model name in request** - Matches against model registry
- **Sequential loading** - Stops current model before starting new one to conserve VRAM
See `orchestrator_subprocess.py:64-100` for process management implementation.
### Model Registry Structure
Each model in `models.yaml` requires:
- `type` - text, image, or audio
- `framework` - vllm, openedai-images, audiocraft, comfyui
- `service_script` - Relative path to Python/shell script
- `port` - Service port (8000+)
- `vram_gb` - GPU memory requirement
- `startup_time_seconds` - Max health check timeout
- `endpoint` - API endpoint path
- `description` - Human-readable description
### Environment Variables
Set in `.env` file:
- `HF_TOKEN` - **Required** - HuggingFace API token for model downloads
- `GPU_TAILSCALE_IP` - Optional - Tailscale IP for VPN access
Models are cached in:
- `/workspace/huggingface_cache` - HuggingFace models
- `/workspace/models` - Other model files
- `/workspace/ComfyUI/models` - ComfyUI model directory structure
### Integration with LiteLLM
For unified API management through LiteLLM proxy:
**LiteLLM configuration (`litellm-config.yaml` on VPS):**
```yaml
model_list:
- model_name: qwen-2.5-7b
litellm_params:
model: hosted_vllm/openai/qwen-2.5-7b # Use hosted_vllm prefix!
api_base: http://100.121.199.88:9000/v1 # Tailscale VPN IP
api_key: dummy
stream: true
timeout: 600
```
**Critical:** Use `hosted_vllm/openai/` prefix for vLLM models to enable proper streaming support. Wrong prefix causes empty delta chunks.
### ComfyUI Installation
ComfyUI provides advanced image/video/audio generation capabilities:
**Directory structure created:**
```
/workspace/ComfyUI/
├── models/
│ ├── checkpoints/ # FLUX, SDXL, SD3 models
│ ├── clip_vision/ # CLIP vision models
│ ├── video_models/ # CogVideoX, SVD
│ ├── audio_models/ # MusicGen
│ └── custom_nodes/ # Extension nodes
```
**Essential custom nodes installed:**
- ComfyUI-Manager - Model/node management GUI
- ComfyUI-VideoHelperSuite - Video operations
- ComfyUI-AnimateDiff-Evolved - Video generation
- ComfyUI_IPAdapter_plus - Style transfer
- ComfyUI-Impact-Pack - Auto face enhancement
- comfyui-sound-lab - Audio generation
**VRAM requirements for 24GB GPU:**
- FLUX Schnell FP16: 23GB (leaves 1GB)
- SDXL Base: 12GB
- CogVideoX-5B: 12GB (with optimizations)
- MusicGen Medium: 8GB
See `COMFYUI_MODELS.md` for detailed model catalog and usage examples.
## Deployment Workflow
### RunPod Deployment (Current Setup)
1. **Clone repository:**
```bash
cd /workspace
git clone <repo-url> ai
cd ai
```
2. **Configure environment:**
```bash
cp .env.example .env
# Edit .env, set HF_TOKEN
```
3. **Run Ansible provisioning:**
```bash
ansible-playbook playbook.yml
# Or selective: --tags base,python,comfyui-essential
```
4. **Start services:**
```bash
bash scripts/start-all.sh
```
5. **Verify:**
```bash
curl http://localhost:9000/health
```
### Tailscale VPN Integration
To connect RunPod GPU to VPS infrastructure:
```bash
# On RunPod instance
curl -fsSL https://tailscale.com/install.sh | sh
tailscaled --tun=userspace-networking --socks5-server=localhost:1055 &
tailscale up --advertise-tags=tag:gpu
tailscale ip -4 # Get IP for LiteLLM config
```
Benefits: Secure tunnel, no public exposure, low latency.
## Project Structure
```
runpod/
├── model-orchestrator/
│ ├── orchestrator_subprocess.py # Main orchestrator (USE THIS)
│ ├── orchestrator.py # Docker-based version (legacy)
│ ├── models.yaml # Model registry
│ └── requirements.txt
├── models/
│ ├── vllm/
│ │ ├── server.py # vLLM text generation service
│ │ └── requirements.txt
│ └── comfyui/
│ ├── start.sh # ComfyUI startup script
│ └── requirements.txt
├── scripts/
│ ├── start-all.sh # Start all services with Supervisor
│ ├── stop-all.sh # Stop all services
│ └── status.sh # Check service status
├── supervisord.conf # Supervisor process manager config
├── arty.yml # Arty repository manager config
├── playbook.yml # Ansible provisioning playbook
├── inventory.yml # Ansible inventory (localhost)
├── ansible.cfg # Ansible configuration
├── .env.example # Environment variables template
├── CLAUDE.md # This file
├── COMFYUI_MODELS.md # ComfyUI models catalog
├── MODELS_LINKED.md # Model linkage documentation
├── comfyui_models.yaml # ComfyUI model configuration
└── README.md # User documentation
```
## Troubleshooting
### Model fails to start
- Check VRAM: `nvidia-smi`
- Verify model weights downloaded: `ls -lh /workspace/huggingface_cache`
- Check port conflicts: `lsof -i :9000`
- Test model directly: `python3 models/vllm/server.py`
### Streaming returns empty deltas
- Use correct LiteLLM model prefix: `hosted_vllm/openai/model-name`
- Set `stream: true` in LiteLLM config
- Verify orchestrator proxies streaming correctly
### HuggingFace download errors
- Check token: `echo $HF_TOKEN`
- Set in .env: `HF_TOKEN=your_token_here`
- Re-run Ansible: `ansible-playbook playbook.yml --tags dependencies`
### Out of storage space
- Check disk usage: `df -h /workspace`
- Use essential tags: `--tags comfyui-essential` (~80GB vs ~137GB)
- Clear cache: `rm -rf /workspace/huggingface_cache`
### Supervisor not running
- Check status: `bash scripts/status.sh`
- View logs: `cat /workspace/logs/supervisord.log`
- Start supervisor: `bash scripts/start-all.sh`
- Check for stale PID: `rm -f /workspace/supervisord.pid` then restart
### Service won't start
- Check supervisor status: `supervisorctl status`
- View service logs: `supervisorctl tail -f comfyui` or `supervisorctl tail -f orchestrator`
- Check error logs: `cat /workspace/logs/comfyui.err.log`
- Restart service: `supervisorctl restart comfyui`
- Check if port is in use: `ss -tulpn | grep :8188`
### Orchestrator not responding
- Check supervisor status: `supervisorctl status orchestrator`
- View logs: `supervisorctl tail -f orchestrator` or `cat /workspace/logs/orchestrator.err.log`
- Restart: `supervisorctl restart orchestrator`
- Manual start for debugging: `cd /workspace/ai && python3 model-orchestrator/orchestrator_subprocess.py`
## Performance Notes
- **Model switching time:** 30-120 seconds (depends on model size)
- **Text generation:** ~20-40 tokens/second (Qwen 2.5 7B on RTX 4090)
- **Image generation:** 4-5 seconds per image (FLUX Schnell)
- **Music generation:** 60-90 seconds for 30s audio (MusicGen Medium)
## Important Conventions
- **Always use `orchestrator_subprocess.py`** - Not the Docker version
- **Sequential loading only** - One model active at a time for 24GB VRAM
- **Models downloaded by Ansible** - Use playbook tags, not manual downloads
- **Services run as processes** - Not systemd (RunPod containers don't support it)
- **Environment managed via .env** - Required: HF_TOKEN
- **Port 9000 for orchestrator** - Model services use 8000+