# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Overview This is a lightweight, process-based AI model orchestrator designed for RunPod GPU instances (specifically RTX 4090 with 24GB VRAM). It manages sequential loading of multiple large AI models on a single GPU, providing OpenAI-compatible API endpoints for text, image, and audio generation. **Key Design Philosophy:** - **Sequential model loading** - Only one model active at a time to fit within GPU memory constraints - **Process-based architecture** - Uses Python subprocess instead of Docker-in-Docker for RunPod compatibility - **Automatic model switching** - Orchestrator detects request types and switches models on-demand - **OpenAI-compatible APIs** - Works seamlessly with LiteLLM proxy and other AI tools ## Architecture ### Core Components 1. **Orchestrator** (`model-orchestrator/orchestrator_subprocess.py`) - FastAPI proxy server listening on port 9000 - Manages model lifecycle via Python subprocesses - Routes requests to appropriate model services - Handles sequential model loading/unloading 2. **Model Registry** (`model-orchestrator/models.yaml`) - YAML configuration defining available models - Specifies: type, framework, service script, port, VRAM requirements, startup time - Easy to extend with new models 3. **Model Services** (`models/*/`) - Individual Python servers running specific AI models - vLLM for text generation (Qwen 2.5 7B, Llama 3.1 8B) - ComfyUI for image/video/audio generation (FLUX, SDXL, CogVideoX, MusicGen) 4. **Ansible Provisioning** (`playbook.yml`) - Complete infrastructure-as-code setup - Installs dependencies, downloads models, configures services - Supports selective installation via tags ### Why Process-Based Instead of Docker? The subprocess implementation (`orchestrator_subprocess.py`) is preferred over the Docker version (`orchestrator.py`) because: - RunPod instances run in containers - Docker-in-Docker adds complexity - Faster model startup (direct Python process spawning) - Simpler debugging (single process tree) - Reduced overhead (no container management layer) **Note:** Always use `orchestrator_subprocess.py` for RunPod deployments. ## Common Commands ### Repository Management with Arty This project uses Arty for repository and deployment management. See `arty.yml` for full configuration. ```bash # Clone all repositories (fresh deployment) arty sync --env prod # Production: Essential nodes only arty sync --env dev # Development: All nodes including optional arty sync --env minimal # Minimal: Just orchestrator + ComfyUI base # Run deployment scripts arty run setup/full # Show setup instructions arty run models/link-comfyui # Link downloaded models to ComfyUI arty run deps/comfyui-nodes # Install custom node dependencies arty run services/start # Start orchestrator arty run services/stop # Stop all services # Health checks arty run health/orchestrator # Check orchestrator arty run health/comfyui # Check ComfyUI arty run check/gpu # nvidia-smi arty run check/models # Show cache size ``` ### Initial Setup ```bash # 1. Clone repositories with Arty (fresh RunPod instance) arty sync --env prod # 2. Configure environment cd /workspace/ai cp .env.example .env # Edit .env and set HF_TOKEN=your_huggingface_token # 3. Full deployment with Ansible ansible-playbook playbook.yml # 4. Essential ComfyUI setup (faster, ~80GB instead of ~137GB) ansible-playbook playbook.yml --tags comfyui-essential # 5. Link models to ComfyUI arty run models/link-comfyui # 6. Install custom node dependencies arty run deps/comfyui-nodes # 7. Selective installation (base system + Python + vLLM models only) ansible-playbook playbook.yml --tags base,python,dependencies ``` ### Service Management This project uses **Supervisor** for process management, providing auto-restart, centralized logging, and easy service control. ```bash # Start all services (with Supervisor) bash scripts/start-all.sh # Starts supervisor daemon + services arty run services/start # Same via arty # Stop all services bash scripts/stop-all.sh # Stops all services + supervisor arty run services/stop # Same via arty # Check service status bash scripts/status.sh # Show all service status arty run services/status # Same via arty supervisorctl status # Direct supervisor command # Individual service control supervisorctl start orchestrator # Start orchestrator supervisorctl restart comfyui # Restart ComfyUI supervisorctl stop orchestrator # Stop orchestrator arty run services/restart-comfyui # Restart ComfyUI via arty # View logs supervisorctl tail -f comfyui # Follow ComfyUI logs supervisorctl tail -f orchestrator # Follow orchestrator logs arty run services/logs # Follow ComfyUI logs via arty # Web interface # Access at http://localhost:9001 (username: admin, password: runpod2024) ``` **Supervisor Configuration:** - Config file: `/workspace/supervisord.conf` - Log directory: `/workspace/logs/` - PID file: `/workspace/supervisord.pid` - Socket: `/workspace/supervisor.sock` **Services managed:** - `comfyui` - ComfyUI server (port 8188, autostart enabled) - `orchestrator` - Model orchestrator (port 9000, autostart disabled) ### GPU Memory Management and Mode Switching **VRAM Constraints (RTX 4090 - 24GB total):** The GPU has limited memory, which requires manual service switching: | Service | Model | VRAM Usage | Compatible With | |---------|-------|------------|-----------------| | ComfyUI | FLUX Schnell FP16 | ~23GB | None (uses all VRAM) | | ComfyUI | SDXL Base | ~12GB | Small vLLM models | | vLLM | Qwen 2.5 7B | ~14GB | None (conflicts with ComfyUI) | | vLLM | Llama 3.1 8B | ~17GB | None (conflicts with ComfyUI) | **Mode Switching Workflow:** Since ComfyUI and vLLM models cannot run simultaneously (they exceed 24GB combined), you must manually switch modes: **Switch to Text Generation Mode (vLLM):** ```bash # 1. Stop ComfyUI supervisorctl stop comfyui # 2. Start orchestrator (manages vLLM models) supervisorctl start orchestrator # 3. Verify supervisorctl status nvidia-smi # Check VRAM usage ``` **Switch to Image/Video/Audio Generation Mode (ComfyUI):** ```bash # 1. Stop orchestrator (stops all vLLM models) supervisorctl stop orchestrator # 2. Start ComfyUI supervisorctl start comfyui # 3. Verify supervisorctl status nvidia-smi # Check VRAM usage ``` **Access via Supervisor Web UI:** You can also switch modes using the Supervisor web interface: - URL: `https://supervisor.ai.pivoine.art` (via VPS proxy) or `http://100.114.60.40:9001` (direct Tailscale) - Username: `admin` - Password: `runpod2024` - Click "Start" or "Stop" buttons for each service **Integration with LiteLLM:** The orchestrator integrates with LiteLLM on the VPS for unified API access: - vLLM models (qwen-2.5-7b, llama-3.1-8b) available when orchestrator is running - Requests route through orchestrator (port 9000) which handles model loading - Environment variable `GPU_TAILSCALE_IP` (100.114.60.40) configures connection - LiteLLM config uses `os.environ/GPU_TAILSCALE_IP` syntax for dynamic IP ### Testing ```bash # Health check curl http://localhost:9000/health # List available models curl http://localhost:9000/v1/models # Test text generation (streaming) curl -s -N -X POST http://localhost:9000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{ "model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Count to 5"}], "max_tokens": 50, "stream": true }' # Test image generation curl -X POST http://localhost:9000/v1/images/generations \ -H 'Content-Type: application/json' \ -d '{ "model": "flux-schnell", "prompt": "A serene mountain landscape at sunset", "size": "1024x1024" }' ``` ### Ansible Tags Reference **System Setup:** - `base` - Base system packages - `python` - Python environment setup - `dependencies` - Install Python packages **Model Installation:** - `models` - Download vLLM/Flux/MusicGen models (legacy) - `comfyui` - Install ComfyUI base - `comfyui-essential` - Quick setup (ComfyUI + essential models only, ~80GB) - `comfyui-models-image` - Image generation models (FLUX, SDXL, SD3.5) - `comfyui-models-video` - Video generation models (CogVideoX, SVD) - `comfyui-models-audio` - Audio generation models (MusicGen variants) - `comfyui-models-support` - CLIP, IP-Adapter, ControlNet models - `comfyui-models-all` - All ComfyUI models (~137GB) - `comfyui-nodes` - Install essential custom nodes **Infrastructure:** - `tailscale` - Install Tailscale VPN client - `supervisor` - Install and configure Supervisor process manager - `systemd` - Configure systemd services (use `never` - not for RunPod) - `validate` - Health checks (use `never` - run explicitly) ### Adding New Models 1. **Add model definition to `model-orchestrator/models.yaml`:** ```yaml llama-3.1-8b: type: text framework: vllm service_script: models/vllm/server_llama.py port: 8001 vram_gb: 17 startup_time_seconds: 120 endpoint: /v1/chat/completions description: "Llama 3.1 8B Instruct" ``` 2. **Create service script** (`models/vllm/server_llama.py`): ```python import os from vllm.entrypoints.openai.api_server import run_server model = "meta-llama/Llama-3.1-8B-Instruct" port = int(os.getenv("PORT", 8001)) run_server(model=model, port=port) ``` 3. **Download model** (handled by Ansible playbook or manually via HuggingFace CLI) 4. **Restart orchestrator:** ```bash bash scripts/stop-all.sh && bash scripts/start-all.sh ``` ## Key Implementation Details ### Model Switching Logic The orchestrator automatically switches models based on: - **Endpoint path** - `/v1/chat/completions` → text models, `/v1/images/generations` → image models - **Model name in request** - Matches against model registry - **Sequential loading** - Stops current model before starting new one to conserve VRAM See `orchestrator_subprocess.py:64-100` for process management implementation. ### Model Registry Structure Each model in `models.yaml` requires: - `type` - text, image, or audio - `framework` - vllm, openedai-images, audiocraft, comfyui - `service_script` - Relative path to Python/shell script - `port` - Service port (8000+) - `vram_gb` - GPU memory requirement - `startup_time_seconds` - Max health check timeout - `endpoint` - API endpoint path - `description` - Human-readable description ### Environment Variables Set in `.env` file: - `HF_TOKEN` - **Required** - HuggingFace API token for model downloads - `GPU_TAILSCALE_IP` - Optional - Tailscale IP for VPN access Models are cached in: - `/workspace/huggingface_cache` - HuggingFace models - `/workspace/models` - Other model files - `/workspace/ComfyUI/models` - ComfyUI model directory structure ### Integration with LiteLLM For unified API management through LiteLLM proxy: **LiteLLM configuration (`litellm-config.yaml` on VPS):** ```yaml model_list: - model_name: qwen-2.5-7b litellm_params: model: hosted_vllm/openai/qwen-2.5-7b # Use hosted_vllm prefix! api_base: http://100.121.199.88:9000/v1 # Tailscale VPN IP api_key: dummy stream: true timeout: 600 ``` **Critical:** Use `hosted_vllm/openai/` prefix for vLLM models to enable proper streaming support. Wrong prefix causes empty delta chunks. ### ComfyUI Installation ComfyUI provides advanced image/video/audio generation capabilities: **Directory structure created:** ``` /workspace/ComfyUI/ ├── models/ │ ├── checkpoints/ # FLUX, SDXL, SD3 models │ ├── clip_vision/ # CLIP vision models │ ├── video_models/ # CogVideoX, SVD │ ├── audio_models/ # MusicGen │ └── custom_nodes/ # Extension nodes ``` **Essential custom nodes installed:** - ComfyUI-Manager - Model/node management GUI - ComfyUI-VideoHelperSuite - Video operations - ComfyUI-AnimateDiff-Evolved - Video generation - ComfyUI_IPAdapter_plus - Style transfer - ComfyUI-Impact-Pack - Auto face enhancement - comfyui-sound-lab - Audio generation **VRAM requirements for 24GB GPU:** - FLUX Schnell FP16: 23GB (leaves 1GB) - SDXL Base: 12GB - CogVideoX-5B: 12GB (with optimizations) - MusicGen Medium: 8GB See `COMFYUI_MODELS.md` for detailed model catalog and usage examples. ## Deployment Workflow ### RunPod Deployment (Current Setup) 1. **Clone repository:** ```bash cd /workspace git clone ai cd ai ``` 2. **Configure environment:** ```bash cp .env.example .env # Edit .env, set HF_TOKEN ``` 3. **Run Ansible provisioning:** ```bash ansible-playbook playbook.yml # Or selective: --tags base,python,comfyui-essential ``` 4. **Start services:** ```bash bash scripts/start-all.sh ``` 5. **Verify:** ```bash curl http://localhost:9000/health ``` ### Tailscale VPN Integration To connect RunPod GPU to VPS infrastructure: ```bash # On RunPod instance curl -fsSL https://tailscale.com/install.sh | sh tailscaled --tun=userspace-networking --socks5-server=localhost:1055 & tailscale up --advertise-tags=tag:gpu tailscale ip -4 # Get IP for LiteLLM config ``` Benefits: Secure tunnel, no public exposure, low latency. ## Project Structure ``` runpod/ ├── model-orchestrator/ │ ├── orchestrator_subprocess.py # Main orchestrator (USE THIS) │ ├── orchestrator.py # Docker-based version (legacy) │ ├── models.yaml # Model registry │ └── requirements.txt ├── models/ │ ├── vllm/ │ │ ├── server.py # vLLM text generation service │ │ └── requirements.txt │ └── comfyui/ │ ├── start.sh # ComfyUI startup script │ └── requirements.txt ├── scripts/ │ ├── start-all.sh # Start all services with Supervisor │ ├── stop-all.sh # Stop all services │ └── status.sh # Check service status ├── supervisord.conf # Supervisor process manager config ├── arty.yml # Arty repository manager config ├── playbook.yml # Ansible provisioning playbook ├── inventory.yml # Ansible inventory (localhost) ├── ansible.cfg # Ansible configuration ├── .env.example # Environment variables template ├── CLAUDE.md # This file ├── COMFYUI_MODELS.md # ComfyUI models catalog ├── MODELS_LINKED.md # Model linkage documentation ├── comfyui_models.yaml # ComfyUI model configuration └── README.md # User documentation ``` ## Troubleshooting ### Model fails to start - Check VRAM: `nvidia-smi` - Verify model weights downloaded: `ls -lh /workspace/huggingface_cache` - Check port conflicts: `lsof -i :9000` - Test model directly: `python3 models/vllm/server.py` ### Streaming returns empty deltas - Use correct LiteLLM model prefix: `hosted_vllm/openai/model-name` - Set `stream: true` in LiteLLM config - Verify orchestrator proxies streaming correctly ### HuggingFace download errors - Check token: `echo $HF_TOKEN` - Set in .env: `HF_TOKEN=your_token_here` - Re-run Ansible: `ansible-playbook playbook.yml --tags dependencies` ### Out of storage space - Check disk usage: `df -h /workspace` - Use essential tags: `--tags comfyui-essential` (~80GB vs ~137GB) - Clear cache: `rm -rf /workspace/huggingface_cache` ### Supervisor not running - Check status: `bash scripts/status.sh` - View logs: `cat /workspace/logs/supervisord.log` - Start supervisor: `bash scripts/start-all.sh` - Check for stale PID: `rm -f /workspace/supervisord.pid` then restart ### Service won't start - Check supervisor status: `supervisorctl status` - View service logs: `supervisorctl tail -f comfyui` or `supervisorctl tail -f orchestrator` - Check error logs: `cat /workspace/logs/comfyui.err.log` - Restart service: `supervisorctl restart comfyui` - Check if port is in use: `ss -tulpn | grep :8188` ### Orchestrator not responding - Check supervisor status: `supervisorctl status orchestrator` - View logs: `supervisorctl tail -f orchestrator` or `cat /workspace/logs/orchestrator.err.log` - Restart: `supervisorctl restart orchestrator` - Manual start for debugging: `cd /workspace/ai && python3 model-orchestrator/orchestrator_subprocess.py` ## Performance Notes - **Model switching time:** 30-120 seconds (depends on model size) - **Text generation:** ~20-40 tokens/second (Qwen 2.5 7B on RTX 4090) - **Image generation:** 4-5 seconds per image (FLUX Schnell) - **Music generation:** 60-90 seconds for 30s audio (MusicGen Medium) ## Important Conventions - **Always use `orchestrator_subprocess.py`** - Not the Docker version - **Sequential loading only** - One model active at a time for 24GB VRAM - **Models downloaded by Ansible** - Use playbook tags, not manual downloads - **Services run as processes** - Not systemd (RunPod containers don't support it) - **Environment managed via .env** - Required: HF_TOKEN - **Port 9000 for orchestrator** - Model services use 8000+