diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..8b17173 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,105 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Overview + +RunPod AI Orchestrator - A Docker-based deployment template for GPU instances on RunPod. Orchestrates AI services (ComfyUI, vLLM, AudioCraft) using Supervisor for process management, with optional Tailscale VPN integration. + +## Architecture + +### Service Orchestration +- **Supervisor** manages all services via `supervisord.conf` +- Services run in isolated Python venvs under `services//venv/` +- Logs written to `.logs/` directory +- Each service has its own requirements.txt + +### Services (managed by Supervisor) +| Service | Port | Description | Auto-start | +|---------|------|-------------|------------| +| ComfyUI | 8188 | Node-based image/video/audio generation | Yes | +| WebDAV Sync | - | Uploads ComfyUI outputs to HiDrive | Yes | +| AudioCraft | - | Music generation | Yes | +| vLLM Llama | 8001 | Llama 3.1 8B language model | No | +| vLLM BGE | 8002 | BGE embedding model | No | + +### Model Management +- Models defined in YAML configs: `models/models_civitai.yaml`, `models/models_huggingface.yaml` +- Downloaded to `.cache/` and symlinked to `services/comfyui/models/` +- Uses external scripts: `artifact_civitai_download.sh`, `artifact_huggingface_download.sh` + +### Repository Management (Arty) +- `arty.yml` defines git repos to clone (ComfyUI + custom nodes) +- Repos cloned to `services/comfyui/` and `services/comfyui/custom_nodes/` +- Run `arty sync` to clone/update all dependencies + +## Common Commands + +### Full Setup (on RunPod) +```bash +arty setup # Complete setup: deps, tailscale, services, comfyui, models, supervisor +``` + +### Supervisor Control +```bash +arty supervisor/start # Start supervisord +arty supervisor/stop # Stop all services +arty supervisor/status # Check service status +arty supervisor/restart # Restart all services +supervisorctl -c supervisord.conf status # Direct status check +supervisorctl -c supervisord.conf start comfyui # Start specific service +supervisorctl -c supervisord.conf tail -f comfyui # Follow logs +``` + +### Model Management +```bash +arty models/download # Download models from Civitai/HuggingFace +arty models/link # Symlink cached models to ComfyUI +``` + +### Setup Components +```bash +arty deps # Clone git references +arty setup/tailscale # Configure Tailscale VPN +arty setup/services # Create venvs for all services +arty setup/comfyui # Install ComfyUI and custom node dependencies +``` + +### Docker Build (CI runs on Gitea) +```bash +docker build -t runpod-ai-orchestrator . +``` + +## Environment Variables + +Required in `.env` (or RunPod template): +- `HF_TOKEN` - HuggingFace API token +- `TAILSCALE_AUTHKEY` - Tailscale auth key (optional) +- `CIVITAI_API_KEY` - Civitai API key for model downloads +- `WEBDAV_URL`, `WEBDAV_USERNAME`, `WEBDAV_PASSWORD`, `WEBDAV_REMOTE_PATH` - WebDAV sync config +- `PUBLIC_KEY` - SSH public key for RunPod access + +## File Structure + +``` +├── Dockerfile # Minimal base image (PyTorch + CUDA) +├── start.sh # Container entrypoint +├── supervisord.conf # Process manager config +├── arty.yml # Git repos + setup scripts +├── models/ +│ ├── models_civitai.yaml # Civitai model definitions +│ └── models_huggingface.yaml # HuggingFace model definitions +├── services/ +│ ├── comfyui/ # ComfyUI + custom_nodes (cloned by arty) +│ ├── audiocraft/ # AudioCraft Studio (cloned by arty) +│ ├── vllm/ # vLLM configs +│ └── webdav-sync/ # Output sync service +└── .gitea/workflows/ # CI/CD for Docker builds +``` + +## Important Notes + +- **Network Volume**: On RunPod, `/workspace` is the persistent network volume. The orchestrator repo is cloned to `/workspace/orchestrator`. +- **Service Ports**: ComfyUI (8188), Supervisor Web UI (9001), vLLM Llama (8001), vLLM BGE (8002) +- **vLLM services** are disabled by default (autostart=false) to conserve GPU memory +- **Custom nodes** have their dependencies installed into the ComfyUI venv during setup diff --git a/arty.yml b/arty.yml index 490ca82..534232c 100644 --- a/arty.yml +++ b/arty.yml @@ -111,34 +111,33 @@ scripts: echo "=========================================" echo "" - if [ -n "${TAILSCALE_AUTHKEY:-}" ]; then - echo " Starting Tailscale daemon..." - tailscaled --tun=userspace-networking --socks5-server=localhost:1055 & - sleep 3 - - echo " Connecting to Tailscale network..." - HOSTNAME="runpod-ai-orchestrator" - tailscale up --authkey="$TAILSCALE_AUTHKEY" --advertise-tags=tag:gpu --hostname="$HOSTNAME" || { - echo " ⚠ Tailscale connection failed, continuing without VPN" - } - - # Get Tailscale IP if connected - TAILSCALE_IP=$(tailscale ip -4 2>/dev/null || echo "not connected") - if [ "$TAILSCALE_IP" != "not connected" ]; then - echo " ✓ Tailscale connected" - echo " Hostname: $HOSTNAME" - echo " IP: $TAILSCALE_IP" - - # Export for other services - export GPU_TAILSCALE_IP="$TAILSCALE_IP" - else - echo " ⚠ Tailscale failed to obtain IP" - fi - else - echo " ⚠ Tailscale disabled (no TAILSCALE_AUTHKEY in env)" - echo " Services requiring VPN connectivity will not work" + if [ ! "$TAILSCALE_AUTHKEY" ]; then + echo " ⚠ Tailscale disabled (no TAILSCALE_AUTHKEY in env)" + echo " Services requiring VPN connectivity will not work" + exit 1 fi + echo " Starting Tailscale daemon..." + tailscaled --tun=userspace-networking --socks5-server=localhost:1055 & + sleep 3 + + echo " Connecting to Tailscale network..." + HOSTNAME="runpod-ai-orchestrator" + tailscale up --authkey="$TAILSCALE_AUTHKEY" --advertise-tags=tag:gpu --hostname="$HOSTNAME" || { + echo " ⚠ Tailscale connection failed, continuing without VPN" + } + + # Get Tailscale IP if connected + TAILSCALE_IP=$(tailscale ip -4 2>/dev/null || echo "not connected") + if [ "$TAILSCALE_IP" == "not connected" ]; then + echo " ⚠ Tailscale failed to obtain IP" + exit 1 + fi + + echo " ✓ Tailscale connected" + echo " Hostname: $HOSTNAME" + echo " IP: $TAILSCALE_IP" + setup/services: | echo "=========================================" echo " Setting up services python venvs" @@ -241,6 +240,7 @@ scripts: # Supervisor Control Scripts # supervisor/start: | + mkdir -p .logs/ supervisord -c supervisord.conf supervisor/stop: | diff --git a/config_llama.yaml b/config_llama.yaml deleted file mode 100644 index ddc68a1..0000000 --- a/config_llama.yaml +++ /dev/null @@ -1,4 +0,0 @@ -model: meta-llama/Llama-3.1-8B-Instruct -host: "0.0.0.0" -port: 8001 -uvicorn-log-level: "info" diff --git a/runpod.yml b/runpod.yml new file mode 100644 index 0000000..3d54cea --- /dev/null +++ b/runpod.yml @@ -0,0 +1,42 @@ +# RunPod Pod Configuration +# Used by service_runpod_control.sh +# +# Usage: +# service_runpod_control.sh create # Create pod from this config +# service_runpod_control.sh get # Show pod status +# service_runpod_control.sh start # Start the pod +# service_runpod_control.sh stop # Stop the pod +# service_runpod_control.sh remove # Delete the pod + +pod: + # Required fields + name: "runpod-ai-orchestrator" + gpuType: "NVIDIA GeForce RTX 4090" + gpuCount: 1 + + # Template and volume IDs (from RunPod dashboard) + templateId: "runpod-ai-orchestrator" + networkVolumeId: "runpod-ai-orchestrator" + imageName: "dev.pivoine.art/valknar/runpod-ai-orchestrator:latest" + # Exposed ports + ports: + - "22/tcp" + + # Optional: Resource limits + # containerDiskSize: 20 # GB (default: 20) + # volumeSize: 1 # GB (default: 1) + # volumePath: "/runpod" # Mount path + # mem: 20 # Minimum memory GB + # vcpu: 1 # Minimum vCPUs + + # Optional: Cloud selection + # secureCloud: false # Use secure cloud only + # communityCloud: false # Use community cloud only + + # Optional: Custom image (overrides template) + # imageName: "runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu22.04" + + # Optional: Environment variables + # env: + # - "HF_TOKEN=your_token_here" + # - "CUSTOM_VAR=value" diff --git a/start.sh b/start.sh index a74ab6c..ef1f32d 100644 --- a/start.sh +++ b/start.sh @@ -25,7 +25,7 @@ if [ ! -d "$PWD/bin" ] ; then git clone https://dev.pivoine.art/valknar/bin.git "$PWD/bin" echo " ✓ bin cloned" else - cd "$PWD/bin" && git stash && git pull && git stash pop || true + cd "$PWD/bin" && git fetch && git reset --hard origin/main echo " ✓ bin updated" cd - fi @@ -33,7 +33,7 @@ if [ ! -d "$PWD/orchestrator" ] ; then git clone https://dev.pivoine.art/valknar/runpod-ai-orchestrator.git "$PWD/orchestrator" echo " ✓ orchestrator cloned" else - cd "$PWD/orchestrator" && git stash && git pull && git stash pop || true + cd "$PWD/orchestrator" && git fetch && git reset --hard origin/main echo " ✓ orchestrator updated" cd - fi