Initial commit

2025-11-26 18:04:53 +01:00
parent 5c61ac5c67
commit 5f8c843b22
5 changed files with 175 additions and 32 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -0,0 +1,105 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Overview
+
+RunPod AI Orchestrator - A Docker-based deployment template for GPU instances on RunPod. Orchestrates AI services (ComfyUI, vLLM, AudioCraft) using Supervisor for process management, with optional Tailscale VPN integration.
+
+## Architecture
+
+### Service Orchestration
+- **Supervisor** manages all services via `supervisord.conf`
+- Services run in isolated Python venvs under `services/<name>/venv/`
+- Logs written to `.logs/` directory
+- Each service has its own requirements.txt
+
+### Services (managed by Supervisor)
+| Service | Port | Description | Auto-start |
+|---------|------|-------------|------------|
+| ComfyUI | 8188 | Node-based image/video/audio generation | Yes |
+| WebDAV Sync | - | Uploads ComfyUI outputs to HiDrive | Yes |
+| AudioCraft | - | Music generation | Yes |
+| vLLM Llama | 8001 | Llama 3.1 8B language model | No |
+| vLLM BGE | 8002 | BGE embedding model | No |
+
+### Model Management
+- Models defined in YAML configs: `models/models_civitai.yaml`, `models/models_huggingface.yaml`
+- Downloaded to `.cache/` and symlinked to `services/comfyui/models/`
+- Uses external scripts: `artifact_civitai_download.sh`, `artifact_huggingface_download.sh`
+
+### Repository Management (Arty)
+- `arty.yml` defines git repos to clone (ComfyUI + custom nodes)
+- Repos cloned to `services/comfyui/` and `services/comfyui/custom_nodes/`
+- Run `arty sync` to clone/update all dependencies
+
+## Common Commands
+
+### Full Setup (on RunPod)
+```bash
+arty setup                   # Complete setup: deps, tailscale, services, comfyui, models, supervisor
+```
+
+### Supervisor Control
+```bash
+arty supervisor/start        # Start supervisord
+arty supervisor/stop         # Stop all services
+arty supervisor/status       # Check service status
+arty supervisor/restart      # Restart all services
+supervisorctl -c supervisord.conf status           # Direct status check
+supervisorctl -c supervisord.conf start comfyui    # Start specific service
+supervisorctl -c supervisord.conf tail -f comfyui  # Follow logs
+```
+
+### Model Management
+```bash
+arty models/download         # Download models from Civitai/HuggingFace
+arty models/link             # Symlink cached models to ComfyUI
+```
+
+### Setup Components
+```bash
+arty deps                    # Clone git references
+arty setup/tailscale         # Configure Tailscale VPN
+arty setup/services          # Create venvs for all services
+arty setup/comfyui           # Install ComfyUI and custom node dependencies
+```
+
+### Docker Build (CI runs on Gitea)
+```bash
+docker build -t runpod-ai-orchestrator .
+```
+
+## Environment Variables
+
+Required in `.env` (or RunPod template):
+- `HF_TOKEN` - HuggingFace API token
+- `TAILSCALE_AUTHKEY` - Tailscale auth key (optional)
+- `CIVITAI_API_KEY` - Civitai API key for model downloads
+- `WEBDAV_URL`, `WEBDAV_USERNAME`, `WEBDAV_PASSWORD`, `WEBDAV_REMOTE_PATH` - WebDAV sync config
+- `PUBLIC_KEY` - SSH public key for RunPod access
+
+## File Structure
+
+```
+├── Dockerfile              # Minimal base image (PyTorch + CUDA)
+├── start.sh                # Container entrypoint
+├── supervisord.conf        # Process manager config
+├── arty.yml                # Git repos + setup scripts
+├── models/
+│   ├── models_civitai.yaml      # Civitai model definitions
+│   └── models_huggingface.yaml  # HuggingFace model definitions
+├── services/
+│   ├── comfyui/            # ComfyUI + custom_nodes (cloned by arty)
+│   ├── audiocraft/         # AudioCraft Studio (cloned by arty)
+│   ├── vllm/               # vLLM configs
+│   └── webdav-sync/        # Output sync service
+└── .gitea/workflows/       # CI/CD for Docker builds
+```
+
+## Important Notes
+
+- **Network Volume**: On RunPod, `/workspace` is the persistent network volume. The orchestrator repo is cloned to `/workspace/orchestrator`.
+- **Service Ports**: ComfyUI (8188), Supervisor Web UI (9001), vLLM Llama (8001), vLLM BGE (8002)
+- **vLLM services** are disabled by default (autostart=false) to conserve GPU memory
+- **Custom nodes** have their dependencies installed into the ComfyUI venv during setup
--- a/arty.yml
+++ b/arty.yml
@@ -111,34 +111,33 @@ scripts:
    echo "========================================="
    echo ""

-    if [ -n "${TAILSCALE_AUTHKEY:-}" ]; then
-      echo "  Starting Tailscale daemon..."
-      tailscaled --tun=userspace-networking --socks5-server=localhost:1055 &
-      sleep 3
-
-      echo "  Connecting to Tailscale network..."
-      HOSTNAME="runpod-ai-orchestrator"
-      tailscale up --authkey="$TAILSCALE_AUTHKEY" --advertise-tags=tag:gpu --hostname="$HOSTNAME" || {
-          echo "  ⚠ Tailscale connection failed, continuing without VPN"
-      }
-
-      # Get Tailscale IP if connected
-      TAILSCALE_IP=$(tailscale ip -4 2>/dev/null || echo "not connected")
-      if [ "$TAILSCALE_IP" != "not connected" ]; then
-          echo "  ✓ Tailscale connected"
-          echo "    Hostname: $HOSTNAME"
-          echo "    IP: $TAILSCALE_IP"
-
-          # Export for other services
-          export GPU_TAILSCALE_IP="$TAILSCALE_IP"
-      else
-          echo "  ⚠ Tailscale failed to obtain IP"
-      fi
-    else
-        echo "  ⚠ Tailscale disabled (no TAILSCALE_AUTHKEY in env)"
-        echo "  Services requiring VPN connectivity will not work"
+    if [ ! "$TAILSCALE_AUTHKEY" ]; then
+      echo "  ⚠ Tailscale disabled (no TAILSCALE_AUTHKEY in env)"
+      echo "  Services requiring VPN connectivity will not work"
+      exit 1
    fi

+    echo "  Starting Tailscale daemon..."
+    tailscaled --tun=userspace-networking --socks5-server=localhost:1055 &
+    sleep 3
+
+    echo "  Connecting to Tailscale network..."
+    HOSTNAME="runpod-ai-orchestrator"
+    tailscale up --authkey="$TAILSCALE_AUTHKEY" --advertise-tags=tag:gpu --hostname="$HOSTNAME" || {
+        echo "  ⚠ Tailscale connection failed, continuing without VPN"
+    }
+
+    # Get Tailscale IP if connected
+    TAILSCALE_IP=$(tailscale ip -4 2>/dev/null || echo "not connected")
+    if [ "$TAILSCALE_IP" == "not connected" ]; then
+        echo "  ⚠ Tailscale failed to obtain IP"
+        exit 1
+    fi
+
+    echo "  ✓ Tailscale connected"
+    echo "    Hostname: $HOSTNAME"
+    echo "    IP: $TAILSCALE_IP"
+
  setup/services: |
    echo "========================================="
    echo "  Setting up services python venvs"
@@ -241,6 +240,7 @@ scripts:
  # Supervisor Control Scripts
  #
  supervisor/start: |
+    mkdir -p .logs/
    supervisord -c supervisord.conf

  supervisor/stop: |
--- a/config_llama.yaml
+++ b/config_llama.yaml
@@ -1,4 +0,0 @@
-model: meta-llama/Llama-3.1-8B-Instruct
-host: "0.0.0.0"
-port: 8001
-uvicorn-log-level: "info"
--- a/runpod.yml
+++ b/runpod.yml
@@ -0,0 +1,42 @@
+# RunPod Pod Configuration
+# Used by service_runpod_control.sh
+#
+# Usage:
+#   service_runpod_control.sh create    # Create pod from this config
+#   service_runpod_control.sh get       # Show pod status
+#   service_runpod_control.sh start     # Start the pod
+#   service_runpod_control.sh stop      # Stop the pod
+#   service_runpod_control.sh remove    # Delete the pod
+
+pod:
+  # Required fields
+  name: "runpod-ai-orchestrator"
+  gpuType: "NVIDIA GeForce RTX 4090"
+  gpuCount: 1
+
+  # Template and volume IDs (from RunPod dashboard)
+  templateId: "runpod-ai-orchestrator"
+  networkVolumeId: "runpod-ai-orchestrator"
+  imageName: "dev.pivoine.art/valknar/runpod-ai-orchestrator:latest"
+  # Exposed ports
+  ports:
+    - "22/tcp"
+
+  # Optional: Resource limits
+  # containerDiskSize: 20      # GB (default: 20)
+  # volumeSize: 1              # GB (default: 1)
+  # volumePath: "/runpod"      # Mount path
+  # mem: 20                    # Minimum memory GB
+  # vcpu: 1                    # Minimum vCPUs
+
+  # Optional: Cloud selection
+  # secureCloud: false         # Use secure cloud only
+  # communityCloud: false      # Use community cloud only
+
+  # Optional: Custom image (overrides template)
+  # imageName: "runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu22.04"
+
+  # Optional: Environment variables
+  # env:
+  #   - "HF_TOKEN=your_token_here"
+  #   - "CUSTOM_VAR=value"
--- a/start.sh
+++ b/start.sh
@@ -25,7 +25,7 @@ if [ ! -d "$PWD/bin" ] ; then
  git clone https://dev.pivoine.art/valknar/bin.git "$PWD/bin"
  echo "  ✓ bin cloned"
 else
-  cd "$PWD/bin" && git stash && git pull && git stash pop || true
+  cd "$PWD/bin" && git fetch && git reset --hard origin/main
  echo "  ✓ bin updated"
  cd -
 fi
@@ -33,7 +33,7 @@ if [ ! -d "$PWD/orchestrator" ] ; then
  git clone https://dev.pivoine.art/valknar/runpod-ai-orchestrator.git "$PWD/orchestrator"
  echo "  ✓ orchestrator cloned"
 else
-  cd "$PWD/orchestrator" && git stash && git pull && git stash pop || true
+  cd "$PWD/orchestrator" && git fetch && git reset --hard origin/main
  echo "  ✓ orchestrator updated"
  cd -
 fi