feat: add RunPod Docker template with automated build workflow

- Add Dockerfile with minimal setup (supervisor, tailscale) - Add start.sh bootstrap script for container initialization - Add Gitea workflow for automated Docker image builds - Add comprehensive RUNPOD_TEMPLATE.md documentation - Add bootstrap-venvs.sh for Python venv health checks This enables deployment of the AI orchestrator on RunPod using: - Minimal Docker image (~2-3GB) for fast deployment - Network volume for models and data persistence (~80-200GB) - Automated builds on push to main or version tags - Full Tailscale VPN integration - Supervisor process management
2025-11-23 21:53:56 +01:00
parent 0e3150e26c
commit 571431955d
5 changed files with 834 additions and 0 deletions
--- a/.gitea/workflows/build-docker-image.yml
+++ b/.gitea/workflows/build-docker-image.yml
@@ -0,0 +1,58 @@
+name: Build and Push RunPod Docker Image
+
+on:
+  push:
+    branches:
+      - main
+    tags:
+      - 'v*'
+  pull_request:
+    branches:
+      - main
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v3
+
+      - name: Log in to Docker Hub
+        if: github.event_name != 'pull_request'
+        uses: docker/login-action@v3
+        with:
+          username: ${{ secrets.DOCKER_USERNAME }}
+          password: ${{ secrets.DOCKER_PASSWORD }}
+
+      - name: Extract metadata (tags, labels)
+        id: meta
+        uses: docker/metadata-action@v5
+        with:
+          images: ${{ secrets.DOCKER_USERNAME }}/runpod-ai-orchestrator
+          tags: |
+            type=ref,event=branch
+            type=ref,event=pr
+            type=semver,pattern={{version}}
+            type=semver,pattern={{major}}.{{minor}}
+            type=semver,pattern={{major}}
+            type=sha,prefix={{branch}}-
+            type=raw,value=latest,enable={{is_default_branch}}
+
+      - name: Build and push Docker image
+        uses: docker/build-push-action@v5
+        with:
+          context: .
+          file: ./Dockerfile
+          platforms: linux/amd64
+          push: ${{ github.event_name != 'pull_request' }}
+          tags: ${{ steps.meta.outputs.tags }}
+          labels: ${{ steps.meta.outputs.labels }}
+          cache-from: type=gha
+          cache-to: type=gha,mode=max
+
+      - name: Image digest
+        run: echo ${{ steps.meta.outputs.digest }}
--- a/26
+++ b/26
@@ -0,0 +1,26 @@
+# RunPod AI Orchestrator Template
+# Minimal Docker image for ComfyUI + vLLM orchestration
+# Models and application code live on network volume at /workspace
+
+FROM runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
+
+# Install Supervisor for process management
+RUN pip install --no-cache-dir supervisor
+
+# Install Tailscale for VPN connectivity
+RUN curl -fsSL https://tailscale.com/install.sh | sh
+
+# Install additional system utilities
+RUN apt-get update && apt-get install -y \
+    wget \
+    && rm -rf /var/lib/apt/lists/*
+
+# Copy the startup script
+COPY start.sh /start.sh
+RUN chmod +x /start.sh
+
+# Set working directory to /workspace (network volume mount point)
+WORKDIR /workspace
+
+# RunPod calls /start.sh by default
+CMD ["/start.sh"]
--- a/RUNPOD_TEMPLATE.md
+++ b/RUNPOD_TEMPLATE.md
@@ -0,0 +1,501 @@
+# RunPod Template Setup Guide
+
+This guide explains how to deploy the AI Orchestrator (ComfyUI + vLLM) on RunPod using a custom Docker template and network volume.
+
+## Architecture Overview
+
+The deployment uses a **two-tier strategy**:
+
+1. **Docker Image** (software layer) - Contains system packages, Supervisor, Tailscale
+2. **Network Volume** (data layer) - Contains models, ComfyUI installation, venvs, configuration
+
+This approach allows fast pod deployment (~2-3 minutes) while keeping all large files (models, ~80-200GB) on a persistent network volume.
+
+## Prerequisites
+
+- RunPod account with credits
+- Docker Hub account (for hosting the template image)
+- HuggingFace account with API token (for model downloads)
+- Tailscale account with auth key (optional, for VPN access)
+
+## Step 1: Build and Push Docker Image
+
+### Option A: Automated Build (Recommended)
+
+The repository includes a Gitea workflow that automatically builds and pushes the Docker image when you push to the `main` branch or create a version tag.
+
+1. **Configure Gitea Secrets:**
+   - Go to your Gitea repository → Settings → Secrets
+   - Add `DOCKER_USERNAME` = your Docker Hub username
+   - Add `DOCKER_PASSWORD` = your Docker Hub password or access token
+
+2. **Trigger Build:**
+   ```bash
+   # Push to main branch
+   git push origin main
+
+   # Or create a version tag
+   git tag v1.0.0
+   git push origin v1.0.0
+   ```
+
+3. **Monitor Build:**
+   - Go to Actions tab in Gitea
+   - Wait for build to complete (~5-10 minutes)
+   - Note the Docker image name: `your-username/runpod-ai-orchestrator:latest`
+
+### Option B: Manual Build
+
+If you prefer to build manually:
+
+```bash
+# From the repository root
+cd /path/to/runpod
+
+# Build the image
+docker build -t your-username/runpod-ai-orchestrator:latest .
+
+# Push to Docker Hub
+docker login
+docker push your-username/runpod-ai-orchestrator:latest
+```
+
+## Step 2: Create Network Volume
+
+Network volumes persist your models and data across pod restarts and rebuilds.
+
+1. **Go to RunPod Dashboard → Storage → Network Volumes**
+
+2. **Click "New Network Volume"**
+
+3. **Configure:**
+   - **Name**: `ai-orchestrator-models`
+   - **Size**: `200GB` (adjust based on your needs)
+     - Essential models only: ~80GB
+     - All models: ~137-200GB
+   - **Datacenter**: Choose closest to you (volume tied to datacenter)
+
+4. **Click "Create Volume"**
+
+5. **Note the Volume ID** (e.g., `vol-abc123def456`) for pod deployment
+
+### Storage Requirements
+
+| Configuration | Size | Models Included |
+|--------------|------|-----------------|
+| Essential | ~80GB | FLUX Schnell, 1-2 SDXL checkpoints, MusicGen Medium |
+| Complete | ~137GB | All image/video/audio models from playbook |
+| Full + vLLM | ~200GB | Complete + Qwen 2.5 7B + Llama 3.1 8B |
+
+## Step 3: Create RunPod Template
+
+1. **Go to RunPod Dashboard → Templates**
+
+2. **Click "New Template"**
+
+3. **Configure Template Settings:**
+
+   **Container Configuration:**
+   - **Template Name**: `AI Orchestrator (ComfyUI + vLLM)`
+   - **Template Type**: Docker
+   - **Container Image**: `your-username/runpod-ai-orchestrator:latest`
+   - **Container Disk**: `50GB` (for system and temp files)
+   - **Docker Command**: Leave empty (uses default `/start.sh`)
+
+   **Volume Configuration:**
+   - **Volume Mount Path**: `/workspace`
+   - **Attach to Network Volume**: Select your volume ID from Step 2
+
+   **Port Configuration:**
+   - **Expose HTTP Ports**: `8188, 9000, 9001`
+     - `8188` - ComfyUI web interface
+     - `9000` - Model orchestrator API
+     - `9001` - Supervisor web UI
+   - **Expose TCP Ports**: `22` (SSH access)
+
+   **Environment Variables:**
+   ```
+   HF_TOKEN=your_huggingface_token_here
+   TAILSCALE_AUTHKEY=tskey-auth-your_tailscale_authkey_here
+   SUPERVISOR_BACKEND_HOST=localhost
+   SUPERVISOR_BACKEND_PORT=9001
+   ```
+
+   **Advanced Settings:**
+   - **Start Jupyter**: No
+   - **Start SSH**: Yes (handled by base image)
+
+4. **Click "Save Template"**
+
+## Step 4: First Deployment (Initial Setup)
+
+The first time you deploy, you need to set up the network volume with models and configuration.
+
+### 4.1 Deploy Pod
+
+1. **Go to RunPod Dashboard → Pods**
+2. **Click "Deploy"** or "GPU Pods"
+3. **Select your custom template**: `AI Orchestrator (ComfyUI + vLLM)`
+4. **Configure GPU:**
+   - **GPU Type**: RTX 4090 (24GB VRAM) or higher
+   - **Network Volume**: Select your volume from Step 2
+   - **On-Demand vs Spot**: Choose based on budget
+5. **Click "Deploy"**
+
+### 4.2 SSH into Pod
+
+```bash
+# Get pod SSH command from RunPod dashboard
+ssh root@<pod-ip> -p <port> -i ~/.ssh/id_ed25519
+
+# Or use RunPod web terminal
+```
+
+### 4.3 Initial Setup on Network Volume
+
+```bash
+# 1. Clone the repository to /workspace/ai
+cd /workspace
+git clone https://github.com/your-username/runpod.git ai
+cd ai
+
+# 2. Create .env file with your credentials
+cp .env.example .env
+nano .env
+
+# Edit and add:
+# HF_TOKEN=your_huggingface_token
+# TAILSCALE_AUTHKEY=tskey-auth-your_key
+# GPU_TAILSCALE_IP=<will be set automatically>
+
+# 3. Download essential models (this takes 30-60 minutes)
+ansible-playbook playbook.yml --tags comfyui-essential
+
+# OR download all models (1-2 hours)
+ansible-playbook playbook.yml --tags comfyui-models-all
+
+# 4. Link models to ComfyUI
+bash scripts/link-comfyui-models.sh
+
+# OR if arty is available
+arty run models/link-comfyui
+
+# 5. Install ComfyUI custom nodes dependencies
+cd /workspace/ComfyUI/custom_nodes/ComfyUI-Manager
+pip install -r requirements.txt
+cd /workspace/ai
+
+# 6. Restart the container to apply all changes
+exit
+# Go to RunPod dashboard → Stop pod → Start pod
+```
+
+### 4.4 Verify Services
+
+After restart, SSH back in and check:
+
+```bash
+# Check supervisor status
+supervisorctl -c /workspace/supervisord.conf status
+
+# Expected output:
+# comfyui                          RUNNING   pid 123, uptime 0:01:00
+# (orchestrator is disabled by default - enable for vLLM)
+
+# Test ComfyUI
+curl -I http://localhost:8188
+
+# Test Supervisor web UI
+curl -I http://localhost:9001
+```
+
+## Step 5: Subsequent Deployments
+
+After initial setup, deploying new pods is quick (2-3 minutes):
+
+1. **Deploy pod** with same template + network volume
+2. **Wait for startup** (~1-2 minutes for services to start)
+3. **Access services:**
+   - ComfyUI: `http://<pod-ip>:8188`
+   - Supervisor: `http://<pod-ip>:9001`
+
+**All models, configuration, and data persist on the network volume!**
+
+## Step 6: Access Services
+
+### Via Direct IP (HTTP)
+
+Get pod IP and ports from RunPod dashboard:
+
+```
+ComfyUI:           http://<pod-ip>:8188
+Supervisor UI:     http://<pod-ip>:9001
+Orchestrator API:  http://<pod-ip>:9000
+SSH:               ssh root@<pod-ip> -p <port>
+```
+
+### Via Tailscale VPN (Recommended)
+
+If you configured `TAILSCALE_AUTHKEY`, the pod automatically joins your Tailscale network:
+
+1. **Get Tailscale IP:**
+   ```bash
+   ssh root@<pod-ip> -p <port>
+   tailscale ip -4
+   # Example output: 100.114.60.40
+   ```
+
+2. **Access via Tailscale:**
+   ```
+   ComfyUI:      http://<tailscale-ip>:8188
+   Supervisor:   http://<tailscale-ip>:9001
+   Orchestrator: http://<tailscale-ip>:9000
+   SSH:          ssh root@<tailscale-ip>
+   ```
+
+3. **Update LiteLLM config** on your VPS with the Tailscale IP
+
+## Service Management
+
+### Start/Stop Services
+
+```bash
+# Start all services
+supervisorctl -c /workspace/supervisord.conf start all
+
+# Stop all services
+supervisorctl -c /workspace/supervisord.conf stop all
+
+# Restart specific service
+supervisorctl -c /workspace/supervisord.conf restart comfyui
+
+# View status
+supervisorctl -c /workspace/supervisord.conf status
+```
+
+### Enable vLLM Models (Text Generation)
+
+By default, only ComfyUI runs (to save VRAM). To enable vLLM:
+
+1. **Stop ComfyUI** (frees up VRAM):
+   ```bash
+   supervisorctl -c /workspace/supervisord.conf stop comfyui
+   ```
+
+2. **Start orchestrator** (manages vLLM models):
+   ```bash
+   supervisorctl -c /workspace/supervisord.conf start orchestrator
+   ```
+
+3. **Test text generation:**
+   ```bash
+   curl -X POST http://localhost:9000/v1/chat/completions \
+     -H 'Content-Type: application/json' \
+     -d '{"model":"qwen-2.5-7b","messages":[{"role":"user","content":"Hello"}]}'
+   ```
+
+### Switch Back to ComfyUI
+
+```bash
+# Stop orchestrator (stops all vLLM models)
+supervisorctl -c /workspace/supervisord.conf stop orchestrator
+
+# Start ComfyUI
+supervisorctl -c /workspace/supervisord.conf start comfyui
+```
+
+## Updating the Template
+
+When you make changes to code or configuration:
+
+### Update Docker Image
+
+```bash
+# 1. Make changes to Dockerfile or start.sh
+# 2. Push to repository
+git add .
+git commit -m "Update template configuration"
+git push origin main
+
+# 3. Gitea workflow auto-builds new image
+
+# 4. Terminate old pod and deploy new one with updated image
+```
+
+### Update Network Volume Data
+
+```bash
+# SSH into running pod
+ssh root@<pod-ip> -p <port>
+
+# Update repository
+cd /workspace/ai
+git pull
+
+# Re-run Ansible if needed
+ansible-playbook playbook.yml --tags <specific-tag>
+
+# Restart services
+supervisorctl -c /workspace/supervisord.conf restart all
+```
+
+## Troubleshooting
+
+### Pod fails to start
+
+**Check logs:**
+```bash
+# Via SSH
+cat /workspace/logs/supervisord.log
+cat /workspace/logs/comfyui.err.log
+
+# Via RunPod web terminal
+tail -f /workspace/logs/*.log
+```
+
+**Common issues:**
+- Missing `.env` file → Create `/workspace/ai/.env` with required vars
+- Supervisor config not found → Ensure `/workspace/ai/supervisord.conf` exists
+- Port conflicts → Check if services are already running
+
+### Tailscale not connecting
+
+**Check Tailscale status:**
+```bash
+tailscale status
+tailscale ip -4
+```
+
+**Common issues:**
+- Missing or invalid `TAILSCALE_AUTHKEY` in `.env`
+- Auth key expired → Generate new key in Tailscale admin
+- Firewall blocking → RunPod should allow Tailscale by default
+
+### Services not starting
+
+**Check Supervisor:**
+```bash
+supervisorctl -c /workspace/supervisord.conf status
+supervisorctl -c /workspace/supervisord.conf tail -f comfyui
+```
+
+**Common issues:**
+- venv broken → Re-run `scripts/bootstrap-venvs.sh`
+- Models not downloaded → Run Ansible playbook again
+- Python version mismatch → Rebuild venvs
+
+### Out of VRAM
+
+**Check GPU memory:**
+```bash
+nvidia-smi
+```
+
+**RTX 4090 (24GB) capacity:**
+- ComfyUI (FLUX Schnell): ~23GB (can't run with vLLM)
+- vLLM (Qwen 2.5 7B): ~14GB
+- vLLM (Llama 3.1 8B): ~17GB
+
+**Solution:** Only run one service at a time (see Service Management section)
+
+### Network volume full
+
+**Check disk usage:**
+```bash
+df -h /workspace
+du -sh /workspace/*
+```
+
+**Clean up:**
+```bash
+# Remove old HuggingFace cache
+rm -rf /workspace/huggingface_cache
+
+# Re-download essential models only
+cd /workspace/ai
+ansible-playbook playbook.yml --tags comfyui-essential
+```
+
+## Cost Optimization
+
+### Spot vs On-Demand
+
+- **Spot instances**: ~70% cheaper, can be interrupted
+- **On-Demand**: More expensive, guaranteed availability
+
+**Recommendation:** Use spot for development, on-demand for production
+
+### Network Volume Pricing
+
+- First 1TB: $0.07/GB/month
+- Beyond 1TB: $0.05/GB/month
+
+**200GB volume cost:** ~$14/month
+
+### Pod Auto-Stop
+
+Configure auto-stop in RunPod pod settings to save costs when idle:
+- Stop after 15 minutes idle
+- Stop after 1 hour idle
+- Manual stop only
+
+## Advanced Configuration
+
+### Custom Environment Variables
+
+Add to template or pod environment variables:
+
+```bash
+# Model cache locations
+HF_HOME=/workspace/huggingface_cache
+TRANSFORMERS_CACHE=/workspace/huggingface_cache
+
+# ComfyUI settings
+COMFYUI_PORT=8188
+COMFYUI_LISTEN=0.0.0.0
+
+# Orchestrator settings
+ORCHESTRATOR_PORT=9000
+
+# GPU settings
+CUDA_VISIBLE_DEVICES=0
+```
+
+### Multiple Network Volumes
+
+You can attach multiple network volumes for organization:
+
+1. **Models volume** - `/workspace/models` (read-only, shared)
+2. **Data volume** - `/workspace/data` (read-write, per-project)
+
+### Custom Startup Script
+
+Override `/start.sh` behavior by creating `/workspace/custom-start.sh`:
+
+```bash
+#!/bin/bash
+# Custom startup commands
+
+# Source default startup
+source /start.sh
+
+# Add your custom commands here
+echo "Running custom initialization..."
+```
+
+## References
+
+- [RunPod Documentation](https://docs.runpod.io/)
+- [RunPod Templates Overview](https://docs.runpod.io/pods/templates/overview)
+- [Network Volumes Guide](https://docs.runpod.io/storage/network-volumes)
+- [ComfyUI Documentation](https://github.com/comfyanonymous/ComfyUI)
+- [Supervisor Documentation](http://supervisord.org/)
+- [Tailscale Documentation](https://tailscale.com/kb/)
+
+## Support
+
+For issues or questions:
+- Check troubleshooting section above
+- Review `/workspace/logs/` files
+- Check RunPod community forums
+- Open issue in project repository
--- a/scripts/bootstrap-venvs.sh
+++ b/scripts/bootstrap-venvs.sh
@@ -0,0 +1,108 @@
+#!/bin/bash
+# Virtual Environment Health Check and Bootstrap Script
+# Checks if Python venvs are compatible with current Python version
+# Rebuilds venvs if needed
+
+set -e
+
+echo "=== Python Virtual Environment Health Check ==="
+
+# Get current system Python version
+SYSTEM_PYTHON=$(python3 --version | awk '{print $2}')
+SYSTEM_PYTHON_MAJOR_MINOR=$(echo "$SYSTEM_PYTHON" | cut -d'.' -f1,2)
+
+echo "System Python: $SYSTEM_PYTHON ($SYSTEM_PYTHON_MAJOR_MINOR)"
+
+# List of venvs to check
+VENVS=(
+    "/workspace/ai/vllm/venv"
+    "/workspace/ai/webdav-sync/venv"
+    "/workspace/ComfyUI/venv"
+)
+
+REBUILD_NEEDED=0
+
+# Check each venv
+for VENV_PATH in "${VENVS[@]}"; do
+    if [ ! -d "$VENV_PATH" ]; then
+        echo "⚠ venv not found: $VENV_PATH (will be created on first service start)"
+        continue
+    fi
+
+    VENV_NAME=$(basename $(dirname "$VENV_PATH"))
+    echo ""
+    echo "Checking venv: $VENV_NAME ($VENV_PATH)"
+
+    # Check if venv Python executable works
+    if ! "$VENV_PATH/bin/python" --version >/dev/null 2>&1; then
+        echo "  ❌ BROKEN - Python executable not working"
+        REBUILD_NEEDED=1
+        continue
+    fi
+
+    # Get venv Python version
+    VENV_PYTHON=$("$VENV_PATH/bin/python" --version 2>&1 | awk '{print $2}')
+    VENV_PYTHON_MAJOR_MINOR=$(echo "$VENV_PYTHON" | cut -d'.' -f1,2)
+
+    echo "  venv Python: $VENV_PYTHON ($VENV_PYTHON_MAJOR_MINOR)"
+
+    # Compare major.minor versions
+    if [ "$SYSTEM_PYTHON_MAJOR_MINOR" != "$VENV_PYTHON_MAJOR_MINOR" ]; then
+        echo "  ⚠ VERSION MISMATCH - System is $SYSTEM_PYTHON_MAJOR_MINOR, venv is $VENV_PYTHON_MAJOR_MINOR"
+        REBUILD_NEEDED=1
+    else
+        # Check if pip works
+        if ! "$VENV_PATH/bin/pip" --version >/dev/null 2>&1; then
+            echo "  ❌ BROKEN - pip not working"
+            REBUILD_NEEDED=1
+        else
+            echo "  ✓ HEALTHY"
+        fi
+    fi
+done
+
+# If any venv needs rebuild, warn the user
+if [ $REBUILD_NEEDED -eq 1 ]; then
+    echo ""
+    echo "========================================"
+    echo " ⚠ WARNING: Some venvs need rebuilding"
+    echo "========================================"
+    echo ""
+    echo "One or more Python virtual environments are incompatible with the current"
+    echo "Python version or are broken. This can happen when:"
+    echo "  - Docker image Python version changed"
+    echo "  - venv files were corrupted"
+    echo "  - Binary dependencies are incompatible"
+    echo ""
+    echo "RECOMMENDED ACTIONS:"
+    echo ""
+    echo "1. vLLM venv rebuild:"
+    echo "   cd /workspace/ai/vllm"
+    echo "   rm -rf venv"
+    echo "   python3 -m venv venv"
+    echo "   source venv/bin/activate"
+    echo "   pip install -r requirements.txt"
+    echo ""
+    echo "2. ComfyUI venv rebuild:"
+    echo "   cd /workspace/ComfyUI"
+    echo "   rm -rf venv"
+    echo "   python3 -m venv venv"
+    echo "   source venv/bin/activate"
+    echo "   pip install -r requirements.txt"
+    echo ""
+    echo "3. WebDAV sync venv rebuild (if used):"
+    echo "   cd /workspace/ai/webdav-sync"
+    echo "   rm -rf venv"
+    echo "   python3 -m venv venv"
+    echo "   source venv/bin/activate"
+    echo "   pip install -r requirements.txt"
+    echo ""
+    echo "Services may fail to start until venvs are rebuilt!"
+    echo "========================================"
+    echo ""
+else
+    echo ""
+    echo "✓ All virtual environments are healthy"
+fi
+
+exit 0
--- a/start.sh
+++ b/start.sh
@@ -0,0 +1,141 @@
+#!/bin/bash
+# RunPod container startup script
+# This script initializes the container environment and starts all services
+
+set -e
+
+echo "========================================"
+echo " RunPod AI Orchestrator - Starting"
+echo "========================================"
+
+# [1/7] Start SSH server (required by RunPod)
+echo "[1/7] Starting SSH server..."
+service ssh start
+echo "  ✓ SSH server started"
+
+# [2/7] Add /workspace/bin to PATH for arty and custom scripts
+echo "[2/7] Configuring PATH..."
+export PATH="/workspace/bin:$PATH"
+echo "  ✓ PATH updated: /workspace/bin added"
+
+# [3/7] Source environment variables from network volume
+echo "[3/7] Loading environment from network volume..."
+if [ -f /workspace/ai/.env ]; then
+    set -a
+    source /workspace/ai/.env
+    set +a
+    echo "  ✓ Environment loaded from /workspace/ai/.env"
+else
+    echo "  ⚠ No .env file found at /workspace/ai/.env"
+    echo "  Some services may not function correctly without environment variables"
+fi
+
+# [4/7] Configure and start Tailscale VPN
+echo "[4/7] Configuring Tailscale VPN..."
+if [ -n "${TAILSCALE_AUTHKEY:-}" ]; then
+    echo "  Starting Tailscale daemon..."
+    tailscaled --tun=userspace-networking --socks5-server=localhost:1055 &
+    sleep 3
+
+    echo "  Connecting to Tailscale network..."
+    HOSTNAME="runpod-$(hostname)"
+    tailscale up --authkey="$TAILSCALE_AUTHKEY" --advertise-tags=tag:gpu --hostname="$HOSTNAME" || {
+        echo "  ⚠ Tailscale connection failed, continuing without VPN"
+    }
+
+    # Get Tailscale IP if connected
+    TAILSCALE_IP=$(tailscale ip -4 2>/dev/null || echo "not connected")
+    if [ "$TAILSCALE_IP" != "not connected" ]; then
+        echo "  ✓ Tailscale connected"
+        echo "    Hostname: $HOSTNAME"
+        echo "    IP: $TAILSCALE_IP"
+
+        # Export for other services
+        export GPU_TAILSCALE_IP="$TAILSCALE_IP"
+    else
+        echo "  ⚠ Tailscale failed to obtain IP"
+    fi
+else
+    echo "  ⚠ Tailscale disabled (no TAILSCALE_AUTHKEY in .env)"
+    echo "  Services requiring VPN connectivity will not work"
+fi
+
+# [5/7] Check Python virtual environments health
+echo "[5/7] Checking Python virtual environments..."
+PYTHON_VERSION=$(python3 --version)
+echo "  System Python: $PYTHON_VERSION"
+
+# Check if bootstrap script exists and run it
+if [ -f /workspace/ai/scripts/bootstrap-venvs.sh ]; then
+    echo "  Running venv health check..."
+    bash /workspace/ai/scripts/bootstrap-venvs.sh
+else
+    echo "  ⚠ No venv bootstrap script found (optional)"
+fi
+
+# [6/7] Configure Supervisor
+echo "[6/7] Configuring Supervisor process manager..."
+if [ -f /workspace/ai/supervisord.conf ]; then
+    # Supervisor expects config at /workspace/supervisord.conf (based on arty scripts)
+    if [ ! -f /workspace/supervisord.conf ]; then
+        cp /workspace/ai/supervisord.conf /workspace/supervisord.conf
+        echo "  ✓ Supervisor config copied to /workspace/supervisord.conf"
+    fi
+
+    # Create logs directory if it doesn't exist
+    mkdir -p /workspace/logs
+
+    echo "  ✓ Supervisor configured"
+else
+    echo "  ⚠ No supervisord.conf found at /workspace/ai/supervisord.conf"
+    echo "  Supervisor will not be started"
+fi
+
+# [7/7] Start Supervisor to manage services
+echo "[7/7] Starting Supervisor and managed services..."
+if [ -f /workspace/supervisord.conf ]; then
+    # Start supervisor daemon
+    supervisord -c /workspace/supervisord.conf
+    echo "  ✓ Supervisor daemon started"
+
+    # Wait a moment for services to initialize
+    sleep 3
+
+    # Display service status
+    echo ""
+    echo "Service Status:"
+    echo "---------------"
+    supervisorctl -c /workspace/supervisord.conf status || echo "  ⚠ Could not query service status"
+else
+    echo "  ⚠ Skipping Supervisor startup (no config file)"
+fi
+
+# Display connection information
+echo ""
+echo "========================================"
+echo " Container Ready"
+echo "========================================"
+echo "Services:"
+echo "  - SSH: port 22"
+echo "  - ComfyUI: http://localhost:8188"
+echo "  - Supervisor Web UI: http://localhost:9001"
+echo "  - Model Orchestrator: http://localhost:9000"
+if [ -n "${TAILSCALE_IP:-}" ] && [ "$TAILSCALE_IP" != "not connected" ]; then
+    echo "  - Tailscale IP: $TAILSCALE_IP"
+fi
+echo ""
+echo "Network Volume: /workspace"
+echo "Project Directory: /workspace/ai"
+echo "Logs: /workspace/logs"
+echo ""
+echo "To view service logs:"
+echo "  supervisorctl -c /workspace/supervisord.conf tail -f <service_name>"
+echo ""
+echo "To manage services:"
+echo "  supervisorctl -c /workspace/supervisord.conf status"
+echo "  supervisorctl -c /workspace/supervisord.conf restart <service_name>"
+echo "========================================"
+
+# Keep container running
+echo "Container is running. Press Ctrl+C to stop."
+sleep infinity