feat: add Supervisor process manager for service management

- Add supervisord.conf with ComfyUI and orchestrator services
- Update Ansible playbook with supervisor installation tag
- Rewrite start-all.sh and stop-all.sh to use Supervisor
- Add status.sh script for checking service status
- Update arty.yml with supervisor commands and shortcuts
- Update CLAUDE.md with Supervisor documentation and troubleshooting
- Services now auto-restart on crashes with centralized logging

Benefits:
- Better process control than manual pkill/background jobs
- Auto-restart on service crashes
- Centralized log management in /workspace/logs/
- Web interface for monitoring (port 9001)
- Works perfectly in RunPod containers (no systemd needed)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
2025-11-22 09:22:16 +01:00
parent 2207d60f98
commit 664da9f4ea
7 changed files with 306 additions and 29 deletions

View File

@@ -102,22 +102,48 @@ ansible-playbook playbook.yml --tags base,python,dependencies
### Service Management
This project uses **Supervisor** for process management, providing auto-restart, centralized logging, and easy service control.
```bash
# Start orchestrator (runs in foreground)
bash scripts/start-all.sh
# Or directly:
python3 model-orchestrator/orchestrator_subprocess.py
# Start all services (with Supervisor)
bash scripts/start-all.sh # Starts supervisor daemon + services
arty run services/start # Same via arty
# Stop all services
bash scripts/stop-all.sh
bash scripts/stop-all.sh # Stops all services + supervisor
arty run services/stop # Same via arty
# Stop orchestrator only
pkill -f orchestrator_subprocess.py
# Check service status
bash scripts/status.sh # Show all service status
arty run services/status # Same via arty
supervisorctl status # Direct supervisor command
# Stop specific model service
pkill -f "models/vllm/server.py"
# Individual service control
supervisorctl start orchestrator # Start orchestrator
supervisorctl restart comfyui # Restart ComfyUI
supervisorctl stop orchestrator # Stop orchestrator
arty run services/restart-comfyui # Restart ComfyUI via arty
# View logs
supervisorctl tail -f comfyui # Follow ComfyUI logs
supervisorctl tail -f orchestrator # Follow orchestrator logs
arty run services/logs # Follow ComfyUI logs via arty
# Web interface
# Access at http://localhost:9001 (username: admin, password: runpod2024)
```
**Supervisor Configuration:**
- Config file: `/workspace/supervisord.conf`
- Log directory: `/workspace/logs/`
- PID file: `/workspace/supervisord.pid`
- Socket: `/workspace/supervisor.sock`
**Services managed:**
- `comfyui` - ComfyUI server (port 8188, autostart enabled)
- `orchestrator` - Model orchestrator (port 9000, autostart disabled)
### Testing
```bash
@@ -167,6 +193,7 @@ curl -X POST http://localhost:9000/v1/images/generations \
**Infrastructure:**
- `tailscale` - Install Tailscale VPN client
- `supervisor` - Install and configure Supervisor process manager
- `systemd` - Configure systemd services (use `never` - not for RunPod)
- `validate` - Health checks (use `never` - run explicitly)
@@ -352,8 +379,10 @@ runpod/
│ ├── start.sh # ComfyUI startup script
│ └── requirements.txt
├── scripts/
│ ├── start-all.sh # Start orchestrator
── stop-all.sh # Stop all services
│ ├── start-all.sh # Start all services with Supervisor
── stop-all.sh # Stop all services
│ └── status.sh # Check service status
├── supervisord.conf # Supervisor process manager config
├── arty.yml # Arty repository manager config
├── playbook.yml # Ansible provisioning playbook
├── inventory.yml # Ansible inventory (localhost)
@@ -389,10 +418,24 @@ runpod/
- Use essential tags: `--tags comfyui-essential` (~80GB vs ~137GB)
- Clear cache: `rm -rf /workspace/huggingface_cache`
### Supervisor not running
- Check status: `bash scripts/status.sh`
- View logs: `cat /workspace/logs/supervisord.log`
- Start supervisor: `bash scripts/start-all.sh`
- Check for stale PID: `rm -f /workspace/supervisord.pid` then restart
### Service won't start
- Check supervisor status: `supervisorctl status`
- View service logs: `supervisorctl tail -f comfyui` or `supervisorctl tail -f orchestrator`
- Check error logs: `cat /workspace/logs/comfyui.err.log`
- Restart service: `supervisorctl restart comfyui`
- Check if port is in use: `ss -tulpn | grep :8188`
### Orchestrator not responding
- Check process: `ps aux | grep orchestrator`
- View logs: Check terminal output where orchestrator was started
- Restart: `bash scripts/stop-all.sh && bash scripts/start-all.sh`
- Check supervisor status: `supervisorctl status orchestrator`
- View logs: `supervisorctl tail -f orchestrator` or `cat /workspace/logs/orchestrator.err.log`
- Restart: `supervisorctl restart orchestrator`
- Manual start for debugging: `cd /workspace/ai && python3 model-orchestrator/orchestrator_subprocess.py`
## Performance Notes

View File

@@ -127,10 +127,17 @@ scripts:
ln -sf /workspace/huggingface_cache/models--stabilityai--stable-video-diffusion-img2vid-xt stable-video-diffusion-img2vid-xt
echo "Models linked to ComfyUI"
# Service management
# Service management (Supervisor-based)
services/start: bash /workspace/ai/scripts/start-all.sh
services/stop: bash /workspace/ai/scripts/stop-all.sh
services/restart: bash /workspace/ai/scripts/stop-all.sh && bash /workspace/ai/scripts/start-all.sh
services/status: bash /workspace/ai/scripts/status.sh
services/logs: supervisorctl -c /workspace/supervisord.conf tail -f comfyui
services/logs-orchestrator: supervisorctl -c /workspace/supervisord.conf tail -f orchestrator
services/restart-comfyui: supervisorctl -c /workspace/supervisord.conf restart comfyui
services/restart-orchestrator: supervisorctl -c /workspace/supervisord.conf restart orchestrator
services/start-orchestrator: supervisorctl -c /workspace/supervisord.conf start orchestrator
services/stop-orchestrator: supervisorctl -c /workspace/supervisord.conf stop orchestrator
# Dependency installation
deps/comfyui-nodes: |
@@ -141,6 +148,7 @@ scripts:
# Ansible provisioning shortcuts
ansible/base: cd /workspace/ai && ansible-playbook playbook.yml --tags base,python,dependencies
ansible/supervisor: cd /workspace/ai && ansible-playbook playbook.yml --tags supervisor
ansible/vllm: cd /workspace/ai && ansible-playbook playbook.yml --tags models
ansible/comfyui: cd /workspace/ai && ansible-playbook playbook.yml --tags comfyui,comfyui-essential
ansible/comfyui-all: cd /workspace/ai && ansible-playbook playbook.yml --tags comfyui,comfyui-models-all,comfyui-nodes

View File

@@ -26,6 +26,7 @@
# comfyui-nodes - Install essential custom nodes
# comfyui-essential - Quick setup (ComfyUI + essential models only)
# tailscale - Install and configure Tailscale
# supervisor - Install and configure Supervisor process manager
# systemd - Configure systemd services
# validate - Health checks and validation
#
@@ -755,6 +756,55 @@
Note: Authentication requires manual intervention via provided URL
#
# Supervisor Process Manager
#
- name: Install and configure Supervisor
tags: [supervisor]
block:
- name: Install Supervisor
pip:
name: supervisor
executable: pip3
become: true
- name: Create logs directory
file:
path: "{{ workspace_dir }}/logs"
state: directory
mode: '0755'
- name: Deploy supervisord configuration
copy:
src: "{{ ai_dir }}/supervisord.conf"
dest: "{{ workspace_dir }}/supervisord.conf"
mode: '0644'
- name: Display Supervisor setup instructions
debug:
msg: |
✓ Supervisor installed successfully!
Configuration: {{ workspace_dir }}/supervisord.conf
Logs: {{ workspace_dir }}/logs/
Services configured:
- comfyui: ComfyUI server (port 8188) - autostart enabled
- orchestrator: Model orchestrator (port 9000) - autostart disabled
To start Supervisor:
supervisord -c {{ workspace_dir }}/supervisord.conf
To manage services:
supervisorctl status # Check service status
supervisorctl start orchestrator # Start orchestrator
supervisorctl restart comfyui # Restart ComfyUI
supervisorctl stop all # Stop all services
supervisorctl tail -f comfyui # Follow ComfyUI logs
Web interface:
http://localhost:9001 (username: admin, password: runpod2024)
#
# Systemd Services (Optional)
#

View File

@@ -1,15 +1,19 @@
#!/bin/bash
#
# Start AI Orchestrator
# Starts the model orchestrator which manages all AI services
# Start AI Services with Supervisor
# Starts supervisor daemon which manages ComfyUI and orchestrator
#
set -e
cd "$(dirname "$0")/.."
WORKSPACE_DIR="${WORKSPACE_DIR:-/workspace}"
SUPERVISORD_CONF="${WORKSPACE_DIR}/supervisord.conf"
AI_DIR="${WORKSPACE_DIR}/ai"
cd "${AI_DIR}"
echo "========================================="
echo " Starting AI Orchestrator"
echo " Starting AI Services with Supervisor"
echo "========================================="
echo ""
@@ -27,9 +31,47 @@ if [ -f .env ]; then
set +a
fi
# Start orchestrator
echo "Starting orchestrator on port 9000..."
python3 model-orchestrator/orchestrator_subprocess.py
# Check if supervisord is already running
if [ -f "${WORKSPACE_DIR}/supervisord.pid" ]; then
PID=$(cat "${WORKSPACE_DIR}/supervisord.pid")
if ps -p "$PID" > /dev/null 2>&1; then
echo "Supervisor is already running (PID: $PID)"
echo ""
echo "Checking service status..."
supervisorctl -c "${SUPERVISORD_CONF}" status
exit 0
else
echo "Removing stale PID file..."
rm -f "${WORKSPACE_DIR}/supervisord.pid"
fi
fi
# Start supervisord
echo "Starting Supervisor daemon..."
supervisord -c "${SUPERVISORD_CONF}"
# Wait a moment for supervisor to start
sleep 2
# Check status
echo ""
echo "Service Status:"
echo "---------------"
supervisorctl -c "${SUPERVISORD_CONF}" status
echo ""
echo "Orchestrator stopped"
echo "========================================="
echo "Services started successfully!"
echo "========================================="
echo ""
echo "Useful commands:"
echo " supervisorctl status - Check status"
echo " supervisorctl start orchestrator - Start orchestrator"
echo " supervisorctl restart comfyui - Restart ComfyUI"
echo " supervisorctl stop all - Stop all services"
echo " supervisorctl tail -f comfyui - Follow ComfyUI logs"
echo ""
echo "Web interface: http://localhost:9001"
echo " Username: admin"
echo " Password: runpod2024"
echo ""

47
scripts/status.sh Normal file
View File

@@ -0,0 +1,47 @@
#!/bin/bash
#
# Check AI Services Status
# Shows status of all services managed by Supervisor
#
WORKSPACE_DIR="${WORKSPACE_DIR:-/workspace}"
SUPERVISORD_CONF="${WORKSPACE_DIR}/supervisord.conf"
echo "========================================="
echo " AI Services Status"
echo "========================================="
echo ""
# Check if supervisord is running
if [ ! -f "${WORKSPACE_DIR}/supervisord.pid" ]; then
echo "❌ Supervisor is not running"
echo ""
echo "To start services, run:"
echo " bash scripts/start-all.sh"
exit 1
fi
PID=$(cat "${WORKSPACE_DIR}/supervisord.pid")
if ! ps -p "$PID" > /dev/null 2>&1; then
echo "❌ Supervisor PID file exists but process is not running"
echo ""
echo "To start services, run:"
echo " bash scripts/start-all.sh"
exit 1
fi
echo "✅ Supervisor is running (PID: $PID)"
echo ""
# Show service status
echo "Service Status:"
echo "---------------"
supervisorctl -c "${SUPERVISORD_CONF}" status
echo ""
echo "Useful commands:"
echo " supervisorctl start orchestrator - Start orchestrator"
echo " supervisorctl restart comfyui - Restart ComfyUI"
echo " supervisorctl stop all - Stop all services"
echo " supervisorctl tail -f comfyui - Follow ComfyUI logs"
echo ""

View File

@@ -1,22 +1,49 @@
#!/bin/bash
#
# Stop AI Services
# Gracefully stops all running AI services
# Gracefully stops all services managed by Supervisor
#
set -e
WORKSPACE_DIR="${WORKSPACE_DIR:-/workspace}"
SUPERVISORD_CONF="${WORKSPACE_DIR}/supervisord.conf"
echo "========================================="
echo " Stopping AI Services"
echo "========================================="
echo ""
# Kill orchestrator and model processes
echo "Stopping orchestrator..."
pkill -f "orchestrator_subprocess.py" || echo "Orchestrator not running"
# Check if supervisord is running
if [ ! -f "${WORKSPACE_DIR}/supervisord.pid" ]; then
echo "Supervisor is not running (no PID file found)"
echo "Cleaning up any stray processes..."
pkill -f "orchestrator_subprocess.py" || echo " - Orchestrator not running"
pkill -f "ComfyUI.*main.py" || echo " - ComfyUI not running"
echo ""
echo "All services stopped"
exit 0
fi
echo "Stopping model services..."
pkill -f "models/vllm/server.py" || echo "vLLM not running"
PID=$(cat "${WORKSPACE_DIR}/supervisord.pid")
if ! ps -p "$PID" > /dev/null 2>&1; then
echo "Supervisor PID file exists but process is not running"
echo "Removing stale PID file..."
rm -f "${WORKSPACE_DIR}/supervisord.pid"
echo ""
echo "All services stopped"
exit 0
fi
# Stop all supervised services
echo "Stopping all supervised services..."
supervisorctl -c "${SUPERVISORD_CONF}" stop all
sleep 2
# Shutdown supervisord
echo "Shutting down Supervisor daemon..."
supervisorctl -c "${SUPERVISORD_CONF}" shutdown
echo ""
echo "All services stopped"

60
supervisord.conf Normal file
View File

@@ -0,0 +1,60 @@
[supervisord]
logfile=/workspace/logs/supervisord.log
pidfile=/workspace/supervisord.pid
childlogdir=/workspace/logs
nodaemon=false
loglevel=info
[unix_http_server]
file=/workspace/supervisor.sock
chmod=0700
[supervisorctl]
serverurl=unix:///workspace/supervisor.sock
[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface
# Web interface for monitoring (localhost only)
[inet_http_server]
port=127.0.0.1:9001
username=admin
password=runpod2024
# ComfyUI Server
[program:comfyui]
command=bash /workspace/ai/models/comfyui/start.sh
directory=/workspace/ComfyUI
autostart=true
autorestart=true
startretries=3
stderr_logfile=/workspace/logs/comfyui.err.log
stdout_logfile=/workspace/logs/comfyui.out.log
stdout_logfile_maxbytes=50MB
stdout_logfile_backups=10
stderr_logfile_maxbytes=50MB
stderr_logfile_backups=10
environment=HF_HOME="/workspace/huggingface_cache",PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
priority=100
stopwaitsecs=30
# AI Model Orchestrator
[program:orchestrator]
command=python3 model-orchestrator/orchestrator_subprocess.py
directory=/workspace/ai
autostart=false
autorestart=true
startretries=3
stderr_logfile=/workspace/logs/orchestrator.err.log
stdout_logfile=/workspace/logs/orchestrator.out.log
stdout_logfile_maxbytes=50MB
stdout_logfile_backups=10
stderr_logfile_maxbytes=50MB
stderr_logfile_backups=10
environment=HF_HOME="/workspace/huggingface_cache",HF_TOKEN="%(ENV_HF_TOKEN)s"
priority=200
stopwaitsecs=30
[group:ai-services]
programs=comfyui,orchestrator
priority=999