From 76cf5b5e3103ffd2d65895012c79262404325ddb Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Sebastian=20Kr=C3=BCger?= <valknar@pivoine.art>
Date: Sun, 23 Nov 2025 16:26:59 +0100
Subject: [PATCH] docs: update CLAUDE.md to reflect direct vLLM architecture

- Remove all orchestrator references
- Update to dedicated vLLM server model
- Update service management commands
- Update LiteLLM integration details
- Update testing examples
---
 CLAUDE.md | 173 +++++++++++++++++++++++++-----------------------------
 1 file changed, 79 insertions(+), 94 deletions(-)

diff --git a/CLAUDE.md b/CLAUDE.md
index 6e45958..6b18e39 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -4,49 +4,39 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 ## Overview
 
-This is a lightweight, process-based AI model orchestrator designed for RunPod GPU instances (specifically RTX 4090 with 24GB VRAM). It manages sequential loading of multiple large AI models on a single GPU, providing OpenAI-compatible API endpoints for text, image, and audio generation.
+This is a lightweight AI model serving infrastructure for RunPod GPU instances (specifically RTX 4090 with 24GB VRAM). It provides OpenAI-compatible API endpoints for text, image, and audio generation via dedicated vLLM servers and ComfyUI.
 
 **Key Design Philosophy:**
-- **Sequential model loading** - Only one model active at a time to fit within GPU memory constraints
-- **Process-based architecture** - Uses Python subprocess instead of Docker-in-Docker for RunPod compatibility
-- **Automatic model switching** - Orchestrator detects request types and switches models on-demand
+- **Direct vLLM access** - Dedicated FastAPI servers for each text model (no proxy layer)
+- **Process-based architecture** - Managed by Supervisor for reliability
+- **Simple service management** - Start/stop models independently as needed
 - **OpenAI-compatible APIs** - Works seamlessly with LiteLLM proxy and other AI tools
 
 ## Architecture
 
 ### Core Components
 
-1. **Orchestrator** (`model-orchestrator/orchestrator_subprocess.py`)
-   - FastAPI proxy server listening on port 9000
-   - Manages model lifecycle via Python subprocesses
-   - Routes requests to appropriate model services
-   - Handles sequential model loading/unloading
+1. **vLLM Servers** (`vllm/server_qwen.py`, `vllm/server_llama.py`)
+   - Dedicated FastAPI servers for each text model
+   - Qwen 2.5 7B on port 8000, Llama 3.1 8B on port 8001
+   - Direct AsyncLLMEngine access for optimal streaming performance
+   - max_model_len=20000 optimized for 24GB VRAM
 
-2. **Model Registry** (`model-orchestrator/models.yaml`)
-   - YAML configuration defining available models
-   - Specifies: type, framework, service script, port, VRAM requirements, startup time
-   - Easy to extend with new models
+2. **ComfyUI** (`comfyui/`)
+   - Image/video/audio generation server on port 8188
+   - FLUX, SDXL, CogVideoX, MusicGen models
+   - Advanced workflow system with custom nodes
 
-3. **Model Services** (`models/*/`)
-   - Individual Python servers running specific AI models
-   - vLLM for text generation (Qwen 2.5 7B, Llama 3.1 8B)
-   - ComfyUI for image/video/audio generation (FLUX, SDXL, CogVideoX, MusicGen)
+3. **Supervisor** (`supervisord.conf`)
+   - Process manager for all AI services
+   - Auto-restart, centralized logging
+   - Web interface on port 9001
 
 4. **Ansible Provisioning** (`playbook.yml`)
    - Complete infrastructure-as-code setup
    - Installs dependencies, downloads models, configures services
    - Supports selective installation via tags
 
-### Why Process-Based Instead of Docker?
-
-The subprocess implementation (`orchestrator_subprocess.py`) is preferred over the Docker version (`orchestrator.py`) because:
-- RunPod instances run in containers - Docker-in-Docker adds complexity
-- Faster model startup (direct Python process spawning)
-- Simpler debugging (single process tree)
-- Reduced overhead (no container management layer)
-
-**Note:** Always use `orchestrator_subprocess.py` for RunPod deployments.
-
 ## Common Commands
 
 ### Repository Management with Arty
@@ -57,17 +47,16 @@ This project uses Arty for repository and deployment management. See `arty.yml`
 # Clone all repositories (fresh deployment)
 arty sync --env prod          # Production: Essential nodes only
 arty sync --env dev           # Development: All nodes including optional
-arty sync --env minimal       # Minimal: Just orchestrator + ComfyUI base
+arty sync --env minimal       # Minimal: Just vLLM + ComfyUI base
 
 # Run deployment scripts
 arty run setup/full           # Show setup instructions
 arty run models/link-comfyui  # Link downloaded models to ComfyUI
 arty run deps/comfyui-nodes   # Install custom node dependencies
-arty run services/start       # Start orchestrator
+arty run services/start       # Start supervisor
 arty run services/stop        # Stop all services
 
 # Health checks
-arty run health/orchestrator  # Check orchestrator
 arty run health/comfyui      # Check ComfyUI
 arty run check/gpu           # nvidia-smi
 arty run check/models        # Show cache size
@@ -119,14 +108,16 @@ arty run services/status            # Same via arty
 supervisorctl status                # Direct supervisor command
 
 # Individual service control
-supervisorctl start orchestrator    # Start orchestrator
+supervisorctl start vllm-qwen       # Start Qwen vLLM server
+supervisorctl start vllm-llama      # Start Llama vLLM server
 supervisorctl restart comfyui       # Restart ComfyUI
-supervisorctl stop orchestrator     # Stop orchestrator
+supervisorctl stop vllm-qwen        # Stop Qwen vLLM server
 arty run services/restart-comfyui   # Restart ComfyUI via arty
 
 # View logs
 supervisorctl tail -f comfyui       # Follow ComfyUI logs
-supervisorctl tail -f orchestrator  # Follow orchestrator logs
+supervisorctl tail -f vllm-qwen     # Follow Qwen vLLM logs
+supervisorctl tail -f vllm-llama    # Follow Llama vLLM logs
 arty run services/logs              # Follow ComfyUI logs via arty
 
 # Web interface
@@ -141,7 +132,9 @@ arty run services/logs              # Follow ComfyUI logs via arty
 
 **Services managed:**
 - `comfyui` - ComfyUI server (port 8188, autostart enabled)
-- `orchestrator` - Model orchestrator (port 9000, autostart disabled)
+- `vllm-qwen` - Qwen 2.5 7B vLLM server (port 8000, autostart disabled)
+- `vllm-llama` - Llama 3.1 8B vLLM server (port 8001, autostart disabled)
+- `webdav-sync` - WebDAV sync service for ComfyUI outputs
 
 ### GPU Memory Management and Mode Switching
 
@@ -165,8 +158,8 @@ Since ComfyUI and vLLM models cannot run simultaneously (they exceed 24GB combin
 # 1. Stop ComfyUI
 supervisorctl stop comfyui
 
-# 2. Start orchestrator (manages vLLM models)
-supervisorctl start orchestrator
+# 2. Start desired vLLM server
+supervisorctl start vllm-qwen    # or vllm-llama
 
 # 3. Verify
 supervisorctl status
@@ -175,8 +168,8 @@ nvidia-smi  # Check VRAM usage
 
 **Switch to Image/Video/Audio Generation Mode (ComfyUI):**
 ```bash
-# 1. Stop orchestrator (stops all vLLM models)
-supervisorctl stop orchestrator
+# 1. Stop vLLM servers
+supervisorctl stop vllm-qwen vllm-llama
 
 # 2. Start ComfyUI
 supervisorctl start comfyui
@@ -196,23 +189,23 @@ You can also switch modes using the Supervisor web interface:
 
 **Integration with LiteLLM:**
 
-The orchestrator integrates with LiteLLM on the VPS for unified API access:
-- vLLM models (qwen-2.5-7b, llama-3.1-8b) available when orchestrator is running
-- Requests route through orchestrator (port 9000) which handles model loading
-- Environment variable `GPU_TAILSCALE_IP` (100.114.60.40) configures connection
-- LiteLLM config uses `os.environ/GPU_TAILSCALE_IP` syntax for dynamic IP
+The vLLM servers integrate with LiteLLM on the VPS for unified API access:
+- vLLM models (qwen-2.5-7b, llama-3.1-8b) available via direct connections
+- Requests route directly to vLLM servers (ports 8000, 8001)
+- Environment variables `GPU_VLLM_QWEN_URL` and `GPU_VLLM_LLAMA_URL` configure connections
+- LiteLLM config uses `os.environ/GPU_VLLM_*_URL` syntax for dynamic URLs
 
 ### Testing
 
 ```bash
-# Health check
-curl http://localhost:9000/health
+# Health check (Qwen vLLM server)
+curl http://localhost:8000/health
 
 # List available models
-curl http://localhost:9000/v1/models
+curl http://localhost:8000/v1/models
 
 # Test text generation (streaming)
-curl -s -N -X POST http://localhost:9000/v1/chat/completions \
+curl -s -N -X POST http://localhost:8000/v1/chat/completions \
   -H 'Content-Type: application/json' \
   -d '{
     "model": "qwen-2.5-7b",
@@ -221,12 +214,10 @@ curl -s -N -X POST http://localhost:9000/v1/chat/completions \
     "stream": true
   }'
 
-# Test image generation
-curl -X POST http://localhost:9000/v1/images/generations \
+# Test ComfyUI (image generation)
+curl -X POST http://localhost:8188/prompt \
   -H 'Content-Type: application/json' \
   -d '{
-    "model": "flux-schnell",
-    "prompt": "A serene mountain landscape at sunset",
     "size": "1024x1024"
   }'
 ```
@@ -255,65 +246,59 @@ curl -X POST http://localhost:9000/v1/images/generations \
 - `systemd` - Configure systemd services (use `never` - not for RunPod)
 - `validate` - Health checks (use `never` - run explicitly)
 
-### Adding New Models
+### Adding New vLLM Models
 
-1. **Add model definition to `model-orchestrator/models.yaml`:**
-
-```yaml
-llama-3.1-8b:
-  type: text
-  framework: vllm
-  service_script: models/vllm/server_llama.py
-  port: 8001
-  vram_gb: 17
-  startup_time_seconds: 120
-  endpoint: /v1/chat/completions
-  description: "Llama 3.1 8B Instruct"
-```
-
-2. **Create service script** (`models/vllm/server_llama.py`):
+1. **Create dedicated server script** (e.g., `vllm/server_mistral.py`):
 
 ```python
-import os
-from vllm.entrypoints.openai.api_server import run_server
+#!/usr/bin/env python3
+from vllm import AsyncLLMEngine, AsyncEngineArgs
+# ... (copy structure from server_qwen.py)
 
-model = "meta-llama/Llama-3.1-8B-Instruct"
-port = int(os.getenv("PORT", 8001))
-run_server(model=model, port=port)
+model_name: str = "mistralai/Mistral-7B-Instruct-v0.3"
+port = 8002  # Choose unique port
+```
+
+2. **Add supervisor configuration** (`supervisord.conf`):
+
+```ini
+[program:vllm-mistral]
+command=vllm/venv/bin/python vllm/server_mistral.py
+directory=.
+autostart=false
+autorestart=true
+environment=HF_HOME="../huggingface_cache",HF_TOKEN="%(ENV_HF_TOKEN)s"
 ```
 
 3. **Download model** (handled by Ansible playbook or manually via HuggingFace CLI)
 
-4. **Restart orchestrator:**
+4. **Add to LiteLLM config** (on VPS):
+
+```yaml
+- model_name: mistral-7b
+  litellm_params:
+    model: hosted_vllm/openai/mistral-7b
+    api_base: os.environ/GPU_VLLM_MISTRAL_URL
+    api_key: "EMPTY"
+```
+
+5. **Start the service:**
 
 ```bash
-bash scripts/stop-all.sh && bash scripts/start-all.sh
+supervisorctl start vllm-mistral
 ```
 
 ## Key Implementation Details
 
-### Model Switching Logic
+### Direct vLLM Server Architecture
 
-The orchestrator automatically switches models based on:
-- **Endpoint path** - `/v1/chat/completions` → text models, `/v1/images/generations` → image models
-- **Model name in request** - Matches against model registry
-- **Sequential loading** - Stops current model before starting new one to conserve VRAM
+Each vLLM model runs as a dedicated FastAPI server:
+- Independent process management via Supervisor
+- Direct AsyncLLMEngine access for optimal performance
+- OpenAI-compatible `/v1/chat/completions` endpoint
+- Streaming support with proper delta formatting
 
-See `orchestrator_subprocess.py:64-100` for process management implementation.
-
-### Model Registry Structure
-
-Each model in `models.yaml` requires:
-- `type` - text, image, or audio
-- `framework` - vllm, openedai-images, audiocraft, comfyui
-- `service_script` - Relative path to Python/shell script
-- `port` - Service port (8000+)
-- `vram_gb` - GPU memory requirement
-- `startup_time_seconds` - Max health check timeout
-- `endpoint` - API endpoint path
-- `description` - Human-readable description
-
-### Environment Variables
+See `vllm/server_qwen.py` for reference implementation.
 
 Set in `.env` file:
 - `HF_TOKEN` - **Required** - HuggingFace API token for model downloads