From 76cf5b5e3103ffd2d65895012c79262404325ddb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Sebastian=20Kr=C3=BCger?= Date: Sun, 23 Nov 2025 16:26:59 +0100 Subject: [PATCH] docs: update CLAUDE.md to reflect direct vLLM architecture - Remove all orchestrator references - Update to dedicated vLLM server model - Update service management commands - Update LiteLLM integration details - Update testing examples --- CLAUDE.md | 173 +++++++++++++++++++++++++----------------------------- 1 file changed, 79 insertions(+), 94 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 6e45958..6b18e39 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -4,49 +4,39 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co ## Overview -This is a lightweight, process-based AI model orchestrator designed for RunPod GPU instances (specifically RTX 4090 with 24GB VRAM). It manages sequential loading of multiple large AI models on a single GPU, providing OpenAI-compatible API endpoints for text, image, and audio generation. +This is a lightweight AI model serving infrastructure for RunPod GPU instances (specifically RTX 4090 with 24GB VRAM). It provides OpenAI-compatible API endpoints for text, image, and audio generation via dedicated vLLM servers and ComfyUI. **Key Design Philosophy:** -- **Sequential model loading** - Only one model active at a time to fit within GPU memory constraints -- **Process-based architecture** - Uses Python subprocess instead of Docker-in-Docker for RunPod compatibility -- **Automatic model switching** - Orchestrator detects request types and switches models on-demand +- **Direct vLLM access** - Dedicated FastAPI servers for each text model (no proxy layer) +- **Process-based architecture** - Managed by Supervisor for reliability +- **Simple service management** - Start/stop models independently as needed - **OpenAI-compatible APIs** - Works seamlessly with LiteLLM proxy and other AI tools ## Architecture ### Core Components -1. **Orchestrator** (`model-orchestrator/orchestrator_subprocess.py`) - - FastAPI proxy server listening on port 9000 - - Manages model lifecycle via Python subprocesses - - Routes requests to appropriate model services - - Handles sequential model loading/unloading +1. **vLLM Servers** (`vllm/server_qwen.py`, `vllm/server_llama.py`) + - Dedicated FastAPI servers for each text model + - Qwen 2.5 7B on port 8000, Llama 3.1 8B on port 8001 + - Direct AsyncLLMEngine access for optimal streaming performance + - max_model_len=20000 optimized for 24GB VRAM -2. **Model Registry** (`model-orchestrator/models.yaml`) - - YAML configuration defining available models - - Specifies: type, framework, service script, port, VRAM requirements, startup time - - Easy to extend with new models +2. **ComfyUI** (`comfyui/`) + - Image/video/audio generation server on port 8188 + - FLUX, SDXL, CogVideoX, MusicGen models + - Advanced workflow system with custom nodes -3. **Model Services** (`models/*/`) - - Individual Python servers running specific AI models - - vLLM for text generation (Qwen 2.5 7B, Llama 3.1 8B) - - ComfyUI for image/video/audio generation (FLUX, SDXL, CogVideoX, MusicGen) +3. **Supervisor** (`supervisord.conf`) + - Process manager for all AI services + - Auto-restart, centralized logging + - Web interface on port 9001 4. **Ansible Provisioning** (`playbook.yml`) - Complete infrastructure-as-code setup - Installs dependencies, downloads models, configures services - Supports selective installation via tags -### Why Process-Based Instead of Docker? - -The subprocess implementation (`orchestrator_subprocess.py`) is preferred over the Docker version (`orchestrator.py`) because: -- RunPod instances run in containers - Docker-in-Docker adds complexity -- Faster model startup (direct Python process spawning) -- Simpler debugging (single process tree) -- Reduced overhead (no container management layer) - -**Note:** Always use `orchestrator_subprocess.py` for RunPod deployments. - ## Common Commands ### Repository Management with Arty @@ -57,17 +47,16 @@ This project uses Arty for repository and deployment management. See `arty.yml` # Clone all repositories (fresh deployment) arty sync --env prod # Production: Essential nodes only arty sync --env dev # Development: All nodes including optional -arty sync --env minimal # Minimal: Just orchestrator + ComfyUI base +arty sync --env minimal # Minimal: Just vLLM + ComfyUI base # Run deployment scripts arty run setup/full # Show setup instructions arty run models/link-comfyui # Link downloaded models to ComfyUI arty run deps/comfyui-nodes # Install custom node dependencies -arty run services/start # Start orchestrator +arty run services/start # Start supervisor arty run services/stop # Stop all services # Health checks -arty run health/orchestrator # Check orchestrator arty run health/comfyui # Check ComfyUI arty run check/gpu # nvidia-smi arty run check/models # Show cache size @@ -119,14 +108,16 @@ arty run services/status # Same via arty supervisorctl status # Direct supervisor command # Individual service control -supervisorctl start orchestrator # Start orchestrator +supervisorctl start vllm-qwen # Start Qwen vLLM server +supervisorctl start vllm-llama # Start Llama vLLM server supervisorctl restart comfyui # Restart ComfyUI -supervisorctl stop orchestrator # Stop orchestrator +supervisorctl stop vllm-qwen # Stop Qwen vLLM server arty run services/restart-comfyui # Restart ComfyUI via arty # View logs supervisorctl tail -f comfyui # Follow ComfyUI logs -supervisorctl tail -f orchestrator # Follow orchestrator logs +supervisorctl tail -f vllm-qwen # Follow Qwen vLLM logs +supervisorctl tail -f vllm-llama # Follow Llama vLLM logs arty run services/logs # Follow ComfyUI logs via arty # Web interface @@ -141,7 +132,9 @@ arty run services/logs # Follow ComfyUI logs via arty **Services managed:** - `comfyui` - ComfyUI server (port 8188, autostart enabled) -- `orchestrator` - Model orchestrator (port 9000, autostart disabled) +- `vllm-qwen` - Qwen 2.5 7B vLLM server (port 8000, autostart disabled) +- `vllm-llama` - Llama 3.1 8B vLLM server (port 8001, autostart disabled) +- `webdav-sync` - WebDAV sync service for ComfyUI outputs ### GPU Memory Management and Mode Switching @@ -165,8 +158,8 @@ Since ComfyUI and vLLM models cannot run simultaneously (they exceed 24GB combin # 1. Stop ComfyUI supervisorctl stop comfyui -# 2. Start orchestrator (manages vLLM models) -supervisorctl start orchestrator +# 2. Start desired vLLM server +supervisorctl start vllm-qwen # or vllm-llama # 3. Verify supervisorctl status @@ -175,8 +168,8 @@ nvidia-smi # Check VRAM usage **Switch to Image/Video/Audio Generation Mode (ComfyUI):** ```bash -# 1. Stop orchestrator (stops all vLLM models) -supervisorctl stop orchestrator +# 1. Stop vLLM servers +supervisorctl stop vllm-qwen vllm-llama # 2. Start ComfyUI supervisorctl start comfyui @@ -196,23 +189,23 @@ You can also switch modes using the Supervisor web interface: **Integration with LiteLLM:** -The orchestrator integrates with LiteLLM on the VPS for unified API access: -- vLLM models (qwen-2.5-7b, llama-3.1-8b) available when orchestrator is running -- Requests route through orchestrator (port 9000) which handles model loading -- Environment variable `GPU_TAILSCALE_IP` (100.114.60.40) configures connection -- LiteLLM config uses `os.environ/GPU_TAILSCALE_IP` syntax for dynamic IP +The vLLM servers integrate with LiteLLM on the VPS for unified API access: +- vLLM models (qwen-2.5-7b, llama-3.1-8b) available via direct connections +- Requests route directly to vLLM servers (ports 8000, 8001) +- Environment variables `GPU_VLLM_QWEN_URL` and `GPU_VLLM_LLAMA_URL` configure connections +- LiteLLM config uses `os.environ/GPU_VLLM_*_URL` syntax for dynamic URLs ### Testing ```bash -# Health check -curl http://localhost:9000/health +# Health check (Qwen vLLM server) +curl http://localhost:8000/health # List available models -curl http://localhost:9000/v1/models +curl http://localhost:8000/v1/models # Test text generation (streaming) -curl -s -N -X POST http://localhost:9000/v1/chat/completions \ +curl -s -N -X POST http://localhost:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{ "model": "qwen-2.5-7b", @@ -221,12 +214,10 @@ curl -s -N -X POST http://localhost:9000/v1/chat/completions \ "stream": true }' -# Test image generation -curl -X POST http://localhost:9000/v1/images/generations \ +# Test ComfyUI (image generation) +curl -X POST http://localhost:8188/prompt \ -H 'Content-Type: application/json' \ -d '{ - "model": "flux-schnell", - "prompt": "A serene mountain landscape at sunset", "size": "1024x1024" }' ``` @@ -255,65 +246,59 @@ curl -X POST http://localhost:9000/v1/images/generations \ - `systemd` - Configure systemd services (use `never` - not for RunPod) - `validate` - Health checks (use `never` - run explicitly) -### Adding New Models +### Adding New vLLM Models -1. **Add model definition to `model-orchestrator/models.yaml`:** - -```yaml -llama-3.1-8b: - type: text - framework: vllm - service_script: models/vllm/server_llama.py - port: 8001 - vram_gb: 17 - startup_time_seconds: 120 - endpoint: /v1/chat/completions - description: "Llama 3.1 8B Instruct" -``` - -2. **Create service script** (`models/vllm/server_llama.py`): +1. **Create dedicated server script** (e.g., `vllm/server_mistral.py`): ```python -import os -from vllm.entrypoints.openai.api_server import run_server +#!/usr/bin/env python3 +from vllm import AsyncLLMEngine, AsyncEngineArgs +# ... (copy structure from server_qwen.py) -model = "meta-llama/Llama-3.1-8B-Instruct" -port = int(os.getenv("PORT", 8001)) -run_server(model=model, port=port) +model_name: str = "mistralai/Mistral-7B-Instruct-v0.3" +port = 8002 # Choose unique port +``` + +2. **Add supervisor configuration** (`supervisord.conf`): + +```ini +[program:vllm-mistral] +command=vllm/venv/bin/python vllm/server_mistral.py +directory=. +autostart=false +autorestart=true +environment=HF_HOME="../huggingface_cache",HF_TOKEN="%(ENV_HF_TOKEN)s" ``` 3. **Download model** (handled by Ansible playbook or manually via HuggingFace CLI) -4. **Restart orchestrator:** +4. **Add to LiteLLM config** (on VPS): + +```yaml +- model_name: mistral-7b + litellm_params: + model: hosted_vllm/openai/mistral-7b + api_base: os.environ/GPU_VLLM_MISTRAL_URL + api_key: "EMPTY" +``` + +5. **Start the service:** ```bash -bash scripts/stop-all.sh && bash scripts/start-all.sh +supervisorctl start vllm-mistral ``` ## Key Implementation Details -### Model Switching Logic +### Direct vLLM Server Architecture -The orchestrator automatically switches models based on: -- **Endpoint path** - `/v1/chat/completions` → text models, `/v1/images/generations` → image models -- **Model name in request** - Matches against model registry -- **Sequential loading** - Stops current model before starting new one to conserve VRAM +Each vLLM model runs as a dedicated FastAPI server: +- Independent process management via Supervisor +- Direct AsyncLLMEngine access for optimal performance +- OpenAI-compatible `/v1/chat/completions` endpoint +- Streaming support with proper delta formatting -See `orchestrator_subprocess.py:64-100` for process management implementation. - -### Model Registry Structure - -Each model in `models.yaml` requires: -- `type` - text, image, or audio -- `framework` - vllm, openedai-images, audiocraft, comfyui -- `service_script` - Relative path to Python/shell script -- `port` - Service port (8000+) -- `vram_gb` - GPU memory requirement -- `startup_time_seconds` - Max health check timeout -- `endpoint` - API endpoint path -- `description` - Human-readable description - -### Environment Variables +See `vllm/server_qwen.py` for reference implementation. Set in `.env` file: - `HF_TOKEN` - **Required** - HuggingFace API token for model downloads