docs: update CLAUDE.md to reflect direct vLLM architecture
- Remove all orchestrator references - Update to dedicated vLLM server model - Update service management commands - Update LiteLLM integration details - Update testing examples
This commit is contained in:
173
CLAUDE.md
173
CLAUDE.md
@@ -4,49 +4,39 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
|
|||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
|
|
||||||
This is a lightweight, process-based AI model orchestrator designed for RunPod GPU instances (specifically RTX 4090 with 24GB VRAM). It manages sequential loading of multiple large AI models on a single GPU, providing OpenAI-compatible API endpoints for text, image, and audio generation.
|
This is a lightweight AI model serving infrastructure for RunPod GPU instances (specifically RTX 4090 with 24GB VRAM). It provides OpenAI-compatible API endpoints for text, image, and audio generation via dedicated vLLM servers and ComfyUI.
|
||||||
|
|
||||||
**Key Design Philosophy:**
|
**Key Design Philosophy:**
|
||||||
- **Sequential model loading** - Only one model active at a time to fit within GPU memory constraints
|
- **Direct vLLM access** - Dedicated FastAPI servers for each text model (no proxy layer)
|
||||||
- **Process-based architecture** - Uses Python subprocess instead of Docker-in-Docker for RunPod compatibility
|
- **Process-based architecture** - Managed by Supervisor for reliability
|
||||||
- **Automatic model switching** - Orchestrator detects request types and switches models on-demand
|
- **Simple service management** - Start/stop models independently as needed
|
||||||
- **OpenAI-compatible APIs** - Works seamlessly with LiteLLM proxy and other AI tools
|
- **OpenAI-compatible APIs** - Works seamlessly with LiteLLM proxy and other AI tools
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
### Core Components
|
### Core Components
|
||||||
|
|
||||||
1. **Orchestrator** (`model-orchestrator/orchestrator_subprocess.py`)
|
1. **vLLM Servers** (`vllm/server_qwen.py`, `vllm/server_llama.py`)
|
||||||
- FastAPI proxy server listening on port 9000
|
- Dedicated FastAPI servers for each text model
|
||||||
- Manages model lifecycle via Python subprocesses
|
- Qwen 2.5 7B on port 8000, Llama 3.1 8B on port 8001
|
||||||
- Routes requests to appropriate model services
|
- Direct AsyncLLMEngine access for optimal streaming performance
|
||||||
- Handles sequential model loading/unloading
|
- max_model_len=20000 optimized for 24GB VRAM
|
||||||
|
|
||||||
2. **Model Registry** (`model-orchestrator/models.yaml`)
|
2. **ComfyUI** (`comfyui/`)
|
||||||
- YAML configuration defining available models
|
- Image/video/audio generation server on port 8188
|
||||||
- Specifies: type, framework, service script, port, VRAM requirements, startup time
|
- FLUX, SDXL, CogVideoX, MusicGen models
|
||||||
- Easy to extend with new models
|
- Advanced workflow system with custom nodes
|
||||||
|
|
||||||
3. **Model Services** (`models/*/`)
|
3. **Supervisor** (`supervisord.conf`)
|
||||||
- Individual Python servers running specific AI models
|
- Process manager for all AI services
|
||||||
- vLLM for text generation (Qwen 2.5 7B, Llama 3.1 8B)
|
- Auto-restart, centralized logging
|
||||||
- ComfyUI for image/video/audio generation (FLUX, SDXL, CogVideoX, MusicGen)
|
- Web interface on port 9001
|
||||||
|
|
||||||
4. **Ansible Provisioning** (`playbook.yml`)
|
4. **Ansible Provisioning** (`playbook.yml`)
|
||||||
- Complete infrastructure-as-code setup
|
- Complete infrastructure-as-code setup
|
||||||
- Installs dependencies, downloads models, configures services
|
- Installs dependencies, downloads models, configures services
|
||||||
- Supports selective installation via tags
|
- Supports selective installation via tags
|
||||||
|
|
||||||
### Why Process-Based Instead of Docker?
|
|
||||||
|
|
||||||
The subprocess implementation (`orchestrator_subprocess.py`) is preferred over the Docker version (`orchestrator.py`) because:
|
|
||||||
- RunPod instances run in containers - Docker-in-Docker adds complexity
|
|
||||||
- Faster model startup (direct Python process spawning)
|
|
||||||
- Simpler debugging (single process tree)
|
|
||||||
- Reduced overhead (no container management layer)
|
|
||||||
|
|
||||||
**Note:** Always use `orchestrator_subprocess.py` for RunPod deployments.
|
|
||||||
|
|
||||||
## Common Commands
|
## Common Commands
|
||||||
|
|
||||||
### Repository Management with Arty
|
### Repository Management with Arty
|
||||||
@@ -57,17 +47,16 @@ This project uses Arty for repository and deployment management. See `arty.yml`
|
|||||||
# Clone all repositories (fresh deployment)
|
# Clone all repositories (fresh deployment)
|
||||||
arty sync --env prod # Production: Essential nodes only
|
arty sync --env prod # Production: Essential nodes only
|
||||||
arty sync --env dev # Development: All nodes including optional
|
arty sync --env dev # Development: All nodes including optional
|
||||||
arty sync --env minimal # Minimal: Just orchestrator + ComfyUI base
|
arty sync --env minimal # Minimal: Just vLLM + ComfyUI base
|
||||||
|
|
||||||
# Run deployment scripts
|
# Run deployment scripts
|
||||||
arty run setup/full # Show setup instructions
|
arty run setup/full # Show setup instructions
|
||||||
arty run models/link-comfyui # Link downloaded models to ComfyUI
|
arty run models/link-comfyui # Link downloaded models to ComfyUI
|
||||||
arty run deps/comfyui-nodes # Install custom node dependencies
|
arty run deps/comfyui-nodes # Install custom node dependencies
|
||||||
arty run services/start # Start orchestrator
|
arty run services/start # Start supervisor
|
||||||
arty run services/stop # Stop all services
|
arty run services/stop # Stop all services
|
||||||
|
|
||||||
# Health checks
|
# Health checks
|
||||||
arty run health/orchestrator # Check orchestrator
|
|
||||||
arty run health/comfyui # Check ComfyUI
|
arty run health/comfyui # Check ComfyUI
|
||||||
arty run check/gpu # nvidia-smi
|
arty run check/gpu # nvidia-smi
|
||||||
arty run check/models # Show cache size
|
arty run check/models # Show cache size
|
||||||
@@ -119,14 +108,16 @@ arty run services/status # Same via arty
|
|||||||
supervisorctl status # Direct supervisor command
|
supervisorctl status # Direct supervisor command
|
||||||
|
|
||||||
# Individual service control
|
# Individual service control
|
||||||
supervisorctl start orchestrator # Start orchestrator
|
supervisorctl start vllm-qwen # Start Qwen vLLM server
|
||||||
|
supervisorctl start vllm-llama # Start Llama vLLM server
|
||||||
supervisorctl restart comfyui # Restart ComfyUI
|
supervisorctl restart comfyui # Restart ComfyUI
|
||||||
supervisorctl stop orchestrator # Stop orchestrator
|
supervisorctl stop vllm-qwen # Stop Qwen vLLM server
|
||||||
arty run services/restart-comfyui # Restart ComfyUI via arty
|
arty run services/restart-comfyui # Restart ComfyUI via arty
|
||||||
|
|
||||||
# View logs
|
# View logs
|
||||||
supervisorctl tail -f comfyui # Follow ComfyUI logs
|
supervisorctl tail -f comfyui # Follow ComfyUI logs
|
||||||
supervisorctl tail -f orchestrator # Follow orchestrator logs
|
supervisorctl tail -f vllm-qwen # Follow Qwen vLLM logs
|
||||||
|
supervisorctl tail -f vllm-llama # Follow Llama vLLM logs
|
||||||
arty run services/logs # Follow ComfyUI logs via arty
|
arty run services/logs # Follow ComfyUI logs via arty
|
||||||
|
|
||||||
# Web interface
|
# Web interface
|
||||||
@@ -141,7 +132,9 @@ arty run services/logs # Follow ComfyUI logs via arty
|
|||||||
|
|
||||||
**Services managed:**
|
**Services managed:**
|
||||||
- `comfyui` - ComfyUI server (port 8188, autostart enabled)
|
- `comfyui` - ComfyUI server (port 8188, autostart enabled)
|
||||||
- `orchestrator` - Model orchestrator (port 9000, autostart disabled)
|
- `vllm-qwen` - Qwen 2.5 7B vLLM server (port 8000, autostart disabled)
|
||||||
|
- `vllm-llama` - Llama 3.1 8B vLLM server (port 8001, autostart disabled)
|
||||||
|
- `webdav-sync` - WebDAV sync service for ComfyUI outputs
|
||||||
|
|
||||||
### GPU Memory Management and Mode Switching
|
### GPU Memory Management and Mode Switching
|
||||||
|
|
||||||
@@ -165,8 +158,8 @@ Since ComfyUI and vLLM models cannot run simultaneously (they exceed 24GB combin
|
|||||||
# 1. Stop ComfyUI
|
# 1. Stop ComfyUI
|
||||||
supervisorctl stop comfyui
|
supervisorctl stop comfyui
|
||||||
|
|
||||||
# 2. Start orchestrator (manages vLLM models)
|
# 2. Start desired vLLM server
|
||||||
supervisorctl start orchestrator
|
supervisorctl start vllm-qwen # or vllm-llama
|
||||||
|
|
||||||
# 3. Verify
|
# 3. Verify
|
||||||
supervisorctl status
|
supervisorctl status
|
||||||
@@ -175,8 +168,8 @@ nvidia-smi # Check VRAM usage
|
|||||||
|
|
||||||
**Switch to Image/Video/Audio Generation Mode (ComfyUI):**
|
**Switch to Image/Video/Audio Generation Mode (ComfyUI):**
|
||||||
```bash
|
```bash
|
||||||
# 1. Stop orchestrator (stops all vLLM models)
|
# 1. Stop vLLM servers
|
||||||
supervisorctl stop orchestrator
|
supervisorctl stop vllm-qwen vllm-llama
|
||||||
|
|
||||||
# 2. Start ComfyUI
|
# 2. Start ComfyUI
|
||||||
supervisorctl start comfyui
|
supervisorctl start comfyui
|
||||||
@@ -196,23 +189,23 @@ You can also switch modes using the Supervisor web interface:
|
|||||||
|
|
||||||
**Integration with LiteLLM:**
|
**Integration with LiteLLM:**
|
||||||
|
|
||||||
The orchestrator integrates with LiteLLM on the VPS for unified API access:
|
The vLLM servers integrate with LiteLLM on the VPS for unified API access:
|
||||||
- vLLM models (qwen-2.5-7b, llama-3.1-8b) available when orchestrator is running
|
- vLLM models (qwen-2.5-7b, llama-3.1-8b) available via direct connections
|
||||||
- Requests route through orchestrator (port 9000) which handles model loading
|
- Requests route directly to vLLM servers (ports 8000, 8001)
|
||||||
- Environment variable `GPU_TAILSCALE_IP` (100.114.60.40) configures connection
|
- Environment variables `GPU_VLLM_QWEN_URL` and `GPU_VLLM_LLAMA_URL` configure connections
|
||||||
- LiteLLM config uses `os.environ/GPU_TAILSCALE_IP` syntax for dynamic IP
|
- LiteLLM config uses `os.environ/GPU_VLLM_*_URL` syntax for dynamic URLs
|
||||||
|
|
||||||
### Testing
|
### Testing
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Health check
|
# Health check (Qwen vLLM server)
|
||||||
curl http://localhost:9000/health
|
curl http://localhost:8000/health
|
||||||
|
|
||||||
# List available models
|
# List available models
|
||||||
curl http://localhost:9000/v1/models
|
curl http://localhost:8000/v1/models
|
||||||
|
|
||||||
# Test text generation (streaming)
|
# Test text generation (streaming)
|
||||||
curl -s -N -X POST http://localhost:9000/v1/chat/completions \
|
curl -s -N -X POST http://localhost:8000/v1/chat/completions \
|
||||||
-H 'Content-Type: application/json' \
|
-H 'Content-Type: application/json' \
|
||||||
-d '{
|
-d '{
|
||||||
"model": "qwen-2.5-7b",
|
"model": "qwen-2.5-7b",
|
||||||
@@ -221,12 +214,10 @@ curl -s -N -X POST http://localhost:9000/v1/chat/completions \
|
|||||||
"stream": true
|
"stream": true
|
||||||
}'
|
}'
|
||||||
|
|
||||||
# Test image generation
|
# Test ComfyUI (image generation)
|
||||||
curl -X POST http://localhost:9000/v1/images/generations \
|
curl -X POST http://localhost:8188/prompt \
|
||||||
-H 'Content-Type: application/json' \
|
-H 'Content-Type: application/json' \
|
||||||
-d '{
|
-d '{
|
||||||
"model": "flux-schnell",
|
|
||||||
"prompt": "A serene mountain landscape at sunset",
|
|
||||||
"size": "1024x1024"
|
"size": "1024x1024"
|
||||||
}'
|
}'
|
||||||
```
|
```
|
||||||
@@ -255,65 +246,59 @@ curl -X POST http://localhost:9000/v1/images/generations \
|
|||||||
- `systemd` - Configure systemd services (use `never` - not for RunPod)
|
- `systemd` - Configure systemd services (use `never` - not for RunPod)
|
||||||
- `validate` - Health checks (use `never` - run explicitly)
|
- `validate` - Health checks (use `never` - run explicitly)
|
||||||
|
|
||||||
### Adding New Models
|
### Adding New vLLM Models
|
||||||
|
|
||||||
1. **Add model definition to `model-orchestrator/models.yaml`:**
|
1. **Create dedicated server script** (e.g., `vllm/server_mistral.py`):
|
||||||
|
|
||||||
```yaml
|
|
||||||
llama-3.1-8b:
|
|
||||||
type: text
|
|
||||||
framework: vllm
|
|
||||||
service_script: models/vllm/server_llama.py
|
|
||||||
port: 8001
|
|
||||||
vram_gb: 17
|
|
||||||
startup_time_seconds: 120
|
|
||||||
endpoint: /v1/chat/completions
|
|
||||||
description: "Llama 3.1 8B Instruct"
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Create service script** (`models/vllm/server_llama.py`):
|
|
||||||
|
|
||||||
```python
|
```python
|
||||||
import os
|
#!/usr/bin/env python3
|
||||||
from vllm.entrypoints.openai.api_server import run_server
|
from vllm import AsyncLLMEngine, AsyncEngineArgs
|
||||||
|
# ... (copy structure from server_qwen.py)
|
||||||
|
|
||||||
model = "meta-llama/Llama-3.1-8B-Instruct"
|
model_name: str = "mistralai/Mistral-7B-Instruct-v0.3"
|
||||||
port = int(os.getenv("PORT", 8001))
|
port = 8002 # Choose unique port
|
||||||
run_server(model=model, port=port)
|
```
|
||||||
|
|
||||||
|
2. **Add supervisor configuration** (`supervisord.conf`):
|
||||||
|
|
||||||
|
```ini
|
||||||
|
[program:vllm-mistral]
|
||||||
|
command=vllm/venv/bin/python vllm/server_mistral.py
|
||||||
|
directory=.
|
||||||
|
autostart=false
|
||||||
|
autorestart=true
|
||||||
|
environment=HF_HOME="../huggingface_cache",HF_TOKEN="%(ENV_HF_TOKEN)s"
|
||||||
```
|
```
|
||||||
|
|
||||||
3. **Download model** (handled by Ansible playbook or manually via HuggingFace CLI)
|
3. **Download model** (handled by Ansible playbook or manually via HuggingFace CLI)
|
||||||
|
|
||||||
4. **Restart orchestrator:**
|
4. **Add to LiteLLM config** (on VPS):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- model_name: mistral-7b
|
||||||
|
litellm_params:
|
||||||
|
model: hosted_vllm/openai/mistral-7b
|
||||||
|
api_base: os.environ/GPU_VLLM_MISTRAL_URL
|
||||||
|
api_key: "EMPTY"
|
||||||
|
```
|
||||||
|
|
||||||
|
5. **Start the service:**
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
bash scripts/stop-all.sh && bash scripts/start-all.sh
|
supervisorctl start vllm-mistral
|
||||||
```
|
```
|
||||||
|
|
||||||
## Key Implementation Details
|
## Key Implementation Details
|
||||||
|
|
||||||
### Model Switching Logic
|
### Direct vLLM Server Architecture
|
||||||
|
|
||||||
The orchestrator automatically switches models based on:
|
Each vLLM model runs as a dedicated FastAPI server:
|
||||||
- **Endpoint path** - `/v1/chat/completions` → text models, `/v1/images/generations` → image models
|
- Independent process management via Supervisor
|
||||||
- **Model name in request** - Matches against model registry
|
- Direct AsyncLLMEngine access for optimal performance
|
||||||
- **Sequential loading** - Stops current model before starting new one to conserve VRAM
|
- OpenAI-compatible `/v1/chat/completions` endpoint
|
||||||
|
- Streaming support with proper delta formatting
|
||||||
|
|
||||||
See `orchestrator_subprocess.py:64-100` for process management implementation.
|
See `vllm/server_qwen.py` for reference implementation.
|
||||||
|
|
||||||
### Model Registry Structure
|
|
||||||
|
|
||||||
Each model in `models.yaml` requires:
|
|
||||||
- `type` - text, image, or audio
|
|
||||||
- `framework` - vllm, openedai-images, audiocraft, comfyui
|
|
||||||
- `service_script` - Relative path to Python/shell script
|
|
||||||
- `port` - Service port (8000+)
|
|
||||||
- `vram_gb` - GPU memory requirement
|
|
||||||
- `startup_time_seconds` - Max health check timeout
|
|
||||||
- `endpoint` - API endpoint path
|
|
||||||
- `description` - Human-readable description
|
|
||||||
|
|
||||||
### Environment Variables
|
|
||||||
|
|
||||||
Set in `.env` file:
|
Set in `.env` file:
|
||||||
- `HF_TOKEN` - **Required** - HuggingFace API token for model downloads
|
- `HF_TOKEN` - **Required** - HuggingFace API token for model downloads
|
||||||
|
|||||||
Reference in New Issue
Block a user