fix: use venv python for vLLM service startup

This commit is contained in:
2025-11-23 15:21:52 +01:00
parent fdd724298a
commit 3f812704a2
2 changed files with 66 additions and 2 deletions

View File

@@ -143,6 +143,64 @@ arty run services/logs # Follow ComfyUI logs via arty
- `comfyui` - ComfyUI server (port 8188, autostart enabled)
- `orchestrator` - Model orchestrator (port 9000, autostart disabled)
### GPU Memory Management and Mode Switching
**VRAM Constraints (RTX 4090 - 24GB total):**
The GPU has limited memory, which requires manual service switching:
| Service | Model | VRAM Usage | Compatible With |
|---------|-------|------------|-----------------|
| ComfyUI | FLUX Schnell FP16 | ~23GB | None (uses all VRAM) |
| ComfyUI | SDXL Base | ~12GB | Small vLLM models |
| vLLM | Qwen 2.5 7B | ~14GB | None (conflicts with ComfyUI) |
| vLLM | Llama 3.1 8B | ~17GB | None (conflicts with ComfyUI) |
**Mode Switching Workflow:**
Since ComfyUI and vLLM models cannot run simultaneously (they exceed 24GB combined), you must manually switch modes:
**Switch to Text Generation Mode (vLLM):**
```bash
# 1. Stop ComfyUI
supervisorctl stop comfyui
# 2. Start orchestrator (manages vLLM models)
supervisorctl start orchestrator
# 3. Verify
supervisorctl status
nvidia-smi # Check VRAM usage
```
**Switch to Image/Video/Audio Generation Mode (ComfyUI):**
```bash
# 1. Stop orchestrator (stops all vLLM models)
supervisorctl stop orchestrator
# 2. Start ComfyUI
supervisorctl start comfyui
# 3. Verify
supervisorctl status
nvidia-smi # Check VRAM usage
```
**Access via Supervisor Web UI:**
You can also switch modes using the Supervisor web interface:
- URL: `https://supervisor.ai.pivoine.art` (via VPS proxy) or `http://100.114.60.40:9001` (direct Tailscale)
- Username: `admin`
- Password: `runpod2024`
- Click "Start" or "Stop" buttons for each service
**Integration with LiteLLM:**
The orchestrator integrates with LiteLLM on the VPS for unified API access:
- vLLM models (qwen-2.5-7b, llama-3.1-8b) available when orchestrator is running
- Requests route through orchestrator (port 9000) which handles model loading
- Environment variable `GPU_TAILSCALE_IP` (100.114.60.40) configures connection
- LiteLLM config uses `os.environ/GPU_TAILSCALE_IP` syntax for dynamic IP
### Testing

View File

@@ -102,11 +102,17 @@ async def start_model_process(model_name: str) -> bool:
env.update({
'HF_TOKEN': os.getenv('HF_TOKEN', ''),
'PORT': str(port),
'HOST': '0.0.0.0'
'HOST': '0.0.0.0',
'MODEL_NAME': model_config.get('model_name', model_name)
})
# Use venv python if it exists
script_dir = script_path.parent
venv_python = script_dir / 'venv' / 'bin' / 'python3'
python_cmd = str(venv_python) if venv_python.exists() else 'python3'
proc = subprocess.Popen(
['python3', str(script_path)],
[python_cmd, str(script_path)],
env=env,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,