fix: use venv python for vLLM service startup

2025-11-23 15:21:52 +01:00
parent fdd724298a
commit 3f812704a2
2 changed files with 66 additions and 2 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -143,6 +143,64 @@ arty run services/logs              # Follow ComfyUI logs via arty
 - `comfyui` - ComfyUI server (port 8188, autostart enabled)
 - `orchestrator` - Model orchestrator (port 9000, autostart disabled)

+### GPU Memory Management and Mode Switching
+
+**VRAM Constraints (RTX 4090 - 24GB total):**
+
+The GPU has limited memory, which requires manual service switching:
+
+| Service | Model | VRAM Usage | Compatible With |
+|---------|-------|------------|-----------------|
+| ComfyUI | FLUX Schnell FP16 | ~23GB | None (uses all VRAM) |
+| ComfyUI | SDXL Base | ~12GB | Small vLLM models |
+| vLLM | Qwen 2.5 7B | ~14GB | None (conflicts with ComfyUI) |
+| vLLM | Llama 3.1 8B | ~17GB | None (conflicts with ComfyUI) |
+
+**Mode Switching Workflow:**
+
+Since ComfyUI and vLLM models cannot run simultaneously (they exceed 24GB combined), you must manually switch modes:
+
+**Switch to Text Generation Mode (vLLM):**
+```bash
+# 1. Stop ComfyUI
+supervisorctl stop comfyui
+
+# 2. Start orchestrator (manages vLLM models)
+supervisorctl start orchestrator
+
+# 3. Verify
+supervisorctl status
+nvidia-smi  # Check VRAM usage
+```
+
+**Switch to Image/Video/Audio Generation Mode (ComfyUI):**
+```bash
+# 1. Stop orchestrator (stops all vLLM models)
+supervisorctl stop orchestrator
+
+# 2. Start ComfyUI
+supervisorctl start comfyui
+
+# 3. Verify
+supervisorctl status
+nvidia-smi  # Check VRAM usage
+```
+
+**Access via Supervisor Web UI:**
+
+You can also switch modes using the Supervisor web interface:
+- URL: `https://supervisor.ai.pivoine.art` (via VPS proxy) or `http://100.114.60.40:9001` (direct Tailscale)
+- Username: `admin`
+- Password: `runpod2024`
+- Click "Start" or "Stop" buttons for each service
+
+**Integration with LiteLLM:**
+
+The orchestrator integrates with LiteLLM on the VPS for unified API access:
+- vLLM models (qwen-2.5-7b, llama-3.1-8b) available when orchestrator is running
+- Requests route through orchestrator (port 9000) which handles model loading
+- Environment variable `GPU_TAILSCALE_IP` (100.114.60.40) configures connection
+- LiteLLM config uses `os.environ/GPU_TAILSCALE_IP` syntax for dynamic IP

 ### Testing

--- a/model-orchestrator/orchestrator_subprocess.py
+++ b/model-orchestrator/orchestrator_subprocess.py
@@ -102,11 +102,17 @@ async def start_model_process(model_name: str) -> bool:
        env.update({
            'HF_TOKEN': os.getenv('HF_TOKEN', ''),
            'PORT': str(port),
-            'HOST': '0.0.0.0'
+            'HOST': '0.0.0.0',
+            'MODEL_NAME': model_config.get('model_name', model_name)
        })

+        # Use venv python if it exists
+        script_dir = script_path.parent
+        venv_python = script_dir / 'venv' / 'bin' / 'python3'
+        python_cmd = str(venv_python) if venv_python.exists() else 'python3'
+
        proc = subprocess.Popen(
-            ['python3', str(script_path)],
+            [python_cmd, str(script_path)],
            env=env,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,