- Add models_civitai.yaml with 6 NSFW SDXL checkpoints - Create artifact_civitai_download.sh with beautiful purple/magenta CLI - Update .env.example with CIVITAI_API_KEY documentation - Update CLAUDE.md with CivitAI usage instructions - Rename comfyui_models.yaml to models_huggingface.yaml for clarity Features: - Dedicated config and downloader for CivitAI models - Same elegant architecture as HuggingFace downloader - Retry logic, rate limiting, progress bars - Models: LUSTIFY, Pony Diffusion V6, RealVisXL, etc. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
571 lines
18 KiB
Markdown
571 lines
18 KiB
Markdown
# CLAUDE.md
|
|
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
|
|
## Overview
|
|
|
|
This is a lightweight AI model serving infrastructure for RunPod GPU instances (specifically RTX 4090 with 24GB VRAM). It provides OpenAI-compatible API endpoints for text, image, and audio generation via dedicated vLLM servers and ComfyUI.
|
|
|
|
**Key Design Philosophy:**
|
|
- **Direct vLLM access** - Dedicated FastAPI servers for each text model (no proxy layer)
|
|
- **Process-based architecture** - Managed by Supervisor for reliability
|
|
- **Simple service management** - Start/stop models independently as needed
|
|
- **OpenAI-compatible APIs** - Works seamlessly with LiteLLM proxy and other AI tools
|
|
|
|
## Architecture
|
|
|
|
### Core Components
|
|
|
|
1. **vLLM Servers** (`vllm/server_qwen.py`, `vllm/server_llama.py`)
|
|
- Dedicated FastAPI servers for each text model
|
|
- Qwen 2.5 7B on port 8000, Llama 3.1 8B on port 8001
|
|
- Direct AsyncLLMEngine access for optimal streaming performance
|
|
- max_model_len=20000 optimized for 24GB VRAM
|
|
|
|
2. **ComfyUI** (`comfyui/`)
|
|
- Image/video/audio generation server on port 8188
|
|
- FLUX, SDXL, CogVideoX, MusicGen models
|
|
- Advanced workflow system with custom nodes
|
|
|
|
3. **Supervisor** (`supervisord.conf`)
|
|
- Process manager for all AI services
|
|
- Auto-restart, centralized logging
|
|
- Web interface on port 9001
|
|
|
|
4. **Ansible Provisioning** (`playbook.yml`)
|
|
- Complete infrastructure-as-code setup
|
|
- Installs dependencies, downloads models, configures services
|
|
- Supports selective installation via tags
|
|
|
|
## Common Commands
|
|
|
|
### Repository Management with Arty
|
|
|
|
This project uses Arty for repository and deployment management. See `arty.yml` for full configuration.
|
|
|
|
```bash
|
|
# Clone all repositories (fresh deployment)
|
|
arty sync --env prod # Production: Essential nodes only
|
|
arty sync --env dev # Development: All nodes including optional
|
|
arty sync --env minimal # Minimal: Just vLLM + ComfyUI base
|
|
|
|
# Run deployment scripts
|
|
arty run setup/full # Show setup instructions
|
|
arty run models/link-comfyui # Link downloaded models to ComfyUI
|
|
arty run deps/comfyui-nodes # Install custom node dependencies
|
|
arty run services/start # Start supervisor
|
|
arty run services/stop # Stop all services
|
|
|
|
# Health checks
|
|
arty run health/comfyui # Check ComfyUI
|
|
arty run check/gpu # nvidia-smi
|
|
arty run check/models # Show cache size
|
|
```
|
|
|
|
### Initial Setup
|
|
|
|
```bash
|
|
# 1. Clone repositories with Arty (fresh RunPod instance)
|
|
arty sync --env prod
|
|
|
|
# 2. Configure environment
|
|
cd /workspace/ai
|
|
cp .env.example .env
|
|
# Edit .env and set HF_TOKEN=your_huggingface_token
|
|
|
|
# 3. Full deployment with Ansible
|
|
ansible-playbook playbook.yml
|
|
|
|
# 4. Essential ComfyUI setup (faster, ~80GB instead of ~137GB)
|
|
ansible-playbook playbook.yml --tags comfyui-essential
|
|
|
|
# 5. Link models to ComfyUI
|
|
arty run models/link-comfyui
|
|
|
|
# 6. Install custom node dependencies
|
|
arty run deps/comfyui-nodes
|
|
|
|
# 7. Selective installation (base system + Python + vLLM models only)
|
|
ansible-playbook playbook.yml --tags base,python,dependencies
|
|
```
|
|
|
|
### Service Management
|
|
|
|
This project uses **Supervisor** for process management, providing auto-restart, centralized logging, and easy service control.
|
|
|
|
```bash
|
|
# Start all services (with Supervisor)
|
|
bash scripts/start-all.sh # Starts supervisor daemon + services
|
|
arty run services/start # Same via arty
|
|
|
|
# Stop all services
|
|
bash scripts/stop-all.sh # Stops all services + supervisor
|
|
arty run services/stop # Same via arty
|
|
|
|
# Check service status
|
|
bash scripts/status.sh # Show all service status
|
|
arty run services/status # Same via arty
|
|
supervisorctl status # Direct supervisor command
|
|
|
|
# Individual service control
|
|
supervisorctl start vllm-qwen # Start Qwen vLLM server
|
|
supervisorctl start vllm-llama # Start Llama vLLM server
|
|
supervisorctl restart comfyui # Restart ComfyUI
|
|
supervisorctl stop vllm-qwen # Stop Qwen vLLM server
|
|
arty run services/restart-comfyui # Restart ComfyUI via arty
|
|
|
|
# View logs
|
|
supervisorctl tail -f comfyui # Follow ComfyUI logs
|
|
supervisorctl tail -f vllm-qwen # Follow Qwen vLLM logs
|
|
supervisorctl tail -f vllm-llama # Follow Llama vLLM logs
|
|
arty run services/logs # Follow ComfyUI logs via arty
|
|
|
|
# Web interface
|
|
# Access at http://localhost:9001 (username: admin, password: runpod2024)
|
|
```
|
|
|
|
**Supervisor Configuration:**
|
|
- Config file: `/workspace/supervisord.conf`
|
|
- Log directory: `/workspace/logs/`
|
|
- PID file: `/workspace/supervisord.pid`
|
|
- Socket: `/workspace/supervisor.sock`
|
|
|
|
**Services managed:**
|
|
- `comfyui` - ComfyUI server (port 8188, autostart enabled)
|
|
- `vllm-qwen` - Qwen 2.5 7B vLLM server (port 8000, autostart disabled)
|
|
- `vllm-llama` - Llama 3.1 8B vLLM server (port 8001, autostart disabled)
|
|
- `webdav-sync` - WebDAV sync service for ComfyUI outputs
|
|
|
|
### GPU Memory Management and Mode Switching
|
|
|
|
**VRAM Constraints (RTX 4090 - 24GB total):**
|
|
|
|
The GPU has limited memory, which requires manual service switching:
|
|
|
|
| Service | Model | VRAM Usage | Compatible With |
|
|
|---------|-------|------------|-----------------|
|
|
| ComfyUI | FLUX Schnell FP16 | ~23GB | None (uses all VRAM) |
|
|
| ComfyUI | SDXL Base | ~12GB | Small vLLM models |
|
|
| vLLM | Qwen 2.5 7B | ~14GB | None (conflicts with ComfyUI) |
|
|
| vLLM | Llama 3.1 8B | ~17GB | None (conflicts with ComfyUI) |
|
|
|
|
**Mode Switching Workflow:**
|
|
|
|
Since ComfyUI and vLLM models cannot run simultaneously (they exceed 24GB combined), you must manually switch modes:
|
|
|
|
**Switch to Text Generation Mode (vLLM):**
|
|
```bash
|
|
# 1. Stop ComfyUI
|
|
supervisorctl stop comfyui
|
|
|
|
# 2. Start desired vLLM server
|
|
supervisorctl start vllm-qwen # or vllm-llama
|
|
|
|
# 3. Verify
|
|
supervisorctl status
|
|
nvidia-smi # Check VRAM usage
|
|
```
|
|
|
|
**Switch to Image/Video/Audio Generation Mode (ComfyUI):**
|
|
```bash
|
|
# 1. Stop vLLM servers
|
|
supervisorctl stop vllm-qwen vllm-llama
|
|
|
|
# 2. Start ComfyUI
|
|
supervisorctl start comfyui
|
|
|
|
# 3. Verify
|
|
supervisorctl status
|
|
nvidia-smi # Check VRAM usage
|
|
```
|
|
|
|
**Access via Supervisor Web UI:**
|
|
|
|
You can also switch modes using the Supervisor web interface:
|
|
- URL: `https://supervisor.ai.pivoine.art` (via VPS proxy) or `http://100.114.60.40:9001` (direct Tailscale)
|
|
- Username: `admin`
|
|
- Password: `runpod2024`
|
|
- Click "Start" or "Stop" buttons for each service
|
|
|
|
**Integration with LiteLLM:**
|
|
|
|
The vLLM servers integrate with LiteLLM on the VPS for unified API access:
|
|
- vLLM models (qwen-2.5-7b, llama-3.1-8b) available via direct connections
|
|
- Requests route directly to vLLM servers (ports 8000, 8001)
|
|
- Environment variables `GPU_VLLM_QWEN_URL` and `GPU_VLLM_LLAMA_URL` configure connections
|
|
- LiteLLM config uses `os.environ/GPU_VLLM_*_URL` syntax for dynamic URLs
|
|
|
|
### Testing
|
|
|
|
```bash
|
|
# Health check (Qwen vLLM server)
|
|
curl http://localhost:8000/health
|
|
|
|
# List available models
|
|
curl http://localhost:8000/v1/models
|
|
|
|
# Test text generation (streaming)
|
|
curl -s -N -X POST http://localhost:8000/v1/chat/completions \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{
|
|
"model": "qwen-2.5-7b",
|
|
"messages": [{"role": "user", "content": "Count to 5"}],
|
|
"max_tokens": 50,
|
|
"stream": true
|
|
}'
|
|
|
|
# Test ComfyUI (image generation)
|
|
curl -X POST http://localhost:8188/prompt \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{
|
|
"size": "1024x1024"
|
|
}'
|
|
```
|
|
|
|
### Ansible Tags Reference
|
|
|
|
**System Setup:**
|
|
- `base` - Base system packages
|
|
- `python` - Python environment setup
|
|
- `dependencies` - Install Python packages
|
|
|
|
**Model Installation:**
|
|
- `models` - Download vLLM/Flux/MusicGen models (legacy)
|
|
- `comfyui` - Install ComfyUI base
|
|
- `comfyui-essential` - Quick setup (ComfyUI + essential models only, ~80GB)
|
|
- `comfyui-models-image` - Image generation models (FLUX, SDXL, SD3.5)
|
|
- `comfyui-models-video` - Video generation models (CogVideoX, SVD)
|
|
- `comfyui-models-audio` - Audio generation models (MusicGen variants)
|
|
- `comfyui-models-support` - CLIP, IP-Adapter, ControlNet models
|
|
- `comfyui-models-all` - All ComfyUI models (~137GB)
|
|
- `comfyui-nodes` - Install essential custom nodes
|
|
|
|
**Infrastructure:**
|
|
- `tailscale` - Install Tailscale VPN client
|
|
- `supervisor` - Install and configure Supervisor process manager
|
|
- `systemd` - Configure systemd services (use `never` - not for RunPod)
|
|
- `validate` - Health checks (use `never` - run explicitly)
|
|
|
|
### Adding New vLLM Models
|
|
|
|
1. **Create dedicated server script** (e.g., `vllm/server_mistral.py`):
|
|
|
|
```python
|
|
#!/usr/bin/env python3
|
|
from vllm import AsyncLLMEngine, AsyncEngineArgs
|
|
# ... (copy structure from server_qwen.py)
|
|
|
|
model_name: str = "mistralai/Mistral-7B-Instruct-v0.3"
|
|
port = 8002 # Choose unique port
|
|
```
|
|
|
|
2. **Add supervisor configuration** (`supervisord.conf`):
|
|
|
|
```ini
|
|
[program:vllm-mistral]
|
|
command=vllm/venv/bin/python vllm/server_mistral.py
|
|
directory=.
|
|
autostart=false
|
|
autorestart=true
|
|
environment=HF_HOME="../huggingface_cache",HF_TOKEN="%(ENV_HF_TOKEN)s"
|
|
```
|
|
|
|
3. **Download model** (handled by Ansible playbook or manually via HuggingFace CLI)
|
|
|
|
4. **Add to LiteLLM config** (on VPS):
|
|
|
|
```yaml
|
|
- model_name: mistral-7b
|
|
litellm_params:
|
|
model: hosted_vllm/openai/mistral-7b
|
|
api_base: os.environ/GPU_VLLM_MISTRAL_URL
|
|
api_key: "EMPTY"
|
|
```
|
|
|
|
5. **Start the service:**
|
|
|
|
```bash
|
|
supervisorctl start vllm-mistral
|
|
```
|
|
|
|
### Downloading NSFW Models from CivitAI
|
|
|
|
The project includes dedicated tooling for downloading NSFW models from CivitAI, separate from the HuggingFace downloader.
|
|
|
|
**Files:**
|
|
- Configuration: `models_civitai.yaml` - Defines available CivitAI models
|
|
- Downloader: `artifact_civitai_download.sh` - Beautiful CLI tool for downloading models
|
|
|
|
**Requirements:**
|
|
- CivitAI API key (set in `.env` file as `CIVITAI_API_KEY`)
|
|
- Get your key from: https://civitai.com/user/account
|
|
|
|
**Available Models (Essential):**
|
|
1. **LUSTIFY v7.0 GGWP** (6.31GB) - Photoreal NSFW checkpoint for women
|
|
2. **Pony Diffusion V6 XL** (6.5GB) - Anime/furry NSFW with balanced content
|
|
3. **RealVisXL V5.0** (6.8GB) - Photorealistic NSFW model
|
|
|
|
**Usage:**
|
|
|
|
```bash
|
|
# Download all essential NSFW models (19.61GB)
|
|
artifact_civitai_download.sh
|
|
|
|
# Download only (no symlinks)
|
|
artifact_civitai_download.sh download
|
|
|
|
# Create symlinks only (models already downloaded)
|
|
artifact_civitai_download.sh link
|
|
|
|
# Custom configuration file
|
|
artifact_civitai_download.sh -c /path/to/models_civitai.yaml
|
|
|
|
# Help
|
|
artifact_civitai_download.sh --help
|
|
```
|
|
|
|
**Directory Structure:**
|
|
- Models downloaded to: `/workspace/models/civitai/` (cache)
|
|
- Symlinks created in: `/workspace/ComfyUI/models/checkpoints/`
|
|
|
|
**Features:**
|
|
- Beautiful purple/magenta themed CLI (vs cyan for HuggingFace)
|
|
- Automatic retry with exponential backoff for rate limiting
|
|
- Resume interrupted downloads
|
|
- Progress bars and status indicators
|
|
- Parallel download support (respects rate limits)
|
|
|
|
**Configuration Example (`models_civitai.yaml`):**
|
|
|
|
```yaml
|
|
model_categories:
|
|
nsfw_checkpoints:
|
|
- name: lustify-v7-ggwp
|
|
version_id: "1094291"
|
|
model_id: "573152"
|
|
description: "LUSTIFY v7.0 GGWP - Photoreal NSFW checkpoint"
|
|
size_gb: 6.31
|
|
essential: true
|
|
type: checkpoints
|
|
```
|
|
|
|
**Installation Profiles:**
|
|
- `essential`: 3 models (LUSTIFY, Pony, RealVisXL) - 19.61GB, 30-60 min
|
|
- `complete`: All 6 models - 39.21GB, 1-2 hours
|
|
- `realistic_only`: LUSTIFY + RealVisXL - 13.11GB, 20-40 min
|
|
- `anime_only`: Pony + WAI-NSFW - 13.0GB, 20-40 min
|
|
|
|
**Important Notes:**
|
|
- All models are SDXL-based (require ~12GB VRAM)
|
|
- Cannot run simultaneously with FLUX models (both use ~23GB)
|
|
- Can run with smaller models like standard SDXL (~12GB total)
|
|
- Rate limiting: 5 second delay between downloads (configurable)
|
|
|
|
## Key Implementation Details
|
|
|
|
### Direct vLLM Server Architecture
|
|
|
|
Each vLLM model runs as a dedicated FastAPI server:
|
|
- Independent process management via Supervisor
|
|
- Direct AsyncLLMEngine access for optimal performance
|
|
- OpenAI-compatible `/v1/chat/completions` endpoint
|
|
- Streaming support with proper delta formatting
|
|
|
|
See `vllm/server_qwen.py` for reference implementation.
|
|
|
|
Set in `.env` file:
|
|
- `HF_TOKEN` - **Required** - HuggingFace API token for model downloads
|
|
- `GPU_TAILSCALE_IP` - Optional - Tailscale IP for VPN access
|
|
|
|
Models are cached in:
|
|
- `/workspace/huggingface_cache` - HuggingFace models
|
|
- `/workspace/models` - Other model files
|
|
- `/workspace/ComfyUI/models` - ComfyUI model directory structure
|
|
|
|
### Integration with LiteLLM
|
|
|
|
For unified API management through LiteLLM proxy:
|
|
|
|
**LiteLLM configuration (`litellm-config.yaml` on VPS):**
|
|
```yaml
|
|
model_list:
|
|
- model_name: qwen-2.5-7b
|
|
litellm_params:
|
|
model: hosted_vllm/openai/qwen-2.5-7b # Use hosted_vllm prefix!
|
|
api_base: http://100.121.199.88:9000/v1 # Tailscale VPN IP
|
|
api_key: dummy
|
|
stream: true
|
|
timeout: 600
|
|
```
|
|
|
|
**Critical:** Use `hosted_vllm/openai/` prefix for vLLM models to enable proper streaming support. Wrong prefix causes empty delta chunks.
|
|
|
|
### ComfyUI Installation
|
|
|
|
ComfyUI provides advanced image/video/audio generation capabilities:
|
|
|
|
**Directory structure created:**
|
|
```
|
|
/workspace/ComfyUI/
|
|
├── models/
|
|
│ ├── checkpoints/ # FLUX, SDXL, SD3 models
|
|
│ ├── clip_vision/ # CLIP vision models
|
|
│ ├── video_models/ # CogVideoX, SVD
|
|
│ ├── audio_models/ # MusicGen
|
|
│ └── custom_nodes/ # Extension nodes
|
|
```
|
|
|
|
**Essential custom nodes installed:**
|
|
- ComfyUI-Manager - Model/node management GUI
|
|
- ComfyUI-VideoHelperSuite - Video operations
|
|
- ComfyUI-AnimateDiff-Evolved - Video generation
|
|
- ComfyUI_IPAdapter_plus - Style transfer
|
|
- ComfyUI-Impact-Pack - Auto face enhancement
|
|
- comfyui-sound-lab - Audio generation
|
|
|
|
**VRAM requirements for 24GB GPU:**
|
|
- FLUX Schnell FP16: 23GB (leaves 1GB)
|
|
- SDXL Base: 12GB
|
|
- CogVideoX-5B: 12GB (with optimizations)
|
|
- MusicGen Medium: 8GB
|
|
|
|
See `COMFYUI_MODELS.md` for detailed model catalog and usage examples.
|
|
|
|
## Deployment Workflow
|
|
|
|
### RunPod Deployment (Current Setup)
|
|
|
|
1. **Clone repository:**
|
|
```bash
|
|
cd /workspace
|
|
git clone <repo-url> ai
|
|
cd ai
|
|
```
|
|
|
|
2. **Configure environment:**
|
|
```bash
|
|
cp .env.example .env
|
|
# Edit .env, set HF_TOKEN
|
|
```
|
|
|
|
3. **Run Ansible provisioning:**
|
|
```bash
|
|
ansible-playbook playbook.yml
|
|
# Or selective: --tags base,python,comfyui-essential
|
|
```
|
|
|
|
4. **Start services:**
|
|
```bash
|
|
bash scripts/start-all.sh
|
|
```
|
|
|
|
5. **Verify:**
|
|
```bash
|
|
curl http://localhost:9000/health
|
|
```
|
|
|
|
### Tailscale VPN Integration
|
|
|
|
To connect RunPod GPU to VPS infrastructure:
|
|
|
|
```bash
|
|
# On RunPod instance
|
|
curl -fsSL https://tailscale.com/install.sh | sh
|
|
tailscaled --tun=userspace-networking --socks5-server=localhost:1055 &
|
|
tailscale up --advertise-tags=tag:gpu
|
|
tailscale ip -4 # Get IP for LiteLLM config
|
|
```
|
|
|
|
Benefits: Secure tunnel, no public exposure, low latency.
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
runpod/
|
|
├── model-orchestrator/
|
|
│ ├── orchestrator_subprocess.py # Main orchestrator (USE THIS)
|
|
│ ├── orchestrator.py # Docker-based version (legacy)
|
|
│ ├── models.yaml # Model registry
|
|
│ └── requirements.txt
|
|
├── models/
|
|
│ ├── vllm/
|
|
│ │ ├── server.py # vLLM text generation service
|
|
│ │ └── requirements.txt
|
|
│ └── comfyui/
|
|
│ ├── start.sh # ComfyUI startup script
|
|
│ └── requirements.txt
|
|
├── scripts/
|
|
│ ├── start-all.sh # Start all services with Supervisor
|
|
│ ├── stop-all.sh # Stop all services
|
|
│ └── status.sh # Check service status
|
|
├── supervisord.conf # Supervisor process manager config
|
|
├── arty.yml # Arty repository manager config
|
|
├── playbook.yml # Ansible provisioning playbook
|
|
├── inventory.yml # Ansible inventory (localhost)
|
|
├── ansible.cfg # Ansible configuration
|
|
├── .env.example # Environment variables template
|
|
├── CLAUDE.md # This file
|
|
├── COMFYUI_MODELS.md # ComfyUI models catalog
|
|
├── MODELS_LINKED.md # Model linkage documentation
|
|
├── comfyui_models.yaml # ComfyUI model configuration
|
|
└── README.md # User documentation
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Model fails to start
|
|
- Check VRAM: `nvidia-smi`
|
|
- Verify model weights downloaded: `ls -lh /workspace/huggingface_cache`
|
|
- Check port conflicts: `lsof -i :9000`
|
|
- Test model directly: `python3 models/vllm/server.py`
|
|
|
|
### Streaming returns empty deltas
|
|
- Use correct LiteLLM model prefix: `hosted_vllm/openai/model-name`
|
|
- Set `stream: true` in LiteLLM config
|
|
- Verify orchestrator proxies streaming correctly
|
|
|
|
### HuggingFace download errors
|
|
- Check token: `echo $HF_TOKEN`
|
|
- Set in .env: `HF_TOKEN=your_token_here`
|
|
- Re-run Ansible: `ansible-playbook playbook.yml --tags dependencies`
|
|
|
|
### Out of storage space
|
|
- Check disk usage: `df -h /workspace`
|
|
- Use essential tags: `--tags comfyui-essential` (~80GB vs ~137GB)
|
|
- Clear cache: `rm -rf /workspace/huggingface_cache`
|
|
|
|
### Supervisor not running
|
|
- Check status: `bash scripts/status.sh`
|
|
- View logs: `cat /workspace/logs/supervisord.log`
|
|
- Start supervisor: `bash scripts/start-all.sh`
|
|
- Check for stale PID: `rm -f /workspace/supervisord.pid` then restart
|
|
|
|
### Service won't start
|
|
- Check supervisor status: `supervisorctl status`
|
|
- View service logs: `supervisorctl tail -f comfyui` or `supervisorctl tail -f orchestrator`
|
|
- Check error logs: `cat /workspace/logs/comfyui.err.log`
|
|
- Restart service: `supervisorctl restart comfyui`
|
|
- Check if port is in use: `ss -tulpn | grep :8188`
|
|
|
|
### Orchestrator not responding
|
|
- Check supervisor status: `supervisorctl status orchestrator`
|
|
- View logs: `supervisorctl tail -f orchestrator` or `cat /workspace/logs/orchestrator.err.log`
|
|
- Restart: `supervisorctl restart orchestrator`
|
|
- Manual start for debugging: `cd /workspace/ai && python3 model-orchestrator/orchestrator_subprocess.py`
|
|
|
|
## Performance Notes
|
|
|
|
- **Model switching time:** 30-120 seconds (depends on model size)
|
|
- **Text generation:** ~20-40 tokens/second (Qwen 2.5 7B on RTX 4090)
|
|
- **Image generation:** 4-5 seconds per image (FLUX Schnell)
|
|
- **Music generation:** 60-90 seconds for 30s audio (MusicGen Medium)
|
|
|
|
## Important Conventions
|
|
|
|
- **Always use `orchestrator_subprocess.py`** - Not the Docker version
|
|
- **Sequential loading only** - One model active at a time for 24GB VRAM
|
|
- **Models downloaded by Ansible** - Use playbook tags, not manual downloads
|
|
- **Services run as processes** - Not systemd (RunPod containers don't support it)
|
|
- **Environment managed via .env** - Required: HF_TOKEN
|
|
- **Port 9000 for orchestrator** - Model services use 8000+
|