Files
runpod/GPU_EXPANSION_PLAN.md
Sebastian Krüger 277f1c95bd Initial commit: RunPod multi-modal AI orchestration stack
- Multi-modal AI infrastructure for RunPod RTX 4090
- Automatic model orchestration (text, image, music)
- Text: vLLM + Qwen 2.5 7B Instruct
- Image: Flux.1 Schnell via OpenEDAI
- Music: MusicGen Medium via AudioCraft
- Cost-optimized sequential loading on single GPU
- Template preparation scripts for rapid deployment
- Comprehensive documentation (README, DEPLOYMENT, TEMPLATE)
2025-11-21 14:34:55 +01:00

1307 lines
33 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# GPU-Enhanced AI Stack Expansion Plan
## Executive Summary
This document outlines a comprehensive plan to extend the current AI stack (LiteLLM, Open WebUI, Crawl4AI) with dedicated GPU hosting capabilities for:
- **LLM Model Hosting**: Self-hosted models (Llama, Mistral, Qwen, etc.)
- **Model Training**: Fine-tuning and training workflows
- **Image Generation**: Stable Diffusion, FLUX via ComfyUI
- **Video Generation**: AnimateDiff, CogVideo, etc.
**Current Architecture**: CPU-based stack on pivoine.art VPS → Claude API via LiteLLM
**Target Architecture**: Hybrid stack with GPU server(s) for self-hosted models + API-based models
---
## Phase 1: Current Stack Analysis
### Existing Components
1. **ai_postgres** (pgvector/pgvector:pg16)
- PostgreSQL with pgvector for RAG
- Stores: conversations, embeddings, LiteLLM logs
2. **webui** (Open WebUI)
- User-facing ChatGPT-like interface
- URL: https://ai.pivoine.art
- Features: RAG, web search, document upload
- Connected to LiteLLM proxy
3. **litellm** (LiteLLM proxy)
- Currently proxies Anthropic Claude API
- OpenAI-compatible endpoint at http://litellm:4000
- Supports multiple providers via config
4. **crawl4ai**
- Internal web scraping for LLM content prep
- Port 11235 (internal only)
5. **facefusion** (CPU-only)
- Face swapping/enhancement
- Currently CPU-based (slow)
- Protected by Authelia SSO
### Current Limitations
- ❌ No self-hosted LLMs (relies on expensive API calls)
- ❌ No GPU acceleration for facefusion
- ❌ No image generation capabilities
- ❌ No model training/fine-tuning capabilities
- ❌ No video generation
- ❌ High operational costs for API usage
---
## Phase 2: GPU Provider Comparison
### Provider Options
#### 1. **RunPod** ⭐ RECOMMENDED
**Pros:**
- Pay-per-second GPU billing
- Wide GPU selection (RTX 4090, A100, H100)
- Docker-first platform
- Global locations
- Easy HTTP/SSH tunneling
- Volume persistence
**Pricing (Approximate):**
- RTX 4090 (24GB): ~$0.50/hour ($360/month 24/7)
- RTX 3090 (24GB): ~$0.35/hour ($250/month)
- A6000 (48GB): ~$0.80/hour ($576/month)
- A100 (40GB): ~$1.50/hour ($1,080/month)
**Best for:** On-demand workloads, experimentation, cost-conscious hosting
---
#### 2. **Lambda Labs**
**Pros:**
- Flat monthly pricing
- High-end GPUs (A100, H100)
- Jupyter notebooks included
- Fast network
**Pricing:**
- 1x A100 (40GB): $1.10/hour ($792/month)
- 8x A100 (40GB): $8.00/hour (~$5,760/month)
**Best for:** Research, high-utilization workloads
---
#### 3. **Vast.ai**
**Pros:**
- Marketplace model (cheapest)
- Many GPU options
- Spot pricing available
**Cons:**
- Variable reliability
- Setup complexity
- Community-hosted machines
**Pricing:**
- RTX 4090: ~$0.25-0.40/hour
- A100: ~$0.80-1.20/hour
**Best for:** Budget-conscious, experimental workloads
---
#### 4. **Google Cloud Platform (GCP)**
**Pros:**
- Enterprise reliability
- Auto-scaling
- Integration with Google services
- Preemptible instances available
**Pricing:**
- T4 (16GB): ~$0.35/hour
- V100 (16GB): ~$2.48/hour
- A100 (40GB): ~$2.93/hour
- TPU options available
**Best for:** Enterprise workloads, auto-scaling needs
---
#### 5. **AWS**
**Pros:**
- Global infrastructure
- Broad GPU selection
- Spot instances for cost savings
- Enterprise support
**Pricing:**
- g4dn.xlarge (T4 16GB): ~$0.526/hour
- p3.2xlarge (V100 16GB): ~$3.06/hour
- p4d.24xlarge (8x A100 40GB): ~$32.77/hour
**Best for:** Enterprise, existing AWS infrastructure
---
#### 6. **Hugging Face Spaces / Inference Endpoints**
**Pros:**
- Managed model hosting
- Auto-scaling
- Simple deployment
- Community models
**Pricing:**
- CPU: $0.03/hour
- T4: $0.60/hour
- A10G: $1.00/hour
- A100: $4.00/hour
**Best for:** Quick model deployment, serverless inference
---
### Recommendation: **RunPod** for Primary GPU Server
**Rationale:**
1. **Cost-effective**: Pay-per-second billing, ~$0.50/hour for RTX 4090
2. **Docker-native**: Easy integration with existing compose stack
3. **Flexibility**: Start/stop as needed, scale up for training
4. **Community**: Large user base, good documentation
5. **Network**: Built-in HTTP/SSH tunneling
**Supplementary**: Use Hugging Face for specific model hosting if needed
---
## Phase 3: Architecture Design
### Network Topology
```
┌─────────────────────────────────────────────────────────────┐
│ pivoine.art VPS (CPU-based) │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Open │─────▶│ LiteLLM │◀────▶│ ai_ │ │
│ │ WebUI │ │ Proxy │ │ postgres │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │
│ │ │ │
└───────┼──────────────────┼──────────────────────────────────┘
│ │
│ ▼
│ ┌─────────────────┐
│ │ Anthropic API │
│ │ (Claude) │
│ └─────────────────┘
┌────────────────────────────────────────────────────────────┐
│ GPU Server (RunPod) │
├────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ vLLM │ │ ComfyUI │ │ Model │ │ JupyterLab│ │
│ │ (LLMs) │ │ (SD/FLUX)│ │ Training │ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │ │
│ └──────────────┴─────────────┴──────────────┘ │
│ │ │
│ ┌───────────────┐ │
│ │ Model Storage │ │
│ │ (Persistent) │ │
│ └───────────────┘ │
│ │
└────────────────────────────────────────────────────────────┘
▼ (Tunneled via WireGuard or Tailscale)
┌────────────────────────────────────────────────────────────┐
│ Integration Options: │
├────────────────────────────────────────────────────────────┤
│ 1. LiteLLM adds vLLM endpoint (http://gpu.internal:8000) │
│ 2. ComfyUI exposed via subdomain (comfy.ai.pivoine.art) │
│ 3. Model storage synced via rclone/restic │
└────────────────────────────────────────────────────────────┘
```
### Connection Methods
#### Option A: WireGuard VPN (RECOMMENDED)
- Create WireGuard tunnel between VPS and GPU server
- GPU services accessible via private IPs
- Secure, low overhead, easy to manage
- Already have wg-easy in your stack
**Setup:**
1. Deploy WireGuard on GPU server
2. Add GPU server as VPN peer
3. Configure LiteLLM to use VPN IPs
#### Option B: SSH Tunnel
- SSH reverse tunnel from GPU to VPS
- Simple, no additional software
- Higher latency
#### Option C: Tailscale
- Zero-config VPN mesh
- Easy setup, good UX
- Proprietary (but free tier available)
---
## Phase 4: Service Implementation Plans
### 4.1 LLM Hosting with vLLM
**vLLM** is the industry-standard for high-performance LLM inference.
#### Features:
- PagedAttention for efficient KV cache
- Continuous batching
- OpenAI-compatible API
- Tensor parallelism for multi-GPU
- Quantization support (AWQ, GPTQ)
#### Docker Compose Configuration:
```yaml
services:
vllm:
image: vllm/vllm-openai:latest
container_name: gpu_vllm
restart: unless-stopped
runtime: nvidia
environment:
NVIDIA_VISIBLE_DEVICES: all
CUDA_VISIBLE_DEVICES: 0
volumes:
- vllm_models:/root/.cache/huggingface
command:
- --model
- meta-llama/Meta-Llama-3.1-8B-Instruct # or any model
- --host
- 0.0.0.0
- --port
- 8000
- --tensor-parallel-size
- '1'
- --gpu-memory-utilization
- '0.9'
- --max-model-len
- '8192'
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
```
#### Recommended Models for RTX 4090 (24GB):
**Text Generation:**
- Llama 3.1 8B Instruct (8GB VRAM, fast)
- Qwen2.5 14B Instruct (14GB VRAM, multilingual)
- Mistral 7B Instruct v0.3 (7GB VRAM)
- Nous Hermes 2 Mixtral 8x7B (with quantization, 16GB)
**Code:**
- DeepSeek Coder 6.7B (7GB VRAM)
- CodeLlama 13B (13GB VRAM)
- Qwen2.5-Coder 14B (14GB VRAM)
#### Integration with LiteLLM:
Add to `ai/litellm-config.yaml`:
```yaml
model_list:
# Existing Anthropic
- model_name: claude-sonnet-4-5
litellm_params:
model: anthropic/claude-sonnet-4-5-20250929
api_key: os.environ/ANTHROPIC_API_KEY
# New vLLM models
- model_name: llama-3.1-8b
litellm_params:
model: openai/meta-llama/Meta-Llama-3.1-8B-Instruct
api_base: http://gpu.internal:8000/v1
api_key: dummy
- model_name: qwen-2.5-14b
litellm_params:
model: openai/Qwen/Qwen2.5-14B-Instruct
api_base: http://gpu.internal:8000/v1
api_key: dummy
```
---
### 4.2 ComfyUI for Image/Video Generation
**ComfyUI** is a node-based UI for Stable Diffusion with advanced workflows.
#### Features:
- Node-based workflow editor
- Support for SD 1.5, SDXL, SD3, FLUX
- ControlNet, LoRA, embeddings
- Video generation (AnimateDiff, SVD)
- API for automation
#### Docker Compose Configuration:
```yaml
services:
comfyui:
image: ghcr.io/ai-dock/comfyui:latest
container_name: gpu_comfyui
restart: unless-stopped
runtime: nvidia
environment:
NVIDIA_VISIBLE_DEVICES: all
# Custom nodes auto-install
COMFYUI_FLAGS: --listen 0.0.0.0 --port 8188
volumes:
- comfyui_data:/data
- comfyui_models:/opt/ComfyUI/models
- comfyui_output:/opt/ComfyUI/output
ports:
- "8188:8188"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
```
#### Model Downloads (via ComfyUI Manager):
**Stable Diffusion Models:**
- FLUX.1-dev (12GB, newest, best quality)
- FLUX.1-schnell (12GB, fast)
- SDXL Base 1.0 (6.9GB)
- SD 1.5 (4GB, fast, wide LoRA support)
**ControlNet Models:**
- controlnet-canny-sdxl
- controlnet-depth-sdxl
- controlnet-openpose-sdxl
**LoRA Models** (download from Civitai):
- Style LoRAs (anime, realistic, etc.)
- Character LoRAs
- Concept LoRAs
#### Traefik Integration:
Add subdomain routing for ComfyUI:
```yaml
labels:
- 'traefik.enable=true'
- 'traefik.http.routers.comfyui-web-secure.rule=Host(`comfy.ai.pivoine.art`)'
- 'traefik.http.routers.comfyui-web-secure.tls.certresolver=resolver'
- 'traefik.http.routers.comfyui-web-secure.entrypoints=web-secure'
- 'traefik.http.routers.comfyui-web-secure.middlewares=net-authelia,security-headers@file'
- 'traefik.http.services.comfyui.loadbalancer.server.port=8188'
```
#### Open WebUI Integration:
ComfyUI has a REST API that can be called from Open WebUI using function calling.
Example workflow API call:
```python
import requests
def generate_image(prompt: str, negative_prompt: str = ""):
workflow = {
# ComfyUI workflow JSON
}
response = requests.post(
"http://comfyui:8188/prompt",
json={"prompt": workflow}
)
return response.json()
```
---
### 4.3 Model Training Infrastructure
For fine-tuning LLMs and training custom models.
#### Option A: Axolotl (Recommended)
**Axolotl** is a user-friendly fine-tuning framework supporting:
- LoRA, QLoRA
- Full fine-tuning
- RLHF/DPO
- Multi-GPU training
```yaml
services:
axolotl:
image: winglian/axolotl:main-py3.11-cu121-2.2.2
container_name: gpu_training
runtime: nvidia
volumes:
- ./training/configs:/workspace/configs
- ./training/data:/workspace/data
- ./training/output:/workspace/output
- training_cache:/root/.cache
environment:
NVIDIA_VISIBLE_DEVICES: all
WANDB_API_KEY: ${WANDB_API_KEY:-} # Optional: Weights & Biases logging
command: |
bash -c "
accelerate launch -m axolotl.cli.train /workspace/configs/config.yaml
"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
```
#### Training Workflow:
1. Prepare dataset (JSONL format)
2. Create Axolotl config (LoRA, batch size, epochs)
3. Start training container
4. Monitor via Weights & Biases or TensorBoard
5. Export LoRA adapters
6. Merge with base model or use in vLLM
#### Example Config:
```yaml
# training/configs/lora-llama3.yaml
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
load_in_8bit: false
load_in_4bit: true
strict: false
datasets:
- path: /workspace/data/train.jsonl
type: completion
field: text
output_dir: /workspace/output/llama3-lora
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
- q_proj
- v_proj
- k_proj
- o_proj
gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
learning_rate: 0.0002
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
warmup_steps: 100
```
#### Option B: JupyterLab for Custom Training
For research and custom training scripts:
```yaml
services:
jupyter:
image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel
container_name: gpu_jupyter
runtime: nvidia
volumes:
- ./notebooks:/workspace
- jupyter_cache:/root/.cache
ports:
- "8888:8888"
environment:
NVIDIA_VISIBLE_DEVICES: all
JUPYTER_ENABLE_LAB: "yes"
command: |
bash -c "
pip install jupyterlab transformers datasets accelerate bitsandbytes peft &&
jupyter lab --ip=0.0.0.0 --allow-root --no-browser --NotebookApp.token=''
"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
```
---
### 4.4 Model Storage Strategy
#### Storage Requirements:
**Per Model Type:**
- LLM 7B: ~14GB (FP16)
- LLM 13B: ~26GB
- SDXL: ~7GB
- FLUX: ~12GB
- ControlNet: ~2.5GB each
- LoRA: ~100-500MB each
**Total Estimated:**
- 3-4 LLMs: ~80GB
- SD models + LoRAs: ~50GB
- Training checkpoints: ~100GB
- **Total: 250-300GB minimum**
#### RunPod Storage Options:
1. **Network Volume** (Recommended)
- Persistent across pod restarts
- Shared between multiple pods
- ~$0.10/GB/month
- 500GB = $50/month
2. **Container Disk**
- Included with pod
- Lost when pod stops
- Good for temporary storage
3. **External Storage (rclone)**
- Sync to/from VPS or cloud storage
- Backup models to Backblaze B2 or Wasabi
- Good for disaster recovery
#### Model Management:
Use **Hugging Face Hub** as model cache:
```bash
# Download models on first run
huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct \
--local-dir /models/llama-3.1-8b
# Or let vLLM/ComfyUI auto-download
```
**Model Sync Script:**
```bash
#!/bin/bash
# sync-models.sh - Sync models from VPS to GPU server
rclone sync \
/mnt/hidrive/AI/models \
gpu:/workspace/models \
--progress \
--transfers 4
```
---
## Phase 5: Implementation Roadmap
### Week 1: Infrastructure Setup
**Day 1-2: RunPod Account & GPU Server**
- [ ] Create RunPod account
- [ ] Deploy RTX 4090 pod with Ubuntu 22.04 + PyTorch template
- [ ] Configure persistent network volume (500GB)
- [ ] Set up SSH access
**Day 3-4: Network Configuration**
- [ ] Deploy WireGuard on GPU server
- [ ] Add GPU server as peer to existing VPN (vpn/compose.yaml)
- [ ] Test connectivity between VPS and GPU server
- [ ] Configure firewall rules
**Day 5: Docker Setup on GPU Server**
- [ ] Install Docker + NVIDIA Container Toolkit
- [ ] Create docker-compose.yaml for GPU services
- [ ] Test GPU access in containers
---
### Week 2: LLM Hosting
**Day 1-2: vLLM Deployment**
- [ ] Deploy vLLM container
- [ ] Download Llama 3.1 8B Instruct
- [ ] Test inference locally
- [ ] Benchmark performance (tokens/sec)
**Day 3-4: LiteLLM Integration**
- [ ] Update litellm-config.yaml with vLLM endpoint
- [ ] Test via Open WebUI
- [ ] Configure model routing (cheap models → vLLM, complex → Claude)
- [ ] Set up usage monitoring
**Day 5: Model Expansion**
- [ ] Download Qwen 2.5 14B
- [ ] Download Mistral 7B Instruct
- [ ] Test model switching in Open WebUI
- [ ] Document performance characteristics
---
### Week 3: Image Generation
**Day 1-2: ComfyUI Setup**
- [ ] Deploy ComfyUI container
- [ ] Download FLUX.1-schnell
- [ ] Download SDXL
- [ ] Install ComfyUI Manager
**Day 3-4: Model Downloads**
- [ ] Download ControlNet models
- [ ] Download VAE models
- [ ] Download popular LoRAs from Civitai
- [ ] Organize model directory
**Day 5: Integration & Workflows**
- [ ] Create basic text-to-image workflow
- [ ] Create ControlNet workflow
- [ ] Test API access
- [ ] Add Traefik subdomain (comfy.ai.pivoine.art)
---
### Week 4: Training Infrastructure
**Day 1-2: Axolotl Setup**
- [ ] Deploy Axolotl container
- [ ] Create sample dataset
- [ ] Test LoRA fine-tuning with tiny model
- [ ] Verify GPU utilization
**Day 3-4: JupyterLab Setup**
- [ ] Deploy JupyterLab container
- [ ] Install ML libraries
- [ ] Create example notebooks
- [ ] Test custom training scripts
**Day 5: Documentation & Testing**
- [ ] Write training guides
- [ ] Test end-to-end workflows
- [ ] Benchmark training speeds
- [ ] Document best practices
---
### Ongoing: Optimization & Expansion
**Month 2:**
- Monitor costs and optimize GPU utilization
- Implement model caching strategies
- Add more models based on usage patterns
- Set up automated model updates
- Implement usage quotas per user
**Month 3+:**
- Consider multi-GPU setup for larger models
- Implement model quantization (AWQ/GPTQ)
- Add video generation (AnimateDiff, CogVideo)
- Explore voice synthesis (XTTS, Bark)
- Custom model training for specific use cases
---
## Phase 6: Cost Analysis
### Scenario A: Single RTX 4090 (24/7)
**GPU Server (RunPod):**
- RTX 4090 pod: $0.50/hour × 720 hours = $360/month
- 500GB network volume: $50/month
- **Subtotal: $410/month**
**VPS (Existing):**
- No change in cost
**Total: ~$410/month**
**Savings:**
- Claude API costs reduced by ~80% (self-hosted for routine tasks)
- Break-even if currently spending >$500/month on API calls
---
### Scenario B: Pay-as-you-go (8 hours/day)
**GPU Server (RunPod):**
- RTX 4090: $0.50/hour × 8 hours × 30 days = $120/month
- Storage: $50/month
- **Subtotal: $170/month**
**Best for:**
- Development/experimentation
- Burst workloads
- Image generation on-demand
---
### Scenario C: Dual GPU (Training + Inference)
**GPU Server 1 (Inference):**
- RTX 4090 24/7: $360/month
**GPU Server 2 (Training - On-demand):**
- A100 40GB: $1.50/hour × 40 hours/month = $60/month
- Used only for fine-tuning sessions
**Storage:**
- 1TB network volume: $100/month
**Total: ~$520/month**
---
### Cost Optimization Tips
1. **Auto-stop idle pods**: RunPod can auto-stop after X minutes idle
2. **Use spot instances**: ~50% cheaper but can be interrupted
3. **Quantized models**: 4-bit models use 4x less VRAM → cheaper GPUs
4. **Batch processing**: Queue image gen jobs to maximize GPU usage
5. **Model sharing**: One vLLM instance can serve multiple models via adapters
6. **Monitoring**: Track per-model costs to optimize routing
---
## Phase 7: Monitoring & Operations
### Metrics to Track
**GPU Utilization:**
- nvidia-smi metrics (utilization %, memory usage)
- Temperature and power draw
- Per-process GPU usage
**Model Performance:**
- Tokens per second (LLM inference)
- Images per second (SD/FLUX)
- Training time per epoch
**Costs:**
- GPU hours consumed
- Storage usage
- API vs self-hosted breakdown
### Monitoring Stack
**Option A: Netdata (Already deployed)**
Add GPU monitoring to existing Netdata:
```yaml
# On GPU server
services:
netdata:
image: netdata/netdata:latest
container_name: gpu_netdata
runtime: nvidia
environment:
NVIDIA_VISIBLE_DEVICES: all
volumes:
- /sys:/host/sys:ro
- /proc:/host/proc:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
command: |
bash -c "
# Enable nvidia_smi plugin
/usr/libexec/netdata/plugins.d/charts.d.plugin nvidia_smi
"
```
**Option B: Prometheus + Grafana**
For detailed metrics:
```yaml
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
dcgm-exporter:
image: nvidia/dcgm-exporter:latest
runtime: nvidia
environment:
NVIDIA_VISIBLE_DEVICES: all
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
```
Import Grafana dashboard #12219 for GPU metrics.
---
## Phase 8: Backup & Disaster Recovery
### What to Backup
1. **Models** (250-300GB)
- Base models can be re-downloaded
- Custom fine-tuned models: CRITICAL
- LoRAs: CRITICAL
2. **Training Data** (~10-50GB)
- Datasets
- Preprocessing scripts
3. **Configurations** (<1GB)
- Docker compose files
- Training configs
- Workflow JSONs
### Backup Strategy
**Tier 1: Critical (Daily)**
- Fine-tuned models
- Training checkpoints
- Custom datasets
**Backup to:**
- Restic → HiDrive (already configured)
- Backblaze B2 (~$6/TB/month)
```bash
# Add to core/compose.yaml backrest config
- gpu_models:/volumes/gpu_models:ro
- gpu_checkpoints:/volumes/gpu_checkpoints:ro
```
**Tier 2: Nice-to-have (Weekly)**
- Base models (can re-download)
- ComfyUI outputs
**Tier 3: Ephemeral (No backup)**
- Inference cache
- Temporary generations
---
## Phase 9: Security Considerations
### GPU Server Security
1. **Firewall:**
- Only allow WireGuard port (51820)
- All services accessed via VPN
- No public exposure
2. **SSH:**
- Key-based auth only
- Disable password auth
- Change default port
3. **Docker:**
- Rootless Docker (optional but recommended)
- Limited container capabilities
- No privileged containers except for nvidia-runtime
4. **Secrets:**
- Store API keys in .env
- Use Docker secrets for sensitive data
- Rotate keys periodically
### Access Control
- **ComfyUI**: Protected by Authelia SSO (already configured)
- **vLLM**: Internal only, accessed via LiteLLM proxy
- **JupyterLab**: Password-protected or Authelia
- **Training**: No public access, VPN only
---
## Phase 10: Advanced Features (Future)
### Multi-GPU Scaling
**Tensor Parallelism** (vLLM):
- Split large models across multiple GPUs
- Example: 70B model on 2x A100s
```yaml
command:
- --model
- meta-llama/Meta-Llama-3.1-70B-Instruct
- --tensor-parallel-size
- '2' # Use 2 GPUs
```
**Pipeline Parallelism** (training):
- Split model layers across GPUs
- Useful for very large models
### Model Serving Optimization
**vLLM Features:**
- Speculative decoding (faster generation)
- Prefix caching (faster for repeated prompts)
- Multi-LoRA serving (multiple adapters, one base model)
**Example multi-LoRA:**
```yaml
command:
- --model
- meta-llama/Meta-Llama-3.1-8B-Instruct
- --enable-lora
- --max-loras
- '4'
- --lora-modules
- customer-support=/models/loras/support-lora
- creative-writing=/models/loras/writing-lora
```
### Video Generation
**AnimateDiff in ComfyUI:**
- Generate short videos from text prompts
- Animate static images
- ~8GB VRAM for 512x512 16-frame videos
**CogVideo:**
- High-quality video generation
- Requires A100 or H100
- 5-second clips at 720p
### Voice Synthesis
**XTTS v2:**
- High-quality voice cloning
- Multi-language support
- ~6GB VRAM
**Bark:**
- Text-to-speech with emotions
- Sound effects
- ~10GB VRAM
---
## Appendix A: Quick Start Commands
### Initial GPU Server Setup
```bash
# SSH into RunPod instance
ssh root@gpu.runpod.io -p 12345
# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update
apt-get install -y nvidia-container-toolkit
systemctl restart docker
# Test GPU access
docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
```
### Deploy vLLM (Quick Test)
```bash
# Create directory
mkdir -p /workspace/vllm
cd /workspace/vllm
# Run vLLM
docker run -d \
--name vllm \
--runtime=nvidia \
--gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--dtype auto \
--max-model-len 8192
# Test inference
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"prompt": "Once upon a time",
"max_tokens": 50
}'
```
### Deploy ComfyUI (Quick Test)
```bash
docker run -d \
--name comfyui \
--runtime=nvidia \
--gpus all \
-v /workspace/comfyui:/data \
-p 8188:8188 \
ghcr.io/ai-dock/comfyui:latest
# Access at http://gpu-ip:8188
```
---
## Appendix B: Sample Docker Compose (Full GPU Stack)
```yaml
# gpu-server/compose.yaml
version: '3.8'
services:
# vLLM for LLM inference
vllm:
image: vllm/vllm-openai:latest
container_name: gpu_vllm
restart: unless-stopped
runtime: nvidia
environment:
NVIDIA_VISIBLE_DEVICES: all
CUDA_VISIBLE_DEVICES: 0
volumes:
- vllm_models:/root/.cache/huggingface
command:
- --model
- meta-llama/Meta-Llama-3.1-8B-Instruct
- --host
- 0.0.0.0
- --port
- 8000
- --gpu-memory-utilization
- '0.9'
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
# ComfyUI for image generation
comfyui:
image: ghcr.io/ai-dock/comfyui:latest
container_name: gpu_comfyui
restart: unless-stopped
runtime: nvidia
environment:
NVIDIA_VISIBLE_DEVICES: all
volumes:
- comfyui_data:/data
- comfyui_models:/opt/ComfyUI/models
- comfyui_output:/opt/ComfyUI/output
ports:
- "8188:8188"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
# Axolotl for model training
axolotl:
image: winglian/axolotl:main-py3.11-cu121-2.2.2
container_name: gpu_training
runtime: nvidia
volumes:
- ./training/configs:/workspace/configs
- ./training/data:/workspace/data
- ./training/output:/workspace/output
- training_cache:/root/.cache
environment:
NVIDIA_VISIBLE_DEVICES: all
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
# Only start when training
profiles:
- training
# JupyterLab for research
jupyter:
image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel
container_name: gpu_jupyter
restart: unless-stopped
runtime: nvidia
volumes:
- ./notebooks:/workspace
- jupyter_cache:/root/.cache
ports:
- "8888:8888"
environment:
NVIDIA_VISIBLE_DEVICES: all
JUPYTER_ENABLE_LAB: "yes"
command: |
bash -c "
pip install jupyterlab transformers datasets accelerate bitsandbytes peft &&
jupyter lab --ip=0.0.0.0 --allow-root --no-browser
"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
# Netdata monitoring
netdata:
image: netdata/netdata:latest
container_name: gpu_netdata
restart: unless-stopped
runtime: nvidia
environment:
NVIDIA_VISIBLE_DEVICES: all
volumes:
- /sys:/host/sys:ro
- /proc:/host/proc:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
ports:
- "19999:19999"
cap_add:
- SYS_PTRACE
- SYS_ADMIN
security_opt:
- apparmor:unconfined
volumes:
vllm_models:
comfyui_data:
comfyui_models:
comfyui_output:
training_cache:
jupyter_cache:
```
---
## Appendix C: Cost Calculator
**Monthly GPU Costs:**
| GPU Model | VRAM | $/hour | 24/7 Month | 8hr/day | Use Case |
|-----------|------|--------|------------|---------|----------|
| RTX 3090 | 24GB | $0.35 | $252 | $84 | Development, small models |
| RTX 4090 | 24GB | $0.50 | $360 | $120 | Production inference, SD |
| A6000 | 48GB | $0.80 | $576 | $192 | Large models, training |
| A100 40GB | 40GB | $1.50 | $1,080 | $360 | Enterprise, training |
| A100 80GB | 80GB | $2.50 | $1,800 | $600 | Massive models, research |
**Storage Costs:**
- Network volume: $0.10/GB/month
- 500GB = $50/month
- 1TB = $100/month
**Total Estimated Monthly:**
- RTX 4090 + 500GB storage = $410/month (24/7)
- RTX 4090 + 500GB storage = $170/month (8hr/day)
**Break-even Analysis:**
- If spending >$500/month on API calls → GPU server saves money
- If spending <$200/month → stick with APIs
---
## Appendix D: Model Recommendations by Use Case
### General Chat (24/7 Inference)
**Best:** Qwen 2.5 14B Instruct
- Excellent multilingual support
- Fast inference
- Good reasoning
**Alternative:** Mistral 7B Instruct v0.3
- Fastest inference
- Lower VRAM
### Code Generation
**Best:** Qwen 2.5 Coder 14B
- SOTA coding performance
- Multi-language support
**Alternative:** DeepSeek Coder 6.7B
- Faster, lighter
### Creative Writing
**Best:** Nous Hermes 2 Mixtral 8x7B (quantized)
- Creative, engaging
- Follows instructions well
### Image Generation (Realistic)
**Best:** FLUX.1-dev
- Highest quality
- Best prompt following
**Alternative:** SDXL + RealVisXL LoRA
- Faster generation
- Good quality
### Image Generation (Anime)
**Best:** SDXL + AnimagineXL LoRA
- Anime-specific training
- Vibrant colors
### Video Generation
**Best:** AnimateDiff + SDXL
- 16-frame clips
- Good quality
**Needs:** A100 40GB or better
---
## Next Steps
1. **Review this plan** and provide feedback
2. **Set budget** for GPU infrastructure
3. **Choose provider** (recommend RunPod)
4. **Define priority services** (LLM hosting first? Image gen first?)
5. **Schedule implementation** (4-week timeline above)
Would you like me to:
- Create the detailed Docker Compose configurations?
- Set up a cost estimation spreadsheet?
- Research specific models for your use cases?
- Begin implementation with Phase 1?
Let me know how you'd like to proceed! 🚀