ai/GPU_EXPANSION_PLAN.md

# GPU-Enhanced AI Stack Expansion Plan

## Executive Summary

This document outlines a comprehensive plan to extend the current AI stack (LiteLLM, Open WebUI, Crawl4AI) with dedicated GPU hosting capabilities for:
- **LLM Model Hosting**: Self-hosted models (Llama, Mistral, Qwen, etc.)
- **Model Training**: Fine-tuning and training workflows
- **Image Generation**: Stable Diffusion, FLUX via ComfyUI
- **Video Generation**: AnimateDiff, CogVideo, etc.

**Current Architecture**: CPU-based stack on pivoine.art VPS → Claude API via LiteLLM
**Target Architecture**: Hybrid stack with GPU server(s) for self-hosted models + API-based models

---

## Phase 1: Current Stack Analysis

### Existing Components

1. **ai_postgres** (pgvector/pgvector:pg16)
   - PostgreSQL with pgvector for RAG
   - Stores: conversations, embeddings, LiteLLM logs

2. **webui** (Open WebUI)
   - User-facing ChatGPT-like interface
   - URL: https://ai.pivoine.art
   - Features: RAG, web search, document upload
   - Connected to LiteLLM proxy

3. **litellm** (LiteLLM proxy)
   - Currently proxies Anthropic Claude API
   - OpenAI-compatible endpoint at http://litellm:4000
   - Supports multiple providers via config

4. **crawl4ai**
   - Internal web scraping for LLM content prep
   - Port 11235 (internal only)

5. **facefusion** (CPU-only)
   - Face swapping/enhancement
   - Currently CPU-based (slow)
   - Protected by Authelia SSO

### Current Limitations

- ❌ No self-hosted LLMs (relies on expensive API calls)
- ❌ No GPU acceleration for facefusion
- ❌ No image generation capabilities
- ❌ No model training/fine-tuning capabilities
- ❌ No video generation
- ❌ High operational costs for API usage

---

## Phase 2: GPU Provider Comparison

### Provider Options

#### 1. **RunPod** ⭐ RECOMMENDED
**Pros:**
- Pay-per-second GPU billing
- Wide GPU selection (RTX 4090, A100, H100)
- Docker-first platform
- Global locations
- Easy HTTP/SSH tunneling
- Volume persistence

**Pricing (Approximate):**
- RTX 4090 (24GB): ~$0.50/hour ($360/month 24/7)
- RTX 3090 (24GB): ~$0.35/hour ($250/month)
- A6000 (48GB): ~$0.80/hour ($576/month)
- A100 (40GB): ~$1.50/hour ($1,080/month)

**Best for:** On-demand workloads, experimentation, cost-conscious hosting

---

#### 2. **Lambda Labs**
**Pros:**
- Flat monthly pricing
- High-end GPUs (A100, H100)
- Jupyter notebooks included
- Fast network

**Pricing:**
- 1x A100 (40GB): $1.10/hour ($792/month)
- 8x A100 (40GB): $8.00/hour (~$5,760/month)

**Best for:** Research, high-utilization workloads

---

#### 3. **Vast.ai**
**Pros:**
- Marketplace model (cheapest)
- Many GPU options
- Spot pricing available

**Cons:**
- Variable reliability
- Setup complexity
- Community-hosted machines

**Pricing:**
- RTX 4090: ~$0.25-0.40/hour
- A100: ~$0.80-1.20/hour

**Best for:** Budget-conscious, experimental workloads

---

#### 4. **Google Cloud Platform (GCP)**
**Pros:**
- Enterprise reliability
- Auto-scaling
- Integration with Google services
- Preemptible instances available

**Pricing:**
- T4 (16GB): ~$0.35/hour
- V100 (16GB): ~$2.48/hour
- A100 (40GB): ~$2.93/hour
- TPU options available

**Best for:** Enterprise workloads, auto-scaling needs

---

#### 5. **AWS**
**Pros:**
- Global infrastructure
- Broad GPU selection
- Spot instances for cost savings
- Enterprise support

**Pricing:**
- g4dn.xlarge (T4 16GB): ~$0.526/hour
- p3.2xlarge (V100 16GB): ~$3.06/hour
- p4d.24xlarge (8x A100 40GB): ~$32.77/hour

**Best for:** Enterprise, existing AWS infrastructure

---

#### 6. **Hugging Face Spaces / Inference Endpoints**
**Pros:**
- Managed model hosting
- Auto-scaling
- Simple deployment
- Community models

**Pricing:**
- CPU: $0.03/hour
- T4: $0.60/hour
- A10G: $1.00/hour
- A100: $4.00/hour

**Best for:** Quick model deployment, serverless inference

---

### Recommendation: **RunPod** for Primary GPU Server

**Rationale:**
1. **Cost-effective**: Pay-per-second billing, ~$0.50/hour for RTX 4090
2. **Docker-native**: Easy integration with existing compose stack
3. **Flexibility**: Start/stop as needed, scale up for training
4. **Community**: Large user base, good documentation
5. **Network**: Built-in HTTP/SSH tunneling

**Supplementary**: Use Hugging Face for specific model hosting if needed

---

## Phase 3: Architecture Design

### Network Topology

```
┌─────────────────────────────────────────────────────────────┐
│ pivoine.art VPS (CPU-based)                                 │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────┐      ┌──────────┐      ┌──────────┐         │
│  │ Open     │─────▶│ LiteLLM  │◀────▶│ ai_      │         │
│  │ WebUI    │      │ Proxy    │      │ postgres │         │
│  └──────────┘      └──────────┘      └──────────┘         │
│       │                  │                                  │
│       │                  │                                  │
└───────┼──────────────────┼──────────────────────────────────┘
        │                  │
        │                  ▼
        │         ┌─────────────────┐
        │         │ Anthropic API   │
        │         │ (Claude)        │
        │         └─────────────────┘
        │
        ▼
┌────────────────────────────────────────────────────────────┐
│ GPU Server (RunPod)                                        │
├────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐ │
│  │ vLLM     │  │ ComfyUI  │  │ Model    │  │ JupyterLab│ │
│  │ (LLMs)   │  │ (SD/FLUX)│  │ Training │  │           │ │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘ │
│       │              │             │              │       │
│       └──────────────┴─────────────┴──────────────┘       │
│                      │                                     │
│              ┌───────────────┐                            │
│              │ Model Storage │                            │
│              │ (Persistent)  │                            │
│              └───────────────┘                            │
│                                                            │
└────────────────────────────────────────────────────────────┘
         │
         ▼ (Tunneled via WireGuard or Tailscale)
┌────────────────────────────────────────────────────────────┐
│ Integration Options:                                       │
├────────────────────────────────────────────────────────────┤
│ 1. LiteLLM adds vLLM endpoint (http://gpu.internal:8000)  │
│ 2. ComfyUI exposed via subdomain (comfy.ai.pivoine.art)   │
│ 3. Model storage synced via rclone/restic                 │
└────────────────────────────────────────────────────────────┘
```

### Connection Methods

#### Option A: WireGuard VPN (RECOMMENDED)
- Create WireGuard tunnel between VPS and GPU server
- GPU services accessible via private IPs
- Secure, low overhead, easy to manage
- Already have wg-easy in your stack

**Setup:**
1. Deploy WireGuard on GPU server
2. Add GPU server as VPN peer
3. Configure LiteLLM to use VPN IPs

#### Option B: SSH Tunnel
- SSH reverse tunnel from GPU to VPS
- Simple, no additional software
- Higher latency

#### Option C: Tailscale
- Zero-config VPN mesh
- Easy setup, good UX
- Proprietary (but free tier available)

---

## Phase 4: Service Implementation Plans

### 4.1 LLM Hosting with vLLM

**vLLM** is the industry-standard for high-performance LLM inference.

#### Features:
- PagedAttention for efficient KV cache
- Continuous batching
- OpenAI-compatible API
- Tensor parallelism for multi-GPU
- Quantization support (AWQ, GPTQ)

#### Docker Compose Configuration:

```yaml
services:
  vllm:
    image: vllm/vllm-openai:latest
    container_name: gpu_vllm
    restart: unless-stopped
    runtime: nvidia
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      CUDA_VISIBLE_DEVICES: 0
    volumes:
      - vllm_models:/root/.cache/huggingface
    command:
      - --model
      - meta-llama/Meta-Llama-3.1-8B-Instruct  # or any model
      - --host
      - 0.0.0.0
      - --port
      - 8000
      - --tensor-parallel-size
      - '1'
      - --gpu-memory-utilization
      - '0.9'
      - --max-model-len
      - '8192'
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
```

#### Recommended Models for RTX 4090 (24GB):

**Text Generation:**
- Llama 3.1 8B Instruct (8GB VRAM, fast)
- Qwen2.5 14B Instruct (14GB VRAM, multilingual)
- Mistral 7B Instruct v0.3 (7GB VRAM)
- Nous Hermes 2 Mixtral 8x7B (with quantization, 16GB)

**Code:**
- DeepSeek Coder 6.7B (7GB VRAM)
- CodeLlama 13B (13GB VRAM)
- Qwen2.5-Coder 14B (14GB VRAM)

#### Integration with LiteLLM:

Add to `ai/litellm-config.yaml`:

```yaml
model_list:
  # Existing Anthropic
  - model_name: claude-sonnet-4-5
    litellm_params:
      model: anthropic/claude-sonnet-4-5-20250929
      api_key: os.environ/ANTHROPIC_API_KEY

  # New vLLM models
  - model_name: llama-3.1-8b
    litellm_params:
      model: openai/meta-llama/Meta-Llama-3.1-8B-Instruct
      api_base: http://gpu.internal:8000/v1
      api_key: dummy

  - model_name: qwen-2.5-14b
    litellm_params:
      model: openai/Qwen/Qwen2.5-14B-Instruct
      api_base: http://gpu.internal:8000/v1
      api_key: dummy
```

---

### 4.2 ComfyUI for Image/Video Generation

**ComfyUI** is a node-based UI for Stable Diffusion with advanced workflows.

#### Features:
- Node-based workflow editor
- Support for SD 1.5, SDXL, SD3, FLUX
- ControlNet, LoRA, embeddings
- Video generation (AnimateDiff, SVD)
- API for automation

#### Docker Compose Configuration:

```yaml
services:
  comfyui:
    image: ghcr.io/ai-dock/comfyui:latest
    container_name: gpu_comfyui
    restart: unless-stopped
    runtime: nvidia
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      # Custom nodes auto-install
      COMFYUI_FLAGS: --listen 0.0.0.0 --port 8188
    volumes:
      - comfyui_data:/data
      - comfyui_models:/opt/ComfyUI/models
      - comfyui_output:/opt/ComfyUI/output
    ports:
      - "8188:8188"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
```

#### Model Downloads (via ComfyUI Manager):

**Stable Diffusion Models:**
- FLUX.1-dev (12GB, newest, best quality)
- FLUX.1-schnell (12GB, fast)
- SDXL Base 1.0 (6.9GB)
- SD 1.5 (4GB, fast, wide LoRA support)

**ControlNet Models:**
- controlnet-canny-sdxl
- controlnet-depth-sdxl
- controlnet-openpose-sdxl

**LoRA Models** (download from Civitai):
- Style LoRAs (anime, realistic, etc.)
- Character LoRAs
- Concept LoRAs

#### Traefik Integration:

Add subdomain routing for ComfyUI:

```yaml
labels:
  - 'traefik.enable=true'
  - 'traefik.http.routers.comfyui-web-secure.rule=Host(`comfy.ai.pivoine.art`)'
  - 'traefik.http.routers.comfyui-web-secure.tls.certresolver=resolver'
  - 'traefik.http.routers.comfyui-web-secure.entrypoints=web-secure'
  - 'traefik.http.routers.comfyui-web-secure.middlewares=net-authelia,security-headers@file'
  - 'traefik.http.services.comfyui.loadbalancer.server.port=8188'
```

#### Open WebUI Integration:

ComfyUI has a REST API that can be called from Open WebUI using function calling.

Example workflow API call:
```python
import requests

def generate_image(prompt: str, negative_prompt: str = ""):
    workflow = {
        # ComfyUI workflow JSON
    }
    response = requests.post(
        "http://comfyui:8188/prompt",
        json={"prompt": workflow}
    )
    return response.json()
```

---

### 4.3 Model Training Infrastructure

For fine-tuning LLMs and training custom models.

#### Option A: Axolotl (Recommended)

**Axolotl** is a user-friendly fine-tuning framework supporting:
- LoRA, QLoRA
- Full fine-tuning
- RLHF/DPO
- Multi-GPU training

```yaml
services:
  axolotl:
    image: winglian/axolotl:main-py3.11-cu121-2.2.2
    container_name: gpu_training
    runtime: nvidia
    volumes:
      - ./training/configs:/workspace/configs
      - ./training/data:/workspace/data
      - ./training/output:/workspace/output
      - training_cache:/root/.cache
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      WANDB_API_KEY: ${WANDB_API_KEY:-}  # Optional: Weights & Biases logging
    command: |
      bash -c "
      accelerate launch -m axolotl.cli.train /workspace/configs/config.yaml
      "
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
```

#### Training Workflow:
1. Prepare dataset (JSONL format)
2. Create Axolotl config (LoRA, batch size, epochs)
3. Start training container
4. Monitor via Weights & Biases or TensorBoard
5. Export LoRA adapters
6. Merge with base model or use in vLLM

#### Example Config:
```yaml
# training/configs/lora-llama3.yaml
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
  - path: /workspace/data/train.jsonl
    type: completion
    field: text

output_dir: /workspace/output/llama3-lora

adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
learning_rate: 0.0002

optimizer: adamw_bnb_8bit
lr_scheduler: cosine
warmup_steps: 100
```

#### Option B: JupyterLab for Custom Training

For research and custom training scripts:

```yaml
services:
  jupyter:
    image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel
    container_name: gpu_jupyter
    runtime: nvidia
    volumes:
      - ./notebooks:/workspace
      - jupyter_cache:/root/.cache
    ports:
      - "8888:8888"
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      JUPYTER_ENABLE_LAB: "yes"
    command: |
      bash -c "
      pip install jupyterlab transformers datasets accelerate bitsandbytes peft &&
      jupyter lab --ip=0.0.0.0 --allow-root --no-browser --NotebookApp.token=''
      "
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
```

---

### 4.4 Model Storage Strategy

#### Storage Requirements:

**Per Model Type:**
- LLM 7B: ~14GB (FP16)
- LLM 13B: ~26GB
- SDXL: ~7GB
- FLUX: ~12GB
- ControlNet: ~2.5GB each
- LoRA: ~100-500MB each

**Total Estimated:**
- 3-4 LLMs: ~80GB
- SD models + LoRAs: ~50GB
- Training checkpoints: ~100GB
- **Total: 250-300GB minimum**

#### RunPod Storage Options:

1. **Network Volume** (Recommended)
   - Persistent across pod restarts
   - Shared between multiple pods
   - ~$0.10/GB/month
   - 500GB = $50/month

2. **Container Disk**
   - Included with pod
   - Lost when pod stops
   - Good for temporary storage

3. **External Storage (rclone)**
   - Sync to/from VPS or cloud storage
   - Backup models to Backblaze B2 or Wasabi
   - Good for disaster recovery

#### Model Management:

Use **Hugging Face Hub** as model cache:

```bash
# Download models on first run
huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct \
  --local-dir /models/llama-3.1-8b

# Or let vLLM/ComfyUI auto-download
```

**Model Sync Script:**
```bash
#!/bin/bash
# sync-models.sh - Sync models from VPS to GPU server

rclone sync \
  /mnt/hidrive/AI/models \
  gpu:/workspace/models \
  --progress \
  --transfers 4
```

---

## Phase 5: Implementation Roadmap

### Week 1: Infrastructure Setup

**Day 1-2: RunPod Account & GPU Server**
- [ ] Create RunPod account
- [ ] Deploy RTX 4090 pod with Ubuntu 22.04 + PyTorch template
- [ ] Configure persistent network volume (500GB)
- [ ] Set up SSH access

**Day 3-4: Network Configuration**
- [ ] Deploy WireGuard on GPU server
- [ ] Add GPU server as peer to existing VPN (vpn/compose.yaml)
- [ ] Test connectivity between VPS and GPU server
- [ ] Configure firewall rules

**Day 5: Docker Setup on GPU Server**
- [ ] Install Docker + NVIDIA Container Toolkit
- [ ] Create docker-compose.yaml for GPU services
- [ ] Test GPU access in containers

---

### Week 2: LLM Hosting

**Day 1-2: vLLM Deployment**
- [ ] Deploy vLLM container
- [ ] Download Llama 3.1 8B Instruct
- [ ] Test inference locally
- [ ] Benchmark performance (tokens/sec)

**Day 3-4: LiteLLM Integration**
- [ ] Update litellm-config.yaml with vLLM endpoint
- [ ] Test via Open WebUI
- [ ] Configure model routing (cheap models → vLLM, complex → Claude)
- [ ] Set up usage monitoring

**Day 5: Model Expansion**
- [ ] Download Qwen 2.5 14B
- [ ] Download Mistral 7B Instruct
- [ ] Test model switching in Open WebUI
- [ ] Document performance characteristics

---

### Week 3: Image Generation

**Day 1-2: ComfyUI Setup**
- [ ] Deploy ComfyUI container
- [ ] Download FLUX.1-schnell
- [ ] Download SDXL
- [ ] Install ComfyUI Manager

**Day 3-4: Model Downloads**
- [ ] Download ControlNet models
- [ ] Download VAE models
- [ ] Download popular LoRAs from Civitai
- [ ] Organize model directory

**Day 5: Integration & Workflows**
- [ ] Create basic text-to-image workflow
- [ ] Create ControlNet workflow
- [ ] Test API access
- [ ] Add Traefik subdomain (comfy.ai.pivoine.art)

---

### Week 4: Training Infrastructure

**Day 1-2: Axolotl Setup**
- [ ] Deploy Axolotl container
- [ ] Create sample dataset
- [ ] Test LoRA fine-tuning with tiny model
- [ ] Verify GPU utilization

**Day 3-4: JupyterLab Setup**
- [ ] Deploy JupyterLab container
- [ ] Install ML libraries
- [ ] Create example notebooks
- [ ] Test custom training scripts

**Day 5: Documentation & Testing**
- [ ] Write training guides
- [ ] Test end-to-end workflows
- [ ] Benchmark training speeds
- [ ] Document best practices

---

### Ongoing: Optimization & Expansion

**Month 2:**
- Monitor costs and optimize GPU utilization
- Implement model caching strategies
- Add more models based on usage patterns
- Set up automated model updates
- Implement usage quotas per user

**Month 3+:**
- Consider multi-GPU setup for larger models
- Implement model quantization (AWQ/GPTQ)
- Add video generation (AnimateDiff, CogVideo)
- Explore voice synthesis (XTTS, Bark)
- Custom model training for specific use cases

---

## Phase 6: Cost Analysis

### Scenario A: Single RTX 4090 (24/7)

**GPU Server (RunPod):**
- RTX 4090 pod: $0.50/hour × 720 hours = $360/month
- 500GB network volume: $50/month
- **Subtotal: $410/month**

**VPS (Existing):**
- No change in cost

**Total: ~$410/month**

**Savings:**
- Claude API costs reduced by ~80% (self-hosted for routine tasks)
- Break-even if currently spending >$500/month on API calls

---

### Scenario B: Pay-as-you-go (8 hours/day)

**GPU Server (RunPod):**
- RTX 4090: $0.50/hour × 8 hours × 30 days = $120/month
- Storage: $50/month
- **Subtotal: $170/month**

**Best for:**
- Development/experimentation
- Burst workloads
- Image generation on-demand

---

### Scenario C: Dual GPU (Training + Inference)

**GPU Server 1 (Inference):**
- RTX 4090 24/7: $360/month

**GPU Server 2 (Training - On-demand):**
- A100 40GB: $1.50/hour × 40 hours/month = $60/month
- Used only for fine-tuning sessions

**Storage:**
- 1TB network volume: $100/month

**Total: ~$520/month**

---

### Cost Optimization Tips

1. **Auto-stop idle pods**: RunPod can auto-stop after X minutes idle
2. **Use spot instances**: ~50% cheaper but can be interrupted
3. **Quantized models**: 4-bit models use 4x less VRAM → cheaper GPUs
4. **Batch processing**: Queue image gen jobs to maximize GPU usage
5. **Model sharing**: One vLLM instance can serve multiple models via adapters
6. **Monitoring**: Track per-model costs to optimize routing

---

## Phase 7: Monitoring & Operations

### Metrics to Track

**GPU Utilization:**
- nvidia-smi metrics (utilization %, memory usage)
- Temperature and power draw
- Per-process GPU usage

**Model Performance:**
- Tokens per second (LLM inference)
- Images per second (SD/FLUX)
- Training time per epoch

**Costs:**
- GPU hours consumed
- Storage usage
- API vs self-hosted breakdown

### Monitoring Stack

**Option A: Netdata (Already deployed)**

Add GPU monitoring to existing Netdata:

```yaml
# On GPU server
services:
  netdata:
    image: netdata/netdata:latest
    container_name: gpu_netdata
    runtime: nvidia
    environment:
      NVIDIA_VISIBLE_DEVICES: all
    volumes:
      - /sys:/host/sys:ro
      - /proc:/host/proc:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    command: |
      bash -c "
      # Enable nvidia_smi plugin
      /usr/libexec/netdata/plugins.d/charts.d.plugin nvidia_smi
      "
```

**Option B: Prometheus + Grafana**

For detailed metrics:

```yaml
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'

  dcgm-exporter:
    image: nvidia/dcgm-exporter:latest
    runtime: nvidia
    environment:
      NVIDIA_VISIBLE_DEVICES: all

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
```

Import Grafana dashboard #12219 for GPU metrics.

---

## Phase 8: Backup & Disaster Recovery

### What to Backup

1. **Models** (250-300GB)
   - Base models can be re-downloaded
   - Custom fine-tuned models: CRITICAL
   - LoRAs: CRITICAL

2. **Training Data** (~10-50GB)
   - Datasets
   - Preprocessing scripts

3. **Configurations** (<1GB)
   - Docker compose files
   - Training configs
   - Workflow JSONs

### Backup Strategy

**Tier 1: Critical (Daily)**
- Fine-tuned models
- Training checkpoints
- Custom datasets

**Backup to:**
- Restic → HiDrive (already configured)
- Backblaze B2 (~$6/TB/month)

```bash
# Add to core/compose.yaml backrest config
- gpu_models:/volumes/gpu_models:ro
- gpu_checkpoints:/volumes/gpu_checkpoints:ro
```

**Tier 2: Nice-to-have (Weekly)**
- Base models (can re-download)
- ComfyUI outputs

**Tier 3: Ephemeral (No backup)**
- Inference cache
- Temporary generations

---

## Phase 9: Security Considerations

### GPU Server Security

1. **Firewall:**
   - Only allow WireGuard port (51820)
   - All services accessed via VPN
   - No public exposure

2. **SSH:**
   - Key-based auth only
   - Disable password auth
   - Change default port

3. **Docker:**
   - Rootless Docker (optional but recommended)
   - Limited container capabilities
   - No privileged containers except for nvidia-runtime

4. **Secrets:**
   - Store API keys in .env
   - Use Docker secrets for sensitive data
   - Rotate keys periodically

### Access Control

- **ComfyUI**: Protected by Authelia SSO (already configured)
- **vLLM**: Internal only, accessed via LiteLLM proxy
- **JupyterLab**: Password-protected or Authelia
- **Training**: No public access, VPN only

---

## Phase 10: Advanced Features (Future)

### Multi-GPU Scaling

**Tensor Parallelism** (vLLM):
- Split large models across multiple GPUs
- Example: 70B model on 2x A100s

```yaml
command:
  - --model
  - meta-llama/Meta-Llama-3.1-70B-Instruct
  - --tensor-parallel-size
  - '2'  # Use 2 GPUs
```

**Pipeline Parallelism** (training):
- Split model layers across GPUs
- Useful for very large models

### Model Serving Optimization

**vLLM Features:**
- Speculative decoding (faster generation)
- Prefix caching (faster for repeated prompts)
- Multi-LoRA serving (multiple adapters, one base model)

**Example multi-LoRA:**
```yaml
command:
  - --model
  - meta-llama/Meta-Llama-3.1-8B-Instruct
  - --enable-lora
  - --max-loras
  - '4'
  - --lora-modules
  - customer-support=/models/loras/support-lora
  - creative-writing=/models/loras/writing-lora
```

### Video Generation

**AnimateDiff in ComfyUI:**
- Generate short videos from text prompts
- Animate static images
- ~8GB VRAM for 512x512 16-frame videos

**CogVideo:**
- High-quality video generation
- Requires A100 or H100
- 5-second clips at 720p

### Voice Synthesis

**XTTS v2:**
- High-quality voice cloning
- Multi-language support
- ~6GB VRAM

**Bark:**
- Text-to-speech with emotions
- Sound effects
- ~10GB VRAM

---

## Appendix A: Quick Start Commands

### Initial GPU Server Setup

```bash
# SSH into RunPod instance
ssh root@gpu.runpod.io -p 12345

# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update
apt-get install -y nvidia-container-toolkit
systemctl restart docker

# Test GPU access
docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
```

### Deploy vLLM (Quick Test)

```bash
# Create directory
mkdir -p /workspace/vllm
cd /workspace/vllm

# Run vLLM
docker run -d \
  --name vllm \
  --runtime=nvidia \
  --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --dtype auto \
  --max-model-len 8192

# Test inference
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "prompt": "Once upon a time",
    "max_tokens": 50
  }'
```

### Deploy ComfyUI (Quick Test)

```bash
docker run -d \
  --name comfyui \
  --runtime=nvidia \
  --gpus all \
  -v /workspace/comfyui:/data \
  -p 8188:8188 \
  ghcr.io/ai-dock/comfyui:latest

# Access at http://gpu-ip:8188
```

---

## Appendix B: Sample Docker Compose (Full GPU Stack)

```yaml
# gpu-server/compose.yaml
version: '3.8'

services:
  # vLLM for LLM inference
  vllm:
    image: vllm/vllm-openai:latest
    container_name: gpu_vllm
    restart: unless-stopped
    runtime: nvidia
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      CUDA_VISIBLE_DEVICES: 0
    volumes:
      - vllm_models:/root/.cache/huggingface
    command:
      - --model
      - meta-llama/Meta-Llama-3.1-8B-Instruct
      - --host
      - 0.0.0.0
      - --port
      - 8000
      - --gpu-memory-utilization
      - '0.9'
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  # ComfyUI for image generation
  comfyui:
    image: ghcr.io/ai-dock/comfyui:latest
    container_name: gpu_comfyui
    restart: unless-stopped
    runtime: nvidia
    environment:
      NVIDIA_VISIBLE_DEVICES: all
    volumes:
      - comfyui_data:/data
      - comfyui_models:/opt/ComfyUI/models
      - comfyui_output:/opt/ComfyUI/output
    ports:
      - "8188:8188"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  # Axolotl for model training
  axolotl:
    image: winglian/axolotl:main-py3.11-cu121-2.2.2
    container_name: gpu_training
    runtime: nvidia
    volumes:
      - ./training/configs:/workspace/configs
      - ./training/data:/workspace/data
      - ./training/output:/workspace/output
      - training_cache:/root/.cache
    environment:
      NVIDIA_VISIBLE_DEVICES: all
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    # Only start when training
    profiles:
      - training

  # JupyterLab for research
  jupyter:
    image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel
    container_name: gpu_jupyter
    restart: unless-stopped
    runtime: nvidia
    volumes:
      - ./notebooks:/workspace
      - jupyter_cache:/root/.cache
    ports:
      - "8888:8888"
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      JUPYTER_ENABLE_LAB: "yes"
    command: |
      bash -c "
      pip install jupyterlab transformers datasets accelerate bitsandbytes peft &&
      jupyter lab --ip=0.0.0.0 --allow-root --no-browser
      "
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  # Netdata monitoring
  netdata:
    image: netdata/netdata:latest
    container_name: gpu_netdata
    restart: unless-stopped
    runtime: nvidia
    environment:
      NVIDIA_VISIBLE_DEVICES: all
    volumes:
      - /sys:/host/sys:ro
      - /proc:/host/proc:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    ports:
      - "19999:19999"
    cap_add:
      - SYS_PTRACE
      - SYS_ADMIN
    security_opt:
      - apparmor:unconfined

volumes:
  vllm_models:
  comfyui_data:
  comfyui_models:
  comfyui_output:
  training_cache:
  jupyter_cache:
```

---

## Appendix C: Cost Calculator

**Monthly GPU Costs:**

| GPU Model | VRAM | $/hour | 24/7 Month | 8hr/day | Use Case |
|-----------|------|--------|------------|---------|----------|
| RTX 3090 | 24GB | $0.35 | $252 | $84 | Development, small models |
| RTX 4090 | 24GB | $0.50 | $360 | $120 | Production inference, SD |
| A6000 | 48GB | $0.80 | $576 | $192 | Large models, training |
| A100 40GB | 40GB | $1.50 | $1,080 | $360 | Enterprise, training |
| A100 80GB | 80GB | $2.50 | $1,800 | $600 | Massive models, research |

**Storage Costs:**
- Network volume: $0.10/GB/month
- 500GB = $50/month
- 1TB = $100/month

**Total Estimated Monthly:**
- RTX 4090 + 500GB storage = $410/month (24/7)
- RTX 4090 + 500GB storage = $170/month (8hr/day)

**Break-even Analysis:**
- If spending >$500/month on API calls → GPU server saves money
- If spending <$200/month → stick with APIs

---

## Appendix D: Model Recommendations by Use Case

### General Chat (24/7 Inference)
**Best:** Qwen 2.5 14B Instruct
- Excellent multilingual support
- Fast inference
- Good reasoning

**Alternative:** Mistral 7B Instruct v0.3
- Fastest inference
- Lower VRAM

### Code Generation
**Best:** Qwen 2.5 Coder 14B
- SOTA coding performance
- Multi-language support

**Alternative:** DeepSeek Coder 6.7B
- Faster, lighter

### Creative Writing
**Best:** Nous Hermes 2 Mixtral 8x7B (quantized)
- Creative, engaging
- Follows instructions well

### Image Generation (Realistic)
**Best:** FLUX.1-dev
- Highest quality
- Best prompt following

**Alternative:** SDXL + RealVisXL LoRA
- Faster generation
- Good quality

### Image Generation (Anime)
**Best:** SDXL + AnimagineXL LoRA
- Anime-specific training
- Vibrant colors

### Video Generation
**Best:** AnimateDiff + SDXL
- 16-frame clips
- Good quality

**Needs:** A100 40GB or better

---

## Next Steps

1. **Review this plan** and provide feedback
2. **Set budget** for GPU infrastructure
3. **Choose provider** (recommend RunPod)
4. **Define priority services** (LLM hosting first? Image gen first?)
5. **Schedule implementation** (4-week timeline above)

Would you like me to:
- Create the detailed Docker Compose configurations?
- Set up a cost estimation spreadsheet?
- Research specific models for your use cases?
- Begin implementation with Phase 1?

Let me know how you'd like to proceed! 🚀
-												docs(ai): add comprehensive GPU setup documentation and configs

- Add setup guides (SETUP_GUIDE, TAILSCALE_SETUP, DOCKER_GPU_SETUP, etc.)
- Add deployment configurations (litellm-config-gpu.yaml, gpu-server-compose.yaml)
- Add GPU_DEPLOYMENT_LOG.md with current infrastructure details
- Add GPU_EXPANSION_PLAN.md with complete provider comparison
- Add deploy-gpu-stack.sh automation script

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-21 12:57:06 +01:00
+								# GPU-Enhanced AI Stack Expansion Plan
 								## Executive Summary
 								This document outlines a comprehensive plan to extend the current AI stack (LiteLLM, Open WebUI, Crawl4AI) with dedicated GPU hosting capabilities for:
 								- **LLM Model Hosting**: Self-hosted models (Llama, Mistral, Qwen, etc.)
 								- **Model Training**: Fine-tuning and training workflows
 								- **Image Generation**: Stable Diffusion, FLUX via ComfyUI
 								- **Video Generation**: AnimateDiff, CogVideo, etc.
 								**Current Architecture**: CPU-based stack on pivoine.art VPS → Claude API via LiteLLM
 								**Target Architecture**: Hybrid stack with GPU server(s) for self-hosted models + API-based models
 								---
 								## Phase 1: Current Stack Analysis
 								### Existing Components
 . **ai_postgres** (pgvector/pgvector:pg16)
 								   - PostgreSQL with pgvector for RAG
 								   - Stores: conversations, embeddings, LiteLLM logs
 . **webui** (Open WebUI)
 								   - User-facing ChatGPT-like interface
 								   - URL: https://ai.pivoine.art
 								   - Features: RAG, web search, document upload
 								   - Connected to LiteLLM proxy
 . **litellm** (LiteLLM proxy)
 								   - Currently proxies Anthropic Claude API
 								   - OpenAI-compatible endpoint at http://litellm:4000
 								   - Supports multiple providers via config
 . **crawl4ai**
 								   - Internal web scraping for LLM content prep
 								   - Port 11235 (internal only)
 . **facefusion** (CPU-only)
 								   - Face swapping/enhancement
 								   - Currently CPU-based (slow)
 								   - Protected by Authelia SSO
 								### Current Limitations
 								- ❌ No self-hosted LLMs (relies on expensive API calls)
 								- ❌ No GPU acceleration for facefusion
 								- ❌ No image generation capabilities
 								- ❌ No model training/fine-tuning capabilities
 								- ❌ No video generation
 								- ❌ High operational costs for API usage
 								---
 								## Phase 2: GPU Provider Comparison
 								### Provider Options
 								#### 1. **RunPod** ⭐ RECOMMENDED
 								**Pros:**
 								- Pay-per-second GPU billing
 								- Wide GPU selection (RTX 4090, A100, H100)
 								- Docker-first platform
 								- Global locations
 								- Easy HTTP/SSH tunneling
 								- Volume persistence
 								**Pricing (Approximate):**
 								- RTX 4090 (24GB): ~$0.50/hour ($360/month 24/7)
 								- RTX 3090 (24GB): ~$0.35/hour ($250/month)
 								- A6000 (48GB): ~$0.80/hour ($576/month)
 								- A100 (40GB): ~$1.50/hour ($1,080/month)
 								**Best for:** On-demand workloads, experimentation, cost-conscious hosting
 								---
 								#### 2. **Lambda Labs**
 								**Pros:**
 								- Flat monthly pricing
 								- High-end GPUs (A100, H100)
 								- Jupyter notebooks included
 								- Fast network
 								**Pricing:**
 								- 1x A100 (40GB): $1.10/hour ($792/month)
 								- 8x A100 (40GB): $8.00/hour (~$5,760/month)
 								**Best for:** Research, high-utilization workloads
 								---
 								#### 3. **Vast.ai**
 								**Pros:**
 								- Marketplace model (cheapest)
 								- Many GPU options
 								- Spot pricing available
 								**Cons:**
 								- Variable reliability
 								- Setup complexity
 								- Community-hosted machines
 								**Pricing:**
 								- RTX 4090: ~$0.25-0.40/hour
 								- A100: ~$0.80-1.20/hour
 								**Best for:** Budget-conscious, experimental workloads
 								---
 								#### 4. **Google Cloud Platform (GCP)**
 								**Pros:**
 								- Enterprise reliability
 								- Auto-scaling
 								- Integration with Google services
 								- Preemptible instances available
 								**Pricing:**
 								- T4 (16GB): ~$0.35/hour
 								- V100 (16GB): ~$2.48/hour
 								- A100 (40GB): ~$2.93/hour
 								- TPU options available
 								**Best for:** Enterprise workloads, auto-scaling needs
 								---
 								#### 5. **AWS**
 								**Pros:**
 								- Global infrastructure
 								- Broad GPU selection
 								- Spot instances for cost savings
 								- Enterprise support
 								**Pricing:**
 								- g4dn.xlarge (T4 16GB): ~$0.526/hour
 								- p3.2xlarge (V100 16GB): ~$3.06/hour
 								- p4d.24xlarge (8x A100 40GB): ~$32.77/hour
 								**Best for:** Enterprise, existing AWS infrastructure
 								---
 								#### 6. **Hugging Face Spaces / Inference Endpoints**
 								**Pros:**
 								- Managed model hosting
 								- Auto-scaling
 								- Simple deployment
 								- Community models
 								**Pricing:**
 								- CPU: $0.03/hour
 								- T4: $0.60/hour
 								- A10G: $1.00/hour
 								- A100: $4.00/hour
 								**Best for:** Quick model deployment, serverless inference
 								---
 								### Recommendation: **RunPod** for Primary GPU Server
 								**Rationale:**
 . **Cost-effective**: Pay-per-second billing, ~$0.50/hour for RTX 4090
 . **Docker-native**: Easy integration with existing compose stack
 . **Flexibility**: Start/stop as needed, scale up for training
 . **Community**: Large user base, good documentation
 . **Network**: Built-in HTTP/SSH tunneling
 								**Supplementary**: Use Hugging Face for specific model hosting if needed
 								---
 								## Phase 3: Architecture Design
 								### Network Topology
 								```
 								┌─────────────────────────────────────────────────────────────┐
 								│ pivoine.art VPS (CPU-based)                                 │
 								├─────────────────────────────────────────────────────────────┤
 								│                                                              │
 								│  ┌──────────┐      ┌──────────┐      ┌──────────┐         │
 								│  │ Open     │─────▶│ LiteLLM  │◀────▶│ ai_      │         │
 								│  │ WebUI    │      │ Proxy    │      │ postgres │         │
 								│  └──────────┘      └──────────┘      └──────────┘         │
 								│       │                  │                                  │
 								│       │                  │                                  │
 								└───────┼──────────────────┼──────────────────────────────────┘
 								        │                  │
 								        │                  ▼
 								        │         ┌─────────────────┐
 								        │         │ Anthropic API   │
 								        │         │ (Claude)        │
 								        │         └─────────────────┘
 								        │
 								        ▼
 								┌────────────────────────────────────────────────────────────┐
 								│ GPU Server (RunPod)                                        │
 								├────────────────────────────────────────────────────────────┤
 								│                                                             │
 								│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐ │
 								│  │ vLLM     │  │ ComfyUI  │  │ Model    │  │ JupyterLab│ │
 								│  │ (LLMs)   │  │ (SD/FLUX)│  │ Training │  │           │ │
 								│  └──────────┘  └──────────┘  └──────────┘  └──────────┘ │
 								│       │              │             │              │       │
 								│       └──────────────┴─────────────┴──────────────┘       │
 								│                      │                                     │
 								│              ┌───────────────┐                            │
 								│              │ Model Storage │                            │
 								│              │ (Persistent)  │                            │
 								│              └───────────────┘                            │
 								│                                                            │
 								└────────────────────────────────────────────────────────────┘
 								         │
 								         ▼ (Tunneled via WireGuard or Tailscale)
 								┌────────────────────────────────────────────────────────────┐
 								│ Integration Options:                                       │
 								├────────────────────────────────────────────────────────────┤
 								│ 1. LiteLLM adds vLLM endpoint (http://gpu.internal:8000)  │
 								│ 2. ComfyUI exposed via subdomain (comfy.ai.pivoine.art)   │
 								│ 3. Model storage synced via rclone/restic                 │
 								└────────────────────────────────────────────────────────────┘
 								```
 								### Connection Methods
 								#### Option A: WireGuard VPN (RECOMMENDED)
 								- Create WireGuard tunnel between VPS and GPU server
 								- GPU services accessible via private IPs
 								- Secure, low overhead, easy to manage
 								- Already have wg-easy in your stack
 								**Setup:**
 . Deploy WireGuard on GPU server
 . Add GPU server as VPN peer
 . Configure LiteLLM to use VPN IPs
 								#### Option B: SSH Tunnel
 								- SSH reverse tunnel from GPU to VPS
 								- Simple, no additional software
 								- Higher latency
 								#### Option C: Tailscale
 								- Zero-config VPN mesh
 								- Easy setup, good UX
 								- Proprietary (but free tier available)
 								---
 								## Phase 4: Service Implementation Plans
 								### 4.1 LLM Hosting with vLLM
 								**vLLM** is the industry-standard for high-performance LLM inference.
 								#### Features:
 								- PagedAttention for efficient KV cache
 								- Continuous batching
 								- OpenAI-compatible API
 								- Tensor parallelism for multi-GPU
 								- Quantization support (AWQ, GPTQ)
 								#### Docker Compose Configuration:
 								```yaml
 								services:
 								  vllm:
 								    image: vllm/vllm-openai:latest
 								    container_name: gpu_vllm
 								    restart: unless-stopped
 								    runtime: nvidia
 								    environment:
 								      NVIDIA_VISIBLE_DEVICES: all
 								      CUDA_VISIBLE_DEVICES: 0
 								    volumes:
 								      - vllm_models:/root/.cache/huggingface
 								    command:
 								      - --model
 								      - meta-llama/Meta-Llama-3.1-8B-Instruct  # or any model
 								      - --host
 								      - 0.0.0.0
 								      - --port
 								      - 8000
 								      - --tensor-parallel-size
 								      - '1'
 								      - --gpu-memory-utilization
 								      - '0.9'
 								      - --max-model-len
 								      - '8192'
 								    ports:
 								      - "8000:8000"
 								    deploy:
 								      resources:
 								        reservations:
 								          devices:
 								            - driver: nvidia
 								              count: 1
 								              capabilities: [gpu]
 								```
 								#### Recommended Models for RTX 4090 (24GB):
 								**Text Generation:**
 								- Llama 3.1 8B Instruct (8GB VRAM, fast)
 								- Qwen2.5 14B Instruct (14GB VRAM, multilingual)
 								- Mistral 7B Instruct v0.3 (7GB VRAM)
 								- Nous Hermes 2 Mixtral 8x7B (with quantization, 16GB)
 								**Code:**
 								- DeepSeek Coder 6.7B (7GB VRAM)
 								- CodeLlama 13B (13GB VRAM)
 								- Qwen2.5-Coder 14B (14GB VRAM)
 								#### Integration with LiteLLM:
 								Add to `ai/litellm-config.yaml`:
 								```yaml
 								model_list:
 								  # Existing Anthropic
 								  - model_name: claude-sonnet-4-5
 								    litellm_params:
 								      model: anthropic/claude-sonnet-4-5-20250929
 								      api_key: os.environ/ANTHROPIC_API_KEY
 								  # New vLLM models
 								  - model_name: llama-3.1-8b
 								    litellm_params:
 								      model: openai/meta-llama/Meta-Llama-3.1-8B-Instruct
 								      api_base: http://gpu.internal:8000/v1
 								      api_key: dummy
 								  - model_name: qwen-2.5-14b
 								    litellm_params:
 								      model: openai/Qwen/Qwen2.5-14B-Instruct
 								      api_base: http://gpu.internal:8000/v1
 								      api_key: dummy
 								```
 								---
 								### 4.2 ComfyUI for Image/Video Generation
 								**ComfyUI** is a node-based UI for Stable Diffusion with advanced workflows.
 								#### Features:
 								- Node-based workflow editor
 								- Support for SD 1.5, SDXL, SD3, FLUX
 								- ControlNet, LoRA, embeddings
 								- Video generation (AnimateDiff, SVD)
 								- API for automation
 								#### Docker Compose Configuration:
 								```yaml
 								services:
 								  comfyui:
 								    image: ghcr.io/ai-dock/comfyui:latest
 								    container_name: gpu_comfyui
 								    restart: unless-stopped
 								    runtime: nvidia
 								    environment:
 								      NVIDIA_VISIBLE_DEVICES: all
 								      # Custom nodes auto-install
 								      COMFYUI_FLAGS: --listen 0.0.0.0 --port 8188
 								    volumes:
 								      - comfyui_data:/data
 								      - comfyui_models:/opt/ComfyUI/models
 								      - comfyui_output:/opt/ComfyUI/output
 								    ports:
 								      - "8188:8188"
 								    deploy:
 								      resources:
 								        reservations:
 								          devices:
 								            - driver: nvidia
 								              count: 1
 								              capabilities: [gpu]
 								```
 								#### Model Downloads (via ComfyUI Manager):
 								**Stable Diffusion Models:**
 								- FLUX.1-dev (12GB, newest, best quality)
 								- FLUX.1-schnell (12GB, fast)
 								- SDXL Base 1.0 (6.9GB)
 								- SD 1.5 (4GB, fast, wide LoRA support)
 								**ControlNet Models:**
 								- controlnet-canny-sdxl
 								- controlnet-depth-sdxl
 								- controlnet-openpose-sdxl
 								**LoRA Models** (download from Civitai):
 								- Style LoRAs (anime, realistic, etc.)
 								- Character LoRAs
 								- Concept LoRAs
 								#### Traefik Integration:
 								Add subdomain routing for ComfyUI:
 								```yaml
 								labels:
 								  - 'traefik.enable=true'
 								  - 'traefik.http.routers.comfyui-web-secure.rule=Host(`comfy.ai.pivoine.art`)'
 								  - 'traefik.http.routers.comfyui-web-secure.tls.certresolver=resolver'
 								  - 'traefik.http.routers.comfyui-web-secure.entrypoints=web-secure'
 								  - 'traefik.http.routers.comfyui-web-secure.middlewares=net-authelia,security-headers@file'
 								  - 'traefik.http.services.comfyui.loadbalancer.server.port=8188'
 								```
 								#### Open WebUI Integration:
 								ComfyUI has a REST API that can be called from Open WebUI using function calling.
 								Example workflow API call:
 								```python
 								import requests
 								def generate_image(prompt: str, negative_prompt: str = ""):
 								    workflow = {
 								        # ComfyUI workflow JSON
 								    }
 								    response = requests.post(
 								        "http://comfyui:8188/prompt",
 								        json={"prompt": workflow}
 								    )
 								    return response.json()
 								```
 								---
 								### 4.3 Model Training Infrastructure
 								For fine-tuning LLMs and training custom models.
 								#### Option A: Axolotl (Recommended)
 								**Axolotl** is a user-friendly fine-tuning framework supporting:
 								- LoRA, QLoRA
 								- Full fine-tuning
 								- RLHF/DPO
 								- Multi-GPU training
 								```yaml
 								services:
 								  axolotl:
 								    image: winglian/axolotl:main-py3.11-cu121-2.2.2
 								    container_name: gpu_training
 								    runtime: nvidia
 								    volumes:
 								      - ./training/configs:/workspace/configs
 								      - ./training/data:/workspace/data
 								      - ./training/output:/workspace/output
 								      - training_cache:/root/.cache
 								    environment:
 								      NVIDIA_VISIBLE_DEVICES: all
 								      WANDB_API_KEY: ${WANDB_API_KEY:-}  # Optional: Weights & Biases logging
 								    command: |
 								      bash -c "
 								      accelerate launch -m axolotl.cli.train /workspace/configs/config.yaml
 								      "
 								    deploy:
 								      resources:
 								        reservations:
 								          devices:
 								            - driver: nvidia
 								              count: 1
 								              capabilities: [gpu]
 								```
 								#### Training Workflow:
 . Prepare dataset (JSONL format)
 . Create Axolotl config (LoRA, batch size, epochs)
 . Start training container
 . Monitor via Weights & Biases or TensorBoard
 . Export LoRA adapters
 . Merge with base model or use in vLLM
 								#### Example Config:
 								```yaml
 								# training/configs/lora-llama3.yaml
 								base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
 								model_type: AutoModelForCausalLM
 								tokenizer_type: AutoTokenizer
 								load_in_8bit: false
 								load_in_4bit: true
 								strict: false
 								datasets:
 								  - path: /workspace/data/train.jsonl
 								    type: completion
 								    field: text
 								output_dir: /workspace/output/llama3-lora
 								adapter: lora
 								lora_r: 16
 								lora_alpha: 32
 								lora_dropout: 0.05
 								lora_target_modules:
 								  - q_proj
 								  - v_proj
 								  - k_proj
 								  - o_proj
 								gradient_accumulation_steps: 4
 								micro_batch_size: 2
 								num_epochs: 3
 								learning_rate: 0.0002
 								optimizer: adamw_bnb_8bit
 								lr_scheduler: cosine
 								warmup_steps: 100
 								```
 								#### Option B: JupyterLab for Custom Training
 								For research and custom training scripts:
 								```yaml
 								services:
 								  jupyter:
 								    image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel
 								    container_name: gpu_jupyter
 								    runtime: nvidia
 								    volumes:
 								      - ./notebooks:/workspace
 								      - jupyter_cache:/root/.cache
 								    ports:
 								      - "8888:8888"
 								    environment:
 								      NVIDIA_VISIBLE_DEVICES: all
 								      JUPYTER_ENABLE_LAB: "yes"
 								    command: |
 								      bash -c "
 								      pip install jupyterlab transformers datasets accelerate bitsandbytes peft &&
 								      jupyter lab --ip=0.0.0.0 --allow-root --no-browser --NotebookApp.token=''
 								      "
 								    deploy:
 								      resources:
 								        reservations:
 								          devices:
 								            - driver: nvidia
 								              count: 1
 								              capabilities: [gpu]
 								```
 								---
 								### 4.4 Model Storage Strategy
 								#### Storage Requirements:
 								**Per Model Type:**
 								- LLM 7B: ~14GB (FP16)
 								- LLM 13B: ~26GB
 								- SDXL: ~7GB
 								- FLUX: ~12GB
 								- ControlNet: ~2.5GB each
 								- LoRA: ~100-500MB each
 								**Total Estimated:**
 								- 3-4 LLMs: ~80GB
 								- SD models + LoRAs: ~50GB
 								- Training checkpoints: ~100GB
 								- **Total: 250-300GB minimum**
 								#### RunPod Storage Options:
 . **Network Volume** (Recommended)
 								   - Persistent across pod restarts
 								   - Shared between multiple pods
 								   - ~$0.10/GB/month
 								   - 500GB = $50/month
 . **Container Disk**
 								   - Included with pod
 								   - Lost when pod stops
 								   - Good for temporary storage
 . **External Storage (rclone)**
 								   - Sync to/from VPS or cloud storage
 								   - Backup models to Backblaze B2 or Wasabi
 								   - Good for disaster recovery
 								#### Model Management:
 								Use **Hugging Face Hub** as model cache:
 								```bash
 								# Download models on first run
 								huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct \
 								  --local-dir /models/llama-3.1-8b
 								# Or let vLLM/ComfyUI auto-download
 								```
 								**Model Sync Script:**
 								```bash
 								#!/bin/bash
 								# sync-models.sh - Sync models from VPS to GPU server
 								rclone sync \
 								  /mnt/hidrive/AI/models \
 								  gpu:/workspace/models \
 								  --progress \
 								  --transfers 4
 								```
 								---
 								## Phase 5: Implementation Roadmap
 								### Week 1: Infrastructure Setup
 								**Day 1-2: RunPod Account & GPU Server**
 								- [ ] Create RunPod account
 								- [ ] Deploy RTX 4090 pod with Ubuntu 22.04 + PyTorch template
 								- [ ] Configure persistent network volume (500GB)
 								- [ ] Set up SSH access
 								**Day 3-4: Network Configuration**
 								- [ ] Deploy WireGuard on GPU server
 								- [ ] Add GPU server as peer to existing VPN (vpn/compose.yaml)
 								- [ ] Test connectivity between VPS and GPU server
 								- [ ] Configure firewall rules
 								**Day 5: Docker Setup on GPU Server**
 								- [ ] Install Docker + NVIDIA Container Toolkit
 								- [ ] Create docker-compose.yaml for GPU services
 								- [ ] Test GPU access in containers
 								---
 								### Week 2: LLM Hosting
 								**Day 1-2: vLLM Deployment**
 								- [ ] Deploy vLLM container
 								- [ ] Download Llama 3.1 8B Instruct
 								- [ ] Test inference locally
 								- [ ] Benchmark performance (tokens/sec)
 								**Day 3-4: LiteLLM Integration**
 								- [ ] Update litellm-config.yaml with vLLM endpoint
 								- [ ] Test via Open WebUI
 								- [ ] Configure model routing (cheap models → vLLM, complex → Claude)
 								- [ ] Set up usage monitoring
 								**Day 5: Model Expansion**
 								- [ ] Download Qwen 2.5 14B
 								- [ ] Download Mistral 7B Instruct
 								- [ ] Test model switching in Open WebUI
 								- [ ] Document performance characteristics
 								---
 								### Week 3: Image Generation
 								**Day 1-2: ComfyUI Setup**
 								- [ ] Deploy ComfyUI container
 								- [ ] Download FLUX.1-schnell
 								- [ ] Download SDXL
 								- [ ] Install ComfyUI Manager
 								**Day 3-4: Model Downloads**
 								- [ ] Download ControlNet models
 								- [ ] Download VAE models
 								- [ ] Download popular LoRAs from Civitai
 								- [ ] Organize model directory
 								**Day 5: Integration & Workflows**
 								- [ ] Create basic text-to-image workflow
 								- [ ] Create ControlNet workflow
 								- [ ] Test API access
 								- [ ] Add Traefik subdomain (comfy.ai.pivoine.art)
 								---
 								### Week 4: Training Infrastructure
 								**Day 1-2: Axolotl Setup**
 								- [ ] Deploy Axolotl container
 								- [ ] Create sample dataset
 								- [ ] Test LoRA fine-tuning with tiny model
 								- [ ] Verify GPU utilization
 								**Day 3-4: JupyterLab Setup**
 								- [ ] Deploy JupyterLab container
 								- [ ] Install ML libraries
 								- [ ] Create example notebooks
 								- [ ] Test custom training scripts
 								**Day 5: Documentation & Testing**
 								- [ ] Write training guides
 								- [ ] Test end-to-end workflows
 								- [ ] Benchmark training speeds
 								- [ ] Document best practices
 								---
 								### Ongoing: Optimization & Expansion
 								**Month 2:**
 								- Monitor costs and optimize GPU utilization
 								- Implement model caching strategies
 								- Add more models based on usage patterns
 								- Set up automated model updates
 								- Implement usage quotas per user
 								**Month 3+:**
 								- Consider multi-GPU setup for larger models
 								- Implement model quantization (AWQ/GPTQ)
 								- Add video generation (AnimateDiff, CogVideo)
 								- Explore voice synthesis (XTTS, Bark)
 								- Custom model training for specific use cases
 								---
 								## Phase 6: Cost Analysis
 								### Scenario A: Single RTX 4090 (24/7)
 								**GPU Server (RunPod):**
 								- RTX 4090 pod: $0.50/hour × 720 hours = $360/month
 								- 500GB network volume: $50/month
 								- **Subtotal: $410/month**
 								**VPS (Existing):**
 								- No change in cost
 								**Total: ~$410/month**
 								**Savings:**
 								- Claude API costs reduced by ~80% (self-hosted for routine tasks)
 								- Break-even if currently spending >$500/month on API calls
 								---
 								### Scenario B: Pay-as-you-go (8 hours/day)
 								**GPU Server (RunPod):**
 								- RTX 4090: $0.50/hour × 8 hours × 30 days = $120/month
 								- Storage: $50/month
 								- **Subtotal: $170/month**
 								**Best for:**
 								- Development/experimentation
 								- Burst workloads
 								- Image generation on-demand
 								---
 								### Scenario C: Dual GPU (Training + Inference)
 								**GPU Server 1 (Inference):**
 								- RTX 4090 24/7: $360/month
 								**GPU Server 2 (Training - On-demand):**
 								- A100 40GB: $1.50/hour × 40 hours/month = $60/month
 								- Used only for fine-tuning sessions
 								**Storage:**
 								- 1TB network volume: $100/month
 								**Total: ~$520/month**
 								---
 								### Cost Optimization Tips
 . **Auto-stop idle pods**: RunPod can auto-stop after X minutes idle
 . **Use spot instances**: ~50% cheaper but can be interrupted
 . **Quantized models**: 4-bit models use 4x less VRAM → cheaper GPUs
 . **Batch processing**: Queue image gen jobs to maximize GPU usage
 . **Model sharing**: One vLLM instance can serve multiple models via adapters
 . **Monitoring**: Track per-model costs to optimize routing
 								---
 								## Phase 7: Monitoring & Operations
 								### Metrics to Track
 								**GPU Utilization:**
 								- nvidia-smi metrics (utilization %, memory usage)
 								- Temperature and power draw
 								- Per-process GPU usage
 								**Model Performance:**
 								- Tokens per second (LLM inference)
 								- Images per second (SD/FLUX)
 								- Training time per epoch
 								**Costs:**
 								- GPU hours consumed
 								- Storage usage
 								- API vs self-hosted breakdown
 								### Monitoring Stack
 								**Option A: Netdata (Already deployed)**
 								Add GPU monitoring to existing Netdata:
 								```yaml
 								# On GPU server
 								services:
 								  netdata:
 								    image: netdata/netdata:latest
 								    container_name: gpu_netdata
 								    runtime: nvidia
 								    environment:
 								      NVIDIA_VISIBLE_DEVICES: all
 								    volumes:
 								      - /sys:/host/sys:ro
 								      - /proc:/host/proc:ro
 								      - /var/run/docker.sock:/var/run/docker.sock:ro
 								    command: |
 								      bash -c "
 								      # Enable nvidia_smi plugin
 								      /usr/libexec/netdata/plugins.d/charts.d.plugin nvidia_smi
 								      "
 								```
 								**Option B: Prometheus + Grafana**
 								For detailed metrics:
 								```yaml
 								services:
 								  prometheus:
 								    image: prom/prometheus:latest
 								    volumes:
 								      - ./prometheus.yml:/etc/prometheus/prometheus.yml
 								      - prometheus_data:/prometheus
 								    command:
 								      - '--config.file=/etc/prometheus/prometheus.yml'
 								  dcgm-exporter:
 								    image: nvidia/dcgm-exporter:latest
 								    runtime: nvidia
 								    environment:
 								      NVIDIA_VISIBLE_DEVICES: all
 								  grafana:
 								    image: grafana/grafana:latest
 								    ports:
 								      - "3000:3000"
 								    volumes:
 								      - grafana_data:/var/lib/grafana
 								```
 								Import Grafana dashboard #12219 for GPU metrics.
 								---
 								## Phase 8: Backup & Disaster Recovery
 								### What to Backup
 . **Models** (250-300GB)
 								   - Base models can be re-downloaded
 								   - Custom fine-tuned models: CRITICAL
 								   - LoRAs: CRITICAL
 . **Training Data** (~10-50GB)
 								   - Datasets
 								   - Preprocessing scripts
 . **Configurations** (<1GB)
 								   - Docker compose files
 								   - Training configs
 								   - Workflow JSONs
 								### Backup Strategy
 								**Tier 1: Critical (Daily)**
 								- Fine-tuned models
 								- Training checkpoints
 								- Custom datasets
 								**Backup to:**
 								- Restic → HiDrive (already configured)
 								- Backblaze B2 (~$6/TB/month)
 								```bash
 								# Add to core/compose.yaml backrest config
 								- gpu_models:/volumes/gpu_models:ro
 								- gpu_checkpoints:/volumes/gpu_checkpoints:ro
 								```
 								**Tier 2: Nice-to-have (Weekly)**
 								- Base models (can re-download)
 								- ComfyUI outputs
 								**Tier 3: Ephemeral (No backup)**
 								- Inference cache
 								- Temporary generations
 								---
 								## Phase 9: Security Considerations
 								### GPU Server Security
 . **Firewall:**
 								   - Only allow WireGuard port (51820)
 								   - All services accessed via VPN
 								   - No public exposure
 . **SSH:**
 								   - Key-based auth only
 								   - Disable password auth
 								   - Change default port
 . **Docker:**
 								   - Rootless Docker (optional but recommended)
 								   - Limited container capabilities
 								   - No privileged containers except for nvidia-runtime
 . **Secrets:**
 								   - Store API keys in .env
 								   - Use Docker secrets for sensitive data
 								   - Rotate keys periodically
 								### Access Control
 								- **ComfyUI**: Protected by Authelia SSO (already configured)
 								- **vLLM**: Internal only, accessed via LiteLLM proxy
 								- **JupyterLab**: Password-protected or Authelia
 								- **Training**: No public access, VPN only
 								---
 								## Phase 10: Advanced Features (Future)
 								### Multi-GPU Scaling
 								**Tensor Parallelism** (vLLM):
 								- Split large models across multiple GPUs
 								- Example: 70B model on 2x A100s
 								```yaml
 								command:
 								  - --model
 								  - meta-llama/Meta-Llama-3.1-70B-Instruct
 								  - --tensor-parallel-size
 								  - '2'  # Use 2 GPUs
 								```
 								**Pipeline Parallelism** (training):
 								- Split model layers across GPUs
 								- Useful for very large models
 								### Model Serving Optimization
 								**vLLM Features:**
 								- Speculative decoding (faster generation)
 								- Prefix caching (faster for repeated prompts)
 								- Multi-LoRA serving (multiple adapters, one base model)
 								**Example multi-LoRA:**
 								```yaml
 								command:
 								  - --model
 								  - meta-llama/Meta-Llama-3.1-8B-Instruct
 								  - --enable-lora
 								  - --max-loras
 								  - '4'
 								  - --lora-modules
 								  - customer-support=/models/loras/support-lora
 								  - creative-writing=/models/loras/writing-lora
 								```
 								### Video Generation
 								**AnimateDiff in ComfyUI:**
 								- Generate short videos from text prompts
 								- Animate static images
 								- ~8GB VRAM for 512x512 16-frame videos
 								**CogVideo:**
 								- High-quality video generation
 								- Requires A100 or H100
 								- 5-second clips at 720p
 								### Voice Synthesis
 								**XTTS v2:**
 								- High-quality voice cloning
 								- Multi-language support
 								- ~6GB VRAM
 								**Bark:**
 								- Text-to-speech with emotions
 								- Sound effects
 								- ~10GB VRAM
 								---
 								## Appendix A: Quick Start Commands
 								### Initial GPU Server Setup
 								```bash
 								# SSH into RunPod instance
 								ssh root@gpu.runpod.io -p 12345
 								# Install Docker
 								curl -fsSL https://get.docker.com -o get-docker.sh
 								sh get-docker.sh
 								# Install NVIDIA Container Toolkit
 								distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
 								curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
 								curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
 								  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
 								  tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
 								apt-get update
 								apt-get install -y nvidia-container-toolkit
 								systemctl restart docker
 								# Test GPU access
 								docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
 								```
 								### Deploy vLLM (Quick Test)
 								```bash
 								# Create directory
 								mkdir -p /workspace/vllm
 								cd /workspace/vllm
 								# Run vLLM
 								docker run -d \
 								  --name vllm \
 								  --runtime=nvidia \
 								  --gpus all \
 								  -v ~/.cache/huggingface:/root/.cache/huggingface \
 								  -p 8000:8000 \
 								  vllm/vllm-openai:latest \
 								  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
 								  --dtype auto \
 								  --max-model-len 8192
 								# Test inference
 								curl http://localhost:8000/v1/completions \
 								  -H "Content-Type: application/json" \
 								  -d '{
 								    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
 								    "prompt": "Once upon a time",
 								    "max_tokens": 50
 								  }'
 								```
 								### Deploy ComfyUI (Quick Test)
 								```bash
 								docker run -d \
 								  --name comfyui \
 								  --runtime=nvidia \
 								  --gpus all \
 								  -v /workspace/comfyui:/data \
 								  -p 8188:8188 \
 								  ghcr.io/ai-dock/comfyui:latest
 								# Access at http://gpu-ip:8188
 								```
 								---
 								## Appendix B: Sample Docker Compose (Full GPU Stack)
 								```yaml
 								# gpu-server/compose.yaml
 								version: '3.8'
 								services:
 								  # vLLM for LLM inference
 								  vllm:
 								    image: vllm/vllm-openai:latest
 								    container_name: gpu_vllm
 								    restart: unless-stopped
 								    runtime: nvidia
 								    environment:
 								      NVIDIA_VISIBLE_DEVICES: all
 								      CUDA_VISIBLE_DEVICES: 0
 								    volumes:
 								      - vllm_models:/root/.cache/huggingface
 								    command:
 								      - --model
 								      - meta-llama/Meta-Llama-3.1-8B-Instruct
 								      - --host
 								      - 0.0.0.0
 								      - --port
 								      - 8000
 								      - --gpu-memory-utilization
 								      - '0.9'
 								    ports:
 								      - "8000:8000"
 								    deploy:
 								      resources:
 								        reservations:
 								          devices:
 								            - driver: nvidia
 								              count: 1
 								              capabilities: [gpu]
 								  # ComfyUI for image generation
 								  comfyui:
 								    image: ghcr.io/ai-dock/comfyui:latest
 								    container_name: gpu_comfyui
 								    restart: unless-stopped
 								    runtime: nvidia
 								    environment:
 								      NVIDIA_VISIBLE_DEVICES: all
 								    volumes:
 								      - comfyui_data:/data
 								      - comfyui_models:/opt/ComfyUI/models
 								      - comfyui_output:/opt/ComfyUI/output
 								    ports:
 								      - "8188:8188"
 								    deploy:
 								      resources:
 								        reservations:
 								          devices:
 								            - driver: nvidia
 								              count: 1
 								              capabilities: [gpu]
 								  # Axolotl for model training
 								  axolotl:
 								    image: winglian/axolotl:main-py3.11-cu121-2.2.2
 								    container_name: gpu_training
 								    runtime: nvidia
 								    volumes:
 								      - ./training/configs:/workspace/configs
 								      - ./training/data:/workspace/data
 								      - ./training/output:/workspace/output
 								      - training_cache:/root/.cache
 								    environment:
 								      NVIDIA_VISIBLE_DEVICES: all
 								    deploy:
 								      resources:
 								        reservations:
 								          devices:
 								            - driver: nvidia
 								              count: 1
 								              capabilities: [gpu]
 								    # Only start when training
 								    profiles:
 								      - training
 								  # JupyterLab for research
 								  jupyter:
 								    image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel
 								    container_name: gpu_jupyter
 								    restart: unless-stopped
 								    runtime: nvidia
 								    volumes:
 								      - ./notebooks:/workspace
 								      - jupyter_cache:/root/.cache
 								    ports:
 								      - "8888:8888"
 								    environment:
 								      NVIDIA_VISIBLE_DEVICES: all
 								      JUPYTER_ENABLE_LAB: "yes"
 								    command: |
 								      bash -c "
 								      pip install jupyterlab transformers datasets accelerate bitsandbytes peft &&
 								      jupyter lab --ip=0.0.0.0 --allow-root --no-browser
 								      "
 								    deploy:
 								      resources:
 								        reservations:
 								          devices:
 								            - driver: nvidia
 								              count: 1
 								              capabilities: [gpu]
 								  # Netdata monitoring
 								  netdata:
 								    image: netdata/netdata:latest
 								    container_name: gpu_netdata
 								    restart: unless-stopped
 								    runtime: nvidia
 								    environment:
 								      NVIDIA_VISIBLE_DEVICES: all
 								    volumes:
 								      - /sys:/host/sys:ro
 								      - /proc:/host/proc:ro
 								      - /var/run/docker.sock:/var/run/docker.sock:ro
 								    ports:
 								      - "19999:19999"
 								    cap_add:
 								      - SYS_PTRACE
 								      - SYS_ADMIN
 								    security_opt:
 								      - apparmor:unconfined
 								volumes:
 								  vllm_models:
 								  comfyui_data:
 								  comfyui_models:
 								  comfyui_output:
 								  training_cache:
 								  jupyter_cache:
 								```
 								---
 								## Appendix C: Cost Calculator
 								**Monthly GPU Costs:**
 								| GPU Model | VRAM | $/hour | 24/7 Month | 8hr/day | Use Case |
 								|-----------|------|--------|------------|---------|----------|
 								| RTX 3090 | 24GB | $0.35 | $252 | $84 | Development, small models |
 								| RTX 4090 | 24GB | $0.50 | $360 | $120 | Production inference, SD |
 								| A6000 | 48GB | $0.80 | $576 | $192 | Large models, training |
 								| A100 40GB | 40GB | $1.50 | $1,080 | $360 | Enterprise, training |
 								| A100 80GB | 80GB | $2.50 | $1,800 | $600 | Massive models, research |
 								**Storage Costs:**
 								- Network volume: $0.10/GB/month
 								- 500GB = $50/month
 								- 1TB = $100/month
 								**Total Estimated Monthly:**
 								- RTX 4090 + 500GB storage = $410/month (24/7)
 								- RTX 4090 + 500GB storage = $170/month (8hr/day)
 								**Break-even Analysis:**
 								- If spending >$500/month on API calls → GPU server saves money
 								- If spending <$200/month → stick with APIs
 								---
 								## Appendix D: Model Recommendations by Use Case
 								### General Chat (24/7 Inference)
 								**Best:** Qwen 2.5 14B Instruct
 								- Excellent multilingual support
 								- Fast inference
 								- Good reasoning
 								**Alternative:** Mistral 7B Instruct v0.3
 								- Fastest inference
 								- Lower VRAM
 								### Code Generation
 								**Best:** Qwen 2.5 Coder 14B
 								- SOTA coding performance
 								- Multi-language support
 								**Alternative:** DeepSeek Coder 6.7B
 								- Faster, lighter
 								### Creative Writing
 								**Best:** Nous Hermes 2 Mixtral 8x7B (quantized)
 								- Creative, engaging
 								- Follows instructions well
 								### Image Generation (Realistic)
 								**Best:** FLUX.1-dev
 								- Highest quality
 								- Best prompt following
 								**Alternative:** SDXL + RealVisXL LoRA
 								- Faster generation
 								- Good quality
 								### Image Generation (Anime)
 								**Best:** SDXL + AnimagineXL LoRA
 								- Anime-specific training
 								- Vibrant colors
 								### Video Generation
 								**Best:** AnimateDiff + SDXL
 								- 16-frame clips
 								- Good quality
 								**Needs:** A100 40GB or better
 								---
 								## Next Steps
 . **Review this plan** and provide feedback
 . **Set budget** for GPU infrastructure
 . **Choose provider** (recommend RunPod)
 . **Define priority services** (LLM hosting first? Image gen first?)
 . **Schedule implementation** (4-week timeline above)
 								Would you like me to:
 								- Create the detailed Docker Compose configurations?
 								- Set up a cost estimation spreadsheet?
 								- Research specific models for your use cases?
 								- Begin implementation with Phase 1?
 								Let me know how you'd like to proceed! 🚀