Files

Sebastian Krüger 277f1c95bd Initial commit: RunPod multi-modal AI orchestration stack

- Multi-modal AI infrastructure for RunPod RTX 4090
- Automatic model orchestration (text, image, music)
- Text: vLLM + Qwen 2.5 7B Instruct
- Image: Flux.1 Schnell via OpenEDAI
- Music: MusicGen Medium via AudioCraft
- Cost-optimized sequential loading on single GPU
- Template preparation scripts for rapid deployment
- Comprehensive documentation (README, DEPLOYMENT, TEMPLATE)

2025-11-21 14:34:55 +01:00

33 KiB

Raw Blame History

GPU-Enhanced AI Stack Expansion Plan

Executive Summary

This document outlines a comprehensive plan to extend the current AI stack (LiteLLM, Open WebUI, Crawl4AI) with dedicated GPU hosting capabilities for:

LLM Model Hosting: Self-hosted models (Llama, Mistral, Qwen, etc.)
Model Training: Fine-tuning and training workflows
Image Generation: Stable Diffusion, FLUX via ComfyUI
Video Generation: AnimateDiff, CogVideo, etc.

Current Architecture: CPU-based stack on pivoine.art VPS → Claude API via LiteLLM Target Architecture: Hybrid stack with GPU server(s) for self-hosted models + API-based models

Phase 1: Current Stack Analysis

Existing Components

ai_postgres (pgvector/pgvector:pg16)
- PostgreSQL with pgvector for RAG
- Stores: conversations, embeddings, LiteLLM logs
webui (Open WebUI)
- User-facing ChatGPT-like interface
- URL: https://ai.pivoine.art
- Features: RAG, web search, document upload
- Connected to LiteLLM proxy
litellm (LiteLLM proxy)
- Currently proxies Anthropic Claude API
- OpenAI-compatible endpoint at http://litellm:4000
- Supports multiple providers via config
crawl4ai
- Internal web scraping for LLM content prep
- Port 11235 (internal only)
facefusion (CPU-only)
- Face swapping/enhancement
- Currently CPU-based (slow)
- Protected by Authelia SSO

Current Limitations

❌ No self-hosted LLMs (relies on expensive API calls)
❌ No GPU acceleration for facefusion
❌ No image generation capabilities
❌ No model training/fine-tuning capabilities
❌ No video generation
❌ High operational costs for API usage

Phase 2: GPU Provider Comparison

Provider Options

1. RunPod ⭐ RECOMMENDED

Pros:

Pay-per-second GPU billing
Wide GPU selection (RTX 4090, A100, H100)
Docker-first platform
Global locations
Easy HTTP/SSH tunneling
Volume persistence

Pricing (Approximate):

RTX 4090 (24GB): ~$0.50/hour ($360/month 24/7)
RTX 3090 (24GB): ~$0.35/hour ($250/month)
A6000 (48GB): ~$0.80/hour ($576/month)
A100 (40GB): ~$1.50/hour ($1,080/month)

Best for: On-demand workloads, experimentation, cost-conscious hosting

2. Lambda Labs

Pros:

Flat monthly pricing
High-end GPUs (A100, H100)
Jupyter notebooks included
Fast network

Pricing:

1x A100 (40GB): $1.10/hour ($792/month)
8x A100 (40GB): $8.00/hour (~$5,760/month)

Best for: Research, high-utilization workloads

3. Vast.ai

Pros:

Marketplace model (cheapest)
Many GPU options
Spot pricing available

Cons:

Variable reliability
Setup complexity
Community-hosted machines

Pricing:

RTX 4090: ~$0.25-0.40/hour
A100: ~$0.80-1.20/hour

Best for: Budget-conscious, experimental workloads

4. Google Cloud Platform (GCP)

Pros:

Enterprise reliability
Auto-scaling
Integration with Google services
Preemptible instances available

Pricing:

T4 (16GB): ~$0.35/hour
V100 (16GB): ~$2.48/hour
A100 (40GB): ~$2.93/hour
TPU options available

Best for: Enterprise workloads, auto-scaling needs

5. AWS

Pros:

Global infrastructure
Broad GPU selection
Spot instances for cost savings
Enterprise support

Pricing:

g4dn.xlarge (T4 16GB): ~$0.526/hour
p3.2xlarge (V100 16GB): ~$3.06/hour
p4d.24xlarge (8x A100 40GB): ~$32.77/hour

Best for: Enterprise, existing AWS infrastructure

6. Hugging Face Spaces / Inference Endpoints

Pros:

Managed model hosting
Auto-scaling
Simple deployment
Community models

Pricing:

CPU: $0.03/hour
T4: $0.60/hour
A10G: $1.00/hour
A100: $4.00/hour

Best for: Quick model deployment, serverless inference

Recommendation: RunPod for Primary GPU Server

Rationale:

Cost-effective: Pay-per-second billing, ~$0.50/hour for RTX 4090
Docker-native: Easy integration with existing compose stack
Flexibility: Start/stop as needed, scale up for training
Community: Large user base, good documentation
Network: Built-in HTTP/SSH tunneling

Supplementary: Use Hugging Face for specific model hosting if needed

Phase 3: Architecture Design

Network Topology

┌─────────────────────────────────────────────────────────────┐
│ pivoine.art VPS (CPU-based)                                 │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────┐      ┌──────────┐      ┌──────────┐         │
│  │ Open     │─────▶│ LiteLLM  │◀────▶│ ai_      │         │
│  │ WebUI    │      │ Proxy    │      │ postgres │         │
│  └──────────┘      └──────────┘      └──────────┘         │
│       │                  │                                  │
│       │                  │                                  │
└───────┼──────────────────┼──────────────────────────────────┘
        │                  │
        │                  ▼
        │         ┌─────────────────┐
        │         │ Anthropic API   │
        │         │ (Claude)        │
        │         └─────────────────┘
        │
        ▼
┌────────────────────────────────────────────────────────────┐
│ GPU Server (RunPod)                                        │
├────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐ │
│  │ vLLM     │  │ ComfyUI  │  │ Model    │  │ JupyterLab│ │
│  │ (LLMs)   │  │ (SD/FLUX)│  │ Training │  │           │ │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘ │
│       │              │             │              │       │
│       └──────────────┴─────────────┴──────────────┘       │
│                      │                                     │
│              ┌───────────────┐                            │
│              │ Model Storage │                            │
│              │ (Persistent)  │                            │
│              └───────────────┘                            │
│                                                            │
└────────────────────────────────────────────────────────────┘
         │
         ▼ (Tunneled via WireGuard or Tailscale)
┌────────────────────────────────────────────────────────────┐
│ Integration Options:                                       │
├────────────────────────────────────────────────────────────┤
│ 1. LiteLLM adds vLLM endpoint (http://gpu.internal:8000)  │
│ 2. ComfyUI exposed via subdomain (comfy.ai.pivoine.art)   │
│ 3. Model storage synced via rclone/restic                 │
└────────────────────────────────────────────────────────────┘

Connection Methods

Option A: WireGuard VPN (RECOMMENDED)

Create WireGuard tunnel between VPS and GPU server
GPU services accessible via private IPs
Secure, low overhead, easy to manage
Already have wg-easy in your stack

Setup:

Deploy WireGuard on GPU server
Add GPU server as VPN peer
Configure LiteLLM to use VPN IPs

Option B: SSH Tunnel

SSH reverse tunnel from GPU to VPS
Simple, no additional software
Higher latency

Option C: Tailscale

Zero-config VPN mesh
Easy setup, good UX
Proprietary (but free tier available)

Phase 4: Service Implementation Plans

4.1 LLM Hosting with vLLM

vLLM is the industry-standard for high-performance LLM inference.

Features:

PagedAttention for efficient KV cache
Continuous batching
OpenAI-compatible API
Tensor parallelism for multi-GPU
Quantization support (AWQ, GPTQ)

Docker Compose Configuration:

services:
  vllm:
    image: vllm/vllm-openai:latest
    container_name: gpu_vllm
    restart: unless-stopped
    runtime: nvidia
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      CUDA_VISIBLE_DEVICES: 0
    volumes:
      - vllm_models:/root/.cache/huggingface
    command:
      - --model
      - meta-llama/Meta-Llama-3.1-8B-Instruct  # or any model
      - --host
      - 0.0.0.0
      - --port
      - 8000
      - --tensor-parallel-size
      - '1'
      - --gpu-memory-utilization
      - '0.9'
      - --max-model-len
      - '8192'
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Recommended Models for RTX 4090 (24GB):

Text Generation:

Llama 3.1 8B Instruct (8GB VRAM, fast)
Qwen2.5 14B Instruct (14GB VRAM, multilingual)
Mistral 7B Instruct v0.3 (7GB VRAM)
Nous Hermes 2 Mixtral 8x7B (with quantization, 16GB)

Code:

DeepSeek Coder 6.7B (7GB VRAM)
CodeLlama 13B (13GB VRAM)
Qwen2.5-Coder 14B (14GB VRAM)

Integration with LiteLLM:

Add to ai/litellm-config.yaml:

model_list:
  # Existing Anthropic
  - model_name: claude-sonnet-4-5
    litellm_params:
      model: anthropic/claude-sonnet-4-5-20250929
      api_key: os.environ/ANTHROPIC_API_KEY

  # New vLLM models
  - model_name: llama-3.1-8b
    litellm_params:
      model: openai/meta-llama/Meta-Llama-3.1-8B-Instruct
      api_base: http://gpu.internal:8000/v1
      api_key: dummy

  - model_name: qwen-2.5-14b
    litellm_params:
      model: openai/Qwen/Qwen2.5-14B-Instruct
      api_base: http://gpu.internal:8000/v1
      api_key: dummy

4.2 ComfyUI for Image/Video Generation

ComfyUI is a node-based UI for Stable Diffusion with advanced workflows.

Features:

Node-based workflow editor
Support for SD 1.5, SDXL, SD3, FLUX
ControlNet, LoRA, embeddings
Video generation (AnimateDiff, SVD)
API for automation

Docker Compose Configuration:

services:
  comfyui:
    image: ghcr.io/ai-dock/comfyui:latest
    container_name: gpu_comfyui
    restart: unless-stopped
    runtime: nvidia
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      # Custom nodes auto-install
      COMFYUI_FLAGS: --listen 0.0.0.0 --port 8188
    volumes:
      - comfyui_data:/data
      - comfyui_models:/opt/ComfyUI/models
      - comfyui_output:/opt/ComfyUI/output
    ports:
      - "8188:8188"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Model Downloads (via ComfyUI Manager):

Stable Diffusion Models:

FLUX.1-dev (12GB, newest, best quality)
FLUX.1-schnell (12GB, fast)
SDXL Base 1.0 (6.9GB)
SD 1.5 (4GB, fast, wide LoRA support)

ControlNet Models:

controlnet-canny-sdxl
controlnet-depth-sdxl
controlnet-openpose-sdxl

LoRA Models (download from Civitai):

Style LoRAs (anime, realistic, etc.)
Character LoRAs
Concept LoRAs

Traefik Integration:

Add subdomain routing for ComfyUI:

labels:
  - 'traefik.enable=true'
  - 'traefik.http.routers.comfyui-web-secure.rule=Host(`comfy.ai.pivoine.art`)'
  - 'traefik.http.routers.comfyui-web-secure.tls.certresolver=resolver'
  - 'traefik.http.routers.comfyui-web-secure.entrypoints=web-secure'
  - 'traefik.http.routers.comfyui-web-secure.middlewares=net-authelia,security-headers@file'
  - 'traefik.http.services.comfyui.loadbalancer.server.port=8188'

Open WebUI Integration:

ComfyUI has a REST API that can be called from Open WebUI using function calling.

Example workflow API call:

import requests

def generate_image(prompt: str, negative_prompt: str = ""):
    workflow = {
        # ComfyUI workflow JSON
    }
    response = requests.post(
        "http://comfyui:8188/prompt",
        json={"prompt": workflow}
    )
    return response.json()

4.3 Model Training Infrastructure

For fine-tuning LLMs and training custom models.

Option A: Axolotl (Recommended)

Axolotl is a user-friendly fine-tuning framework supporting:

LoRA, QLoRA
Full fine-tuning
RLHF/DPO
Multi-GPU training

services:
  axolotl:
    image: winglian/axolotl:main-py3.11-cu121-2.2.2
    container_name: gpu_training
    runtime: nvidia
    volumes:
      - ./training/configs:/workspace/configs
      - ./training/data:/workspace/data
      - ./training/output:/workspace/output
      - training_cache:/root/.cache
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      WANDB_API_KEY: ${WANDB_API_KEY:-}  # Optional: Weights & Biases logging
    command: |
      bash -c "
      accelerate launch -m axolotl.cli.train /workspace/configs/config.yaml
      "
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Training Workflow:

Prepare dataset (JSONL format)
Create Axolotl config (LoRA, batch size, epochs)
Start training container
Monitor via Weights & Biases or TensorBoard
Export LoRA adapters
Merge with base model or use in vLLM

Example Config:

# training/configs/lora-llama3.yaml
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
  - path: /workspace/data/train.jsonl
    type: completion
    field: text

output_dir: /workspace/output/llama3-lora

adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
learning_rate: 0.0002

optimizer: adamw_bnb_8bit
lr_scheduler: cosine
warmup_steps: 100

Option B: JupyterLab for Custom Training

For research and custom training scripts:

services:
  jupyter:
    image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel
    container_name: gpu_jupyter
    runtime: nvidia
    volumes:
      - ./notebooks:/workspace
      - jupyter_cache:/root/.cache
    ports:
      - "8888:8888"
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      JUPYTER_ENABLE_LAB: "yes"
    command: |
      bash -c "
      pip install jupyterlab transformers datasets accelerate bitsandbytes peft &&
      jupyter lab --ip=0.0.0.0 --allow-root --no-browser --NotebookApp.token=''
      "
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

4.4 Model Storage Strategy

Storage Requirements:

Per Model Type:

LLM 7B: ~14GB (FP16)
LLM 13B: ~26GB
SDXL: ~7GB
FLUX: ~12GB
ControlNet: ~2.5GB each
LoRA: ~100-500MB each

Total Estimated:

3-4 LLMs: ~80GB
SD models + LoRAs: ~50GB
Training checkpoints: ~100GB
Total: 250-300GB minimum

RunPod Storage Options:

Network Volume (Recommended)
- Persistent across pod restarts
- Shared between multiple pods
- ~$0.10/GB/month
- 500GB = $50/month
Container Disk
- Included with pod
- Lost when pod stops
- Good for temporary storage
External Storage (rclone)
- Sync to/from VPS or cloud storage
- Backup models to Backblaze B2 or Wasabi
- Good for disaster recovery

Model Management:

Use Hugging Face Hub as model cache:

# Download models on first run
huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct \
  --local-dir /models/llama-3.1-8b

# Or let vLLM/ComfyUI auto-download

Model Sync Script:

#!/bin/bash
# sync-models.sh - Sync models from VPS to GPU server

rclone sync \
  /mnt/hidrive/AI/models \
  gpu:/workspace/models \
  --progress \
  --transfers 4

Phase 5: Implementation Roadmap

Week 1: Infrastructure Setup

Day 1-2: RunPod Account & GPU Server

Create RunPod account
Deploy RTX 4090 pod with Ubuntu 22.04 + PyTorch template
Configure persistent network volume (500GB)
Set up SSH access

Day 3-4: Network Configuration

Deploy WireGuard on GPU server
Add GPU server as peer to existing VPN (vpn/compose.yaml)
Test connectivity between VPS and GPU server
Configure firewall rules

Day 5: Docker Setup on GPU Server

Install Docker + NVIDIA Container Toolkit
Create docker-compose.yaml for GPU services
Test GPU access in containers

Week 2: LLM Hosting

Day 1-2: vLLM Deployment

Deploy vLLM container
Download Llama 3.1 8B Instruct
Test inference locally
Benchmark performance (tokens/sec)

Day 3-4: LiteLLM Integration

Update litellm-config.yaml with vLLM endpoint
Test via Open WebUI
Configure model routing (cheap models → vLLM, complex → Claude)
Set up usage monitoring

Day 5: Model Expansion

Download Qwen 2.5 14B
Download Mistral 7B Instruct
Test model switching in Open WebUI
Document performance characteristics

Week 3: Image Generation

Day 1-2: ComfyUI Setup

Deploy ComfyUI container
Download FLUX.1-schnell
Download SDXL
Install ComfyUI Manager

Day 3-4: Model Downloads

Download ControlNet models
Download VAE models
Download popular LoRAs from Civitai
Organize model directory

Day 5: Integration & Workflows

Create basic text-to-image workflow
Create ControlNet workflow
Test API access
Add Traefik subdomain (comfy.ai.pivoine.art)

Week 4: Training Infrastructure

Day 1-2: Axolotl Setup

Deploy Axolotl container
Create sample dataset
Test LoRA fine-tuning with tiny model
Verify GPU utilization

Day 3-4: JupyterLab Setup

Deploy JupyterLab container
Install ML libraries
Create example notebooks
Test custom training scripts

Day 5: Documentation & Testing

Write training guides
Test end-to-end workflows
Benchmark training speeds
Document best practices

Ongoing: Optimization & Expansion

Month 2:

Monitor costs and optimize GPU utilization
Implement model caching strategies
Add more models based on usage patterns
Set up automated model updates
Implement usage quotas per user

Month 3+:

Consider multi-GPU setup for larger models
Implement model quantization (AWQ/GPTQ)
Add video generation (AnimateDiff, CogVideo)
Explore voice synthesis (XTTS, Bark)
Custom model training for specific use cases

Phase 6: Cost Analysis

Scenario A: Single RTX 4090 (24/7)

GPU Server (RunPod):

RTX 4090 pod: $0.50/hour × 720 hours = $360/month
500GB network volume: $50/month
Subtotal: $410/month

VPS (Existing):

No change in cost

Total: ~$410/month

Savings:

Claude API costs reduced by ~80% (self-hosted for routine tasks)
Break-even if currently spending >$500/month on API calls

Scenario B: Pay-as-you-go (8 hours/day)

GPU Server (RunPod):

RTX 4090: $0.50/hour × 8 hours × 30 days = $120/month
Storage: $50/month
Subtotal: $170/month

Best for:

Development/experimentation
Burst workloads
Image generation on-demand

Scenario C: Dual GPU (Training + Inference)

GPU Server 1 (Inference):

RTX 4090 24/7: $360/month

GPU Server 2 (Training - On-demand):

A100 40GB: $1.50/hour × 40 hours/month = $60/month
Used only for fine-tuning sessions

Storage:

1TB network volume: $100/month

Total: ~$520/month

Cost Optimization Tips

Auto-stop idle pods: RunPod can auto-stop after X minutes idle
Use spot instances: ~50% cheaper but can be interrupted
Quantized models: 4-bit models use 4x less VRAM → cheaper GPUs
Batch processing: Queue image gen jobs to maximize GPU usage
Model sharing: One vLLM instance can serve multiple models via adapters
Monitoring: Track per-model costs to optimize routing

Phase 7: Monitoring & Operations

Metrics to Track

GPU Utilization:

nvidia-smi metrics (utilization %, memory usage)
Temperature and power draw
Per-process GPU usage

Model Performance:

Tokens per second (LLM inference)
Images per second (SD/FLUX)
Training time per epoch

Costs:

GPU hours consumed
Storage usage
API vs self-hosted breakdown

Monitoring Stack

Option A: Netdata (Already deployed)

Add GPU monitoring to existing Netdata:

# On GPU server
services:
  netdata:
    image: netdata/netdata:latest
    container_name: gpu_netdata
    runtime: nvidia
    environment:
      NVIDIA_VISIBLE_DEVICES: all
    volumes:
      - /sys:/host/sys:ro
      - /proc:/host/proc:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    command: |
      bash -c "
      # Enable nvidia_smi plugin
      /usr/libexec/netdata/plugins.d/charts.d.plugin nvidia_smi
      "

Option B: Prometheus + Grafana

For detailed metrics:

services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'

  dcgm-exporter:
    image: nvidia/dcgm-exporter:latest
    runtime: nvidia
    environment:
      NVIDIA_VISIBLE_DEVICES: all

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana

Import Grafana dashboard #12219 for GPU metrics.

Phase 8: Backup & Disaster Recovery

What to Backup

Models (250-300GB)
- Base models can be re-downloaded
- Custom fine-tuned models: CRITICAL
- LoRAs: CRITICAL
Training Data (~10-50GB)
- Datasets
- Preprocessing scripts
Configurations (<1GB)
- Docker compose files
- Training configs
- Workflow JSONs

Backup Strategy

Tier 1: Critical (Daily)

Fine-tuned models
Training checkpoints
Custom datasets

Backup to:

Restic → HiDrive (already configured)
Backblaze B2 (~$6/TB/month)

# Add to core/compose.yaml backrest config
- gpu_models:/volumes/gpu_models:ro
- gpu_checkpoints:/volumes/gpu_checkpoints:ro

Tier 2: Nice-to-have (Weekly)

Base models (can re-download)
ComfyUI outputs

Tier 3: Ephemeral (No backup)

Inference cache
Temporary generations

Phase 9: Security Considerations

GPU Server Security

Firewall:
- Only allow WireGuard port (51820)
- All services accessed via VPN
- No public exposure
SSH:
- Key-based auth only
- Disable password auth
- Change default port
Docker:
- Rootless Docker (optional but recommended)
- Limited container capabilities
- No privileged containers except for nvidia-runtime
Secrets:
- Store API keys in .env
- Use Docker secrets for sensitive data
- Rotate keys periodically

Access Control

ComfyUI: Protected by Authelia SSO (already configured)
vLLM: Internal only, accessed via LiteLLM proxy
JupyterLab: Password-protected or Authelia
Training: No public access, VPN only

Phase 10: Advanced Features (Future)

Multi-GPU Scaling

Tensor Parallelism (vLLM):

Split large models across multiple GPUs
Example: 70B model on 2x A100s

command:
  - --model
  - meta-llama/Meta-Llama-3.1-70B-Instruct
  - --tensor-parallel-size
  - '2'  # Use 2 GPUs

Pipeline Parallelism (training):

Split model layers across GPUs
Useful for very large models

Model Serving Optimization

vLLM Features:

Speculative decoding (faster generation)
Prefix caching (faster for repeated prompts)
Multi-LoRA serving (multiple adapters, one base model)

Example multi-LoRA:

command:
  - --model
  - meta-llama/Meta-Llama-3.1-8B-Instruct
  - --enable-lora
  - --max-loras
  - '4'
  - --lora-modules
  - customer-support=/models/loras/support-lora
  - creative-writing=/models/loras/writing-lora

Video Generation

AnimateDiff in ComfyUI:

Generate short videos from text prompts
Animate static images
~8GB VRAM for 512x512 16-frame videos

CogVideo:

High-quality video generation
Requires A100 or H100
5-second clips at 720p

Voice Synthesis

XTTS v2:

High-quality voice cloning
Multi-language support
~6GB VRAM

Bark:

Text-to-speech with emotions
Sound effects
~10GB VRAM

Appendix A: Quick Start Commands

Initial GPU Server Setup

# SSH into RunPod instance
ssh root@gpu.runpod.io -p 12345

# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update
apt-get install -y nvidia-container-toolkit
systemctl restart docker

# Test GPU access
docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

Deploy vLLM (Quick Test)

# Create directory
mkdir -p /workspace/vllm
cd /workspace/vllm

# Run vLLM
docker run -d \
  --name vllm \
  --runtime=nvidia \
  --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --dtype auto \
  --max-model-len 8192

# Test inference
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "prompt": "Once upon a time",
    "max_tokens": 50
  }'

Deploy ComfyUI (Quick Test)

docker run -d \
  --name comfyui \
  --runtime=nvidia \
  --gpus all \
  -v /workspace/comfyui:/data \
  -p 8188:8188 \
  ghcr.io/ai-dock/comfyui:latest

# Access at http://gpu-ip:8188

Appendix B: Sample Docker Compose (Full GPU Stack)

# gpu-server/compose.yaml
version: '3.8'

services:
  # vLLM for LLM inference
  vllm:
    image: vllm/vllm-openai:latest
    container_name: gpu_vllm
    restart: unless-stopped
    runtime: nvidia
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      CUDA_VISIBLE_DEVICES: 0
    volumes:
      - vllm_models:/root/.cache/huggingface
    command:
      - --model
      - meta-llama/Meta-Llama-3.1-8B-Instruct
      - --host
      - 0.0.0.0
      - --port
      - 8000
      - --gpu-memory-utilization
      - '0.9'
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  # ComfyUI for image generation
  comfyui:
    image: ghcr.io/ai-dock/comfyui:latest
    container_name: gpu_comfyui
    restart: unless-stopped
    runtime: nvidia
    environment:
      NVIDIA_VISIBLE_DEVICES: all
    volumes:
      - comfyui_data:/data
      - comfyui_models:/opt/ComfyUI/models
      - comfyui_output:/opt/ComfyUI/output
    ports:
      - "8188:8188"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  # Axolotl for model training
  axolotl:
    image: winglian/axolotl:main-py3.11-cu121-2.2.2
    container_name: gpu_training
    runtime: nvidia
    volumes:
      - ./training/configs:/workspace/configs
      - ./training/data:/workspace/data
      - ./training/output:/workspace/output
      - training_cache:/root/.cache
    environment:
      NVIDIA_VISIBLE_DEVICES: all
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    # Only start when training
    profiles:
      - training

  # JupyterLab for research
  jupyter:
    image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel
    container_name: gpu_jupyter
    restart: unless-stopped
    runtime: nvidia
    volumes:
      - ./notebooks:/workspace
      - jupyter_cache:/root/.cache
    ports:
      - "8888:8888"
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      JUPYTER_ENABLE_LAB: "yes"
    command: |
      bash -c "
      pip install jupyterlab transformers datasets accelerate bitsandbytes peft &&
      jupyter lab --ip=0.0.0.0 --allow-root --no-browser
      "
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  # Netdata monitoring
  netdata:
    image: netdata/netdata:latest
    container_name: gpu_netdata
    restart: unless-stopped
    runtime: nvidia
    environment:
      NVIDIA_VISIBLE_DEVICES: all
    volumes:
      - /sys:/host/sys:ro
      - /proc:/host/proc:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    ports:
      - "19999:19999"
    cap_add:
      - SYS_PTRACE
      - SYS_ADMIN
    security_opt:
      - apparmor:unconfined

volumes:
  vllm_models:
  comfyui_data:
  comfyui_models:
  comfyui_output:
  training_cache:
  jupyter_cache:

Appendix C: Cost Calculator

Monthly GPU Costs:

GPU Model	VRAM	$/hour	24/7 Month	8hr/day	Use Case
RTX 3090	24GB	$0.35	$252	$84	Development, small models
RTX 4090	24GB	$0.50	$360	$120	Production inference, SD
A6000	48GB	$0.80	$576	$192	Large models, training
A100 40GB	40GB	$1.50	$1,080	$360	Enterprise, training
A100 80GB	80GB	$2.50	$1,800	$600	Massive models, research

Storage Costs:

Network volume: $0.10/GB/month
500GB = $50/month
1TB = $100/month

Total Estimated Monthly:

RTX 4090 + 500GB storage = $410/month (24/7)
RTX 4090 + 500GB storage = $170/month (8hr/day)

Break-even Analysis:

If spending >$500/month on API calls → GPU server saves money
If spending <$200/month → stick with APIs

Appendix D: Model Recommendations by Use Case

General Chat (24/7 Inference)

Best: Qwen 2.5 14B Instruct

Excellent multilingual support
Fast inference
Good reasoning

Alternative: Mistral 7B Instruct v0.3

Fastest inference
Lower VRAM

Code Generation

Best: Qwen 2.5 Coder 14B

SOTA coding performance
Multi-language support

Alternative: DeepSeek Coder 6.7B

Faster, lighter

Creative Writing

Best: Nous Hermes 2 Mixtral 8x7B (quantized)

Creative, engaging
Follows instructions well

Image Generation (Realistic)

Best: FLUX.1-dev

Highest quality
Best prompt following

Alternative: SDXL + RealVisXL LoRA

Faster generation
Good quality

Image Generation (Anime)

Best: SDXL + AnimagineXL LoRA

Anime-specific training
Vibrant colors

Video Generation

Best: AnimateDiff + SDXL

16-frame clips
Good quality

Needs: A100 40GB or better

Next Steps

Review this plan and provide feedback
Set budget for GPU infrastructure
Choose provider (recommend RunPod)
Define priority services (LLM hosting first? Image gen first?)
Schedule implementation (4-week timeline above)

Would you like me to:

Create the detailed Docker Compose configurations?
Set up a cost estimation spreadsheet?
Research specific models for your use cases?
Begin implementation with Phase 1?

Let me know how you'd like to proceed! 🚀

33 KiB Raw Blame History Unescape Escape

GPU-Enhanced AI Stack Expansion Plan

Executive Summary

Phase 1: Current Stack Analysis

Existing Components

Current Limitations

Phase 2: GPU Provider Comparison

Provider Options

1. RunPod ⭐ RECOMMENDED

2. Lambda Labs

3. Vast.ai

4. Google Cloud Platform (GCP)

5. AWS

6. Hugging Face Spaces / Inference Endpoints

Recommendation: RunPod for Primary GPU Server

Phase 3: Architecture Design

Network Topology

Connection Methods

Option A: WireGuard VPN (RECOMMENDED)

Option B: SSH Tunnel

Option C: Tailscale

Phase 4: Service Implementation Plans

4.1 LLM Hosting with vLLM

Features:

Docker Compose Configuration:

Recommended Models for RTX 4090 (24GB):

Integration with LiteLLM:

4.2 ComfyUI for Image/Video Generation

Features:

Docker Compose Configuration:

Model Downloads (via ComfyUI Manager):

Traefik Integration:

Open WebUI Integration:

4.3 Model Training Infrastructure

Option A: Axolotl (Recommended)

Training Workflow:

Example Config:

Option B: JupyterLab for Custom Training

4.4 Model Storage Strategy

Storage Requirements:

RunPod Storage Options:

Model Management:

Phase 5: Implementation Roadmap

Week 1: Infrastructure Setup

Week 2: LLM Hosting

Week 3: Image Generation

Week 4: Training Infrastructure

Ongoing: Optimization & Expansion

Phase 6: Cost Analysis

Scenario A: Single RTX 4090 (24/7)

Scenario B: Pay-as-you-go (8 hours/day)

Scenario C: Dual GPU (Training + Inference)

Cost Optimization Tips

Phase 7: Monitoring & Operations

Metrics to Track

Monitoring Stack

Phase 8: Backup & Disaster Recovery

What to Backup

Backup Strategy

Phase 9: Security Considerations

GPU Server Security

Access Control

Phase 10: Advanced Features (Future)

Multi-GPU Scaling

Model Serving Optimization

Video Generation

Voice Synthesis

Appendix A: Quick Start Commands

Initial GPU Server Setup

Deploy vLLM (Quick Test)

Deploy ComfyUI (Quick Test)

Appendix B: Sample Docker Compose (Full GPU Stack)

Appendix C: Cost Calculator

Appendix D: Model Recommendations by Use Case

General Chat (24/7 Inference)

Code Generation

Creative Writing

Image Generation (Realistic)

Image Generation (Anime)

Video Generation

33 KiB

Raw Blame History