Files
runpod/GPU_EXPANSION_PLAN.md
Sebastian Krüger 277f1c95bd Initial commit: RunPod multi-modal AI orchestration stack
- Multi-modal AI infrastructure for RunPod RTX 4090
- Automatic model orchestration (text, image, music)
- Text: vLLM + Qwen 2.5 7B Instruct
- Image: Flux.1 Schnell via OpenEDAI
- Music: MusicGen Medium via AudioCraft
- Cost-optimized sequential loading on single GPU
- Template preparation scripts for rapid deployment
- Comprehensive documentation (README, DEPLOYMENT, TEMPLATE)
2025-11-21 14:34:55 +01:00

33 KiB
Raw Blame History

GPU-Enhanced AI Stack Expansion Plan

Executive Summary

This document outlines a comprehensive plan to extend the current AI stack (LiteLLM, Open WebUI, Crawl4AI) with dedicated GPU hosting capabilities for:

  • LLM Model Hosting: Self-hosted models (Llama, Mistral, Qwen, etc.)
  • Model Training: Fine-tuning and training workflows
  • Image Generation: Stable Diffusion, FLUX via ComfyUI
  • Video Generation: AnimateDiff, CogVideo, etc.

Current Architecture: CPU-based stack on pivoine.art VPS → Claude API via LiteLLM Target Architecture: Hybrid stack with GPU server(s) for self-hosted models + API-based models


Phase 1: Current Stack Analysis

Existing Components

  1. ai_postgres (pgvector/pgvector:pg16)

    • PostgreSQL with pgvector for RAG
    • Stores: conversations, embeddings, LiteLLM logs
  2. webui (Open WebUI)

    • User-facing ChatGPT-like interface
    • URL: https://ai.pivoine.art
    • Features: RAG, web search, document upload
    • Connected to LiteLLM proxy
  3. litellm (LiteLLM proxy)

    • Currently proxies Anthropic Claude API
    • OpenAI-compatible endpoint at http://litellm:4000
    • Supports multiple providers via config
  4. crawl4ai

    • Internal web scraping for LLM content prep
    • Port 11235 (internal only)
  5. facefusion (CPU-only)

    • Face swapping/enhancement
    • Currently CPU-based (slow)
    • Protected by Authelia SSO

Current Limitations

  • No self-hosted LLMs (relies on expensive API calls)
  • No GPU acceleration for facefusion
  • No image generation capabilities
  • No model training/fine-tuning capabilities
  • No video generation
  • High operational costs for API usage

Phase 2: GPU Provider Comparison

Provider Options

Pros:

  • Pay-per-second GPU billing
  • Wide GPU selection (RTX 4090, A100, H100)
  • Docker-first platform
  • Global locations
  • Easy HTTP/SSH tunneling
  • Volume persistence

Pricing (Approximate):

  • RTX 4090 (24GB): ~$0.50/hour ($360/month 24/7)
  • RTX 3090 (24GB): ~$0.35/hour ($250/month)
  • A6000 (48GB): ~$0.80/hour ($576/month)
  • A100 (40GB): ~$1.50/hour ($1,080/month)

Best for: On-demand workloads, experimentation, cost-conscious hosting


2. Lambda Labs

Pros:

  • Flat monthly pricing
  • High-end GPUs (A100, H100)
  • Jupyter notebooks included
  • Fast network

Pricing:

  • 1x A100 (40GB): $1.10/hour ($792/month)
  • 8x A100 (40GB): $8.00/hour (~$5,760/month)

Best for: Research, high-utilization workloads


3. Vast.ai

Pros:

  • Marketplace model (cheapest)
  • Many GPU options
  • Spot pricing available

Cons:

  • Variable reliability
  • Setup complexity
  • Community-hosted machines

Pricing:

  • RTX 4090: ~$0.25-0.40/hour
  • A100: ~$0.80-1.20/hour

Best for: Budget-conscious, experimental workloads


4. Google Cloud Platform (GCP)

Pros:

  • Enterprise reliability
  • Auto-scaling
  • Integration with Google services
  • Preemptible instances available

Pricing:

  • T4 (16GB): ~$0.35/hour
  • V100 (16GB): ~$2.48/hour
  • A100 (40GB): ~$2.93/hour
  • TPU options available

Best for: Enterprise workloads, auto-scaling needs


5. AWS

Pros:

  • Global infrastructure
  • Broad GPU selection
  • Spot instances for cost savings
  • Enterprise support

Pricing:

  • g4dn.xlarge (T4 16GB): ~$0.526/hour
  • p3.2xlarge (V100 16GB): ~$3.06/hour
  • p4d.24xlarge (8x A100 40GB): ~$32.77/hour

Best for: Enterprise, existing AWS infrastructure


6. Hugging Face Spaces / Inference Endpoints

Pros:

  • Managed model hosting
  • Auto-scaling
  • Simple deployment
  • Community models

Pricing:

  • CPU: $0.03/hour
  • T4: $0.60/hour
  • A10G: $1.00/hour
  • A100: $4.00/hour

Best for: Quick model deployment, serverless inference


Recommendation: RunPod for Primary GPU Server

Rationale:

  1. Cost-effective: Pay-per-second billing, ~$0.50/hour for RTX 4090
  2. Docker-native: Easy integration with existing compose stack
  3. Flexibility: Start/stop as needed, scale up for training
  4. Community: Large user base, good documentation
  5. Network: Built-in HTTP/SSH tunneling

Supplementary: Use Hugging Face for specific model hosting if needed


Phase 3: Architecture Design

Network Topology

┌─────────────────────────────────────────────────────────────┐
│ pivoine.art VPS (CPU-based)                                 │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────┐      ┌──────────┐      ┌──────────┐         │
│  │ Open     │─────▶│ LiteLLM  │◀────▶│ ai_      │         │
│  │ WebUI    │      │ Proxy    │      │ postgres │         │
│  └──────────┘      └──────────┘      └──────────┘         │
│       │                  │                                  │
│       │                  │                                  │
└───────┼──────────────────┼──────────────────────────────────┘
        │                  │
        │                  ▼
        │         ┌─────────────────┐
        │         │ Anthropic API   │
        │         │ (Claude)        │
        │         └─────────────────┘
        │
        ▼
┌────────────────────────────────────────────────────────────┐
│ GPU Server (RunPod)                                        │
├────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐ │
│  │ vLLM     │  │ ComfyUI  │  │ Model    │  │ JupyterLab│ │
│  │ (LLMs)   │  │ (SD/FLUX)│  │ Training │  │           │ │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘ │
│       │              │             │              │       │
│       └──────────────┴─────────────┴──────────────┘       │
│                      │                                     │
│              ┌───────────────┐                            │
│              │ Model Storage │                            │
│              │ (Persistent)  │                            │
│              └───────────────┘                            │
│                                                            │
└────────────────────────────────────────────────────────────┘
         │
         ▼ (Tunneled via WireGuard or Tailscale)
┌────────────────────────────────────────────────────────────┐
│ Integration Options:                                       │
├────────────────────────────────────────────────────────────┤
│ 1. LiteLLM adds vLLM endpoint (http://gpu.internal:8000)  │
│ 2. ComfyUI exposed via subdomain (comfy.ai.pivoine.art)   │
│ 3. Model storage synced via rclone/restic                 │
└────────────────────────────────────────────────────────────┘

Connection Methods

  • Create WireGuard tunnel between VPS and GPU server
  • GPU services accessible via private IPs
  • Secure, low overhead, easy to manage
  • Already have wg-easy in your stack

Setup:

  1. Deploy WireGuard on GPU server
  2. Add GPU server as VPN peer
  3. Configure LiteLLM to use VPN IPs

Option B: SSH Tunnel

  • SSH reverse tunnel from GPU to VPS
  • Simple, no additional software
  • Higher latency

Option C: Tailscale

  • Zero-config VPN mesh
  • Easy setup, good UX
  • Proprietary (but free tier available)

Phase 4: Service Implementation Plans

4.1 LLM Hosting with vLLM

vLLM is the industry-standard for high-performance LLM inference.

Features:

  • PagedAttention for efficient KV cache
  • Continuous batching
  • OpenAI-compatible API
  • Tensor parallelism for multi-GPU
  • Quantization support (AWQ, GPTQ)

Docker Compose Configuration:

services:
  vllm:
    image: vllm/vllm-openai:latest
    container_name: gpu_vllm
    restart: unless-stopped
    runtime: nvidia
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      CUDA_VISIBLE_DEVICES: 0
    volumes:
      - vllm_models:/root/.cache/huggingface
    command:
      - --model
      - meta-llama/Meta-Llama-3.1-8B-Instruct  # or any model
      - --host
      - 0.0.0.0
      - --port
      - 8000
      - --tensor-parallel-size
      - '1'
      - --gpu-memory-utilization
      - '0.9'
      - --max-model-len
      - '8192'
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Text Generation:

  • Llama 3.1 8B Instruct (8GB VRAM, fast)
  • Qwen2.5 14B Instruct (14GB VRAM, multilingual)
  • Mistral 7B Instruct v0.3 (7GB VRAM)
  • Nous Hermes 2 Mixtral 8x7B (with quantization, 16GB)

Code:

  • DeepSeek Coder 6.7B (7GB VRAM)
  • CodeLlama 13B (13GB VRAM)
  • Qwen2.5-Coder 14B (14GB VRAM)

Integration with LiteLLM:

Add to ai/litellm-config.yaml:

model_list:
  # Existing Anthropic
  - model_name: claude-sonnet-4-5
    litellm_params:
      model: anthropic/claude-sonnet-4-5-20250929
      api_key: os.environ/ANTHROPIC_API_KEY

  # New vLLM models
  - model_name: llama-3.1-8b
    litellm_params:
      model: openai/meta-llama/Meta-Llama-3.1-8B-Instruct
      api_base: http://gpu.internal:8000/v1
      api_key: dummy

  - model_name: qwen-2.5-14b
    litellm_params:
      model: openai/Qwen/Qwen2.5-14B-Instruct
      api_base: http://gpu.internal:8000/v1
      api_key: dummy

4.2 ComfyUI for Image/Video Generation

ComfyUI is a node-based UI for Stable Diffusion with advanced workflows.

Features:

  • Node-based workflow editor
  • Support for SD 1.5, SDXL, SD3, FLUX
  • ControlNet, LoRA, embeddings
  • Video generation (AnimateDiff, SVD)
  • API for automation

Docker Compose Configuration:

services:
  comfyui:
    image: ghcr.io/ai-dock/comfyui:latest
    container_name: gpu_comfyui
    restart: unless-stopped
    runtime: nvidia
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      # Custom nodes auto-install
      COMFYUI_FLAGS: --listen 0.0.0.0 --port 8188
    volumes:
      - comfyui_data:/data
      - comfyui_models:/opt/ComfyUI/models
      - comfyui_output:/opt/ComfyUI/output
    ports:
      - "8188:8188"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Model Downloads (via ComfyUI Manager):

Stable Diffusion Models:

  • FLUX.1-dev (12GB, newest, best quality)
  • FLUX.1-schnell (12GB, fast)
  • SDXL Base 1.0 (6.9GB)
  • SD 1.5 (4GB, fast, wide LoRA support)

ControlNet Models:

  • controlnet-canny-sdxl
  • controlnet-depth-sdxl
  • controlnet-openpose-sdxl

LoRA Models (download from Civitai):

  • Style LoRAs (anime, realistic, etc.)
  • Character LoRAs
  • Concept LoRAs

Traefik Integration:

Add subdomain routing for ComfyUI:

labels:
  - 'traefik.enable=true'
  - 'traefik.http.routers.comfyui-web-secure.rule=Host(`comfy.ai.pivoine.art`)'
  - 'traefik.http.routers.comfyui-web-secure.tls.certresolver=resolver'
  - 'traefik.http.routers.comfyui-web-secure.entrypoints=web-secure'
  - 'traefik.http.routers.comfyui-web-secure.middlewares=net-authelia,security-headers@file'
  - 'traefik.http.services.comfyui.loadbalancer.server.port=8188'

Open WebUI Integration:

ComfyUI has a REST API that can be called from Open WebUI using function calling.

Example workflow API call:

import requests

def generate_image(prompt: str, negative_prompt: str = ""):
    workflow = {
        # ComfyUI workflow JSON
    }
    response = requests.post(
        "http://comfyui:8188/prompt",
        json={"prompt": workflow}
    )
    return response.json()

4.3 Model Training Infrastructure

For fine-tuning LLMs and training custom models.

Axolotl is a user-friendly fine-tuning framework supporting:

  • LoRA, QLoRA
  • Full fine-tuning
  • RLHF/DPO
  • Multi-GPU training
services:
  axolotl:
    image: winglian/axolotl:main-py3.11-cu121-2.2.2
    container_name: gpu_training
    runtime: nvidia
    volumes:
      - ./training/configs:/workspace/configs
      - ./training/data:/workspace/data
      - ./training/output:/workspace/output
      - training_cache:/root/.cache
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      WANDB_API_KEY: ${WANDB_API_KEY:-}  # Optional: Weights & Biases logging
    command: |
      bash -c "
      accelerate launch -m axolotl.cli.train /workspace/configs/config.yaml
      "
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Training Workflow:

  1. Prepare dataset (JSONL format)
  2. Create Axolotl config (LoRA, batch size, epochs)
  3. Start training container
  4. Monitor via Weights & Biases or TensorBoard
  5. Export LoRA adapters
  6. Merge with base model or use in vLLM

Example Config:

# training/configs/lora-llama3.yaml
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
  - path: /workspace/data/train.jsonl
    type: completion
    field: text

output_dir: /workspace/output/llama3-lora

adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
learning_rate: 0.0002

optimizer: adamw_bnb_8bit
lr_scheduler: cosine
warmup_steps: 100

Option B: JupyterLab for Custom Training

For research and custom training scripts:

services:
  jupyter:
    image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel
    container_name: gpu_jupyter
    runtime: nvidia
    volumes:
      - ./notebooks:/workspace
      - jupyter_cache:/root/.cache
    ports:
      - "8888:8888"
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      JUPYTER_ENABLE_LAB: "yes"
    command: |
      bash -c "
      pip install jupyterlab transformers datasets accelerate bitsandbytes peft &&
      jupyter lab --ip=0.0.0.0 --allow-root --no-browser --NotebookApp.token=''
      "
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

4.4 Model Storage Strategy

Storage Requirements:

Per Model Type:

  • LLM 7B: ~14GB (FP16)
  • LLM 13B: ~26GB
  • SDXL: ~7GB
  • FLUX: ~12GB
  • ControlNet: ~2.5GB each
  • LoRA: ~100-500MB each

Total Estimated:

  • 3-4 LLMs: ~80GB
  • SD models + LoRAs: ~50GB
  • Training checkpoints: ~100GB
  • Total: 250-300GB minimum

RunPod Storage Options:

  1. Network Volume (Recommended)

    • Persistent across pod restarts
    • Shared between multiple pods
    • ~$0.10/GB/month
    • 500GB = $50/month
  2. Container Disk

    • Included with pod
    • Lost when pod stops
    • Good for temporary storage
  3. External Storage (rclone)

    • Sync to/from VPS or cloud storage
    • Backup models to Backblaze B2 or Wasabi
    • Good for disaster recovery

Model Management:

Use Hugging Face Hub as model cache:

# Download models on first run
huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct \
  --local-dir /models/llama-3.1-8b

# Or let vLLM/ComfyUI auto-download

Model Sync Script:

#!/bin/bash
# sync-models.sh - Sync models from VPS to GPU server

rclone sync \
  /mnt/hidrive/AI/models \
  gpu:/workspace/models \
  --progress \
  --transfers 4

Phase 5: Implementation Roadmap

Week 1: Infrastructure Setup

Day 1-2: RunPod Account & GPU Server

  • Create RunPod account
  • Deploy RTX 4090 pod with Ubuntu 22.04 + PyTorch template
  • Configure persistent network volume (500GB)
  • Set up SSH access

Day 3-4: Network Configuration

  • Deploy WireGuard on GPU server
  • Add GPU server as peer to existing VPN (vpn/compose.yaml)
  • Test connectivity between VPS and GPU server
  • Configure firewall rules

Day 5: Docker Setup on GPU Server

  • Install Docker + NVIDIA Container Toolkit
  • Create docker-compose.yaml for GPU services
  • Test GPU access in containers

Week 2: LLM Hosting

Day 1-2: vLLM Deployment

  • Deploy vLLM container
  • Download Llama 3.1 8B Instruct
  • Test inference locally
  • Benchmark performance (tokens/sec)

Day 3-4: LiteLLM Integration

  • Update litellm-config.yaml with vLLM endpoint
  • Test via Open WebUI
  • Configure model routing (cheap models → vLLM, complex → Claude)
  • Set up usage monitoring

Day 5: Model Expansion

  • Download Qwen 2.5 14B
  • Download Mistral 7B Instruct
  • Test model switching in Open WebUI
  • Document performance characteristics

Week 3: Image Generation

Day 1-2: ComfyUI Setup

  • Deploy ComfyUI container
  • Download FLUX.1-schnell
  • Download SDXL
  • Install ComfyUI Manager

Day 3-4: Model Downloads

  • Download ControlNet models
  • Download VAE models
  • Download popular LoRAs from Civitai
  • Organize model directory

Day 5: Integration & Workflows

  • Create basic text-to-image workflow
  • Create ControlNet workflow
  • Test API access
  • Add Traefik subdomain (comfy.ai.pivoine.art)

Week 4: Training Infrastructure

Day 1-2: Axolotl Setup

  • Deploy Axolotl container
  • Create sample dataset
  • Test LoRA fine-tuning with tiny model
  • Verify GPU utilization

Day 3-4: JupyterLab Setup

  • Deploy JupyterLab container
  • Install ML libraries
  • Create example notebooks
  • Test custom training scripts

Day 5: Documentation & Testing

  • Write training guides
  • Test end-to-end workflows
  • Benchmark training speeds
  • Document best practices

Ongoing: Optimization & Expansion

Month 2:

  • Monitor costs and optimize GPU utilization
  • Implement model caching strategies
  • Add more models based on usage patterns
  • Set up automated model updates
  • Implement usage quotas per user

Month 3+:

  • Consider multi-GPU setup for larger models
  • Implement model quantization (AWQ/GPTQ)
  • Add video generation (AnimateDiff, CogVideo)
  • Explore voice synthesis (XTTS, Bark)
  • Custom model training for specific use cases

Phase 6: Cost Analysis

Scenario A: Single RTX 4090 (24/7)

GPU Server (RunPod):

  • RTX 4090 pod: $0.50/hour × 720 hours = $360/month
  • 500GB network volume: $50/month
  • Subtotal: $410/month

VPS (Existing):

  • No change in cost

Total: ~$410/month

Savings:

  • Claude API costs reduced by ~80% (self-hosted for routine tasks)
  • Break-even if currently spending >$500/month on API calls

Scenario B: Pay-as-you-go (8 hours/day)

GPU Server (RunPod):

  • RTX 4090: $0.50/hour × 8 hours × 30 days = $120/month
  • Storage: $50/month
  • Subtotal: $170/month

Best for:

  • Development/experimentation
  • Burst workloads
  • Image generation on-demand

Scenario C: Dual GPU (Training + Inference)

GPU Server 1 (Inference):

  • RTX 4090 24/7: $360/month

GPU Server 2 (Training - On-demand):

  • A100 40GB: $1.50/hour × 40 hours/month = $60/month
  • Used only for fine-tuning sessions

Storage:

  • 1TB network volume: $100/month

Total: ~$520/month


Cost Optimization Tips

  1. Auto-stop idle pods: RunPod can auto-stop after X minutes idle
  2. Use spot instances: ~50% cheaper but can be interrupted
  3. Quantized models: 4-bit models use 4x less VRAM → cheaper GPUs
  4. Batch processing: Queue image gen jobs to maximize GPU usage
  5. Model sharing: One vLLM instance can serve multiple models via adapters
  6. Monitoring: Track per-model costs to optimize routing

Phase 7: Monitoring & Operations

Metrics to Track

GPU Utilization:

  • nvidia-smi metrics (utilization %, memory usage)
  • Temperature and power draw
  • Per-process GPU usage

Model Performance:

  • Tokens per second (LLM inference)
  • Images per second (SD/FLUX)
  • Training time per epoch

Costs:

  • GPU hours consumed
  • Storage usage
  • API vs self-hosted breakdown

Monitoring Stack

Option A: Netdata (Already deployed)

Add GPU monitoring to existing Netdata:

# On GPU server
services:
  netdata:
    image: netdata/netdata:latest
    container_name: gpu_netdata
    runtime: nvidia
    environment:
      NVIDIA_VISIBLE_DEVICES: all
    volumes:
      - /sys:/host/sys:ro
      - /proc:/host/proc:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    command: |
      bash -c "
      # Enable nvidia_smi plugin
      /usr/libexec/netdata/plugins.d/charts.d.plugin nvidia_smi
      "

Option B: Prometheus + Grafana

For detailed metrics:

services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'

  dcgm-exporter:
    image: nvidia/dcgm-exporter:latest
    runtime: nvidia
    environment:
      NVIDIA_VISIBLE_DEVICES: all

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana

Import Grafana dashboard #12219 for GPU metrics.


Phase 8: Backup & Disaster Recovery

What to Backup

  1. Models (250-300GB)

    • Base models can be re-downloaded
    • Custom fine-tuned models: CRITICAL
    • LoRAs: CRITICAL
  2. Training Data (~10-50GB)

    • Datasets
    • Preprocessing scripts
  3. Configurations (<1GB)

    • Docker compose files
    • Training configs
    • Workflow JSONs

Backup Strategy

Tier 1: Critical (Daily)

  • Fine-tuned models
  • Training checkpoints
  • Custom datasets

Backup to:

  • Restic → HiDrive (already configured)
  • Backblaze B2 (~$6/TB/month)
# Add to core/compose.yaml backrest config
- gpu_models:/volumes/gpu_models:ro
- gpu_checkpoints:/volumes/gpu_checkpoints:ro

Tier 2: Nice-to-have (Weekly)

  • Base models (can re-download)
  • ComfyUI outputs

Tier 3: Ephemeral (No backup)

  • Inference cache
  • Temporary generations

Phase 9: Security Considerations

GPU Server Security

  1. Firewall:

    • Only allow WireGuard port (51820)
    • All services accessed via VPN
    • No public exposure
  2. SSH:

    • Key-based auth only
    • Disable password auth
    • Change default port
  3. Docker:

    • Rootless Docker (optional but recommended)
    • Limited container capabilities
    • No privileged containers except for nvidia-runtime
  4. Secrets:

    • Store API keys in .env
    • Use Docker secrets for sensitive data
    • Rotate keys periodically

Access Control

  • ComfyUI: Protected by Authelia SSO (already configured)
  • vLLM: Internal only, accessed via LiteLLM proxy
  • JupyterLab: Password-protected or Authelia
  • Training: No public access, VPN only

Phase 10: Advanced Features (Future)

Multi-GPU Scaling

Tensor Parallelism (vLLM):

  • Split large models across multiple GPUs
  • Example: 70B model on 2x A100s
command:
  - --model
  - meta-llama/Meta-Llama-3.1-70B-Instruct
  - --tensor-parallel-size
  - '2'  # Use 2 GPUs

Pipeline Parallelism (training):

  • Split model layers across GPUs
  • Useful for very large models

Model Serving Optimization

vLLM Features:

  • Speculative decoding (faster generation)
  • Prefix caching (faster for repeated prompts)
  • Multi-LoRA serving (multiple adapters, one base model)

Example multi-LoRA:

command:
  - --model
  - meta-llama/Meta-Llama-3.1-8B-Instruct
  - --enable-lora
  - --max-loras
  - '4'
  - --lora-modules
  - customer-support=/models/loras/support-lora
  - creative-writing=/models/loras/writing-lora

Video Generation

AnimateDiff in ComfyUI:

  • Generate short videos from text prompts
  • Animate static images
  • ~8GB VRAM for 512x512 16-frame videos

CogVideo:

  • High-quality video generation
  • Requires A100 or H100
  • 5-second clips at 720p

Voice Synthesis

XTTS v2:

  • High-quality voice cloning
  • Multi-language support
  • ~6GB VRAM

Bark:

  • Text-to-speech with emotions
  • Sound effects
  • ~10GB VRAM

Appendix A: Quick Start Commands

Initial GPU Server Setup

# SSH into RunPod instance
ssh root@gpu.runpod.io -p 12345

# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update
apt-get install -y nvidia-container-toolkit
systemctl restart docker

# Test GPU access
docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

Deploy vLLM (Quick Test)

# Create directory
mkdir -p /workspace/vllm
cd /workspace/vllm

# Run vLLM
docker run -d \
  --name vllm \
  --runtime=nvidia \
  --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --dtype auto \
  --max-model-len 8192

# Test inference
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "prompt": "Once upon a time",
    "max_tokens": 50
  }'

Deploy ComfyUI (Quick Test)

docker run -d \
  --name comfyui \
  --runtime=nvidia \
  --gpus all \
  -v /workspace/comfyui:/data \
  -p 8188:8188 \
  ghcr.io/ai-dock/comfyui:latest

# Access at http://gpu-ip:8188

Appendix B: Sample Docker Compose (Full GPU Stack)

# gpu-server/compose.yaml
version: '3.8'

services:
  # vLLM for LLM inference
  vllm:
    image: vllm/vllm-openai:latest
    container_name: gpu_vllm
    restart: unless-stopped
    runtime: nvidia
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      CUDA_VISIBLE_DEVICES: 0
    volumes:
      - vllm_models:/root/.cache/huggingface
    command:
      - --model
      - meta-llama/Meta-Llama-3.1-8B-Instruct
      - --host
      - 0.0.0.0
      - --port
      - 8000
      - --gpu-memory-utilization
      - '0.9'
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  # ComfyUI for image generation
  comfyui:
    image: ghcr.io/ai-dock/comfyui:latest
    container_name: gpu_comfyui
    restart: unless-stopped
    runtime: nvidia
    environment:
      NVIDIA_VISIBLE_DEVICES: all
    volumes:
      - comfyui_data:/data
      - comfyui_models:/opt/ComfyUI/models
      - comfyui_output:/opt/ComfyUI/output
    ports:
      - "8188:8188"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  # Axolotl for model training
  axolotl:
    image: winglian/axolotl:main-py3.11-cu121-2.2.2
    container_name: gpu_training
    runtime: nvidia
    volumes:
      - ./training/configs:/workspace/configs
      - ./training/data:/workspace/data
      - ./training/output:/workspace/output
      - training_cache:/root/.cache
    environment:
      NVIDIA_VISIBLE_DEVICES: all
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    # Only start when training
    profiles:
      - training

  # JupyterLab for research
  jupyter:
    image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel
    container_name: gpu_jupyter
    restart: unless-stopped
    runtime: nvidia
    volumes:
      - ./notebooks:/workspace
      - jupyter_cache:/root/.cache
    ports:
      - "8888:8888"
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      JUPYTER_ENABLE_LAB: "yes"
    command: |
      bash -c "
      pip install jupyterlab transformers datasets accelerate bitsandbytes peft &&
      jupyter lab --ip=0.0.0.0 --allow-root --no-browser
      "
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  # Netdata monitoring
  netdata:
    image: netdata/netdata:latest
    container_name: gpu_netdata
    restart: unless-stopped
    runtime: nvidia
    environment:
      NVIDIA_VISIBLE_DEVICES: all
    volumes:
      - /sys:/host/sys:ro
      - /proc:/host/proc:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    ports:
      - "19999:19999"
    cap_add:
      - SYS_PTRACE
      - SYS_ADMIN
    security_opt:
      - apparmor:unconfined

volumes:
  vllm_models:
  comfyui_data:
  comfyui_models:
  comfyui_output:
  training_cache:
  jupyter_cache:

Appendix C: Cost Calculator

Monthly GPU Costs:

GPU Model VRAM $/hour 24/7 Month 8hr/day Use Case
RTX 3090 24GB $0.35 $252 $84 Development, small models
RTX 4090 24GB $0.50 $360 $120 Production inference, SD
A6000 48GB $0.80 $576 $192 Large models, training
A100 40GB 40GB $1.50 $1,080 $360 Enterprise, training
A100 80GB 80GB $2.50 $1,800 $600 Massive models, research

Storage Costs:

  • Network volume: $0.10/GB/month
  • 500GB = $50/month
  • 1TB = $100/month

Total Estimated Monthly:

  • RTX 4090 + 500GB storage = $410/month (24/7)
  • RTX 4090 + 500GB storage = $170/month (8hr/day)

Break-even Analysis:

  • If spending >$500/month on API calls → GPU server saves money
  • If spending <$200/month → stick with APIs

Appendix D: Model Recommendations by Use Case

General Chat (24/7 Inference)

Best: Qwen 2.5 14B Instruct

  • Excellent multilingual support
  • Fast inference
  • Good reasoning

Alternative: Mistral 7B Instruct v0.3

  • Fastest inference
  • Lower VRAM

Code Generation

Best: Qwen 2.5 Coder 14B

  • SOTA coding performance
  • Multi-language support

Alternative: DeepSeek Coder 6.7B

  • Faster, lighter

Creative Writing

Best: Nous Hermes 2 Mixtral 8x7B (quantized)

  • Creative, engaging
  • Follows instructions well

Image Generation (Realistic)

Best: FLUX.1-dev

  • Highest quality
  • Best prompt following

Alternative: SDXL + RealVisXL LoRA

  • Faster generation
  • Good quality

Image Generation (Anime)

Best: SDXL + AnimagineXL LoRA

  • Anime-specific training
  • Vibrant colors

Video Generation

Best: AnimateDiff + SDXL

  • 16-frame clips
  • Good quality

Needs: A100 40GB or better


Next Steps

  1. Review this plan and provide feedback
  2. Set budget for GPU infrastructure
  3. Choose provider (recommend RunPod)
  4. Define priority services (LLM hosting first? Image gen first?)
  5. Schedule implementation (4-week timeline above)

Would you like me to:

  • Create the detailed Docker Compose configurations?
  • Set up a cost estimation spreadsheet?
  • Research specific models for your use cases?
  • Begin implementation with Phase 1?

Let me know how you'd like to proceed! 🚀