- Multi-modal AI infrastructure for RunPod RTX 4090 - Automatic model orchestration (text, image, music) - Text: vLLM + Qwen 2.5 7B Instruct - Image: Flux.1 Schnell via OpenEDAI - Music: MusicGen Medium via AudioCraft - Cost-optimized sequential loading on single GPU - Template preparation scripts for rapid deployment - Comprehensive documentation (README, DEPLOYMENT, TEMPLATE)
33 KiB
GPU-Enhanced AI Stack Expansion Plan
Executive Summary
This document outlines a comprehensive plan to extend the current AI stack (LiteLLM, Open WebUI, Crawl4AI) with dedicated GPU hosting capabilities for:
- LLM Model Hosting: Self-hosted models (Llama, Mistral, Qwen, etc.)
- Model Training: Fine-tuning and training workflows
- Image Generation: Stable Diffusion, FLUX via ComfyUI
- Video Generation: AnimateDiff, CogVideo, etc.
Current Architecture: CPU-based stack on pivoine.art VPS → Claude API via LiteLLM Target Architecture: Hybrid stack with GPU server(s) for self-hosted models + API-based models
Phase 1: Current Stack Analysis
Existing Components
-
ai_postgres (pgvector/pgvector:pg16)
- PostgreSQL with pgvector for RAG
- Stores: conversations, embeddings, LiteLLM logs
-
webui (Open WebUI)
- User-facing ChatGPT-like interface
- URL: https://ai.pivoine.art
- Features: RAG, web search, document upload
- Connected to LiteLLM proxy
-
litellm (LiteLLM proxy)
- Currently proxies Anthropic Claude API
- OpenAI-compatible endpoint at http://litellm:4000
- Supports multiple providers via config
-
crawl4ai
- Internal web scraping for LLM content prep
- Port 11235 (internal only)
-
facefusion (CPU-only)
- Face swapping/enhancement
- Currently CPU-based (slow)
- Protected by Authelia SSO
Current Limitations
- ❌ No self-hosted LLMs (relies on expensive API calls)
- ❌ No GPU acceleration for facefusion
- ❌ No image generation capabilities
- ❌ No model training/fine-tuning capabilities
- ❌ No video generation
- ❌ High operational costs for API usage
Phase 2: GPU Provider Comparison
Provider Options
1. RunPod ⭐ RECOMMENDED
Pros:
- Pay-per-second GPU billing
- Wide GPU selection (RTX 4090, A100, H100)
- Docker-first platform
- Global locations
- Easy HTTP/SSH tunneling
- Volume persistence
Pricing (Approximate):
- RTX 4090 (24GB): ~$0.50/hour ($360/month 24/7)
- RTX 3090 (24GB): ~$0.35/hour ($250/month)
- A6000 (48GB): ~$0.80/hour ($576/month)
- A100 (40GB): ~$1.50/hour ($1,080/month)
Best for: On-demand workloads, experimentation, cost-conscious hosting
2. Lambda Labs
Pros:
- Flat monthly pricing
- High-end GPUs (A100, H100)
- Jupyter notebooks included
- Fast network
Pricing:
- 1x A100 (40GB): $1.10/hour ($792/month)
- 8x A100 (40GB): $8.00/hour (~$5,760/month)
Best for: Research, high-utilization workloads
3. Vast.ai
Pros:
- Marketplace model (cheapest)
- Many GPU options
- Spot pricing available
Cons:
- Variable reliability
- Setup complexity
- Community-hosted machines
Pricing:
- RTX 4090: ~$0.25-0.40/hour
- A100: ~$0.80-1.20/hour
Best for: Budget-conscious, experimental workloads
4. Google Cloud Platform (GCP)
Pros:
- Enterprise reliability
- Auto-scaling
- Integration with Google services
- Preemptible instances available
Pricing:
- T4 (16GB): ~$0.35/hour
- V100 (16GB): ~$2.48/hour
- A100 (40GB): ~$2.93/hour
- TPU options available
Best for: Enterprise workloads, auto-scaling needs
5. AWS
Pros:
- Global infrastructure
- Broad GPU selection
- Spot instances for cost savings
- Enterprise support
Pricing:
- g4dn.xlarge (T4 16GB): ~$0.526/hour
- p3.2xlarge (V100 16GB): ~$3.06/hour
- p4d.24xlarge (8x A100 40GB): ~$32.77/hour
Best for: Enterprise, existing AWS infrastructure
6. Hugging Face Spaces / Inference Endpoints
Pros:
- Managed model hosting
- Auto-scaling
- Simple deployment
- Community models
Pricing:
- CPU: $0.03/hour
- T4: $0.60/hour
- A10G: $1.00/hour
- A100: $4.00/hour
Best for: Quick model deployment, serverless inference
Recommendation: RunPod for Primary GPU Server
Rationale:
- Cost-effective: Pay-per-second billing, ~$0.50/hour for RTX 4090
- Docker-native: Easy integration with existing compose stack
- Flexibility: Start/stop as needed, scale up for training
- Community: Large user base, good documentation
- Network: Built-in HTTP/SSH tunneling
Supplementary: Use Hugging Face for specific model hosting if needed
Phase 3: Architecture Design
Network Topology
┌─────────────────────────────────────────────────────────────┐
│ pivoine.art VPS (CPU-based) │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Open │─────▶│ LiteLLM │◀────▶│ ai_ │ │
│ │ WebUI │ │ Proxy │ │ postgres │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │
│ │ │ │
└───────┼──────────────────┼──────────────────────────────────┘
│ │
│ ▼
│ ┌─────────────────┐
│ │ Anthropic API │
│ │ (Claude) │
│ └─────────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ GPU Server (RunPod) │
├────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ vLLM │ │ ComfyUI │ │ Model │ │ JupyterLab│ │
│ │ (LLMs) │ │ (SD/FLUX)│ │ Training │ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │ │
│ └──────────────┴─────────────┴──────────────┘ │
│ │ │
│ ┌───────────────┐ │
│ │ Model Storage │ │
│ │ (Persistent) │ │
│ └───────────────┘ │
│ │
└────────────────────────────────────────────────────────────┘
│
▼ (Tunneled via WireGuard or Tailscale)
┌────────────────────────────────────────────────────────────┐
│ Integration Options: │
├────────────────────────────────────────────────────────────┤
│ 1. LiteLLM adds vLLM endpoint (http://gpu.internal:8000) │
│ 2. ComfyUI exposed via subdomain (comfy.ai.pivoine.art) │
│ 3. Model storage synced via rclone/restic │
└────────────────────────────────────────────────────────────┘
Connection Methods
Option A: WireGuard VPN (RECOMMENDED)
- Create WireGuard tunnel between VPS and GPU server
- GPU services accessible via private IPs
- Secure, low overhead, easy to manage
- Already have wg-easy in your stack
Setup:
- Deploy WireGuard on GPU server
- Add GPU server as VPN peer
- Configure LiteLLM to use VPN IPs
Option B: SSH Tunnel
- SSH reverse tunnel from GPU to VPS
- Simple, no additional software
- Higher latency
Option C: Tailscale
- Zero-config VPN mesh
- Easy setup, good UX
- Proprietary (but free tier available)
Phase 4: Service Implementation Plans
4.1 LLM Hosting with vLLM
vLLM is the industry-standard for high-performance LLM inference.
Features:
- PagedAttention for efficient KV cache
- Continuous batching
- OpenAI-compatible API
- Tensor parallelism for multi-GPU
- Quantization support (AWQ, GPTQ)
Docker Compose Configuration:
services:
vllm:
image: vllm/vllm-openai:latest
container_name: gpu_vllm
restart: unless-stopped
runtime: nvidia
environment:
NVIDIA_VISIBLE_DEVICES: all
CUDA_VISIBLE_DEVICES: 0
volumes:
- vllm_models:/root/.cache/huggingface
command:
- --model
- meta-llama/Meta-Llama-3.1-8B-Instruct # or any model
- --host
- 0.0.0.0
- --port
- 8000
- --tensor-parallel-size
- '1'
- --gpu-memory-utilization
- '0.9'
- --max-model-len
- '8192'
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Recommended Models for RTX 4090 (24GB):
Text Generation:
- Llama 3.1 8B Instruct (8GB VRAM, fast)
- Qwen2.5 14B Instruct (14GB VRAM, multilingual)
- Mistral 7B Instruct v0.3 (7GB VRAM)
- Nous Hermes 2 Mixtral 8x7B (with quantization, 16GB)
Code:
- DeepSeek Coder 6.7B (7GB VRAM)
- CodeLlama 13B (13GB VRAM)
- Qwen2.5-Coder 14B (14GB VRAM)
Integration with LiteLLM:
Add to ai/litellm-config.yaml:
model_list:
# Existing Anthropic
- model_name: claude-sonnet-4-5
litellm_params:
model: anthropic/claude-sonnet-4-5-20250929
api_key: os.environ/ANTHROPIC_API_KEY
# New vLLM models
- model_name: llama-3.1-8b
litellm_params:
model: openai/meta-llama/Meta-Llama-3.1-8B-Instruct
api_base: http://gpu.internal:8000/v1
api_key: dummy
- model_name: qwen-2.5-14b
litellm_params:
model: openai/Qwen/Qwen2.5-14B-Instruct
api_base: http://gpu.internal:8000/v1
api_key: dummy
4.2 ComfyUI for Image/Video Generation
ComfyUI is a node-based UI for Stable Diffusion with advanced workflows.
Features:
- Node-based workflow editor
- Support for SD 1.5, SDXL, SD3, FLUX
- ControlNet, LoRA, embeddings
- Video generation (AnimateDiff, SVD)
- API for automation
Docker Compose Configuration:
services:
comfyui:
image: ghcr.io/ai-dock/comfyui:latest
container_name: gpu_comfyui
restart: unless-stopped
runtime: nvidia
environment:
NVIDIA_VISIBLE_DEVICES: all
# Custom nodes auto-install
COMFYUI_FLAGS: --listen 0.0.0.0 --port 8188
volumes:
- comfyui_data:/data
- comfyui_models:/opt/ComfyUI/models
- comfyui_output:/opt/ComfyUI/output
ports:
- "8188:8188"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Model Downloads (via ComfyUI Manager):
Stable Diffusion Models:
- FLUX.1-dev (12GB, newest, best quality)
- FLUX.1-schnell (12GB, fast)
- SDXL Base 1.0 (6.9GB)
- SD 1.5 (4GB, fast, wide LoRA support)
ControlNet Models:
- controlnet-canny-sdxl
- controlnet-depth-sdxl
- controlnet-openpose-sdxl
LoRA Models (download from Civitai):
- Style LoRAs (anime, realistic, etc.)
- Character LoRAs
- Concept LoRAs
Traefik Integration:
Add subdomain routing for ComfyUI:
labels:
- 'traefik.enable=true'
- 'traefik.http.routers.comfyui-web-secure.rule=Host(`comfy.ai.pivoine.art`)'
- 'traefik.http.routers.comfyui-web-secure.tls.certresolver=resolver'
- 'traefik.http.routers.comfyui-web-secure.entrypoints=web-secure'
- 'traefik.http.routers.comfyui-web-secure.middlewares=net-authelia,security-headers@file'
- 'traefik.http.services.comfyui.loadbalancer.server.port=8188'
Open WebUI Integration:
ComfyUI has a REST API that can be called from Open WebUI using function calling.
Example workflow API call:
import requests
def generate_image(prompt: str, negative_prompt: str = ""):
workflow = {
# ComfyUI workflow JSON
}
response = requests.post(
"http://comfyui:8188/prompt",
json={"prompt": workflow}
)
return response.json()
4.3 Model Training Infrastructure
For fine-tuning LLMs and training custom models.
Option A: Axolotl (Recommended)
Axolotl is a user-friendly fine-tuning framework supporting:
- LoRA, QLoRA
- Full fine-tuning
- RLHF/DPO
- Multi-GPU training
services:
axolotl:
image: winglian/axolotl:main-py3.11-cu121-2.2.2
container_name: gpu_training
runtime: nvidia
volumes:
- ./training/configs:/workspace/configs
- ./training/data:/workspace/data
- ./training/output:/workspace/output
- training_cache:/root/.cache
environment:
NVIDIA_VISIBLE_DEVICES: all
WANDB_API_KEY: ${WANDB_API_KEY:-} # Optional: Weights & Biases logging
command: |
bash -c "
accelerate launch -m axolotl.cli.train /workspace/configs/config.yaml
"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Training Workflow:
- Prepare dataset (JSONL format)
- Create Axolotl config (LoRA, batch size, epochs)
- Start training container
- Monitor via Weights & Biases or TensorBoard
- Export LoRA adapters
- Merge with base model or use in vLLM
Example Config:
# training/configs/lora-llama3.yaml
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
load_in_8bit: false
load_in_4bit: true
strict: false
datasets:
- path: /workspace/data/train.jsonl
type: completion
field: text
output_dir: /workspace/output/llama3-lora
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
- q_proj
- v_proj
- k_proj
- o_proj
gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
learning_rate: 0.0002
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
warmup_steps: 100
Option B: JupyterLab for Custom Training
For research and custom training scripts:
services:
jupyter:
image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel
container_name: gpu_jupyter
runtime: nvidia
volumes:
- ./notebooks:/workspace
- jupyter_cache:/root/.cache
ports:
- "8888:8888"
environment:
NVIDIA_VISIBLE_DEVICES: all
JUPYTER_ENABLE_LAB: "yes"
command: |
bash -c "
pip install jupyterlab transformers datasets accelerate bitsandbytes peft &&
jupyter lab --ip=0.0.0.0 --allow-root --no-browser --NotebookApp.token=''
"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
4.4 Model Storage Strategy
Storage Requirements:
Per Model Type:
- LLM 7B: ~14GB (FP16)
- LLM 13B: ~26GB
- SDXL: ~7GB
- FLUX: ~12GB
- ControlNet: ~2.5GB each
- LoRA: ~100-500MB each
Total Estimated:
- 3-4 LLMs: ~80GB
- SD models + LoRAs: ~50GB
- Training checkpoints: ~100GB
- Total: 250-300GB minimum
RunPod Storage Options:
-
Network Volume (Recommended)
- Persistent across pod restarts
- Shared between multiple pods
- ~$0.10/GB/month
- 500GB = $50/month
-
Container Disk
- Included with pod
- Lost when pod stops
- Good for temporary storage
-
External Storage (rclone)
- Sync to/from VPS or cloud storage
- Backup models to Backblaze B2 or Wasabi
- Good for disaster recovery
Model Management:
Use Hugging Face Hub as model cache:
# Download models on first run
huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct \
--local-dir /models/llama-3.1-8b
# Or let vLLM/ComfyUI auto-download
Model Sync Script:
#!/bin/bash
# sync-models.sh - Sync models from VPS to GPU server
rclone sync \
/mnt/hidrive/AI/models \
gpu:/workspace/models \
--progress \
--transfers 4
Phase 5: Implementation Roadmap
Week 1: Infrastructure Setup
Day 1-2: RunPod Account & GPU Server
- Create RunPod account
- Deploy RTX 4090 pod with Ubuntu 22.04 + PyTorch template
- Configure persistent network volume (500GB)
- Set up SSH access
Day 3-4: Network Configuration
- Deploy WireGuard on GPU server
- Add GPU server as peer to existing VPN (vpn/compose.yaml)
- Test connectivity between VPS and GPU server
- Configure firewall rules
Day 5: Docker Setup on GPU Server
- Install Docker + NVIDIA Container Toolkit
- Create docker-compose.yaml for GPU services
- Test GPU access in containers
Week 2: LLM Hosting
Day 1-2: vLLM Deployment
- Deploy vLLM container
- Download Llama 3.1 8B Instruct
- Test inference locally
- Benchmark performance (tokens/sec)
Day 3-4: LiteLLM Integration
- Update litellm-config.yaml with vLLM endpoint
- Test via Open WebUI
- Configure model routing (cheap models → vLLM, complex → Claude)
- Set up usage monitoring
Day 5: Model Expansion
- Download Qwen 2.5 14B
- Download Mistral 7B Instruct
- Test model switching in Open WebUI
- Document performance characteristics
Week 3: Image Generation
Day 1-2: ComfyUI Setup
- Deploy ComfyUI container
- Download FLUX.1-schnell
- Download SDXL
- Install ComfyUI Manager
Day 3-4: Model Downloads
- Download ControlNet models
- Download VAE models
- Download popular LoRAs from Civitai
- Organize model directory
Day 5: Integration & Workflows
- Create basic text-to-image workflow
- Create ControlNet workflow
- Test API access
- Add Traefik subdomain (comfy.ai.pivoine.art)
Week 4: Training Infrastructure
Day 1-2: Axolotl Setup
- Deploy Axolotl container
- Create sample dataset
- Test LoRA fine-tuning with tiny model
- Verify GPU utilization
Day 3-4: JupyterLab Setup
- Deploy JupyterLab container
- Install ML libraries
- Create example notebooks
- Test custom training scripts
Day 5: Documentation & Testing
- Write training guides
- Test end-to-end workflows
- Benchmark training speeds
- Document best practices
Ongoing: Optimization & Expansion
Month 2:
- Monitor costs and optimize GPU utilization
- Implement model caching strategies
- Add more models based on usage patterns
- Set up automated model updates
- Implement usage quotas per user
Month 3+:
- Consider multi-GPU setup for larger models
- Implement model quantization (AWQ/GPTQ)
- Add video generation (AnimateDiff, CogVideo)
- Explore voice synthesis (XTTS, Bark)
- Custom model training for specific use cases
Phase 6: Cost Analysis
Scenario A: Single RTX 4090 (24/7)
GPU Server (RunPod):
- RTX 4090 pod: $0.50/hour × 720 hours = $360/month
- 500GB network volume: $50/month
- Subtotal: $410/month
VPS (Existing):
- No change in cost
Total: ~$410/month
Savings:
- Claude API costs reduced by ~80% (self-hosted for routine tasks)
- Break-even if currently spending >$500/month on API calls
Scenario B: Pay-as-you-go (8 hours/day)
GPU Server (RunPod):
- RTX 4090: $0.50/hour × 8 hours × 30 days = $120/month
- Storage: $50/month
- Subtotal: $170/month
Best for:
- Development/experimentation
- Burst workloads
- Image generation on-demand
Scenario C: Dual GPU (Training + Inference)
GPU Server 1 (Inference):
- RTX 4090 24/7: $360/month
GPU Server 2 (Training - On-demand):
- A100 40GB: $1.50/hour × 40 hours/month = $60/month
- Used only for fine-tuning sessions
Storage:
- 1TB network volume: $100/month
Total: ~$520/month
Cost Optimization Tips
- Auto-stop idle pods: RunPod can auto-stop after X minutes idle
- Use spot instances: ~50% cheaper but can be interrupted
- Quantized models: 4-bit models use 4x less VRAM → cheaper GPUs
- Batch processing: Queue image gen jobs to maximize GPU usage
- Model sharing: One vLLM instance can serve multiple models via adapters
- Monitoring: Track per-model costs to optimize routing
Phase 7: Monitoring & Operations
Metrics to Track
GPU Utilization:
- nvidia-smi metrics (utilization %, memory usage)
- Temperature and power draw
- Per-process GPU usage
Model Performance:
- Tokens per second (LLM inference)
- Images per second (SD/FLUX)
- Training time per epoch
Costs:
- GPU hours consumed
- Storage usage
- API vs self-hosted breakdown
Monitoring Stack
Option A: Netdata (Already deployed)
Add GPU monitoring to existing Netdata:
# On GPU server
services:
netdata:
image: netdata/netdata:latest
container_name: gpu_netdata
runtime: nvidia
environment:
NVIDIA_VISIBLE_DEVICES: all
volumes:
- /sys:/host/sys:ro
- /proc:/host/proc:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
command: |
bash -c "
# Enable nvidia_smi plugin
/usr/libexec/netdata/plugins.d/charts.d.plugin nvidia_smi
"
Option B: Prometheus + Grafana
For detailed metrics:
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
dcgm-exporter:
image: nvidia/dcgm-exporter:latest
runtime: nvidia
environment:
NVIDIA_VISIBLE_DEVICES: all
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
Import Grafana dashboard #12219 for GPU metrics.
Phase 8: Backup & Disaster Recovery
What to Backup
-
Models (250-300GB)
- Base models can be re-downloaded
- Custom fine-tuned models: CRITICAL
- LoRAs: CRITICAL
-
Training Data (~10-50GB)
- Datasets
- Preprocessing scripts
-
Configurations (<1GB)
- Docker compose files
- Training configs
- Workflow JSONs
Backup Strategy
Tier 1: Critical (Daily)
- Fine-tuned models
- Training checkpoints
- Custom datasets
Backup to:
- Restic → HiDrive (already configured)
- Backblaze B2 (~$6/TB/month)
# Add to core/compose.yaml backrest config
- gpu_models:/volumes/gpu_models:ro
- gpu_checkpoints:/volumes/gpu_checkpoints:ro
Tier 2: Nice-to-have (Weekly)
- Base models (can re-download)
- ComfyUI outputs
Tier 3: Ephemeral (No backup)
- Inference cache
- Temporary generations
Phase 9: Security Considerations
GPU Server Security
-
Firewall:
- Only allow WireGuard port (51820)
- All services accessed via VPN
- No public exposure
-
SSH:
- Key-based auth only
- Disable password auth
- Change default port
-
Docker:
- Rootless Docker (optional but recommended)
- Limited container capabilities
- No privileged containers except for nvidia-runtime
-
Secrets:
- Store API keys in .env
- Use Docker secrets for sensitive data
- Rotate keys periodically
Access Control
- ComfyUI: Protected by Authelia SSO (already configured)
- vLLM: Internal only, accessed via LiteLLM proxy
- JupyterLab: Password-protected or Authelia
- Training: No public access, VPN only
Phase 10: Advanced Features (Future)
Multi-GPU Scaling
Tensor Parallelism (vLLM):
- Split large models across multiple GPUs
- Example: 70B model on 2x A100s
command:
- --model
- meta-llama/Meta-Llama-3.1-70B-Instruct
- --tensor-parallel-size
- '2' # Use 2 GPUs
Pipeline Parallelism (training):
- Split model layers across GPUs
- Useful for very large models
Model Serving Optimization
vLLM Features:
- Speculative decoding (faster generation)
- Prefix caching (faster for repeated prompts)
- Multi-LoRA serving (multiple adapters, one base model)
Example multi-LoRA:
command:
- --model
- meta-llama/Meta-Llama-3.1-8B-Instruct
- --enable-lora
- --max-loras
- '4'
- --lora-modules
- customer-support=/models/loras/support-lora
- creative-writing=/models/loras/writing-lora
Video Generation
AnimateDiff in ComfyUI:
- Generate short videos from text prompts
- Animate static images
- ~8GB VRAM for 512x512 16-frame videos
CogVideo:
- High-quality video generation
- Requires A100 or H100
- 5-second clips at 720p
Voice Synthesis
XTTS v2:
- High-quality voice cloning
- Multi-language support
- ~6GB VRAM
Bark:
- Text-to-speech with emotions
- Sound effects
- ~10GB VRAM
Appendix A: Quick Start Commands
Initial GPU Server Setup
# SSH into RunPod instance
ssh root@gpu.runpod.io -p 12345
# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update
apt-get install -y nvidia-container-toolkit
systemctl restart docker
# Test GPU access
docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
Deploy vLLM (Quick Test)
# Create directory
mkdir -p /workspace/vllm
cd /workspace/vllm
# Run vLLM
docker run -d \
--name vllm \
--runtime=nvidia \
--gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--dtype auto \
--max-model-len 8192
# Test inference
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"prompt": "Once upon a time",
"max_tokens": 50
}'
Deploy ComfyUI (Quick Test)
docker run -d \
--name comfyui \
--runtime=nvidia \
--gpus all \
-v /workspace/comfyui:/data \
-p 8188:8188 \
ghcr.io/ai-dock/comfyui:latest
# Access at http://gpu-ip:8188
Appendix B: Sample Docker Compose (Full GPU Stack)
# gpu-server/compose.yaml
version: '3.8'
services:
# vLLM for LLM inference
vllm:
image: vllm/vllm-openai:latest
container_name: gpu_vllm
restart: unless-stopped
runtime: nvidia
environment:
NVIDIA_VISIBLE_DEVICES: all
CUDA_VISIBLE_DEVICES: 0
volumes:
- vllm_models:/root/.cache/huggingface
command:
- --model
- meta-llama/Meta-Llama-3.1-8B-Instruct
- --host
- 0.0.0.0
- --port
- 8000
- --gpu-memory-utilization
- '0.9'
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
# ComfyUI for image generation
comfyui:
image: ghcr.io/ai-dock/comfyui:latest
container_name: gpu_comfyui
restart: unless-stopped
runtime: nvidia
environment:
NVIDIA_VISIBLE_DEVICES: all
volumes:
- comfyui_data:/data
- comfyui_models:/opt/ComfyUI/models
- comfyui_output:/opt/ComfyUI/output
ports:
- "8188:8188"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
# Axolotl for model training
axolotl:
image: winglian/axolotl:main-py3.11-cu121-2.2.2
container_name: gpu_training
runtime: nvidia
volumes:
- ./training/configs:/workspace/configs
- ./training/data:/workspace/data
- ./training/output:/workspace/output
- training_cache:/root/.cache
environment:
NVIDIA_VISIBLE_DEVICES: all
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
# Only start when training
profiles:
- training
# JupyterLab for research
jupyter:
image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel
container_name: gpu_jupyter
restart: unless-stopped
runtime: nvidia
volumes:
- ./notebooks:/workspace
- jupyter_cache:/root/.cache
ports:
- "8888:8888"
environment:
NVIDIA_VISIBLE_DEVICES: all
JUPYTER_ENABLE_LAB: "yes"
command: |
bash -c "
pip install jupyterlab transformers datasets accelerate bitsandbytes peft &&
jupyter lab --ip=0.0.0.0 --allow-root --no-browser
"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
# Netdata monitoring
netdata:
image: netdata/netdata:latest
container_name: gpu_netdata
restart: unless-stopped
runtime: nvidia
environment:
NVIDIA_VISIBLE_DEVICES: all
volumes:
- /sys:/host/sys:ro
- /proc:/host/proc:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
ports:
- "19999:19999"
cap_add:
- SYS_PTRACE
- SYS_ADMIN
security_opt:
- apparmor:unconfined
volumes:
vllm_models:
comfyui_data:
comfyui_models:
comfyui_output:
training_cache:
jupyter_cache:
Appendix C: Cost Calculator
Monthly GPU Costs:
| GPU Model | VRAM | $/hour | 24/7 Month | 8hr/day | Use Case |
|---|---|---|---|---|---|
| RTX 3090 | 24GB | $0.35 | $252 | $84 | Development, small models |
| RTX 4090 | 24GB | $0.50 | $360 | $120 | Production inference, SD |
| A6000 | 48GB | $0.80 | $576 | $192 | Large models, training |
| A100 40GB | 40GB | $1.50 | $1,080 | $360 | Enterprise, training |
| A100 80GB | 80GB | $2.50 | $1,800 | $600 | Massive models, research |
Storage Costs:
- Network volume: $0.10/GB/month
- 500GB = $50/month
- 1TB = $100/month
Total Estimated Monthly:
- RTX 4090 + 500GB storage = $410/month (24/7)
- RTX 4090 + 500GB storage = $170/month (8hr/day)
Break-even Analysis:
- If spending >$500/month on API calls → GPU server saves money
- If spending <$200/month → stick with APIs
Appendix D: Model Recommendations by Use Case
General Chat (24/7 Inference)
Best: Qwen 2.5 14B Instruct
- Excellent multilingual support
- Fast inference
- Good reasoning
Alternative: Mistral 7B Instruct v0.3
- Fastest inference
- Lower VRAM
Code Generation
Best: Qwen 2.5 Coder 14B
- SOTA coding performance
- Multi-language support
Alternative: DeepSeek Coder 6.7B
- Faster, lighter
Creative Writing
Best: Nous Hermes 2 Mixtral 8x7B (quantized)
- Creative, engaging
- Follows instructions well
Image Generation (Realistic)
Best: FLUX.1-dev
- Highest quality
- Best prompt following
Alternative: SDXL + RealVisXL LoRA
- Faster generation
- Good quality
Image Generation (Anime)
Best: SDXL + AnimagineXL LoRA
- Anime-specific training
- Vibrant colors
Video Generation
Best: AnimateDiff + SDXL
- 16-frame clips
- Good quality
Needs: A100 40GB or better
Next Steps
- Review this plan and provide feedback
- Set budget for GPU infrastructure
- Choose provider (recommend RunPod)
- Define priority services (LLM hosting first? Image gen first?)
- Schedule implementation (4-week timeline above)
Would you like me to:
- Create the detailed Docker Compose configurations?
- Set up a cost estimation spreadsheet?
- Research specific models for your use cases?
- Begin implementation with Phase 1?
Let me know how you'd like to proceed! 🚀