Files

Sebastian Krüger 8de88d96ac docs(ai): add comprehensive GPU setup documentation and configs

- Add setup guides (SETUP_GUIDE, TAILSCALE_SETUP, DOCKER_GPU_SETUP, etc.)
- Add deployment configurations (litellm-config-gpu.yaml, gpu-server-compose.yaml)
- Add GPU_DEPLOYMENT_LOG.md with current infrastructure details
- Add GPU_EXPANSION_PLAN.md with complete provider comparison
- Add deploy-gpu-stack.sh automation script

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-21 12:57:06 +01:00

11 KiB

Raw Blame History

GPU-Enhanced AI Stack - Implementation Guide

Welcome to your GPU expansion setup! This directory contains everything you need to deploy a production-ready GPU server for LLM hosting, image generation, and model training.

📚 Documentation Files

Planning & Architecture

GPU_EXPANSION_PLAN.md - Complete 70-page plan with provider comparison, architecture, and roadmap
README_GPU_SETUP.md - This file

Step-by-Step Setup Guides

SETUP_GUIDE.md - Day 1-2: RunPod account & GPU server deployment
WIREGUARD_SETUP.md - Day 3-4: VPN connection between VPS and GPU server
DOCKER_GPU_SETUP.md - Day 5: Docker + NVIDIA Container Toolkit configuration

Configuration Files

gpu-server-compose.yaml - Production Docker Compose for GPU server
litellm-config-gpu.yaml - Updated LiteLLM config with self-hosted models
deploy-gpu-stack.sh - Automated deployment script

🚀 Quick Start (Week 1 Checklist)

Day 1-2: RunPod & GPU Server ✓

Create RunPod account at https://www.runpod.io/
Add billing method ($50 initial credit recommended)
Deploy RTX 4090 pod with PyTorch template
Configure 500GB network volume
Verify SSH access
Test GPU with nvidia-smi
Guide: SETUP_GUIDE.md

Day 3-4: Network Configuration ✓

Install Tailscale on VPS
Install Tailscale on GPU server
Authenticate both devices
Test VPN connectivity
Configure firewall rules
Verify VPS can reach GPU server
Guide: TAILSCALE_SETUP.md

Day 5: Docker & GPU Setup ✓

Install Docker on GPU server
Install NVIDIA Container Toolkit
Test GPU access in containers
Create /workspace/gpu-stack directory
Copy configuration files
Guide: DOCKER_GPU_SETUP.md

Day 6-7: Deploy Services ✓

Copy gpu-server-compose.yaml to GPU server
Edit .env with your settings
Run ./deploy-gpu-stack.sh
Wait for vLLM to load model (~5 minutes)
Test vLLM: curl http://localhost:8000/v1/models
Access ComfyUI: http://[tailscale-ip]:8188
Script: deploy-gpu-stack.sh

📦 Services Included

vLLM (http://[tailscale-ip]:8000)

Purpose: High-performance LLM inference Default Model: Llama 3.1 8B Instruct Performance: 50-80 tokens/second on RTX 4090 Use for: General chat, Q&A, code generation, summarization

Switch models: Edit gpu-server-compose.yaml, change --model parameter, restart:

docker compose restart vllm

ComfyUI (http://[tailscale-ip]:8188)

Purpose: Advanced Stable Diffusion interface Features: FLUX, SDXL, ControlNet, LoRA Use for: Image generation, img2img, inpainting

Download models: Access web UI → ComfyUI Manager → Install Models

JupyterLab (http://[tailscale-ip]:8888)

Purpose: Interactive development environment Token: pivoine-ai-2025 (change in .env) Use for: Research, experimentation, custom training scripts

Axolotl (Training - on-demand)

Purpose: LLM fine-tuning framework Start: docker compose --profile training up -d axolotl Use for: LoRA training, full fine-tuning, RLHF

Netdata (http://[tailscale-ip]:19999)

Purpose: System & GPU monitoring Features: Real-time metrics, GPU utilization, memory usage Use for: Performance monitoring, troubleshooting

🔧 Configuration

Environment Variables (.env)

# VPN Network (Tailscale)
VPS_IP=100.x.x.x         # Your VPS Tailscale IP (get with: tailscale ip -4)
GPU_IP=100.x.x.x         # GPU server Tailscale IP (get with: tailscale ip -4)

# Model Storage
MODELS_PATH=/workspace/models

# Hugging Face Token (for gated models like Llama)
HF_TOKEN=hf_xxxxxxxxxxxxx

# Weights & Biases (for training logging)
WANDB_API_KEY=

# JupyterLab Access
JUPYTER_TOKEN=pivoine-ai-2025

# PostgreSQL (on VPS)
DB_HOST=100.x.x.x        # Your VPS Tailscale IP
DB_PORT=5432
DB_USER=valknar
DB_PASSWORD=ragnarok98
DB_NAME=openwebui

Updating LiteLLM on VPS

After GPU server is running, update your VPS LiteLLM config:

# On VPS
cd ~/Projects/docker-compose/ai

# Backup current config
cp litellm-config.yaml litellm-config.yaml.backup

# Copy new config with GPU models
cp litellm-config-gpu.yaml litellm-config.yaml

# Restart LiteLLM
arty restart litellm

Now Open WebUI will have access to both Claude (API) and Llama (self-hosted)!

💰 Cost Management

Current Costs (24/7 Operation)

GPU Server: RTX 4090 @ $0.50/hour = $360/month
Storage: 500GB network volume = $50/month
Total: $410/month

Cost-Saving Options

1. Pay-as-you-go (8 hours/day)

GPU: $0.50 × 8 × 30 = $120/month
Storage: $50/month
Total: $170/month

2. Auto-stop idle pods RunPod can auto-stop after X minutes idle:

Dashboard → Pod Settings → Auto-stop after 30 minutes

3. Use smaller models

Mistral 7B instead of Llama 8B: Faster, cheaper GPU
Quantized models: 4-bit = 1/4 the VRAM

4. Batch image generation

Generate multiple images at once
Use scheduled jobs (cron) during off-peak hours

Cost Tracking

Check GPU usage:

# On RunPod dashboard
Billing → Usage History

# See hourly costs, total spent

Check API vs GPU savings:

# On VPS, check LiteLLM logs
docker logs ai_litellm | grep "model="

# Count requests to llama-3.1-8b vs claude-*

Expected savings:

80% of requests → self-hosted = $0 cost
20% of requests → Claude = API cost
Break-even if currently spending >$500/month on APIs

🔍 Monitoring & Troubleshooting

Check Service Status

# On GPU server
cd /workspace/gpu-stack

# View all services
docker compose ps

# Check specific service logs
docker compose logs -f vllm
docker compose logs -f comfyui
docker compose logs -f jupyter

# Check GPU usage
nvidia-smi
# or prettier:
nvtop

Common Issues

vLLM not loading model:

# Check logs
docker compose logs vllm

# Common causes:
# - Model download in progress (wait 5-10 minutes)
# - Out of VRAM (try smaller model)
# - Missing HF_TOKEN (for gated models like Llama)

ComfyUI slow/crashing:

# Check GPU memory
nvidia-smi

# If VRAM full:
# - Close vLLM temporarily
# - Use smaller models
# - Reduce batch size in ComfyUI

Can't access from VPS:

# Test VPN
ping [tailscale-ip]

# If fails:
# - Check Tailscale status: tailscale status
# - Restart Tailscale: tailscale down && tailscale up
# - Check firewall: ufw status

Docker can't see GPU:

# Test GPU access
docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.1.0-base nvidia-smi

# If fails:
# - Check NVIDIA driver: nvidia-smi
# - Check nvidia-docker: nvidia-ctk --version
# - Restart Docker: systemctl restart docker

📊 Performance Benchmarks

Expected Performance (RTX 4090)

LLM Inference (vLLM):

Llama 3.1 8B: 50-80 tokens/second
Qwen 2.5 14B: 30-50 tokens/second
Batch size 32: ~1500 tokens/second

Image Generation (ComfyUI):

SDXL (1024×1024): ~4-6 seconds
FLUX (1024×1024): ~8-12 seconds
SD 1.5 (512×512): ~1-2 seconds

Training (Axolotl):

LoRA fine-tuning (8B model): ~3-5 hours for 3 epochs
Full fine-tuning: Not recommended on 24GB VRAM

🔐 Security Best Practices

Network Security

✅ All services behind Tailscale VPN (end-to-end encrypted) ✅ No public exposure (except RunPod's SSH) ✅ Firewall configured (no additional ports needed)

Access Control

✅ JupyterLab password-protected ✅ ComfyUI accessible via VPN only ✅ vLLM internal API (no auth needed)

SSH Security

# On GPU server, harden SSH
nano /etc/ssh/sshd_config

# Set:
PermitRootLogin prohibit-password
PasswordAuthentication no
PubkeyAuthentication yes

systemctl restart sshd

Regular Updates

# Weekly updates
apt update && apt upgrade -y

# Update Docker images
docker compose pull
docker compose up -d

📈 Scaling Up

When to Add More GPUs

Current limitations (1× RTX 4090):

Can run ONE of these at a time:
- 8B LLM at full speed
- 14B LLM at moderate speed
- SDXL image generation
- Training job

Add 2nd GPU if:

You want LLM + image gen simultaneously
Training + inference at same time
Multiple users with high demand

Multi-GPU options:

2× RTX 4090: Run vLLM + ComfyUI separately ($720/month)
1× A100 40GB: Larger models (70B with quantization) ($1,080/month)
Mix: RTX 4090 (inference) + A100 (training) (~$1,300/month)

Deploying Larger Models

70B models (need 2× A100 or 4× RTX 4090):

# In gpu-server-compose.yaml
vllm:
  command:
    - --model
    - meta-llama/Meta-Llama-3.1-70B-Instruct
    - --tensor-parallel-size
    - "2"  # Split across 2 GPUs
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 2  # Use 2 GPUs
            capabilities: [gpu]

🎯 Next Steps (Week 2+)

Week 2: LLM Production Deployment

Test Llama 3.1 8B performance
Download additional models (Qwen, Mistral)
Configure model routing in LiteLLM
Set up usage monitoring
Benchmark tokens/second for each model

Week 3: Image Generation

Download FLUX and SDXL models
Install ComfyUI Manager
Download ControlNet models
Create sample workflows
Test API integration with Open WebUI

Week 4: Training Infrastructure

Prepare a sample dataset
Test LoRA fine-tuning with Axolotl
Set up Weights & Biases logging
Create training documentation
Benchmark training speed

🆘 Getting Help

Resources

RunPod Docs: https://docs.runpod.io/
vLLM Docs: https://docs.vllm.ai/
ComfyUI Wiki: https://github.com/comfyanonymous/ComfyUI/wiki
Axolotl Docs: https://github.com/OpenAccess-AI-Collective/axolotl

Community

RunPod Discord: https://discord.gg/runpod
vLLM Discord: https://discord.gg/vllm
r/LocalLLaMA: https://reddit.com/r/LocalLLaMA

Support

If you encounter issues:

Check logs: docker compose logs -f [service]
Check GPU: nvidia-smi
Check VPN: wg show
Restart service: docker compose restart [service]
Full restart: docker compose down && docker compose up -d

✅ Success Criteria

You're ready to proceed when:

GPU server responds to ping [tailscale-ip] from VPS
vLLM returns models: curl http://[tailscale-ip]:8000/v1/models
ComfyUI web interface loads: http://[tailscale-ip]:8188
JupyterLab accessible with token
Netdata shows GPU metrics
Open WebUI shows both Claude and Llama models

Total setup time: 4-6 hours (if following guides sequentially)

🎉 You're All Set!

Your GPU-enhanced AI stack is ready. You now have:

✅ Self-hosted LLM inference (saves $$$)
✅ Advanced image generation (FLUX, SDXL)
✅ Model training capabilities (LoRA, fine-tuning)
✅ Secure VPN connection
✅ Full monitoring and logging

Enjoy building with your new AI infrastructure! 🚀

11 KiB Raw Blame History Unescape Escape