Files
docker-compose/ai/README_GPU_SETUP.md
Sebastian Krüger 8de88d96ac docs(ai): add comprehensive GPU setup documentation and configs
- Add setup guides (SETUP_GUIDE, TAILSCALE_SETUP, DOCKER_GPU_SETUP, etc.)
- Add deployment configurations (litellm-config-gpu.yaml, gpu-server-compose.yaml)
- Add GPU_DEPLOYMENT_LOG.md with current infrastructure details
- Add GPU_EXPANSION_PLAN.md with complete provider comparison
- Add deploy-gpu-stack.sh automation script

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 12:57:06 +01:00

11 KiB
Raw Blame History

GPU-Enhanced AI Stack - Implementation Guide

Welcome to your GPU expansion setup! This directory contains everything you need to deploy a production-ready GPU server for LLM hosting, image generation, and model training.

📚 Documentation Files

Planning & Architecture

  • GPU_EXPANSION_PLAN.md - Complete 70-page plan with provider comparison, architecture, and roadmap
  • README_GPU_SETUP.md - This file

Step-by-Step Setup Guides

  1. SETUP_GUIDE.md - Day 1-2: RunPod account & GPU server deployment
  2. WIREGUARD_SETUP.md - Day 3-4: VPN connection between VPS and GPU server
  3. DOCKER_GPU_SETUP.md - Day 5: Docker + NVIDIA Container Toolkit configuration

Configuration Files

  • gpu-server-compose.yaml - Production Docker Compose for GPU server
  • litellm-config-gpu.yaml - Updated LiteLLM config with self-hosted models
  • deploy-gpu-stack.sh - Automated deployment script

🚀 Quick Start (Week 1 Checklist)

Day 1-2: RunPod & GPU Server ✓

  • Create RunPod account at https://www.runpod.io/
  • Add billing method ($50 initial credit recommended)
  • Deploy RTX 4090 pod with PyTorch template
  • Configure 500GB network volume
  • Verify SSH access
  • Test GPU with nvidia-smi
  • Guide: SETUP_GUIDE.md

Day 3-4: Network Configuration ✓

  • Install Tailscale on VPS
  • Install Tailscale on GPU server
  • Authenticate both devices
  • Test VPN connectivity
  • Configure firewall rules
  • Verify VPS can reach GPU server
  • Guide: TAILSCALE_SETUP.md

Day 5: Docker & GPU Setup ✓

  • Install Docker on GPU server
  • Install NVIDIA Container Toolkit
  • Test GPU access in containers
  • Create /workspace/gpu-stack directory
  • Copy configuration files
  • Guide: DOCKER_GPU_SETUP.md

Day 6-7: Deploy Services ✓

  • Copy gpu-server-compose.yaml to GPU server
  • Edit .env with your settings
  • Run ./deploy-gpu-stack.sh
  • Wait for vLLM to load model (~5 minutes)
  • Test vLLM: curl http://localhost:8000/v1/models
  • Access ComfyUI: http://[tailscale-ip]:8188
  • Script: deploy-gpu-stack.sh

📦 Services Included

vLLM (http://[tailscale-ip]:8000)

Purpose: High-performance LLM inference Default Model: Llama 3.1 8B Instruct Performance: 50-80 tokens/second on RTX 4090 Use for: General chat, Q&A, code generation, summarization

Switch models: Edit gpu-server-compose.yaml, change --model parameter, restart:

docker compose restart vllm

ComfyUI (http://[tailscale-ip]:8188)

Purpose: Advanced Stable Diffusion interface Features: FLUX, SDXL, ControlNet, LoRA Use for: Image generation, img2img, inpainting

Download models: Access web UI → ComfyUI Manager → Install Models

JupyterLab (http://[tailscale-ip]:8888)

Purpose: Interactive development environment Token: pivoine-ai-2025 (change in .env) Use for: Research, experimentation, custom training scripts

Axolotl (Training - on-demand)

Purpose: LLM fine-tuning framework Start: docker compose --profile training up -d axolotl Use for: LoRA training, full fine-tuning, RLHF

Netdata (http://[tailscale-ip]:19999)

Purpose: System & GPU monitoring Features: Real-time metrics, GPU utilization, memory usage Use for: Performance monitoring, troubleshooting


🔧 Configuration

Environment Variables (.env)

# VPN Network (Tailscale)
VPS_IP=100.x.x.x         # Your VPS Tailscale IP (get with: tailscale ip -4)
GPU_IP=100.x.x.x         # GPU server Tailscale IP (get with: tailscale ip -4)

# Model Storage
MODELS_PATH=/workspace/models

# Hugging Face Token (for gated models like Llama)
HF_TOKEN=hf_xxxxxxxxxxxxx

# Weights & Biases (for training logging)
WANDB_API_KEY=

# JupyterLab Access
JUPYTER_TOKEN=pivoine-ai-2025

# PostgreSQL (on VPS)
DB_HOST=100.x.x.x        # Your VPS Tailscale IP
DB_PORT=5432
DB_USER=valknar
DB_PASSWORD=ragnarok98
DB_NAME=openwebui

Updating LiteLLM on VPS

After GPU server is running, update your VPS LiteLLM config:

# On VPS
cd ~/Projects/docker-compose/ai

# Backup current config
cp litellm-config.yaml litellm-config.yaml.backup

# Copy new config with GPU models
cp litellm-config-gpu.yaml litellm-config.yaml

# Restart LiteLLM
arty restart litellm

Now Open WebUI will have access to both Claude (API) and Llama (self-hosted)!


💰 Cost Management

Current Costs (24/7 Operation)

  • GPU Server: RTX 4090 @ $0.50/hour = $360/month
  • Storage: 500GB network volume = $50/month
  • Total: $410/month

Cost-Saving Options

1. Pay-as-you-go (8 hours/day)

  • GPU: $0.50 × 8 × 30 = $120/month
  • Storage: $50/month
  • Total: $170/month

2. Auto-stop idle pods RunPod can auto-stop after X minutes idle:

  • Dashboard → Pod Settings → Auto-stop after 30 minutes

3. Use smaller models

  • Mistral 7B instead of Llama 8B: Faster, cheaper GPU
  • Quantized models: 4-bit = 1/4 the VRAM

4. Batch image generation

  • Generate multiple images at once
  • Use scheduled jobs (cron) during off-peak hours

Cost Tracking

Check GPU usage:

# On RunPod dashboard
Billing → Usage History

# See hourly costs, total spent

Check API vs GPU savings:

# On VPS, check LiteLLM logs
docker logs ai_litellm | grep "model="

# Count requests to llama-3.1-8b vs claude-*

Expected savings:

  • 80% of requests → self-hosted = $0 cost
  • 20% of requests → Claude = API cost
  • Break-even if currently spending >$500/month on APIs

🔍 Monitoring & Troubleshooting

Check Service Status

# On GPU server
cd /workspace/gpu-stack

# View all services
docker compose ps

# Check specific service logs
docker compose logs -f vllm
docker compose logs -f comfyui
docker compose logs -f jupyter

# Check GPU usage
nvidia-smi
# or prettier:
nvtop

Common Issues

vLLM not loading model:

# Check logs
docker compose logs vllm

# Common causes:
# - Model download in progress (wait 5-10 minutes)
# - Out of VRAM (try smaller model)
# - Missing HF_TOKEN (for gated models like Llama)

ComfyUI slow/crashing:

# Check GPU memory
nvidia-smi

# If VRAM full:
# - Close vLLM temporarily
# - Use smaller models
# - Reduce batch size in ComfyUI

Can't access from VPS:

# Test VPN
ping [tailscale-ip]

# If fails:
# - Check Tailscale status: tailscale status
# - Restart Tailscale: tailscale down && tailscale up
# - Check firewall: ufw status

Docker can't see GPU:

# Test GPU access
docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.1.0-base nvidia-smi

# If fails:
# - Check NVIDIA driver: nvidia-smi
# - Check nvidia-docker: nvidia-ctk --version
# - Restart Docker: systemctl restart docker

📊 Performance Benchmarks

Expected Performance (RTX 4090)

LLM Inference (vLLM):

  • Llama 3.1 8B: 50-80 tokens/second
  • Qwen 2.5 14B: 30-50 tokens/second
  • Batch size 32: ~1500 tokens/second

Image Generation (ComfyUI):

  • SDXL (1024×1024): ~4-6 seconds
  • FLUX (1024×1024): ~8-12 seconds
  • SD 1.5 (512×512): ~1-2 seconds

Training (Axolotl):

  • LoRA fine-tuning (8B model): ~3-5 hours for 3 epochs
  • Full fine-tuning: Not recommended on 24GB VRAM

🔐 Security Best Practices

Network Security

All services behind Tailscale VPN (end-to-end encrypted) No public exposure (except RunPod's SSH) Firewall configured (no additional ports needed)

Access Control

JupyterLab password-protected ComfyUI accessible via VPN only vLLM internal API (no auth needed)

SSH Security

# On GPU server, harden SSH
nano /etc/ssh/sshd_config

# Set:
PermitRootLogin prohibit-password
PasswordAuthentication no
PubkeyAuthentication yes

systemctl restart sshd

Regular Updates

# Weekly updates
apt update && apt upgrade -y

# Update Docker images
docker compose pull
docker compose up -d

📈 Scaling Up

When to Add More GPUs

Current limitations (1× RTX 4090):

  • Can run ONE of these at a time:
    • 8B LLM at full speed
    • 14B LLM at moderate speed
    • SDXL image generation
    • Training job

Add 2nd GPU if:

  • You want LLM + image gen simultaneously
  • Training + inference at same time
  • Multiple users with high demand

Multi-GPU options:

  • 2× RTX 4090: Run vLLM + ComfyUI separately ($720/month)
  • 1× A100 40GB: Larger models (70B with quantization) ($1,080/month)
  • Mix: RTX 4090 (inference) + A100 (training) (~$1,300/month)

Deploying Larger Models

70B models (need 2× A100 or 4× RTX 4090):

# In gpu-server-compose.yaml
vllm:
  command:
    - --model
    - meta-llama/Meta-Llama-3.1-70B-Instruct
    - --tensor-parallel-size
    - "2"  # Split across 2 GPUs
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 2  # Use 2 GPUs
            capabilities: [gpu]

🎯 Next Steps (Week 2+)

Week 2: LLM Production Deployment

  • Test Llama 3.1 8B performance
  • Download additional models (Qwen, Mistral)
  • Configure model routing in LiteLLM
  • Set up usage monitoring
  • Benchmark tokens/second for each model

Week 3: Image Generation

  • Download FLUX and SDXL models
  • Install ComfyUI Manager
  • Download ControlNet models
  • Create sample workflows
  • Test API integration with Open WebUI

Week 4: Training Infrastructure

  • Prepare a sample dataset
  • Test LoRA fine-tuning with Axolotl
  • Set up Weights & Biases logging
  • Create training documentation
  • Benchmark training speed

🆘 Getting Help

Resources

Community

Support

If you encounter issues:

  1. Check logs: docker compose logs -f [service]
  2. Check GPU: nvidia-smi
  3. Check VPN: wg show
  4. Restart service: docker compose restart [service]
  5. Full restart: docker compose down && docker compose up -d

Success Criteria

You're ready to proceed when:

  • GPU server responds to ping [tailscale-ip] from VPS
  • vLLM returns models: curl http://[tailscale-ip]:8000/v1/models
  • ComfyUI web interface loads: http://[tailscale-ip]:8188
  • JupyterLab accessible with token
  • Netdata shows GPU metrics
  • Open WebUI shows both Claude and Llama models

Total setup time: 4-6 hours (if following guides sequentially)


🎉 You're All Set!

Your GPU-enhanced AI stack is ready. You now have:

  • Self-hosted LLM inference (saves $$$)
  • Advanced image generation (FLUX, SDXL)
  • Model training capabilities (LoRA, fine-tuning)
  • Secure VPN connection
  • Full monitoring and logging

Enjoy building with your new AI infrastructure! 🚀