- Add setup guides (SETUP_GUIDE, TAILSCALE_SETUP, DOCKER_GPU_SETUP, etc.) - Add deployment configurations (litellm-config-gpu.yaml, gpu-server-compose.yaml) - Add GPU_DEPLOYMENT_LOG.md with current infrastructure details - Add GPU_EXPANSION_PLAN.md with complete provider comparison - Add deploy-gpu-stack.sh automation script 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
11 KiB
GPU-Enhanced AI Stack - Implementation Guide
Welcome to your GPU expansion setup! This directory contains everything you need to deploy a production-ready GPU server for LLM hosting, image generation, and model training.
📚 Documentation Files
Planning & Architecture
GPU_EXPANSION_PLAN.md- Complete 70-page plan with provider comparison, architecture, and roadmapREADME_GPU_SETUP.md- This file
Step-by-Step Setup Guides
SETUP_GUIDE.md- Day 1-2: RunPod account & GPU server deploymentWIREGUARD_SETUP.md- Day 3-4: VPN connection between VPS and GPU serverDOCKER_GPU_SETUP.md- Day 5: Docker + NVIDIA Container Toolkit configuration
Configuration Files
gpu-server-compose.yaml- Production Docker Compose for GPU serverlitellm-config-gpu.yaml- Updated LiteLLM config with self-hosted modelsdeploy-gpu-stack.sh- Automated deployment script
🚀 Quick Start (Week 1 Checklist)
Day 1-2: RunPod & GPU Server ✓
- Create RunPod account at https://www.runpod.io/
- Add billing method ($50 initial credit recommended)
- Deploy RTX 4090 pod with PyTorch template
- Configure 500GB network volume
- Verify SSH access
- Test GPU with
nvidia-smi - Guide:
SETUP_GUIDE.md
Day 3-4: Network Configuration ✓
- Install Tailscale on VPS
- Install Tailscale on GPU server
- Authenticate both devices
- Test VPN connectivity
- Configure firewall rules
- Verify VPS can reach GPU server
- Guide:
TAILSCALE_SETUP.md
Day 5: Docker & GPU Setup ✓
- Install Docker on GPU server
- Install NVIDIA Container Toolkit
- Test GPU access in containers
- Create /workspace/gpu-stack directory
- Copy configuration files
- Guide:
DOCKER_GPU_SETUP.md
Day 6-7: Deploy Services ✓
- Copy
gpu-server-compose.yamlto GPU server - Edit
.envwith your settings - Run
./deploy-gpu-stack.sh - Wait for vLLM to load model (~5 minutes)
- Test vLLM:
curl http://localhost:8000/v1/models - Access ComfyUI:
http://[tailscale-ip]:8188 - Script:
deploy-gpu-stack.sh
📦 Services Included
vLLM (http://[tailscale-ip]:8000)
Purpose: High-performance LLM inference Default Model: Llama 3.1 8B Instruct Performance: 50-80 tokens/second on RTX 4090 Use for: General chat, Q&A, code generation, summarization
Switch models:
Edit gpu-server-compose.yaml, change --model parameter, restart:
docker compose restart vllm
ComfyUI (http://[tailscale-ip]:8188)
Purpose: Advanced Stable Diffusion interface Features: FLUX, SDXL, ControlNet, LoRA Use for: Image generation, img2img, inpainting
Download models: Access web UI → ComfyUI Manager → Install Models
JupyterLab (http://[tailscale-ip]:8888)
Purpose: Interactive development environment
Token: pivoine-ai-2025 (change in .env)
Use for: Research, experimentation, custom training scripts
Axolotl (Training - on-demand)
Purpose: LLM fine-tuning framework
Start: docker compose --profile training up -d axolotl
Use for: LoRA training, full fine-tuning, RLHF
Netdata (http://[tailscale-ip]:19999)
Purpose: System & GPU monitoring Features: Real-time metrics, GPU utilization, memory usage Use for: Performance monitoring, troubleshooting
🔧 Configuration
Environment Variables (.env)
# VPN Network (Tailscale)
VPS_IP=100.x.x.x # Your VPS Tailscale IP (get with: tailscale ip -4)
GPU_IP=100.x.x.x # GPU server Tailscale IP (get with: tailscale ip -4)
# Model Storage
MODELS_PATH=/workspace/models
# Hugging Face Token (for gated models like Llama)
HF_TOKEN=hf_xxxxxxxxxxxxx
# Weights & Biases (for training logging)
WANDB_API_KEY=
# JupyterLab Access
JUPYTER_TOKEN=pivoine-ai-2025
# PostgreSQL (on VPS)
DB_HOST=100.x.x.x # Your VPS Tailscale IP
DB_PORT=5432
DB_USER=valknar
DB_PASSWORD=ragnarok98
DB_NAME=openwebui
Updating LiteLLM on VPS
After GPU server is running, update your VPS LiteLLM config:
# On VPS
cd ~/Projects/docker-compose/ai
# Backup current config
cp litellm-config.yaml litellm-config.yaml.backup
# Copy new config with GPU models
cp litellm-config-gpu.yaml litellm-config.yaml
# Restart LiteLLM
arty restart litellm
Now Open WebUI will have access to both Claude (API) and Llama (self-hosted)!
💰 Cost Management
Current Costs (24/7 Operation)
- GPU Server: RTX 4090 @ $0.50/hour = $360/month
- Storage: 500GB network volume = $50/month
- Total: $410/month
Cost-Saving Options
1. Pay-as-you-go (8 hours/day)
- GPU: $0.50 × 8 × 30 = $120/month
- Storage: $50/month
- Total: $170/month
2. Auto-stop idle pods RunPod can auto-stop after X minutes idle:
- Dashboard → Pod Settings → Auto-stop after 30 minutes
3. Use smaller models
- Mistral 7B instead of Llama 8B: Faster, cheaper GPU
- Quantized models: 4-bit = 1/4 the VRAM
4. Batch image generation
- Generate multiple images at once
- Use scheduled jobs (cron) during off-peak hours
Cost Tracking
Check GPU usage:
# On RunPod dashboard
Billing → Usage History
# See hourly costs, total spent
Check API vs GPU savings:
# On VPS, check LiteLLM logs
docker logs ai_litellm | grep "model="
# Count requests to llama-3.1-8b vs claude-*
Expected savings:
- 80% of requests → self-hosted = $0 cost
- 20% of requests → Claude = API cost
- Break-even if currently spending >$500/month on APIs
🔍 Monitoring & Troubleshooting
Check Service Status
# On GPU server
cd /workspace/gpu-stack
# View all services
docker compose ps
# Check specific service logs
docker compose logs -f vllm
docker compose logs -f comfyui
docker compose logs -f jupyter
# Check GPU usage
nvidia-smi
# or prettier:
nvtop
Common Issues
vLLM not loading model:
# Check logs
docker compose logs vllm
# Common causes:
# - Model download in progress (wait 5-10 minutes)
# - Out of VRAM (try smaller model)
# - Missing HF_TOKEN (for gated models like Llama)
ComfyUI slow/crashing:
# Check GPU memory
nvidia-smi
# If VRAM full:
# - Close vLLM temporarily
# - Use smaller models
# - Reduce batch size in ComfyUI
Can't access from VPS:
# Test VPN
ping [tailscale-ip]
# If fails:
# - Check Tailscale status: tailscale status
# - Restart Tailscale: tailscale down && tailscale up
# - Check firewall: ufw status
Docker can't see GPU:
# Test GPU access
docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.1.0-base nvidia-smi
# If fails:
# - Check NVIDIA driver: nvidia-smi
# - Check nvidia-docker: nvidia-ctk --version
# - Restart Docker: systemctl restart docker
📊 Performance Benchmarks
Expected Performance (RTX 4090)
LLM Inference (vLLM):
- Llama 3.1 8B: 50-80 tokens/second
- Qwen 2.5 14B: 30-50 tokens/second
- Batch size 32: ~1500 tokens/second
Image Generation (ComfyUI):
- SDXL (1024×1024): ~4-6 seconds
- FLUX (1024×1024): ~8-12 seconds
- SD 1.5 (512×512): ~1-2 seconds
Training (Axolotl):
- LoRA fine-tuning (8B model): ~3-5 hours for 3 epochs
- Full fine-tuning: Not recommended on 24GB VRAM
🔐 Security Best Practices
Network Security
✅ All services behind Tailscale VPN (end-to-end encrypted) ✅ No public exposure (except RunPod's SSH) ✅ Firewall configured (no additional ports needed)
Access Control
✅ JupyterLab password-protected ✅ ComfyUI accessible via VPN only ✅ vLLM internal API (no auth needed)
SSH Security
# On GPU server, harden SSH
nano /etc/ssh/sshd_config
# Set:
PermitRootLogin prohibit-password
PasswordAuthentication no
PubkeyAuthentication yes
systemctl restart sshd
Regular Updates
# Weekly updates
apt update && apt upgrade -y
# Update Docker images
docker compose pull
docker compose up -d
📈 Scaling Up
When to Add More GPUs
Current limitations (1× RTX 4090):
- Can run ONE of these at a time:
- 8B LLM at full speed
- 14B LLM at moderate speed
- SDXL image generation
- Training job
Add 2nd GPU if:
- You want LLM + image gen simultaneously
- Training + inference at same time
- Multiple users with high demand
Multi-GPU options:
- 2× RTX 4090: Run vLLM + ComfyUI separately ($720/month)
- 1× A100 40GB: Larger models (70B with quantization) ($1,080/month)
- Mix: RTX 4090 (inference) + A100 (training) (~$1,300/month)
Deploying Larger Models
70B models (need 2× A100 or 4× RTX 4090):
# In gpu-server-compose.yaml
vllm:
command:
- --model
- meta-llama/Meta-Llama-3.1-70B-Instruct
- --tensor-parallel-size
- "2" # Split across 2 GPUs
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2 # Use 2 GPUs
capabilities: [gpu]
🎯 Next Steps (Week 2+)
Week 2: LLM Production Deployment
- Test Llama 3.1 8B performance
- Download additional models (Qwen, Mistral)
- Configure model routing in LiteLLM
- Set up usage monitoring
- Benchmark tokens/second for each model
Week 3: Image Generation
- Download FLUX and SDXL models
- Install ComfyUI Manager
- Download ControlNet models
- Create sample workflows
- Test API integration with Open WebUI
Week 4: Training Infrastructure
- Prepare a sample dataset
- Test LoRA fine-tuning with Axolotl
- Set up Weights & Biases logging
- Create training documentation
- Benchmark training speed
🆘 Getting Help
Resources
- RunPod Docs: https://docs.runpod.io/
- vLLM Docs: https://docs.vllm.ai/
- ComfyUI Wiki: https://github.com/comfyanonymous/ComfyUI/wiki
- Axolotl Docs: https://github.com/OpenAccess-AI-Collective/axolotl
Community
- RunPod Discord: https://discord.gg/runpod
- vLLM Discord: https://discord.gg/vllm
- r/LocalLLaMA: https://reddit.com/r/LocalLLaMA
Support
If you encounter issues:
- Check logs:
docker compose logs -f [service] - Check GPU:
nvidia-smi - Check VPN:
wg show - Restart service:
docker compose restart [service] - Full restart:
docker compose down && docker compose up -d
✅ Success Criteria
You're ready to proceed when:
- GPU server responds to
ping [tailscale-ip]from VPS - vLLM returns models:
curl http://[tailscale-ip]:8000/v1/models - ComfyUI web interface loads:
http://[tailscale-ip]:8188 - JupyterLab accessible with token
- Netdata shows GPU metrics
- Open WebUI shows both Claude and Llama models
Total setup time: 4-6 hours (if following guides sequentially)
🎉 You're All Set!
Your GPU-enhanced AI stack is ready. You now have:
- ✅ Self-hosted LLM inference (saves $$$)
- ✅ Advanced image generation (FLUX, SDXL)
- ✅ Model training capabilities (LoRA, fine-tuning)
- ✅ Secure VPN connection
- ✅ Full monitoring and logging
Enjoy building with your new AI infrastructure! 🚀