- Add setup guides (SETUP_GUIDE, TAILSCALE_SETUP, DOCKER_GPU_SETUP, etc.) - Add deployment configurations (litellm-config-gpu.yaml, gpu-server-compose.yaml) - Add GPU_DEPLOYMENT_LOG.md with current infrastructure details - Add GPU_EXPANSION_PLAN.md with complete provider comparison - Add deploy-gpu-stack.sh automation script 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
445 lines
11 KiB
Markdown
445 lines
11 KiB
Markdown
# GPU-Enhanced AI Stack - Implementation Guide
|
||
|
||
Welcome to your GPU expansion setup! This directory contains everything you need to deploy a production-ready GPU server for LLM hosting, image generation, and model training.
|
||
|
||
## 📚 Documentation Files
|
||
|
||
### Planning & Architecture
|
||
- **`GPU_EXPANSION_PLAN.md`** - Complete 70-page plan with provider comparison, architecture, and roadmap
|
||
- **`README_GPU_SETUP.md`** - This file
|
||
|
||
### Step-by-Step Setup Guides
|
||
1. **`SETUP_GUIDE.md`** - Day 1-2: RunPod account & GPU server deployment
|
||
2. **`WIREGUARD_SETUP.md`** - Day 3-4: VPN connection between VPS and GPU server
|
||
3. **`DOCKER_GPU_SETUP.md`** - Day 5: Docker + NVIDIA Container Toolkit configuration
|
||
|
||
### Configuration Files
|
||
- **`gpu-server-compose.yaml`** - Production Docker Compose for GPU server
|
||
- **`litellm-config-gpu.yaml`** - Updated LiteLLM config with self-hosted models
|
||
- **`deploy-gpu-stack.sh`** - Automated deployment script
|
||
|
||
---
|
||
|
||
## 🚀 Quick Start (Week 1 Checklist)
|
||
|
||
### Day 1-2: RunPod & GPU Server ✓
|
||
- [ ] Create RunPod account at https://www.runpod.io/
|
||
- [ ] Add billing method ($50 initial credit recommended)
|
||
- [ ] Deploy RTX 4090 pod with PyTorch template
|
||
- [ ] Configure 500GB network volume
|
||
- [ ] Verify SSH access
|
||
- [ ] Test GPU with `nvidia-smi`
|
||
- [ ] **Guide:** `SETUP_GUIDE.md`
|
||
|
||
### Day 3-4: Network Configuration ✓
|
||
- [ ] Install Tailscale on VPS
|
||
- [ ] Install Tailscale on GPU server
|
||
- [ ] Authenticate both devices
|
||
- [ ] Test VPN connectivity
|
||
- [ ] Configure firewall rules
|
||
- [ ] Verify VPS can reach GPU server
|
||
- [ ] **Guide:** `TAILSCALE_SETUP.md`
|
||
|
||
### Day 5: Docker & GPU Setup ✓
|
||
- [ ] Install Docker on GPU server
|
||
- [ ] Install NVIDIA Container Toolkit
|
||
- [ ] Test GPU access in containers
|
||
- [ ] Create /workspace/gpu-stack directory
|
||
- [ ] Copy configuration files
|
||
- [ ] **Guide:** `DOCKER_GPU_SETUP.md`
|
||
|
||
### Day 6-7: Deploy Services ✓
|
||
- [ ] Copy `gpu-server-compose.yaml` to GPU server
|
||
- [ ] Edit `.env` with your settings
|
||
- [ ] Run `./deploy-gpu-stack.sh`
|
||
- [ ] Wait for vLLM to load model (~5 minutes)
|
||
- [ ] Test vLLM: `curl http://localhost:8000/v1/models`
|
||
- [ ] Access ComfyUI: `http://[tailscale-ip]:8188`
|
||
- [ ] **Script:** `deploy-gpu-stack.sh`
|
||
|
||
---
|
||
|
||
## 📦 Services Included
|
||
|
||
### vLLM (http://[tailscale-ip]:8000)
|
||
**Purpose:** High-performance LLM inference
|
||
**Default Model:** Llama 3.1 8B Instruct
|
||
**Performance:** 50-80 tokens/second on RTX 4090
|
||
**Use for:** General chat, Q&A, code generation, summarization
|
||
|
||
**Switch models:**
|
||
Edit `gpu-server-compose.yaml`, change `--model` parameter, restart:
|
||
```bash
|
||
docker compose restart vllm
|
||
```
|
||
|
||
### ComfyUI (http://[tailscale-ip]:8188)
|
||
**Purpose:** Advanced Stable Diffusion interface
|
||
**Features:** FLUX, SDXL, ControlNet, LoRA
|
||
**Use for:** Image generation, img2img, inpainting
|
||
|
||
**Download models:**
|
||
Access web UI → ComfyUI Manager → Install Models
|
||
|
||
### JupyterLab (http://[tailscale-ip]:8888)
|
||
**Purpose:** Interactive development environment
|
||
**Token:** `pivoine-ai-2025` (change in `.env`)
|
||
**Use for:** Research, experimentation, custom training scripts
|
||
|
||
### Axolotl (Training - on-demand)
|
||
**Purpose:** LLM fine-tuning framework
|
||
**Start:** `docker compose --profile training up -d axolotl`
|
||
**Use for:** LoRA training, full fine-tuning, RLHF
|
||
|
||
### Netdata (http://[tailscale-ip]:19999)
|
||
**Purpose:** System & GPU monitoring
|
||
**Features:** Real-time metrics, GPU utilization, memory usage
|
||
**Use for:** Performance monitoring, troubleshooting
|
||
|
||
---
|
||
|
||
## 🔧 Configuration
|
||
|
||
### Environment Variables (.env)
|
||
|
||
```bash
|
||
# VPN Network (Tailscale)
|
||
VPS_IP=100.x.x.x # Your VPS Tailscale IP (get with: tailscale ip -4)
|
||
GPU_IP=100.x.x.x # GPU server Tailscale IP (get with: tailscale ip -4)
|
||
|
||
# Model Storage
|
||
MODELS_PATH=/workspace/models
|
||
|
||
# Hugging Face Token (for gated models like Llama)
|
||
HF_TOKEN=hf_xxxxxxxxxxxxx
|
||
|
||
# Weights & Biases (for training logging)
|
||
WANDB_API_KEY=
|
||
|
||
# JupyterLab Access
|
||
JUPYTER_TOKEN=pivoine-ai-2025
|
||
|
||
# PostgreSQL (on VPS)
|
||
DB_HOST=100.x.x.x # Your VPS Tailscale IP
|
||
DB_PORT=5432
|
||
DB_USER=valknar
|
||
DB_PASSWORD=ragnarok98
|
||
DB_NAME=openwebui
|
||
```
|
||
|
||
### Updating LiteLLM on VPS
|
||
|
||
After GPU server is running, update your VPS LiteLLM config:
|
||
|
||
```bash
|
||
# On VPS
|
||
cd ~/Projects/docker-compose/ai
|
||
|
||
# Backup current config
|
||
cp litellm-config.yaml litellm-config.yaml.backup
|
||
|
||
# Copy new config with GPU models
|
||
cp litellm-config-gpu.yaml litellm-config.yaml
|
||
|
||
# Restart LiteLLM
|
||
arty restart litellm
|
||
```
|
||
|
||
Now Open WebUI will have access to both Claude (API) and Llama (self-hosted)!
|
||
|
||
---
|
||
|
||
## 💰 Cost Management
|
||
|
||
### Current Costs (24/7 Operation)
|
||
- **GPU Server:** RTX 4090 @ $0.50/hour = $360/month
|
||
- **Storage:** 500GB network volume = $50/month
|
||
- **Total:** **$410/month**
|
||
|
||
### Cost-Saving Options
|
||
|
||
**1. Pay-as-you-go (8 hours/day)**
|
||
- GPU: $0.50 × 8 × 30 = $120/month
|
||
- Storage: $50/month
|
||
- **Total: $170/month**
|
||
|
||
**2. Auto-stop idle pods**
|
||
RunPod can auto-stop after X minutes idle:
|
||
- Dashboard → Pod Settings → Auto-stop after 30 minutes
|
||
|
||
**3. Use smaller models**
|
||
- Mistral 7B instead of Llama 8B: Faster, cheaper GPU
|
||
- Quantized models: 4-bit = 1/4 the VRAM
|
||
|
||
**4. Batch image generation**
|
||
- Generate multiple images at once
|
||
- Use scheduled jobs (cron) during off-peak hours
|
||
|
||
### Cost Tracking
|
||
|
||
**Check GPU usage:**
|
||
```bash
|
||
# On RunPod dashboard
|
||
Billing → Usage History
|
||
|
||
# See hourly costs, total spent
|
||
```
|
||
|
||
**Check API vs GPU savings:**
|
||
```bash
|
||
# On VPS, check LiteLLM logs
|
||
docker logs ai_litellm | grep "model="
|
||
|
||
# Count requests to llama-3.1-8b vs claude-*
|
||
```
|
||
|
||
**Expected savings:**
|
||
- 80% of requests → self-hosted = $0 cost
|
||
- 20% of requests → Claude = API cost
|
||
- Break-even if currently spending >$500/month on APIs
|
||
|
||
---
|
||
|
||
## 🔍 Monitoring & Troubleshooting
|
||
|
||
### Check Service Status
|
||
|
||
```bash
|
||
# On GPU server
|
||
cd /workspace/gpu-stack
|
||
|
||
# View all services
|
||
docker compose ps
|
||
|
||
# Check specific service logs
|
||
docker compose logs -f vllm
|
||
docker compose logs -f comfyui
|
||
docker compose logs -f jupyter
|
||
|
||
# Check GPU usage
|
||
nvidia-smi
|
||
# or prettier:
|
||
nvtop
|
||
```
|
||
|
||
### Common Issues
|
||
|
||
**vLLM not loading model:**
|
||
```bash
|
||
# Check logs
|
||
docker compose logs vllm
|
||
|
||
# Common causes:
|
||
# - Model download in progress (wait 5-10 minutes)
|
||
# - Out of VRAM (try smaller model)
|
||
# - Missing HF_TOKEN (for gated models like Llama)
|
||
```
|
||
|
||
**ComfyUI slow/crashing:**
|
||
```bash
|
||
# Check GPU memory
|
||
nvidia-smi
|
||
|
||
# If VRAM full:
|
||
# - Close vLLM temporarily
|
||
# - Use smaller models
|
||
# - Reduce batch size in ComfyUI
|
||
```
|
||
|
||
**Can't access from VPS:**
|
||
```bash
|
||
# Test VPN
|
||
ping [tailscale-ip]
|
||
|
||
# If fails:
|
||
# - Check Tailscale status: tailscale status
|
||
# - Restart Tailscale: tailscale down && tailscale up
|
||
# - Check firewall: ufw status
|
||
```
|
||
|
||
**Docker can't see GPU:**
|
||
```bash
|
||
# Test GPU access
|
||
docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.1.0-base nvidia-smi
|
||
|
||
# If fails:
|
||
# - Check NVIDIA driver: nvidia-smi
|
||
# - Check nvidia-docker: nvidia-ctk --version
|
||
# - Restart Docker: systemctl restart docker
|
||
```
|
||
|
||
---
|
||
|
||
## 📊 Performance Benchmarks
|
||
|
||
### Expected Performance (RTX 4090)
|
||
|
||
**LLM Inference (vLLM):**
|
||
- Llama 3.1 8B: 50-80 tokens/second
|
||
- Qwen 2.5 14B: 30-50 tokens/second
|
||
- Batch size 32: ~1500 tokens/second
|
||
|
||
**Image Generation (ComfyUI):**
|
||
- SDXL (1024×1024): ~4-6 seconds
|
||
- FLUX (1024×1024): ~8-12 seconds
|
||
- SD 1.5 (512×512): ~1-2 seconds
|
||
|
||
**Training (Axolotl):**
|
||
- LoRA fine-tuning (8B model): ~3-5 hours for 3 epochs
|
||
- Full fine-tuning: Not recommended on 24GB VRAM
|
||
|
||
---
|
||
|
||
## 🔐 Security Best Practices
|
||
|
||
### Network Security
|
||
✅ All services behind Tailscale VPN (end-to-end encrypted)
|
||
✅ No public exposure (except RunPod's SSH)
|
||
✅ Firewall configured (no additional ports needed)
|
||
|
||
### Access Control
|
||
✅ JupyterLab password-protected
|
||
✅ ComfyUI accessible via VPN only
|
||
✅ vLLM internal API (no auth needed)
|
||
|
||
### SSH Security
|
||
```bash
|
||
# On GPU server, harden SSH
|
||
nano /etc/ssh/sshd_config
|
||
|
||
# Set:
|
||
PermitRootLogin prohibit-password
|
||
PasswordAuthentication no
|
||
PubkeyAuthentication yes
|
||
|
||
systemctl restart sshd
|
||
```
|
||
|
||
### Regular Updates
|
||
```bash
|
||
# Weekly updates
|
||
apt update && apt upgrade -y
|
||
|
||
# Update Docker images
|
||
docker compose pull
|
||
docker compose up -d
|
||
```
|
||
|
||
---
|
||
|
||
## 📈 Scaling Up
|
||
|
||
### When to Add More GPUs
|
||
|
||
**Current limitations (1× RTX 4090):**
|
||
- Can run ONE of these at a time:
|
||
- 8B LLM at full speed
|
||
- 14B LLM at moderate speed
|
||
- SDXL image generation
|
||
- Training job
|
||
|
||
**Add 2nd GPU if:**
|
||
- You want LLM + image gen simultaneously
|
||
- Training + inference at same time
|
||
- Multiple users with high demand
|
||
|
||
**Multi-GPU options:**
|
||
- 2× RTX 4090: Run vLLM + ComfyUI separately ($720/month)
|
||
- 1× A100 40GB: Larger models (70B with quantization) ($1,080/month)
|
||
- Mix: RTX 4090 (inference) + A100 (training) (~$1,300/month)
|
||
|
||
### Deploying Larger Models
|
||
|
||
**70B models (need 2× A100 or 4× RTX 4090):**
|
||
```yaml
|
||
# In gpu-server-compose.yaml
|
||
vllm:
|
||
command:
|
||
- --model
|
||
- meta-llama/Meta-Llama-3.1-70B-Instruct
|
||
- --tensor-parallel-size
|
||
- "2" # Split across 2 GPUs
|
||
deploy:
|
||
resources:
|
||
reservations:
|
||
devices:
|
||
- driver: nvidia
|
||
count: 2 # Use 2 GPUs
|
||
capabilities: [gpu]
|
||
```
|
||
|
||
---
|
||
|
||
## 🎯 Next Steps (Week 2+)
|
||
|
||
### Week 2: LLM Production Deployment
|
||
- [ ] Test Llama 3.1 8B performance
|
||
- [ ] Download additional models (Qwen, Mistral)
|
||
- [ ] Configure model routing in LiteLLM
|
||
- [ ] Set up usage monitoring
|
||
- [ ] Benchmark tokens/second for each model
|
||
|
||
### Week 3: Image Generation
|
||
- [ ] Download FLUX and SDXL models
|
||
- [ ] Install ComfyUI Manager
|
||
- [ ] Download ControlNet models
|
||
- [ ] Create sample workflows
|
||
- [ ] Test API integration with Open WebUI
|
||
|
||
### Week 4: Training Infrastructure
|
||
- [ ] Prepare a sample dataset
|
||
- [ ] Test LoRA fine-tuning with Axolotl
|
||
- [ ] Set up Weights & Biases logging
|
||
- [ ] Create training documentation
|
||
- [ ] Benchmark training speed
|
||
|
||
---
|
||
|
||
## 🆘 Getting Help
|
||
|
||
### Resources
|
||
- **RunPod Docs:** https://docs.runpod.io/
|
||
- **vLLM Docs:** https://docs.vllm.ai/
|
||
- **ComfyUI Wiki:** https://github.com/comfyanonymous/ComfyUI/wiki
|
||
- **Axolotl Docs:** https://github.com/OpenAccess-AI-Collective/axolotl
|
||
|
||
### Community
|
||
- **RunPod Discord:** https://discord.gg/runpod
|
||
- **vLLM Discord:** https://discord.gg/vllm
|
||
- **r/LocalLLaMA:** https://reddit.com/r/LocalLLaMA
|
||
|
||
### Support
|
||
If you encounter issues:
|
||
1. Check logs: `docker compose logs -f [service]`
|
||
2. Check GPU: `nvidia-smi`
|
||
3. Check VPN: `wg show`
|
||
4. Restart service: `docker compose restart [service]`
|
||
5. Full restart: `docker compose down && docker compose up -d`
|
||
|
||
---
|
||
|
||
## ✅ Success Criteria
|
||
|
||
You're ready to proceed when:
|
||
- [ ] GPU server responds to `ping [tailscale-ip]` from VPS
|
||
- [ ] vLLM returns models: `curl http://[tailscale-ip]:8000/v1/models`
|
||
- [ ] ComfyUI web interface loads: `http://[tailscale-ip]:8188`
|
||
- [ ] JupyterLab accessible with token
|
||
- [ ] Netdata shows GPU metrics
|
||
- [ ] Open WebUI shows both Claude and Llama models
|
||
|
||
**Total setup time:** 4-6 hours (if following guides sequentially)
|
||
|
||
---
|
||
|
||
## 🎉 You're All Set!
|
||
|
||
Your GPU-enhanced AI stack is ready. You now have:
|
||
- ✅ Self-hosted LLM inference (saves $$$)
|
||
- ✅ Advanced image generation (FLUX, SDXL)
|
||
- ✅ Model training capabilities (LoRA, fine-tuning)
|
||
- ✅ Secure VPN connection
|
||
- ✅ Full monitoring and logging
|
||
|
||
Enjoy building with your new AI infrastructure! 🚀
|