445 lines
11 KiB
Markdown
445 lines
11 KiB
Markdown
|
|
# GPU-Enhanced AI Stack - Implementation Guide
|
|||
|
|
|
|||
|
|
Welcome to your GPU expansion setup! This directory contains everything you need to deploy a production-ready GPU server for LLM hosting, image generation, and model training.
|
|||
|
|
|
|||
|
|
## 📚 Documentation Files
|
|||
|
|
|
|||
|
|
### Planning & Architecture
|
|||
|
|
- **`GPU_EXPANSION_PLAN.md`** - Complete 70-page plan with provider comparison, architecture, and roadmap
|
|||
|
|
- **`README_GPU_SETUP.md`** - This file
|
|||
|
|
|
|||
|
|
### Step-by-Step Setup Guides
|
|||
|
|
1. **`SETUP_GUIDE.md`** - Day 1-2: RunPod account & GPU server deployment
|
|||
|
|
2. **`WIREGUARD_SETUP.md`** - Day 3-4: VPN connection between VPS and GPU server
|
|||
|
|
3. **`DOCKER_GPU_SETUP.md`** - Day 5: Docker + NVIDIA Container Toolkit configuration
|
|||
|
|
|
|||
|
|
### Configuration Files
|
|||
|
|
- **`gpu-server-compose.yaml`** - Production Docker Compose for GPU server
|
|||
|
|
- **`litellm-config-gpu.yaml`** - Updated LiteLLM config with self-hosted models
|
|||
|
|
- **`deploy-gpu-stack.sh`** - Automated deployment script
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚀 Quick Start (Week 1 Checklist)
|
|||
|
|
|
|||
|
|
### Day 1-2: RunPod & GPU Server ✓
|
|||
|
|
- [ ] Create RunPod account at https://www.runpod.io/
|
|||
|
|
- [ ] Add billing method ($50 initial credit recommended)
|
|||
|
|
- [ ] Deploy RTX 4090 pod with PyTorch template
|
|||
|
|
- [ ] Configure 500GB network volume
|
|||
|
|
- [ ] Verify SSH access
|
|||
|
|
- [ ] Test GPU with `nvidia-smi`
|
|||
|
|
- [ ] **Guide:** `SETUP_GUIDE.md`
|
|||
|
|
|
|||
|
|
### Day 3-4: Network Configuration ✓
|
|||
|
|
- [ ] Install Tailscale on VPS
|
|||
|
|
- [ ] Install Tailscale on GPU server
|
|||
|
|
- [ ] Authenticate both devices
|
|||
|
|
- [ ] Test VPN connectivity
|
|||
|
|
- [ ] Configure firewall rules
|
|||
|
|
- [ ] Verify VPS can reach GPU server
|
|||
|
|
- [ ] **Guide:** `TAILSCALE_SETUP.md`
|
|||
|
|
|
|||
|
|
### Day 5: Docker & GPU Setup ✓
|
|||
|
|
- [ ] Install Docker on GPU server
|
|||
|
|
- [ ] Install NVIDIA Container Toolkit
|
|||
|
|
- [ ] Test GPU access in containers
|
|||
|
|
- [ ] Create /workspace/gpu-stack directory
|
|||
|
|
- [ ] Copy configuration files
|
|||
|
|
- [ ] **Guide:** `DOCKER_GPU_SETUP.md`
|
|||
|
|
|
|||
|
|
### Day 6-7: Deploy Services ✓
|
|||
|
|
- [ ] Copy `gpu-server-compose.yaml` to GPU server
|
|||
|
|
- [ ] Edit `.env` with your settings
|
|||
|
|
- [ ] Run `./deploy-gpu-stack.sh`
|
|||
|
|
- [ ] Wait for vLLM to load model (~5 minutes)
|
|||
|
|
- [ ] Test vLLM: `curl http://localhost:8000/v1/models`
|
|||
|
|
- [ ] Access ComfyUI: `http://[tailscale-ip]:8188`
|
|||
|
|
- [ ] **Script:** `deploy-gpu-stack.sh`
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📦 Services Included
|
|||
|
|
|
|||
|
|
### vLLM (http://[tailscale-ip]:8000)
|
|||
|
|
**Purpose:** High-performance LLM inference
|
|||
|
|
**Default Model:** Llama 3.1 8B Instruct
|
|||
|
|
**Performance:** 50-80 tokens/second on RTX 4090
|
|||
|
|
**Use for:** General chat, Q&A, code generation, summarization
|
|||
|
|
|
|||
|
|
**Switch models:**
|
|||
|
|
Edit `gpu-server-compose.yaml`, change `--model` parameter, restart:
|
|||
|
|
```bash
|
|||
|
|
docker compose restart vllm
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### ComfyUI (http://[tailscale-ip]:8188)
|
|||
|
|
**Purpose:** Advanced Stable Diffusion interface
|
|||
|
|
**Features:** FLUX, SDXL, ControlNet, LoRA
|
|||
|
|
**Use for:** Image generation, img2img, inpainting
|
|||
|
|
|
|||
|
|
**Download models:**
|
|||
|
|
Access web UI → ComfyUI Manager → Install Models
|
|||
|
|
|
|||
|
|
### JupyterLab (http://[tailscale-ip]:8888)
|
|||
|
|
**Purpose:** Interactive development environment
|
|||
|
|
**Token:** `pivoine-ai-2025` (change in `.env`)
|
|||
|
|
**Use for:** Research, experimentation, custom training scripts
|
|||
|
|
|
|||
|
|
### Axolotl (Training - on-demand)
|
|||
|
|
**Purpose:** LLM fine-tuning framework
|
|||
|
|
**Start:** `docker compose --profile training up -d axolotl`
|
|||
|
|
**Use for:** LoRA training, full fine-tuning, RLHF
|
|||
|
|
|
|||
|
|
### Netdata (http://[tailscale-ip]:19999)
|
|||
|
|
**Purpose:** System & GPU monitoring
|
|||
|
|
**Features:** Real-time metrics, GPU utilization, memory usage
|
|||
|
|
**Use for:** Performance monitoring, troubleshooting
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔧 Configuration
|
|||
|
|
|
|||
|
|
### Environment Variables (.env)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# VPN Network (Tailscale)
|
|||
|
|
VPS_IP=100.x.x.x # Your VPS Tailscale IP (get with: tailscale ip -4)
|
|||
|
|
GPU_IP=100.x.x.x # GPU server Tailscale IP (get with: tailscale ip -4)
|
|||
|
|
|
|||
|
|
# Model Storage
|
|||
|
|
MODELS_PATH=/workspace/models
|
|||
|
|
|
|||
|
|
# Hugging Face Token (for gated models like Llama)
|
|||
|
|
HF_TOKEN=hf_xxxxxxxxxxxxx
|
|||
|
|
|
|||
|
|
# Weights & Biases (for training logging)
|
|||
|
|
WANDB_API_KEY=
|
|||
|
|
|
|||
|
|
# JupyterLab Access
|
|||
|
|
JUPYTER_TOKEN=pivoine-ai-2025
|
|||
|
|
|
|||
|
|
# PostgreSQL (on VPS)
|
|||
|
|
DB_HOST=100.x.x.x # Your VPS Tailscale IP
|
|||
|
|
DB_PORT=5432
|
|||
|
|
DB_USER=valknar
|
|||
|
|
DB_PASSWORD=ragnarok98
|
|||
|
|
DB_NAME=openwebui
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Updating LiteLLM on VPS
|
|||
|
|
|
|||
|
|
After GPU server is running, update your VPS LiteLLM config:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# On VPS
|
|||
|
|
cd ~/Projects/docker-compose/ai
|
|||
|
|
|
|||
|
|
# Backup current config
|
|||
|
|
cp litellm-config.yaml litellm-config.yaml.backup
|
|||
|
|
|
|||
|
|
# Copy new config with GPU models
|
|||
|
|
cp litellm-config-gpu.yaml litellm-config.yaml
|
|||
|
|
|
|||
|
|
# Restart LiteLLM
|
|||
|
|
arty restart litellm
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Now Open WebUI will have access to both Claude (API) and Llama (self-hosted)!
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 💰 Cost Management
|
|||
|
|
|
|||
|
|
### Current Costs (24/7 Operation)
|
|||
|
|
- **GPU Server:** RTX 4090 @ $0.50/hour = $360/month
|
|||
|
|
- **Storage:** 500GB network volume = $50/month
|
|||
|
|
- **Total:** **$410/month**
|
|||
|
|
|
|||
|
|
### Cost-Saving Options
|
|||
|
|
|
|||
|
|
**1. Pay-as-you-go (8 hours/day)**
|
|||
|
|
- GPU: $0.50 × 8 × 30 = $120/month
|
|||
|
|
- Storage: $50/month
|
|||
|
|
- **Total: $170/month**
|
|||
|
|
|
|||
|
|
**2. Auto-stop idle pods**
|
|||
|
|
RunPod can auto-stop after X minutes idle:
|
|||
|
|
- Dashboard → Pod Settings → Auto-stop after 30 minutes
|
|||
|
|
|
|||
|
|
**3. Use smaller models**
|
|||
|
|
- Mistral 7B instead of Llama 8B: Faster, cheaper GPU
|
|||
|
|
- Quantized models: 4-bit = 1/4 the VRAM
|
|||
|
|
|
|||
|
|
**4. Batch image generation**
|
|||
|
|
- Generate multiple images at once
|
|||
|
|
- Use scheduled jobs (cron) during off-peak hours
|
|||
|
|
|
|||
|
|
### Cost Tracking
|
|||
|
|
|
|||
|
|
**Check GPU usage:**
|
|||
|
|
```bash
|
|||
|
|
# On RunPod dashboard
|
|||
|
|
Billing → Usage History
|
|||
|
|
|
|||
|
|
# See hourly costs, total spent
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Check API vs GPU savings:**
|
|||
|
|
```bash
|
|||
|
|
# On VPS, check LiteLLM logs
|
|||
|
|
docker logs ai_litellm | grep "model="
|
|||
|
|
|
|||
|
|
# Count requests to llama-3.1-8b vs claude-*
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected savings:**
|
|||
|
|
- 80% of requests → self-hosted = $0 cost
|
|||
|
|
- 20% of requests → Claude = API cost
|
|||
|
|
- Break-even if currently spending >$500/month on APIs
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔍 Monitoring & Troubleshooting
|
|||
|
|
|
|||
|
|
### Check Service Status
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# On GPU server
|
|||
|
|
cd /workspace/gpu-stack
|
|||
|
|
|
|||
|
|
# View all services
|
|||
|
|
docker compose ps
|
|||
|
|
|
|||
|
|
# Check specific service logs
|
|||
|
|
docker compose logs -f vllm
|
|||
|
|
docker compose logs -f comfyui
|
|||
|
|
docker compose logs -f jupyter
|
|||
|
|
|
|||
|
|
# Check GPU usage
|
|||
|
|
nvidia-smi
|
|||
|
|
# or prettier:
|
|||
|
|
nvtop
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Common Issues
|
|||
|
|
|
|||
|
|
**vLLM not loading model:**
|
|||
|
|
```bash
|
|||
|
|
# Check logs
|
|||
|
|
docker compose logs vllm
|
|||
|
|
|
|||
|
|
# Common causes:
|
|||
|
|
# - Model download in progress (wait 5-10 minutes)
|
|||
|
|
# - Out of VRAM (try smaller model)
|
|||
|
|
# - Missing HF_TOKEN (for gated models like Llama)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**ComfyUI slow/crashing:**
|
|||
|
|
```bash
|
|||
|
|
# Check GPU memory
|
|||
|
|
nvidia-smi
|
|||
|
|
|
|||
|
|
# If VRAM full:
|
|||
|
|
# - Close vLLM temporarily
|
|||
|
|
# - Use smaller models
|
|||
|
|
# - Reduce batch size in ComfyUI
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Can't access from VPS:**
|
|||
|
|
```bash
|
|||
|
|
# Test VPN
|
|||
|
|
ping [tailscale-ip]
|
|||
|
|
|
|||
|
|
# If fails:
|
|||
|
|
# - Check Tailscale status: tailscale status
|
|||
|
|
# - Restart Tailscale: tailscale down && tailscale up
|
|||
|
|
# - Check firewall: ufw status
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Docker can't see GPU:**
|
|||
|
|
```bash
|
|||
|
|
# Test GPU access
|
|||
|
|
docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.1.0-base nvidia-smi
|
|||
|
|
|
|||
|
|
# If fails:
|
|||
|
|
# - Check NVIDIA driver: nvidia-smi
|
|||
|
|
# - Check nvidia-docker: nvidia-ctk --version
|
|||
|
|
# - Restart Docker: systemctl restart docker
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 Performance Benchmarks
|
|||
|
|
|
|||
|
|
### Expected Performance (RTX 4090)
|
|||
|
|
|
|||
|
|
**LLM Inference (vLLM):**
|
|||
|
|
- Llama 3.1 8B: 50-80 tokens/second
|
|||
|
|
- Qwen 2.5 14B: 30-50 tokens/second
|
|||
|
|
- Batch size 32: ~1500 tokens/second
|
|||
|
|
|
|||
|
|
**Image Generation (ComfyUI):**
|
|||
|
|
- SDXL (1024×1024): ~4-6 seconds
|
|||
|
|
- FLUX (1024×1024): ~8-12 seconds
|
|||
|
|
- SD 1.5 (512×512): ~1-2 seconds
|
|||
|
|
|
|||
|
|
**Training (Axolotl):**
|
|||
|
|
- LoRA fine-tuning (8B model): ~3-5 hours for 3 epochs
|
|||
|
|
- Full fine-tuning: Not recommended on 24GB VRAM
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔐 Security Best Practices
|
|||
|
|
|
|||
|
|
### Network Security
|
|||
|
|
✅ All services behind Tailscale VPN (end-to-end encrypted)
|
|||
|
|
✅ No public exposure (except RunPod's SSH)
|
|||
|
|
✅ Firewall configured (no additional ports needed)
|
|||
|
|
|
|||
|
|
### Access Control
|
|||
|
|
✅ JupyterLab password-protected
|
|||
|
|
✅ ComfyUI accessible via VPN only
|
|||
|
|
✅ vLLM internal API (no auth needed)
|
|||
|
|
|
|||
|
|
### SSH Security
|
|||
|
|
```bash
|
|||
|
|
# On GPU server, harden SSH
|
|||
|
|
nano /etc/ssh/sshd_config
|
|||
|
|
|
|||
|
|
# Set:
|
|||
|
|
PermitRootLogin prohibit-password
|
|||
|
|
PasswordAuthentication no
|
|||
|
|
PubkeyAuthentication yes
|
|||
|
|
|
|||
|
|
systemctl restart sshd
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Regular Updates
|
|||
|
|
```bash
|
|||
|
|
# Weekly updates
|
|||
|
|
apt update && apt upgrade -y
|
|||
|
|
|
|||
|
|
# Update Docker images
|
|||
|
|
docker compose pull
|
|||
|
|
docker compose up -d
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📈 Scaling Up
|
|||
|
|
|
|||
|
|
### When to Add More GPUs
|
|||
|
|
|
|||
|
|
**Current limitations (1× RTX 4090):**
|
|||
|
|
- Can run ONE of these at a time:
|
|||
|
|
- 8B LLM at full speed
|
|||
|
|
- 14B LLM at moderate speed
|
|||
|
|
- SDXL image generation
|
|||
|
|
- Training job
|
|||
|
|
|
|||
|
|
**Add 2nd GPU if:**
|
|||
|
|
- You want LLM + image gen simultaneously
|
|||
|
|
- Training + inference at same time
|
|||
|
|
- Multiple users with high demand
|
|||
|
|
|
|||
|
|
**Multi-GPU options:**
|
|||
|
|
- 2× RTX 4090: Run vLLM + ComfyUI separately ($720/month)
|
|||
|
|
- 1× A100 40GB: Larger models (70B with quantization) ($1,080/month)
|
|||
|
|
- Mix: RTX 4090 (inference) + A100 (training) (~$1,300/month)
|
|||
|
|
|
|||
|
|
### Deploying Larger Models
|
|||
|
|
|
|||
|
|
**70B models (need 2× A100 or 4× RTX 4090):**
|
|||
|
|
```yaml
|
|||
|
|
# In gpu-server-compose.yaml
|
|||
|
|
vllm:
|
|||
|
|
command:
|
|||
|
|
- --model
|
|||
|
|
- meta-llama/Meta-Llama-3.1-70B-Instruct
|
|||
|
|
- --tensor-parallel-size
|
|||
|
|
- "2" # Split across 2 GPUs
|
|||
|
|
deploy:
|
|||
|
|
resources:
|
|||
|
|
reservations:
|
|||
|
|
devices:
|
|||
|
|
- driver: nvidia
|
|||
|
|
count: 2 # Use 2 GPUs
|
|||
|
|
capabilities: [gpu]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 Next Steps (Week 2+)
|
|||
|
|
|
|||
|
|
### Week 2: LLM Production Deployment
|
|||
|
|
- [ ] Test Llama 3.1 8B performance
|
|||
|
|
- [ ] Download additional models (Qwen, Mistral)
|
|||
|
|
- [ ] Configure model routing in LiteLLM
|
|||
|
|
- [ ] Set up usage monitoring
|
|||
|
|
- [ ] Benchmark tokens/second for each model
|
|||
|
|
|
|||
|
|
### Week 3: Image Generation
|
|||
|
|
- [ ] Download FLUX and SDXL models
|
|||
|
|
- [ ] Install ComfyUI Manager
|
|||
|
|
- [ ] Download ControlNet models
|
|||
|
|
- [ ] Create sample workflows
|
|||
|
|
- [ ] Test API integration with Open WebUI
|
|||
|
|
|
|||
|
|
### Week 4: Training Infrastructure
|
|||
|
|
- [ ] Prepare a sample dataset
|
|||
|
|
- [ ] Test LoRA fine-tuning with Axolotl
|
|||
|
|
- [ ] Set up Weights & Biases logging
|
|||
|
|
- [ ] Create training documentation
|
|||
|
|
- [ ] Benchmark training speed
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🆘 Getting Help
|
|||
|
|
|
|||
|
|
### Resources
|
|||
|
|
- **RunPod Docs:** https://docs.runpod.io/
|
|||
|
|
- **vLLM Docs:** https://docs.vllm.ai/
|
|||
|
|
- **ComfyUI Wiki:** https://github.com/comfyanonymous/ComfyUI/wiki
|
|||
|
|
- **Axolotl Docs:** https://github.com/OpenAccess-AI-Collective/axolotl
|
|||
|
|
|
|||
|
|
### Community
|
|||
|
|
- **RunPod Discord:** https://discord.gg/runpod
|
|||
|
|
- **vLLM Discord:** https://discord.gg/vllm
|
|||
|
|
- **r/LocalLLaMA:** https://reddit.com/r/LocalLLaMA
|
|||
|
|
|
|||
|
|
### Support
|
|||
|
|
If you encounter issues:
|
|||
|
|
1. Check logs: `docker compose logs -f [service]`
|
|||
|
|
2. Check GPU: `nvidia-smi`
|
|||
|
|
3. Check VPN: `wg show`
|
|||
|
|
4. Restart service: `docker compose restart [service]`
|
|||
|
|
5. Full restart: `docker compose down && docker compose up -d`
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ✅ Success Criteria
|
|||
|
|
|
|||
|
|
You're ready to proceed when:
|
|||
|
|
- [ ] GPU server responds to `ping [tailscale-ip]` from VPS
|
|||
|
|
- [ ] vLLM returns models: `curl http://[tailscale-ip]:8000/v1/models`
|
|||
|
|
- [ ] ComfyUI web interface loads: `http://[tailscale-ip]:8188`
|
|||
|
|
- [ ] JupyterLab accessible with token
|
|||
|
|
- [ ] Netdata shows GPU metrics
|
|||
|
|
- [ ] Open WebUI shows both Claude and Llama models
|
|||
|
|
|
|||
|
|
**Total setup time:** 4-6 hours (if following guides sequentially)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎉 You're All Set!
|
|||
|
|
|
|||
|
|
Your GPU-enhanced AI stack is ready. You now have:
|
|||
|
|
- ✅ Self-hosted LLM inference (saves $$$)
|
|||
|
|
- ✅ Advanced image generation (FLUX, SDXL)
|
|||
|
|
- ✅ Model training capabilities (LoRA, fine-tuning)
|
|||
|
|
- ✅ Secure VPN connection
|
|||
|
|
- ✅ Full monitoring and logging
|
|||
|
|
|
|||
|
|
Enjoy building with your new AI infrastructure! 🚀
|