# GPU-Enhanced AI Stack - Implementation Guide

Welcome to your GPU expansion setup! This directory contains everything you need to deploy a production-ready GPU server for LLM hosting, image generation, and model training.

## 📚 Documentation Files

### Planning & Architecture
- **`GPU_EXPANSION_PLAN.md`** - Complete 70-page plan with provider comparison, architecture, and roadmap
- **`README_GPU_SETUP.md`** - This file

### Step-by-Step Setup Guides
1. **`SETUP_GUIDE.md`** - Day 1-2: RunPod account & GPU server deployment
2. **`WIREGUARD_SETUP.md`** - Day 3-4: VPN connection between VPS and GPU server
3. **`DOCKER_GPU_SETUP.md`** - Day 5: Docker + NVIDIA Container Toolkit configuration

### Configuration Files
- **`gpu-server-compose.yaml`** - Production Docker Compose for GPU server
- **`litellm-config-gpu.yaml`** - Updated LiteLLM config with self-hosted models
- **`deploy-gpu-stack.sh`** - Automated deployment script

---

## 🚀 Quick Start (Week 1 Checklist)

### Day 1-2: RunPod & GPU Server ✓
- [ ] Create RunPod account at https://www.runpod.io/
- [ ] Add billing method ($50 initial credit recommended)
- [ ] Deploy RTX 4090 pod with PyTorch template
- [ ] Configure 500GB network volume
- [ ] Verify SSH access
- [ ] Test GPU with `nvidia-smi`
- [ ] **Guide:** `SETUP_GUIDE.md`

### Day 3-4: Network Configuration ✓
- [ ] Install Tailscale on VPS
- [ ] Install Tailscale on GPU server
- [ ] Authenticate both devices
- [ ] Test VPN connectivity
- [ ] Configure firewall rules
- [ ] Verify VPS can reach GPU server
- [ ] **Guide:** `TAILSCALE_SETUP.md`

### Day 5: Docker & GPU Setup ✓
- [ ] Install Docker on GPU server
- [ ] Install NVIDIA Container Toolkit
- [ ] Test GPU access in containers
- [ ] Create /workspace/gpu-stack directory
- [ ] Copy configuration files
- [ ] **Guide:** `DOCKER_GPU_SETUP.md`

### Day 6-7: Deploy Services ✓
- [ ] Copy `gpu-server-compose.yaml` to GPU server
- [ ] Edit `.env` with your settings
- [ ] Run `./deploy-gpu-stack.sh`
- [ ] Wait for vLLM to load model (~5 minutes)
- [ ] Test vLLM: `curl http://localhost:8000/v1/models`
- [ ] Access ComfyUI: `http://[tailscale-ip]:8188`
- [ ] **Script:** `deploy-gpu-stack.sh`

---

## 📦 Services Included

### vLLM (http://[tailscale-ip]:8000)
**Purpose:** High-performance LLM inference
**Default Model:** Llama 3.1 8B Instruct
**Performance:** 50-80 tokens/second on RTX 4090
**Use for:** General chat, Q&A, code generation, summarization

**Switch models:**
Edit `gpu-server-compose.yaml`, change `--model` parameter, restart:
```bash
docker compose restart vllm
```

### ComfyUI (http://[tailscale-ip]:8188)
**Purpose:** Advanced Stable Diffusion interface
**Features:** FLUX, SDXL, ControlNet, LoRA
**Use for:** Image generation, img2img, inpainting

**Download models:**
Access web UI → ComfyUI Manager → Install Models

### JupyterLab (http://[tailscale-ip]:8888)
**Purpose:** Interactive development environment
**Token:** `pivoine-ai-2025` (change in `.env`)
**Use for:** Research, experimentation, custom training scripts

### Axolotl (Training - on-demand)
**Purpose:** LLM fine-tuning framework
**Start:** `docker compose --profile training up -d axolotl`
**Use for:** LoRA training, full fine-tuning, RLHF

### Netdata (http://[tailscale-ip]:19999)
**Purpose:** System & GPU monitoring
**Features:** Real-time metrics, GPU utilization, memory usage
**Use for:** Performance monitoring, troubleshooting

---

## 🔧 Configuration

### Environment Variables (.env)

```bash
# VPN Network (Tailscale)
VPS_IP=100.x.x.x         # Your VPS Tailscale IP (get with: tailscale ip -4)
GPU_IP=100.x.x.x         # GPU server Tailscale IP (get with: tailscale ip -4)

# Model Storage
MODELS_PATH=/workspace/models

# Hugging Face Token (for gated models like Llama)
HF_TOKEN=hf_xxxxxxxxxxxxx

# Weights & Biases (for training logging)
WANDB_API_KEY=

# JupyterLab Access
JUPYTER_TOKEN=pivoine-ai-2025

# PostgreSQL (on VPS)
DB_HOST=100.x.x.x        # Your VPS Tailscale IP
DB_PORT=5432
DB_USER=valknar
DB_PASSWORD=ragnarok98
DB_NAME=openwebui
```

### Updating LiteLLM on VPS

After GPU server is running, update your VPS LiteLLM config:

```bash
# On VPS
cd ~/Projects/docker-compose/ai

# Backup current config
cp litellm-config.yaml litellm-config.yaml.backup

# Copy new config with GPU models
cp litellm-config-gpu.yaml litellm-config.yaml

# Restart LiteLLM
arty restart litellm
```

Now Open WebUI will have access to both Claude (API) and Llama (self-hosted)!

---

## 💰 Cost Management

### Current Costs (24/7 Operation)
- **GPU Server:** RTX 4090 @ $0.50/hour = $360/month
- **Storage:** 500GB network volume = $50/month
- **Total:** **$410/month**

### Cost-Saving Options

**1. Pay-as-you-go (8 hours/day)**
- GPU: $0.50 × 8 × 30 = $120/month
- Storage: $50/month
- **Total: $170/month**

**2. Auto-stop idle pods**
RunPod can auto-stop after X minutes idle:
- Dashboard → Pod Settings → Auto-stop after 30 minutes

**3. Use smaller models**
- Mistral 7B instead of Llama 8B: Faster, cheaper GPU
- Quantized models: 4-bit = 1/4 the VRAM

**4. Batch image generation**
- Generate multiple images at once
- Use scheduled jobs (cron) during off-peak hours

### Cost Tracking

**Check GPU usage:**
```bash
# On RunPod dashboard
Billing → Usage History

# See hourly costs, total spent
```

**Check API vs GPU savings:**
```bash
# On VPS, check LiteLLM logs
docker logs ai_litellm | grep "model="

# Count requests to llama-3.1-8b vs claude-*
```

**Expected savings:**
- 80% of requests → self-hosted = $0 cost
- 20% of requests → Claude = API cost
- Break-even if currently spending >$500/month on APIs

---

## 🔍 Monitoring & Troubleshooting

### Check Service Status

```bash
# On GPU server
cd /workspace/gpu-stack

# View all services
docker compose ps

# Check specific service logs
docker compose logs -f vllm
docker compose logs -f comfyui
docker compose logs -f jupyter

# Check GPU usage
nvidia-smi
# or prettier:
nvtop
```

### Common Issues

**vLLM not loading model:**
```bash
# Check logs
docker compose logs vllm

# Common causes:
# - Model download in progress (wait 5-10 minutes)
# - Out of VRAM (try smaller model)
# - Missing HF_TOKEN (for gated models like Llama)
```

**ComfyUI slow/crashing:**
```bash
# Check GPU memory
nvidia-smi

# If VRAM full:
# - Close vLLM temporarily
# - Use smaller models
# - Reduce batch size in ComfyUI
```

**Can't access from VPS:**
```bash
# Test VPN
ping [tailscale-ip]

# If fails:
# - Check Tailscale status: tailscale status
# - Restart Tailscale: tailscale down && tailscale up
# - Check firewall: ufw status
```

**Docker can't see GPU:**
```bash
# Test GPU access
docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.1.0-base nvidia-smi

# If fails:
# - Check NVIDIA driver: nvidia-smi
# - Check nvidia-docker: nvidia-ctk --version
# - Restart Docker: systemctl restart docker
```

---

## 📊 Performance Benchmarks

### Expected Performance (RTX 4090)

**LLM Inference (vLLM):**
- Llama 3.1 8B: 50-80 tokens/second
- Qwen 2.5 14B: 30-50 tokens/second
- Batch size 32: ~1500 tokens/second

**Image Generation (ComfyUI):**
- SDXL (1024×1024): ~4-6 seconds
- FLUX (1024×1024): ~8-12 seconds
- SD 1.5 (512×512): ~1-2 seconds

**Training (Axolotl):**
- LoRA fine-tuning (8B model): ~3-5 hours for 3 epochs
- Full fine-tuning: Not recommended on 24GB VRAM

---

## 🔐 Security Best Practices

### Network Security
✅ All services behind Tailscale VPN (end-to-end encrypted)
✅ No public exposure (except RunPod's SSH)
✅ Firewall configured (no additional ports needed)

### Access Control
✅ JupyterLab password-protected
✅ ComfyUI accessible via VPN only
✅ vLLM internal API (no auth needed)

### SSH Security
```bash
# On GPU server, harden SSH
nano /etc/ssh/sshd_config

# Set:
PermitRootLogin prohibit-password
PasswordAuthentication no
PubkeyAuthentication yes

systemctl restart sshd
```

### Regular Updates
```bash
# Weekly updates
apt update && apt upgrade -y

# Update Docker images
docker compose pull
docker compose up -d
```

---

## 📈 Scaling Up

### When to Add More GPUs

**Current limitations (1× RTX 4090):**
- Can run ONE of these at a time:
  - 8B LLM at full speed
  - 14B LLM at moderate speed
  - SDXL image generation
  - Training job

**Add 2nd GPU if:**
- You want LLM + image gen simultaneously
- Training + inference at same time
- Multiple users with high demand

**Multi-GPU options:**
- 2× RTX 4090: Run vLLM + ComfyUI separately ($720/month)
- 1× A100 40GB: Larger models (70B with quantization) ($1,080/month)
- Mix: RTX 4090 (inference) + A100 (training) (~$1,300/month)

### Deploying Larger Models

**70B models (need 2× A100 or 4× RTX 4090):**
```yaml
# In gpu-server-compose.yaml
vllm:
  command:
    - --model
    - meta-llama/Meta-Llama-3.1-70B-Instruct
    - --tensor-parallel-size
    - "2"  # Split across 2 GPUs
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 2  # Use 2 GPUs
            capabilities: [gpu]
```

---

## 🎯 Next Steps (Week 2+)

### Week 2: LLM Production Deployment
- [ ] Test Llama 3.1 8B performance
- [ ] Download additional models (Qwen, Mistral)
- [ ] Configure model routing in LiteLLM
- [ ] Set up usage monitoring
- [ ] Benchmark tokens/second for each model

### Week 3: Image Generation
- [ ] Download FLUX and SDXL models
- [ ] Install ComfyUI Manager
- [ ] Download ControlNet models
- [ ] Create sample workflows
- [ ] Test API integration with Open WebUI

### Week 4: Training Infrastructure
- [ ] Prepare a sample dataset
- [ ] Test LoRA fine-tuning with Axolotl
- [ ] Set up Weights & Biases logging
- [ ] Create training documentation
- [ ] Benchmark training speed

---

## 🆘 Getting Help

### Resources
- **RunPod Docs:** https://docs.runpod.io/
- **vLLM Docs:** https://docs.vllm.ai/
- **ComfyUI Wiki:** https://github.com/comfyanonymous/ComfyUI/wiki
- **Axolotl Docs:** https://github.com/OpenAccess-AI-Collective/axolotl

### Community
- **RunPod Discord:** https://discord.gg/runpod
- **vLLM Discord:** https://discord.gg/vllm
- **r/LocalLLaMA:** https://reddit.com/r/LocalLLaMA

### Support
If you encounter issues:
1. Check logs: `docker compose logs -f [service]`
2. Check GPU: `nvidia-smi`
3. Check VPN: `wg show`
4. Restart service: `docker compose restart [service]`
5. Full restart: `docker compose down && docker compose up -d`

---

## ✅ Success Criteria

You're ready to proceed when:
- [ ] GPU server responds to `ping [tailscale-ip]` from VPS
- [ ] vLLM returns models: `curl http://[tailscale-ip]:8000/v1/models`
- [ ] ComfyUI web interface loads: `http://[tailscale-ip]:8188`
- [ ] JupyterLab accessible with token
- [ ] Netdata shows GPU metrics
- [ ] Open WebUI shows both Claude and Llama models

**Total setup time:** 4-6 hours (if following guides sequentially)

---

## 🎉 You're All Set!

Your GPU-enhanced AI stack is ready. You now have:
- ✅ Self-hosted LLM inference (saves $$$)
- ✅ Advanced image generation (FLUX, SDXL)
- ✅ Model training capabilities (LoRA, fine-tuning)
- ✅ Secure VPN connection
- ✅ Full monitoring and logging

Enjoy building with your new AI infrastructure! 🚀