# GPU-Enhanced AI Stack - Implementation Guide Welcome to your GPU expansion setup! This directory contains everything you need to deploy a production-ready GPU server for LLM hosting, image generation, and model training. ## 📚 Documentation Files ### Planning & Architecture - **`GPU_EXPANSION_PLAN.md`** - Complete 70-page plan with provider comparison, architecture, and roadmap - **`README_GPU_SETUP.md`** - This file ### Step-by-Step Setup Guides 1. **`SETUP_GUIDE.md`** - Day 1-2: RunPod account & GPU server deployment 2. **`WIREGUARD_SETUP.md`** - Day 3-4: VPN connection between VPS and GPU server 3. **`DOCKER_GPU_SETUP.md`** - Day 5: Docker + NVIDIA Container Toolkit configuration ### Configuration Files - **`gpu-server-compose.yaml`** - Production Docker Compose for GPU server - **`litellm-config-gpu.yaml`** - Updated LiteLLM config with self-hosted models - **`deploy-gpu-stack.sh`** - Automated deployment script --- ## 🚀 Quick Start (Week 1 Checklist) ### Day 1-2: RunPod & GPU Server ✓ - [ ] Create RunPod account at https://www.runpod.io/ - [ ] Add billing method ($50 initial credit recommended) - [ ] Deploy RTX 4090 pod with PyTorch template - [ ] Configure 500GB network volume - [ ] Verify SSH access - [ ] Test GPU with `nvidia-smi` - [ ] **Guide:** `SETUP_GUIDE.md` ### Day 3-4: Network Configuration ✓ - [ ] Install Tailscale on VPS - [ ] Install Tailscale on GPU server - [ ] Authenticate both devices - [ ] Test VPN connectivity - [ ] Configure firewall rules - [ ] Verify VPS can reach GPU server - [ ] **Guide:** `TAILSCALE_SETUP.md` ### Day 5: Docker & GPU Setup ✓ - [ ] Install Docker on GPU server - [ ] Install NVIDIA Container Toolkit - [ ] Test GPU access in containers - [ ] Create /workspace/gpu-stack directory - [ ] Copy configuration files - [ ] **Guide:** `DOCKER_GPU_SETUP.md` ### Day 6-7: Deploy Services ✓ - [ ] Copy `gpu-server-compose.yaml` to GPU server - [ ] Edit `.env` with your settings - [ ] Run `./deploy-gpu-stack.sh` - [ ] Wait for vLLM to load model (~5 minutes) - [ ] Test vLLM: `curl http://localhost:8000/v1/models` - [ ] Access ComfyUI: `http://[tailscale-ip]:8188` - [ ] **Script:** `deploy-gpu-stack.sh` --- ## 📦 Services Included ### vLLM (http://[tailscale-ip]:8000) **Purpose:** High-performance LLM inference **Default Model:** Llama 3.1 8B Instruct **Performance:** 50-80 tokens/second on RTX 4090 **Use for:** General chat, Q&A, code generation, summarization **Switch models:** Edit `gpu-server-compose.yaml`, change `--model` parameter, restart: ```bash docker compose restart vllm ``` ### ComfyUI (http://[tailscale-ip]:8188) **Purpose:** Advanced Stable Diffusion interface **Features:** FLUX, SDXL, ControlNet, LoRA **Use for:** Image generation, img2img, inpainting **Download models:** Access web UI → ComfyUI Manager → Install Models ### JupyterLab (http://[tailscale-ip]:8888) **Purpose:** Interactive development environment **Token:** `pivoine-ai-2025` (change in `.env`) **Use for:** Research, experimentation, custom training scripts ### Axolotl (Training - on-demand) **Purpose:** LLM fine-tuning framework **Start:** `docker compose --profile training up -d axolotl` **Use for:** LoRA training, full fine-tuning, RLHF ### Netdata (http://[tailscale-ip]:19999) **Purpose:** System & GPU monitoring **Features:** Real-time metrics, GPU utilization, memory usage **Use for:** Performance monitoring, troubleshooting --- ## 🔧 Configuration ### Environment Variables (.env) ```bash # VPN Network (Tailscale) VPS_IP=100.x.x.x # Your VPS Tailscale IP (get with: tailscale ip -4) GPU_IP=100.x.x.x # GPU server Tailscale IP (get with: tailscale ip -4) # Model Storage MODELS_PATH=/workspace/models # Hugging Face Token (for gated models like Llama) HF_TOKEN=hf_xxxxxxxxxxxxx # Weights & Biases (for training logging) WANDB_API_KEY= # JupyterLab Access JUPYTER_TOKEN=pivoine-ai-2025 # PostgreSQL (on VPS) DB_HOST=100.x.x.x # Your VPS Tailscale IP DB_PORT=5432 DB_USER=valknar DB_PASSWORD=ragnarok98 DB_NAME=openwebui ``` ### Updating LiteLLM on VPS After GPU server is running, update your VPS LiteLLM config: ```bash # On VPS cd ~/Projects/docker-compose/ai # Backup current config cp litellm-config.yaml litellm-config.yaml.backup # Copy new config with GPU models cp litellm-config-gpu.yaml litellm-config.yaml # Restart LiteLLM arty restart litellm ``` Now Open WebUI will have access to both Claude (API) and Llama (self-hosted)! --- ## 💰 Cost Management ### Current Costs (24/7 Operation) - **GPU Server:** RTX 4090 @ $0.50/hour = $360/month - **Storage:** 500GB network volume = $50/month - **Total:** **$410/month** ### Cost-Saving Options **1. Pay-as-you-go (8 hours/day)** - GPU: $0.50 × 8 × 30 = $120/month - Storage: $50/month - **Total: $170/month** **2. Auto-stop idle pods** RunPod can auto-stop after X minutes idle: - Dashboard → Pod Settings → Auto-stop after 30 minutes **3. Use smaller models** - Mistral 7B instead of Llama 8B: Faster, cheaper GPU - Quantized models: 4-bit = 1/4 the VRAM **4. Batch image generation** - Generate multiple images at once - Use scheduled jobs (cron) during off-peak hours ### Cost Tracking **Check GPU usage:** ```bash # On RunPod dashboard Billing → Usage History # See hourly costs, total spent ``` **Check API vs GPU savings:** ```bash # On VPS, check LiteLLM logs docker logs ai_litellm | grep "model=" # Count requests to llama-3.1-8b vs claude-* ``` **Expected savings:** - 80% of requests → self-hosted = $0 cost - 20% of requests → Claude = API cost - Break-even if currently spending >$500/month on APIs --- ## 🔍 Monitoring & Troubleshooting ### Check Service Status ```bash # On GPU server cd /workspace/gpu-stack # View all services docker compose ps # Check specific service logs docker compose logs -f vllm docker compose logs -f comfyui docker compose logs -f jupyter # Check GPU usage nvidia-smi # or prettier: nvtop ``` ### Common Issues **vLLM not loading model:** ```bash # Check logs docker compose logs vllm # Common causes: # - Model download in progress (wait 5-10 minutes) # - Out of VRAM (try smaller model) # - Missing HF_TOKEN (for gated models like Llama) ``` **ComfyUI slow/crashing:** ```bash # Check GPU memory nvidia-smi # If VRAM full: # - Close vLLM temporarily # - Use smaller models # - Reduce batch size in ComfyUI ``` **Can't access from VPS:** ```bash # Test VPN ping [tailscale-ip] # If fails: # - Check Tailscale status: tailscale status # - Restart Tailscale: tailscale down && tailscale up # - Check firewall: ufw status ``` **Docker can't see GPU:** ```bash # Test GPU access docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.1.0-base nvidia-smi # If fails: # - Check NVIDIA driver: nvidia-smi # - Check nvidia-docker: nvidia-ctk --version # - Restart Docker: systemctl restart docker ``` --- ## 📊 Performance Benchmarks ### Expected Performance (RTX 4090) **LLM Inference (vLLM):** - Llama 3.1 8B: 50-80 tokens/second - Qwen 2.5 14B: 30-50 tokens/second - Batch size 32: ~1500 tokens/second **Image Generation (ComfyUI):** - SDXL (1024×1024): ~4-6 seconds - FLUX (1024×1024): ~8-12 seconds - SD 1.5 (512×512): ~1-2 seconds **Training (Axolotl):** - LoRA fine-tuning (8B model): ~3-5 hours for 3 epochs - Full fine-tuning: Not recommended on 24GB VRAM --- ## 🔐 Security Best Practices ### Network Security ✅ All services behind Tailscale VPN (end-to-end encrypted) ✅ No public exposure (except RunPod's SSH) ✅ Firewall configured (no additional ports needed) ### Access Control ✅ JupyterLab password-protected ✅ ComfyUI accessible via VPN only ✅ vLLM internal API (no auth needed) ### SSH Security ```bash # On GPU server, harden SSH nano /etc/ssh/sshd_config # Set: PermitRootLogin prohibit-password PasswordAuthentication no PubkeyAuthentication yes systemctl restart sshd ``` ### Regular Updates ```bash # Weekly updates apt update && apt upgrade -y # Update Docker images docker compose pull docker compose up -d ``` --- ## 📈 Scaling Up ### When to Add More GPUs **Current limitations (1× RTX 4090):** - Can run ONE of these at a time: - 8B LLM at full speed - 14B LLM at moderate speed - SDXL image generation - Training job **Add 2nd GPU if:** - You want LLM + image gen simultaneously - Training + inference at same time - Multiple users with high demand **Multi-GPU options:** - 2× RTX 4090: Run vLLM + ComfyUI separately ($720/month) - 1× A100 40GB: Larger models (70B with quantization) ($1,080/month) - Mix: RTX 4090 (inference) + A100 (training) (~$1,300/month) ### Deploying Larger Models **70B models (need 2× A100 or 4× RTX 4090):** ```yaml # In gpu-server-compose.yaml vllm: command: - --model - meta-llama/Meta-Llama-3.1-70B-Instruct - --tensor-parallel-size - "2" # Split across 2 GPUs deploy: resources: reservations: devices: - driver: nvidia count: 2 # Use 2 GPUs capabilities: [gpu] ``` --- ## 🎯 Next Steps (Week 2+) ### Week 2: LLM Production Deployment - [ ] Test Llama 3.1 8B performance - [ ] Download additional models (Qwen, Mistral) - [ ] Configure model routing in LiteLLM - [ ] Set up usage monitoring - [ ] Benchmark tokens/second for each model ### Week 3: Image Generation - [ ] Download FLUX and SDXL models - [ ] Install ComfyUI Manager - [ ] Download ControlNet models - [ ] Create sample workflows - [ ] Test API integration with Open WebUI ### Week 4: Training Infrastructure - [ ] Prepare a sample dataset - [ ] Test LoRA fine-tuning with Axolotl - [ ] Set up Weights & Biases logging - [ ] Create training documentation - [ ] Benchmark training speed --- ## 🆘 Getting Help ### Resources - **RunPod Docs:** https://docs.runpod.io/ - **vLLM Docs:** https://docs.vllm.ai/ - **ComfyUI Wiki:** https://github.com/comfyanonymous/ComfyUI/wiki - **Axolotl Docs:** https://github.com/OpenAccess-AI-Collective/axolotl ### Community - **RunPod Discord:** https://discord.gg/runpod - **vLLM Discord:** https://discord.gg/vllm - **r/LocalLLaMA:** https://reddit.com/r/LocalLLaMA ### Support If you encounter issues: 1. Check logs: `docker compose logs -f [service]` 2. Check GPU: `nvidia-smi` 3. Check VPN: `wg show` 4. Restart service: `docker compose restart [service]` 5. Full restart: `docker compose down && docker compose up -d` --- ## ✅ Success Criteria You're ready to proceed when: - [ ] GPU server responds to `ping [tailscale-ip]` from VPS - [ ] vLLM returns models: `curl http://[tailscale-ip]:8000/v1/models` - [ ] ComfyUI web interface loads: `http://[tailscale-ip]:8188` - [ ] JupyterLab accessible with token - [ ] Netdata shows GPU metrics - [ ] Open WebUI shows both Claude and Llama models **Total setup time:** 4-6 hours (if following guides sequentially) --- ## 🎉 You're All Set! Your GPU-enhanced AI stack is ready. You now have: - ✅ Self-hosted LLM inference (saves $$$) - ✅ Advanced image generation (FLUX, SDXL) - ✅ Model training capabilities (LoRA, fine-tuning) - ✅ Secure VPN connection - ✅ Full monitoring and logging Enjoy building with your new AI infrastructure! 🚀