Initial commit: RunPod multi-modal AI orchestration stack

- Multi-modal AI infrastructure for RunPod RTX 4090 - Automatic model orchestration (text, image, music) - Text: vLLM + Qwen 2.5 7B Instruct - Image: Flux.1 Schnell via OpenEDAI - Music: MusicGen Medium via AudioCraft - Cost-optimized sequential loading on single GPU - Template preparation scripts for rapid deployment - Comprehensive documentation (README, DEPLOYMENT, TEMPLATE)
2025-11-21 14:34:55 +01:00
commit 277f1c95bd
35 changed files with 7654 additions and 0 deletions
--- a/README_GPU_SETUP.md
+++ b/README_GPU_SETUP.md
@@ -0,0 +1,444 @@
+# GPU-Enhanced AI Stack - Implementation Guide
+
+Welcome to your GPU expansion setup! This directory contains everything you need to deploy a production-ready GPU server for LLM hosting, image generation, and model training.
+
+## 📚 Documentation Files
+
+### Planning & Architecture
+- **`GPU_EXPANSION_PLAN.md`** - Complete 70-page plan with provider comparison, architecture, and roadmap
+- **`README_GPU_SETUP.md`** - This file
+
+### Step-by-Step Setup Guides
+1. **`SETUP_GUIDE.md`** - Day 1-2: RunPod account & GPU server deployment
+2. **`WIREGUARD_SETUP.md`** - Day 3-4: VPN connection between VPS and GPU server
+3. **`DOCKER_GPU_SETUP.md`** - Day 5: Docker + NVIDIA Container Toolkit configuration
+
+### Configuration Files
+- **`gpu-server-compose.yaml`** - Production Docker Compose for GPU server
+- **`litellm-config-gpu.yaml`** - Updated LiteLLM config with self-hosted models
+- **`deploy-gpu-stack.sh`** - Automated deployment script
+
+---
+
+## 🚀 Quick Start (Week 1 Checklist)
+
+### Day 1-2: RunPod & GPU Server ✓
+- [ ] Create RunPod account at https://www.runpod.io/
+- [ ] Add billing method ($50 initial credit recommended)
+- [ ] Deploy RTX 4090 pod with PyTorch template
+- [ ] Configure 500GB network volume
+- [ ] Verify SSH access
+- [ ] Test GPU with `nvidia-smi`
+- [ ] **Guide:** `SETUP_GUIDE.md`
+
+### Day 3-4: Network Configuration ✓
+- [ ] Install Tailscale on VPS
+- [ ] Install Tailscale on GPU server
+- [ ] Authenticate both devices
+- [ ] Test VPN connectivity
+- [ ] Configure firewall rules
+- [ ] Verify VPS can reach GPU server
+- [ ] **Guide:** `TAILSCALE_SETUP.md`
+
+### Day 5: Docker & GPU Setup ✓
+- [ ] Install Docker on GPU server
+- [ ] Install NVIDIA Container Toolkit
+- [ ] Test GPU access in containers
+- [ ] Create /workspace/gpu-stack directory
+- [ ] Copy configuration files
+- [ ] **Guide:** `DOCKER_GPU_SETUP.md`
+
+### Day 6-7: Deploy Services ✓
+- [ ] Copy `gpu-server-compose.yaml` to GPU server
+- [ ] Edit `.env` with your settings
+- [ ] Run `./deploy-gpu-stack.sh`
+- [ ] Wait for vLLM to load model (~5 minutes)
+- [ ] Test vLLM: `curl http://localhost:8000/v1/models`
+- [ ] Access ComfyUI: `http://[tailscale-ip]:8188`
+- [ ] **Script:** `deploy-gpu-stack.sh`
+
+---
+
+## 📦 Services Included
+
+### vLLM (http://[tailscale-ip]:8000)
+**Purpose:** High-performance LLM inference
+**Default Model:** Llama 3.1 8B Instruct
+**Performance:** 50-80 tokens/second on RTX 4090
+**Use for:** General chat, Q&A, code generation, summarization
+
+**Switch models:**
+Edit `gpu-server-compose.yaml`, change `--model` parameter, restart:
+```bash
+docker compose restart vllm
+```
+
+### ComfyUI (http://[tailscale-ip]:8188)
+**Purpose:** Advanced Stable Diffusion interface
+**Features:** FLUX, SDXL, ControlNet, LoRA
+**Use for:** Image generation, img2img, inpainting
+
+**Download models:**
+Access web UI → ComfyUI Manager → Install Models
+
+### JupyterLab (http://[tailscale-ip]:8888)
+**Purpose:** Interactive development environment
+**Token:** `pivoine-ai-2025` (change in `.env`)
+**Use for:** Research, experimentation, custom training scripts
+
+### Axolotl (Training - on-demand)
+**Purpose:** LLM fine-tuning framework
+**Start:** `docker compose --profile training up -d axolotl`
+**Use for:** LoRA training, full fine-tuning, RLHF
+
+### Netdata (http://[tailscale-ip]:19999)
+**Purpose:** System & GPU monitoring
+**Features:** Real-time metrics, GPU utilization, memory usage
+**Use for:** Performance monitoring, troubleshooting
+
+---
+
+## 🔧 Configuration
+
+### Environment Variables (.env)
+
+```bash
+# VPN Network (Tailscale)
+VPS_IP=100.x.x.x         # Your VPS Tailscale IP (get with: tailscale ip -4)
+GPU_IP=100.x.x.x         # GPU server Tailscale IP (get with: tailscale ip -4)
+
+# Model Storage
+MODELS_PATH=/workspace/models
+
+# Hugging Face Token (for gated models like Llama)
+HF_TOKEN=hf_xxxxxxxxxxxxx
+
+# Weights & Biases (for training logging)
+WANDB_API_KEY=
+
+# JupyterLab Access
+JUPYTER_TOKEN=pivoine-ai-2025
+
+# PostgreSQL (on VPS)
+DB_HOST=100.x.x.x        # Your VPS Tailscale IP
+DB_PORT=5432
+DB_USER=valknar
+DB_PASSWORD=ragnarok98
+DB_NAME=openwebui
+```
+
+### Updating LiteLLM on VPS
+
+After GPU server is running, update your VPS LiteLLM config:
+
+```bash
+# On VPS
+cd ~/Projects/docker-compose/ai
+
+# Backup current config
+cp litellm-config.yaml litellm-config.yaml.backup
+
+# Copy new config with GPU models
+cp litellm-config-gpu.yaml litellm-config.yaml
+
+# Restart LiteLLM
+arty restart litellm
+```
+
+Now Open WebUI will have access to both Claude (API) and Llama (self-hosted)!
+
+---
+
+## 💰 Cost Management
+
+### Current Costs (24/7 Operation)
+- **GPU Server:** RTX 4090 @ $0.50/hour = $360/month
+- **Storage:** 500GB network volume = $50/month
+- **Total:** **$410/month**
+
+### Cost-Saving Options
+
+**1. Pay-as-you-go (8 hours/day)**
+- GPU: $0.50 × 8 × 30 = $120/month
+- Storage: $50/month
+- **Total: $170/month**
+
+**2. Auto-stop idle pods**
+RunPod can auto-stop after X minutes idle:
+- Dashboard → Pod Settings → Auto-stop after 30 minutes
+
+**3. Use smaller models**
+- Mistral 7B instead of Llama 8B: Faster, cheaper GPU
+- Quantized models: 4-bit = 1/4 the VRAM
+
+**4. Batch image generation**
+- Generate multiple images at once
+- Use scheduled jobs (cron) during off-peak hours
+
+### Cost Tracking
+
+**Check GPU usage:**
+```bash
+# On RunPod dashboard
+Billing → Usage History
+
+# See hourly costs, total spent
+```
+
+**Check API vs GPU savings:**
+```bash
+# On VPS, check LiteLLM logs
+docker logs ai_litellm | grep "model="
+
+# Count requests to llama-3.1-8b vs claude-*
+```
+
+**Expected savings:**
+- 80% of requests → self-hosted = $0 cost
+- 20% of requests → Claude = API cost
+- Break-even if currently spending >$500/month on APIs
+
+---
+
+## 🔍 Monitoring & Troubleshooting
+
+### Check Service Status
+
+```bash
+# On GPU server
+cd /workspace/gpu-stack
+
+# View all services
+docker compose ps
+
+# Check specific service logs
+docker compose logs -f vllm
+docker compose logs -f comfyui
+docker compose logs -f jupyter
+
+# Check GPU usage
+nvidia-smi
+# or prettier:
+nvtop
+```
+
+### Common Issues
+
+**vLLM not loading model:**
+```bash
+# Check logs
+docker compose logs vllm
+
+# Common causes:
+# - Model download in progress (wait 5-10 minutes)
+# - Out of VRAM (try smaller model)
+# - Missing HF_TOKEN (for gated models like Llama)
+```
+
+**ComfyUI slow/crashing:**
+```bash
+# Check GPU memory
+nvidia-smi
+
+# If VRAM full:
+# - Close vLLM temporarily
+# - Use smaller models
+# - Reduce batch size in ComfyUI
+```
+
+**Can't access from VPS:**
+```bash
+# Test VPN
+ping [tailscale-ip]
+
+# If fails:
+# - Check Tailscale status: tailscale status
+# - Restart Tailscale: tailscale down && tailscale up
+# - Check firewall: ufw status
+```
+
+**Docker can't see GPU:**
+```bash
+# Test GPU access
+docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.1.0-base nvidia-smi
+
+# If fails:
+# - Check NVIDIA driver: nvidia-smi
+# - Check nvidia-docker: nvidia-ctk --version
+# - Restart Docker: systemctl restart docker
+```
+
+---
+
+## 📊 Performance Benchmarks
+
+### Expected Performance (RTX 4090)
+
+**LLM Inference (vLLM):**
+- Llama 3.1 8B: 50-80 tokens/second
+- Qwen 2.5 14B: 30-50 tokens/second
+- Batch size 32: ~1500 tokens/second
+
+**Image Generation (ComfyUI):**
+- SDXL (1024×1024): ~4-6 seconds
+- FLUX (1024×1024): ~8-12 seconds
+- SD 1.5 (512×512): ~1-2 seconds
+
+**Training (Axolotl):**
+- LoRA fine-tuning (8B model): ~3-5 hours for 3 epochs
+- Full fine-tuning: Not recommended on 24GB VRAM
+
+---
+
+## 🔐 Security Best Practices
+
+### Network Security
+✅ All services behind Tailscale VPN (end-to-end encrypted)
+✅ No public exposure (except RunPod's SSH)
+✅ Firewall configured (no additional ports needed)
+
+### Access Control
+✅ JupyterLab password-protected
+✅ ComfyUI accessible via VPN only
+✅ vLLM internal API (no auth needed)
+
+### SSH Security
+```bash
+# On GPU server, harden SSH
+nano /etc/ssh/sshd_config
+
+# Set:
+PermitRootLogin prohibit-password
+PasswordAuthentication no
+PubkeyAuthentication yes
+
+systemctl restart sshd
+```
+
+### Regular Updates
+```bash
+# Weekly updates
+apt update && apt upgrade -y
+
+# Update Docker images
+docker compose pull
+docker compose up -d
+```
+
+---
+
+## 📈 Scaling Up
+
+### When to Add More GPUs
+
+**Current limitations (1× RTX 4090):**
+- Can run ONE of these at a time:
+  - 8B LLM at full speed
+  - 14B LLM at moderate speed
+  - SDXL image generation
+  - Training job
+
+**Add 2nd GPU if:**
+- You want LLM + image gen simultaneously
+- Training + inference at same time
+- Multiple users with high demand
+
+**Multi-GPU options:**
+- 2× RTX 4090: Run vLLM + ComfyUI separately ($720/month)
+- 1× A100 40GB: Larger models (70B with quantization) ($1,080/month)
+- Mix: RTX 4090 (inference) + A100 (training) (~$1,300/month)
+
+### Deploying Larger Models
+
+**70B models (need 2× A100 or 4× RTX 4090):**
+```yaml
+# In gpu-server-compose.yaml
+vllm:
+  command:
+    - --model
+    - meta-llama/Meta-Llama-3.1-70B-Instruct
+    - --tensor-parallel-size
+    - "2"  # Split across 2 GPUs
+  deploy:
+    resources:
+      reservations:
+        devices:
+          - driver: nvidia
+            count: 2  # Use 2 GPUs
+            capabilities: [gpu]
+```
+
+---
+
+## 🎯 Next Steps (Week 2+)
+
+### Week 2: LLM Production Deployment
+- [ ] Test Llama 3.1 8B performance
+- [ ] Download additional models (Qwen, Mistral)
+- [ ] Configure model routing in LiteLLM
+- [ ] Set up usage monitoring
+- [ ] Benchmark tokens/second for each model
+
+### Week 3: Image Generation
+- [ ] Download FLUX and SDXL models
+- [ ] Install ComfyUI Manager
+- [ ] Download ControlNet models
+- [ ] Create sample workflows
+- [ ] Test API integration with Open WebUI
+
+### Week 4: Training Infrastructure
+- [ ] Prepare a sample dataset
+- [ ] Test LoRA fine-tuning with Axolotl
+- [ ] Set up Weights & Biases logging
+- [ ] Create training documentation
+- [ ] Benchmark training speed
+
+---
+
+## 🆘 Getting Help
+
+### Resources
+- **RunPod Docs:** https://docs.runpod.io/
+- **vLLM Docs:** https://docs.vllm.ai/
+- **ComfyUI Wiki:** https://github.com/comfyanonymous/ComfyUI/wiki
+- **Axolotl Docs:** https://github.com/OpenAccess-AI-Collective/axolotl
+
+### Community
+- **RunPod Discord:** https://discord.gg/runpod
+- **vLLM Discord:** https://discord.gg/vllm
+- **r/LocalLLaMA:** https://reddit.com/r/LocalLLaMA
+
+### Support
+If you encounter issues:
+1. Check logs: `docker compose logs -f [service]`
+2. Check GPU: `nvidia-smi`
+3. Check VPN: `wg show`
+4. Restart service: `docker compose restart [service]`
+5. Full restart: `docker compose down && docker compose up -d`
+
+---
+
+## ✅ Success Criteria
+
+You're ready to proceed when:
+- [ ] GPU server responds to `ping [tailscale-ip]` from VPS
+- [ ] vLLM returns models: `curl http://[tailscale-ip]:8000/v1/models`
+- [ ] ComfyUI web interface loads: `http://[tailscale-ip]:8188`
+- [ ] JupyterLab accessible with token
+- [ ] Netdata shows GPU metrics
+- [ ] Open WebUI shows both Claude and Llama models
+
+**Total setup time:** 4-6 hours (if following guides sequentially)
+
+---
+
+## 🎉 You're All Set!
+
+Your GPU-enhanced AI stack is ready. You now have:
+- ✅ Self-hosted LLM inference (saves $$$)
+- ✅ Advanced image generation (FLUX, SDXL)
+- ✅ Model training capabilities (LoRA, fine-tuning)
+- ✅ Secure VPN connection
+- ✅ Full monitoring and logging
+
+Enjoy building with your new AI infrastructure! 🚀