docs(ai): add comprehensive GPU setup documentation and configs

- Add setup guides (SETUP_GUIDE, TAILSCALE_SETUP, DOCKER_GPU_SETUP, etc.) - Add deployment configurations (litellm-config-gpu.yaml, gpu-server-compose.yaml) - Add GPU_DEPLOYMENT_LOG.md with current infrastructure details - Add GPU_EXPANSION_PLAN.md with complete provider comparison - Add deploy-gpu-stack.sh automation script 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
feat(ai): add GPU server deployment with vLLM and Tailscale
2025-11-21 12:57:06 +01:00 · 2025-11-21 12:56:57 +01:00
11 changed files with 4391 additions and 0 deletions
--- a/ai/DOCKER_GPU_SETUP.md
+++ b/ai/DOCKER_GPU_SETUP.md
@@ -0,0 +1,430 @@
+# Docker & NVIDIA Container Toolkit Setup
+
+## Day 5: Docker Configuration on GPU Server
+
+This guide sets up Docker with GPU support on your RunPod server.
+
+---
+
+## Step 1: Install Docker
+
+### Quick Install (Recommended)
+
+```bash
+# SSH into GPU server
+ssh gpu-pivoine
+
+# Download and run Docker install script
+curl -fsSL https://get.docker.com -o get-docker.sh
+sh get-docker.sh
+
+# Verify installation
+docker --version
+docker compose version
+```
+
+Expected output:
+```
+Docker version 24.0.7, build afdd53b
+Docker Compose version v2.23.0
+```
+
+### Manual Install (Alternative)
+
+```bash
+# Add Docker's official GPG key
+apt-get update
+apt-get install -y ca-certificates curl gnupg
+install -m 0755 -d /etc/apt/keyrings
+curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /etc/apt/keyrings/docker.gpg
+chmod a+r /etc/apt/keyrings/docker.gpg
+
+# Add repository
+echo \
+  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
+  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
+  tee /etc/apt/sources.list.d/docker.list > /dev/null
+
+# Install Docker
+apt-get update
+apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
+
+# Start Docker
+systemctl enable docker
+systemctl start docker
+```
+
+---
+
+## Step 2: Install NVIDIA Container Toolkit
+
+This enables Docker containers to use the GPU.
+
+```bash
+# Add NVIDIA repository
+distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
+curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
+  gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
+
+curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
+  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
+  tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
+
+# Install toolkit
+apt-get update
+apt-get install -y nvidia-container-toolkit
+
+# Configure Docker to use NVIDIA runtime
+nvidia-ctk runtime configure --runtime=docker
+
+# Restart Docker
+systemctl restart docker
+```
+
+---
+
+## Step 3: Test GPU Access in Docker
+
+### Test 1: Basic CUDA Container
+
+```bash
+docker run --rm --runtime=nvidia --gpus all \
+  nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
+```
+
+Expected output: Same as `nvidia-smi` output showing your RTX 4090.
+
+### Test 2: PyTorch Container
+
+```bash
+docker run --rm --runtime=nvidia --gpus all \
+  pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime \
+  python -c "import torch; print('CUDA:', torch.cuda.is_available(), 'Device:', torch.cuda.get_device_name(0))"
+```
+
+Expected output:
+```
+CUDA: True Device: NVIDIA GeForce RTX 4090
+```
+
+### Test 3: Multi-GPU Query (if you have multiple GPUs)
+
+```bash
+docker run --rm --runtime=nvidia --gpus all \
+  nvidia/cuda:12.1.0-base-ubuntu22.04 \
+  bash -c "echo 'GPU Count:' && nvidia-smi --list-gpus"
+```
+
+---
+
+## Step 4: Configure Docker Compose with GPU Support
+
+Docker Compose needs to know about NVIDIA runtime.
+
+### Create daemon.json
+
+```bash
+cat > /etc/docker/daemon.json << 'EOF'
+{
+  "runtimes": {
+    "nvidia": {
+      "path": "nvidia-container-runtime",
+      "runtimeArgs": []
+    }
+  },
+  "default-runtime": "nvidia",
+  "log-driver": "json-file",
+  "log-opts": {
+    "max-size": "10m",
+    "max-file": "3"
+  }
+}
+EOF
+
+# Restart Docker
+systemctl restart docker
+```
+
+---
+
+## Step 5: Create GPU Project Structure
+
+```bash
+cd /workspace
+
+# Create directory structure
+mkdir -p gpu-stack/{vllm,comfyui,training,jupyter,monitoring}
+cd gpu-stack
+
+# Create .env file
+cat > .env << 'EOF'
+# GPU Stack Environment Variables
+
+# Timezone
+TIMEZONE=Europe/Berlin
+
+# VPN Network
+VPS_IP=10.8.0.1
+GPU_IP=10.8.0.2
+
+# Model Storage
+MODELS_PATH=/workspace/models
+
+# Hugging Face (optional, for private models)
+HF_TOKEN=
+
+# PostgreSQL (on VPS)
+DB_HOST=10.8.0.1
+DB_PORT=5432
+DB_USER=valknar
+DB_PASSWORD=ragnarok98
+DB_NAME=openwebui
+
+# Weights & Biases (optional, for training logging)
+WANDB_API_KEY=
+EOF
+
+chmod 600 .env
+```
+
+---
+
+## Step 6: Test Full Stack (Quick Smoke Test)
+
+Let's deploy a minimal vLLM container to verify everything works:
+
+```bash
+cd /workspace/gpu-stack
+
+# Create test compose file
+cat > test-compose.yaml << 'EOF'
+services:
+  test-vllm:
+    image: vllm/vllm-openai:latest
+    container_name: test_vllm
+    runtime: nvidia
+    environment:
+      NVIDIA_VISIBLE_DEVICES: all
+    command:
+      - --model
+      - facebook/opt-125m  # Tiny model for testing
+      - --host
+      - 0.0.0.0
+      - --port
+      - 8000
+    ports:
+      - "8000:8000"
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: 1
+              capabilities: [gpu]
+EOF
+
+# Start test
+docker compose -f test-compose.yaml up -d
+
+# Wait 30 seconds for model download
+sleep 30
+
+# Check logs
+docker compose -f test-compose.yaml logs
+
+# Test inference
+curl http://localhost:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "facebook/opt-125m",
+    "prompt": "Hello, my name is",
+    "max_tokens": 10
+  }'
+```
+
+Expected output (JSON response with generated text).
+
+**Clean up test:**
+```bash
+docker compose -f test-compose.yaml down
+```
+
+---
+
+## Step 7: Install Additional Tools
+
+```bash
+# Python tools
+apt install -y python3-pip python3-venv
+
+# Monitoring tools
+apt install -y htop nvtop iotop
+
+# Network tools
+apt install -y iperf3 tcpdump
+
+# Development tools
+apt install -y build-essential
+
+# Git LFS (for large model files)
+apt install -y git-lfs
+git lfs install
+```
+
+---
+
+## Step 8: Configure Automatic Updates (Optional)
+
+```bash
+# Install unattended-upgrades
+apt install -y unattended-upgrades
+
+# Configure
+dpkg-reconfigure -plow unattended-upgrades
+
+# Enable automatic security updates
+cat > /etc/apt/apt.conf.d/50unattended-upgrades << 'EOF'
+Unattended-Upgrade::Allowed-Origins {
+    "${distro_id}:${distro_codename}-security";
+};
+Unattended-Upgrade::Automatic-Reboot "false";
+Unattended-Upgrade::Remove-Unused-Dependencies "true";
+EOF
+```
+
+---
+
+## Troubleshooting
+
+### Docker can't access GPU
+
+**Problem:** `docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]`
+
+**Solution:**
+```bash
+# Verify NVIDIA runtime is configured
+docker info | grep -i runtime
+
+# Should show nvidia in runtimes list
+# If not, reinstall nvidia-container-toolkit
+
+# Check daemon.json
+cat /etc/docker/daemon.json
+
+# Restart Docker
+systemctl restart docker
+```
+
+### Permission denied on docker commands
+
+**Solution:**
+```bash
+# Add your user to docker group (if not root)
+usermod -aG docker $USER
+
+# Or always use sudo
+sudo docker ...
+```
+
+### Out of disk space
+
+**Check usage:**
+```bash
+df -h
+du -sh /var/lib/docker
+docker system df
+```
+
+**Clean up:**
+```bash
+# Remove unused images
+docker image prune -a
+
+# Remove unused volumes
+docker volume prune
+
+# Full cleanup
+docker system prune -a --volumes
+```
+
+---
+
+## Verification Checklist
+
+Before deploying the full stack:
+
+- [ ] Docker installed and running
+- [ ] `docker --version` shows 24.x or newer
+- [ ] `docker compose version` works
+- [ ] NVIDIA Container Toolkit installed
+- [ ] `docker run --gpus all nvidia/cuda:12.1.0-base nvidia-smi` works
+- [ ] PyTorch container can see GPU
+- [ ] Test vLLM deployment successful
+- [ ] /workspace directory structure created
+- [ ] .env file configured with VPN IPs
+- [ ] Additional tools installed (nvtop, htop, etc.)
+
+---
+
+## Performance Monitoring Commands
+
+**GPU Monitoring:**
+```bash
+# Real-time GPU stats
+watch -n 1 nvidia-smi
+
+# Or with nvtop (prettier)
+nvtop
+
+# GPU memory usage
+nvidia-smi --query-gpu=memory.used,memory.total --format=csv
+```
+
+**Docker Stats:**
+```bash
+# Container resource usage
+docker stats
+
+# Specific container
+docker stats vllm --no-stream
+```
+
+**System Resources:**
+```bash
+# Overall system
+htop
+
+# I/O stats
+iotop
+
+# Network
+iftop
+```
+
+---
+
+## Next: Deploy Production Stack
+
+Now you're ready to deploy the full GPU stack with vLLM, ComfyUI, and training tools.
+
+**Proceed to:** Deploying the production docker-compose.yaml
+
+**Save your progress:**
+
+```bash
+cat >> /workspace/SERVER_INFO.md << 'EOF'
+
+## Docker Configuration
+- Docker Version: [docker --version]
+- NVIDIA Runtime: Enabled
+- GPU Access in Containers: ✓
+- Test vLLM Deployment: Successful
+- Directory: /workspace/gpu-stack
+
+## Tools Installed
+- nvtop: GPU monitoring
+- htop: System monitoring
+- Docker Compose: v2.x
+- Git LFS: Large file support
+EOF
+```
--- a/ai/GPU_DEPLOYMENT_LOG.md
+++ b/ai/GPU_DEPLOYMENT_LOG.md
@@ -0,0 +1,173 @@
+# GPU Server Deployment Log
+
+## Current Deployment (2025-11-21)
+
+### Infrastructure
+- **Provider**: RunPod (Spot Instance)
+- **GPU**: NVIDIA RTX 4090 24GB
+- **Disk**: 50GB local SSD (expanded from 20GB)
+- **Network Volume**: 922TB at `/workspace`
+- **Region**: Europe
+- **Cost**: ~$0.50/hour (~$360/month if running 24/7)
+
+### Network Configuration
+- **VPN**: Tailscale (replaces WireGuard due to RunPod UDP restrictions)
+- **GPU Server Tailscale IP**: 100.100.108.13
+- **VPS Tailscale IP**: (get with `tailscale ip -4` on VPS)
+
+### SSH Access
+```
+Host gpu-pivoine
+    HostName 213.173.102.232
+    Port 29695
+    User root
+    IdentityFile ~/.ssh/id_ed25519
+```
+
+**Note**: RunPod Spot instances can be terminated and restarted with new ports/IPs. Update SSH config accordingly.
+
+### Software Stack
+- **Python**: 3.11.10
+- **vLLM**: 0.6.4.post1 (installed with pip)
+- **PyTorch**: 2.5.1 with CUDA 12.4
+- **Tailscale**: Installed via official script
+
+### vLLM Deployment
+
+**Custom Server**: `ai/simple_vllm_server.py`
+- Uses `AsyncLLMEngine` directly to bypass multiprocessing issues
+- OpenAI-compatible API endpoints:
+  - `GET /v1/models` - List available models
+  - `POST /v1/completions` - Text completion
+  - `POST /v1/chat/completions` - Chat completion
+- Default model: Qwen/Qwen2.5-7B-Instruct
+- Cache directory: `/workspace/huggingface_cache`
+
+**Deployment Command**:
+```bash
+# Copy server script to GPU server
+scp ai/simple_vllm_server.py gpu-pivoine:/workspace/
+
+# Start server
+ssh gpu-pivoine "cd /workspace && nohup python3 simple_vllm_server.py > vllm.log 2>&1 &"
+
+# Check status
+ssh gpu-pivoine "curl http://localhost:8000/v1/models"
+```
+
+**Server Configuration** (environment variables):
+- `VLLM_HOST`: 0.0.0.0 (default)
+- `VLLM_PORT`: 8000 (default)
+
+### Model Configuration
+- **Model**: Qwen/Qwen2.5-7B-Instruct (no auth required)
+- **Context Length**: 4096 tokens
+- **GPU Memory**: 85% utilization
+- **Tensor Parallel**: 1 (single GPU)
+
+### Known Issues & Solutions
+
+#### Issue 1: vLLM Multiprocessing Errors
+**Problem**: Default vLLM v1 engine fails with ZMQ/CUDA multiprocessing errors on RunPod.
+**Solution**: Custom `AsyncLLMEngine` FastAPI server bypasses multiprocessing layer entirely.
+
+#### Issue 2: Disk Space (Solved)
+**Problem**: Original 20GB disk filled up with Hugging Face cache.
+**Solution**: Expanded to 50GB and use `/workspace` for model cache.
+
+#### Issue 3: Gated Models
+**Problem**: Llama models require Hugging Face authentication.
+**Solution**: Use Qwen 2.5 7B Instruct (no auth required) or set `HF_TOKEN` environment variable.
+
+#### Issue 4: Spot Instance Volatility
+**Problem**: RunPod Spot instances can be terminated anytime.
+**Solution**: Accept as trade-off for cost savings. Document SSH details for quick reconnection.
+
+### Monitoring
+
+**Check vLLM logs**:
+```bash
+ssh gpu-pivoine "tail -f /workspace/vllm.log"
+```
+
+**Check GPU usage**:
+```bash
+ssh gpu-pivoine "nvidia-smi"
+```
+
+**Check Tailscale status**:
+```bash
+ssh gpu-pivoine "tailscale status"
+```
+
+**Test API locally (on GPU server)**:
+```bash
+ssh gpu-pivoine "curl http://localhost:8000/v1/models"
+```
+
+**Test API via Tailscale (from VPS)**:
+```bash
+curl http://100.100.108.13:8000/v1/models
+```
+
+### LiteLLM Integration
+
+Update VPS LiteLLM config at `ai/litellm-config-gpu.yaml`:
+
+```yaml
+# Replace old WireGuard IP (10.8.0.2) with Tailscale IP
+- model_name: qwen-2.5-7b
+  litellm_params:
+    model: openai/qwen-2.5-7b
+    api_base: http://100.100.108.13:8000/v1  # Tailscale IP
+    api_key: dummy
+    rpm: 1000
+    tpm: 100000
+```
+
+Restart LiteLLM:
+```bash
+arty restart litellm
+```
+
+### Troubleshooting
+
+**Server not responding**:
+1. Check if process is running: `pgrep -f simple_vllm_server`
+2. Check logs: `tail -100 /workspace/vllm.log`
+3. Check GPU availability: `nvidia-smi`
+4. Restart server: `pkill -f simple_vllm_server && python3 /workspace/simple_vllm_server.py &`
+
+**Tailscale not connected**:
+1. Check status: `tailscale status`
+2. Check daemon: `ps aux | grep tailscaled`
+3. Restart: `tailscale down && tailscale up`
+
+**Model download failing**:
+1. Check disk space: `df -h`
+2. Check cache directory: `ls -lah /workspace/huggingface_cache`
+3. Clear cache if needed: `rm -rf /workspace/huggingface_cache/*`
+
+### Next Steps
+1. ✅ Deploy vLLM with Qwen 2.5 7B
+2. ⏳ Test API endpoints locally and via Tailscale
+3. ⏳ Update VPS LiteLLM configuration
+4. ⏳ Test end-to-end: Open WebUI → LiteLLM → vLLM
+5. ⏹️ Monitor performance and costs
+6. ⏹️ Consider adding more models (Mistral, DeepSeek Coder)
+7. ⏹️ Set up auto-stop for idle periods to save costs
+
+### Cost Optimization Ideas
+1. **Auto-stop**: Configure RunPod to auto-stop after 30 minutes idle
+2. **Spot Instances**: Already using Spot for 50% cost reduction
+3. **Scheduled Operation**: Run only during business hours (8 hours/day = $120/month)
+4. **Smaller Models**: Use Mistral 7B or quantized models for lighter workloads
+5. **Pay-as-you-go**: Manually start/stop pod as needed
+
+### Performance Benchmarks
+*To be measured after deployment*
+
+Expected (based on RTX 4090):
+- Qwen 2.5 7B: 50-80 tokens/second
+- Context processing: ~2-3 seconds for 1000 tokens
+- First token latency: ~200-300ms
--- a/ai/GPU_EXPANSION_PLAN.md
+++ b/ai/GPU_EXPANSION_PLAN.md
--- a/ai/README_GPU_SETUP.md
+++ b/ai/README_GPU_SETUP.md
@@ -0,0 +1,444 @@
+# GPU-Enhanced AI Stack - Implementation Guide
+
+Welcome to your GPU expansion setup! This directory contains everything you need to deploy a production-ready GPU server for LLM hosting, image generation, and model training.
+
+## 📚 Documentation Files
+
+### Planning & Architecture
+- **`GPU_EXPANSION_PLAN.md`** - Complete 70-page plan with provider comparison, architecture, and roadmap
+- **`README_GPU_SETUP.md`** - This file
+
+### Step-by-Step Setup Guides
+1. **`SETUP_GUIDE.md`** - Day 1-2: RunPod account & GPU server deployment
+2. **`WIREGUARD_SETUP.md`** - Day 3-4: VPN connection between VPS and GPU server
+3. **`DOCKER_GPU_SETUP.md`** - Day 5: Docker + NVIDIA Container Toolkit configuration
+
+### Configuration Files
+- **`gpu-server-compose.yaml`** - Production Docker Compose for GPU server
+- **`litellm-config-gpu.yaml`** - Updated LiteLLM config with self-hosted models
+- **`deploy-gpu-stack.sh`** - Automated deployment script
+
+---
+
+## 🚀 Quick Start (Week 1 Checklist)
+
+### Day 1-2: RunPod & GPU Server ✓
+- [ ] Create RunPod account at https://www.runpod.io/
+- [ ] Add billing method ($50 initial credit recommended)
+- [ ] Deploy RTX 4090 pod with PyTorch template
+- [ ] Configure 500GB network volume
+- [ ] Verify SSH access
+- [ ] Test GPU with `nvidia-smi`
+- [ ] **Guide:** `SETUP_GUIDE.md`
+
+### Day 3-4: Network Configuration ✓
+- [ ] Install Tailscale on VPS
+- [ ] Install Tailscale on GPU server
+- [ ] Authenticate both devices
+- [ ] Test VPN connectivity
+- [ ] Configure firewall rules
+- [ ] Verify VPS can reach GPU server
+- [ ] **Guide:** `TAILSCALE_SETUP.md`
+
+### Day 5: Docker & GPU Setup ✓
+- [ ] Install Docker on GPU server
+- [ ] Install NVIDIA Container Toolkit
+- [ ] Test GPU access in containers
+- [ ] Create /workspace/gpu-stack directory
+- [ ] Copy configuration files
+- [ ] **Guide:** `DOCKER_GPU_SETUP.md`
+
+### Day 6-7: Deploy Services ✓
+- [ ] Copy `gpu-server-compose.yaml` to GPU server
+- [ ] Edit `.env` with your settings
+- [ ] Run `./deploy-gpu-stack.sh`
+- [ ] Wait for vLLM to load model (~5 minutes)
+- [ ] Test vLLM: `curl http://localhost:8000/v1/models`
+- [ ] Access ComfyUI: `http://[tailscale-ip]:8188`
+- [ ] **Script:** `deploy-gpu-stack.sh`
+
+---
+
+## 📦 Services Included
+
+### vLLM (http://[tailscale-ip]:8000)
+**Purpose:** High-performance LLM inference
+**Default Model:** Llama 3.1 8B Instruct
+**Performance:** 50-80 tokens/second on RTX 4090
+**Use for:** General chat, Q&A, code generation, summarization
+
+**Switch models:**
+Edit `gpu-server-compose.yaml`, change `--model` parameter, restart:
+```bash
+docker compose restart vllm
+```
+
+### ComfyUI (http://[tailscale-ip]:8188)
+**Purpose:** Advanced Stable Diffusion interface
+**Features:** FLUX, SDXL, ControlNet, LoRA
+**Use for:** Image generation, img2img, inpainting
+
+**Download models:**
+Access web UI → ComfyUI Manager → Install Models
+
+### JupyterLab (http://[tailscale-ip]:8888)
+**Purpose:** Interactive development environment
+**Token:** `pivoine-ai-2025` (change in `.env`)
+**Use for:** Research, experimentation, custom training scripts
+
+### Axolotl (Training - on-demand)
+**Purpose:** LLM fine-tuning framework
+**Start:** `docker compose --profile training up -d axolotl`
+**Use for:** LoRA training, full fine-tuning, RLHF
+
+### Netdata (http://[tailscale-ip]:19999)
+**Purpose:** System & GPU monitoring
+**Features:** Real-time metrics, GPU utilization, memory usage
+**Use for:** Performance monitoring, troubleshooting
+
+---
+
+## 🔧 Configuration
+
+### Environment Variables (.env)
+
+```bash
+# VPN Network (Tailscale)
+VPS_IP=100.x.x.x         # Your VPS Tailscale IP (get with: tailscale ip -4)
+GPU_IP=100.x.x.x         # GPU server Tailscale IP (get with: tailscale ip -4)
+
+# Model Storage
+MODELS_PATH=/workspace/models
+
+# Hugging Face Token (for gated models like Llama)
+HF_TOKEN=hf_xxxxxxxxxxxxx
+
+# Weights & Biases (for training logging)
+WANDB_API_KEY=
+
+# JupyterLab Access
+JUPYTER_TOKEN=pivoine-ai-2025
+
+# PostgreSQL (on VPS)
+DB_HOST=100.x.x.x        # Your VPS Tailscale IP
+DB_PORT=5432
+DB_USER=valknar
+DB_PASSWORD=ragnarok98
+DB_NAME=openwebui
+```
+
+### Updating LiteLLM on VPS
+
+After GPU server is running, update your VPS LiteLLM config:
+
+```bash
+# On VPS
+cd ~/Projects/docker-compose/ai
+
+# Backup current config
+cp litellm-config.yaml litellm-config.yaml.backup
+
+# Copy new config with GPU models
+cp litellm-config-gpu.yaml litellm-config.yaml
+
+# Restart LiteLLM
+arty restart litellm
+```
+
+Now Open WebUI will have access to both Claude (API) and Llama (self-hosted)!
+
+---
+
+## 💰 Cost Management
+
+### Current Costs (24/7 Operation)
+- **GPU Server:** RTX 4090 @ $0.50/hour = $360/month
+- **Storage:** 500GB network volume = $50/month
+- **Total:** **$410/month**
+
+### Cost-Saving Options
+
+**1. Pay-as-you-go (8 hours/day)**
+- GPU: $0.50 × 8 × 30 = $120/month
+- Storage: $50/month
+- **Total: $170/month**
+
+**2. Auto-stop idle pods**
+RunPod can auto-stop after X minutes idle:
+- Dashboard → Pod Settings → Auto-stop after 30 minutes
+
+**3. Use smaller models**
+- Mistral 7B instead of Llama 8B: Faster, cheaper GPU
+- Quantized models: 4-bit = 1/4 the VRAM
+
+**4. Batch image generation**
+- Generate multiple images at once
+- Use scheduled jobs (cron) during off-peak hours
+
+### Cost Tracking
+
+**Check GPU usage:**
+```bash
+# On RunPod dashboard
+Billing → Usage History
+
+# See hourly costs, total spent
+```
+
+**Check API vs GPU savings:**
+```bash
+# On VPS, check LiteLLM logs
+docker logs ai_litellm | grep "model="
+
+# Count requests to llama-3.1-8b vs claude-*
+```
+
+**Expected savings:**
+- 80% of requests → self-hosted = $0 cost
+- 20% of requests → Claude = API cost
+- Break-even if currently spending >$500/month on APIs
+
+---
+
+## 🔍 Monitoring & Troubleshooting
+
+### Check Service Status
+
+```bash
+# On GPU server
+cd /workspace/gpu-stack
+
+# View all services
+docker compose ps
+
+# Check specific service logs
+docker compose logs -f vllm
+docker compose logs -f comfyui
+docker compose logs -f jupyter
+
+# Check GPU usage
+nvidia-smi
+# or prettier:
+nvtop
+```
+
+### Common Issues
+
+**vLLM not loading model:**
+```bash
+# Check logs
+docker compose logs vllm
+
+# Common causes:
+# - Model download in progress (wait 5-10 minutes)
+# - Out of VRAM (try smaller model)
+# - Missing HF_TOKEN (for gated models like Llama)
+```
+
+**ComfyUI slow/crashing:**
+```bash
+# Check GPU memory
+nvidia-smi
+
+# If VRAM full:
+# - Close vLLM temporarily
+# - Use smaller models
+# - Reduce batch size in ComfyUI
+```
+
+**Can't access from VPS:**
+```bash
+# Test VPN
+ping [tailscale-ip]
+
+# If fails:
+# - Check Tailscale status: tailscale status
+# - Restart Tailscale: tailscale down && tailscale up
+# - Check firewall: ufw status
+```
+
+**Docker can't see GPU:**
+```bash
+# Test GPU access
+docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.1.0-base nvidia-smi
+
+# If fails:
+# - Check NVIDIA driver: nvidia-smi
+# - Check nvidia-docker: nvidia-ctk --version
+# - Restart Docker: systemctl restart docker
+```
+
+---
+
+## 📊 Performance Benchmarks
+
+### Expected Performance (RTX 4090)
+
+**LLM Inference (vLLM):**
+- Llama 3.1 8B: 50-80 tokens/second
+- Qwen 2.5 14B: 30-50 tokens/second
+- Batch size 32: ~1500 tokens/second
+
+**Image Generation (ComfyUI):**
+- SDXL (1024×1024): ~4-6 seconds
+- FLUX (1024×1024): ~8-12 seconds
+- SD 1.5 (512×512): ~1-2 seconds
+
+**Training (Axolotl):**
+- LoRA fine-tuning (8B model): ~3-5 hours for 3 epochs
+- Full fine-tuning: Not recommended on 24GB VRAM
+
+---
+
+## 🔐 Security Best Practices
+
+### Network Security
+✅ All services behind Tailscale VPN (end-to-end encrypted)
+✅ No public exposure (except RunPod's SSH)
+✅ Firewall configured (no additional ports needed)
+
+### Access Control
+✅ JupyterLab password-protected
+✅ ComfyUI accessible via VPN only
+✅ vLLM internal API (no auth needed)
+
+### SSH Security
+```bash
+# On GPU server, harden SSH
+nano /etc/ssh/sshd_config
+
+# Set:
+PermitRootLogin prohibit-password
+PasswordAuthentication no
+PubkeyAuthentication yes
+
+systemctl restart sshd
+```
+
+### Regular Updates
+```bash
+# Weekly updates
+apt update && apt upgrade -y
+
+# Update Docker images
+docker compose pull
+docker compose up -d
+```
+
+---
+
+## 📈 Scaling Up
+
+### When to Add More GPUs
+
+**Current limitations (1× RTX 4090):**
+- Can run ONE of these at a time:
+  - 8B LLM at full speed
+  - 14B LLM at moderate speed
+  - SDXL image generation
+  - Training job
+
+**Add 2nd GPU if:**
+- You want LLM + image gen simultaneously
+- Training + inference at same time
+- Multiple users with high demand
+
+**Multi-GPU options:**
+- 2× RTX 4090: Run vLLM + ComfyUI separately ($720/month)
+- 1× A100 40GB: Larger models (70B with quantization) ($1,080/month)
+- Mix: RTX 4090 (inference) + A100 (training) (~$1,300/month)
+
+### Deploying Larger Models
+
+**70B models (need 2× A100 or 4× RTX 4090):**
+```yaml
+# In gpu-server-compose.yaml
+vllm:
+  command:
+    - --model
+    - meta-llama/Meta-Llama-3.1-70B-Instruct
+    - --tensor-parallel-size
+    - "2"  # Split across 2 GPUs
+  deploy:
+    resources:
+      reservations:
+        devices:
+          - driver: nvidia
+            count: 2  # Use 2 GPUs
+            capabilities: [gpu]
+```
+
+---
+
+## 🎯 Next Steps (Week 2+)
+
+### Week 2: LLM Production Deployment
+- [ ] Test Llama 3.1 8B performance
+- [ ] Download additional models (Qwen, Mistral)
+- [ ] Configure model routing in LiteLLM
+- [ ] Set up usage monitoring
+- [ ] Benchmark tokens/second for each model
+
+### Week 3: Image Generation
+- [ ] Download FLUX and SDXL models
+- [ ] Install ComfyUI Manager
+- [ ] Download ControlNet models
+- [ ] Create sample workflows
+- [ ] Test API integration with Open WebUI
+
+### Week 4: Training Infrastructure
+- [ ] Prepare a sample dataset
+- [ ] Test LoRA fine-tuning with Axolotl
+- [ ] Set up Weights & Biases logging
+- [ ] Create training documentation
+- [ ] Benchmark training speed
+
+---
+
+## 🆘 Getting Help
+
+### Resources
+- **RunPod Docs:** https://docs.runpod.io/
+- **vLLM Docs:** https://docs.vllm.ai/
+- **ComfyUI Wiki:** https://github.com/comfyanonymous/ComfyUI/wiki
+- **Axolotl Docs:** https://github.com/OpenAccess-AI-Collective/axolotl
+
+### Community
+- **RunPod Discord:** https://discord.gg/runpod
+- **vLLM Discord:** https://discord.gg/vllm
+- **r/LocalLLaMA:** https://reddit.com/r/LocalLLaMA
+
+### Support
+If you encounter issues:
+1. Check logs: `docker compose logs -f [service]`
+2. Check GPU: `nvidia-smi`
+3. Check VPN: `wg show`
+4. Restart service: `docker compose restart [service]`
+5. Full restart: `docker compose down && docker compose up -d`
+
+---
+
+## ✅ Success Criteria
+
+You're ready to proceed when:
+- [ ] GPU server responds to `ping [tailscale-ip]` from VPS
+- [ ] vLLM returns models: `curl http://[tailscale-ip]:8000/v1/models`
+- [ ] ComfyUI web interface loads: `http://[tailscale-ip]:8188`
+- [ ] JupyterLab accessible with token
+- [ ] Netdata shows GPU metrics
+- [ ] Open WebUI shows both Claude and Llama models
+
+**Total setup time:** 4-6 hours (if following guides sequentially)
+
+---
+
+## 🎉 You're All Set!
+
+Your GPU-enhanced AI stack is ready. You now have:
+- ✅ Self-hosted LLM inference (saves $$$)
+- ✅ Advanced image generation (FLUX, SDXL)
+- ✅ Model training capabilities (LoRA, fine-tuning)
+- ✅ Secure VPN connection
+- ✅ Full monitoring and logging
+
+Enjoy building with your new AI infrastructure! 🚀
--- a/ai/SETUP_GUIDE.md
+++ b/ai/SETUP_GUIDE.md
@@ -0,0 +1,261 @@
+# GPU Server Setup Guide - Week 1
+
+## Day 1-2: RunPod Account & GPU Server
+
+### Step 1: Create RunPod Account
+
+1. **Go to RunPod**: https://www.runpod.io/
+2. **Sign up** with email or GitHub
+3. **Add billing method**:
+   - Credit card required
+   - No charges until you deploy a pod
+   - Recommended: Add $50 initial credit
+
+4. **Verify email** and complete account setup
+
+### Step 2: Deploy Your First GPU Pod
+
+#### 2.1 Navigate to Pods
+
+1. Click **"Deploy"** in top menu
+2. Select **"GPU Pods"**
+
+#### 2.2 Choose GPU Type
+
+**Recommended: RTX 4090**
+- 24GB VRAM
+- ~$0.50/hour
+- Perfect for LLMs up to 14B params
+- Great for SDXL/FLUX
+
+**Filter options:**
+- GPU Type: RTX 4090
+- GPU Count: 1
+- Sort by: Price (lowest first)
+- Region: Europe (lower latency to Germany)
+
+#### 2.3 Select Template
+
+Choose: **"RunPod PyTorch"** template
+- Includes: CUDA, PyTorch, Python
+- Pre-configured for GPU workloads
+- Docker pre-installed
+
+**Alternative**: "Ubuntu 22.04 with CUDA 12.1" (more control)
+
+#### 2.4 Configure Pod
+
+**Container Settings:**
+- **Container Disk**: 50GB (temporary, auto-included)
+- **Expose Ports**:
+  - Add: 22 (SSH)
+  - Add: 8000 (vLLM)
+  - Add: 8188 (ComfyUI)
+  - Add: 8888 (JupyterLab)
+
+**Volume Settings:**
+- Click **"+ Network Volume"**
+- **Name**: `gpu-models-storage`
+- **Size**: 500GB
+- **Region**: Same as pod
+- **Cost**: ~$50/month
+
+**Environment Variables:**
+- Add later (not needed for initial setup)
+
+#### 2.5 Deploy Pod
+
+1. Review configuration
+2. Click **"Deploy On-Demand"** (not Spot for reliability)
+3. Wait 2-3 minutes for deployment
+
+**Expected cost:**
+- GPU: $0.50/hour = $360/month (24/7)
+- Storage: $50/month
+- **Total: $410/month**
+
+### Step 3: Access Your GPU Server
+
+#### 3.1 Get Connection Info
+
+Once deployed, you'll see:
+- **Pod ID**: e.g., `abc123def456`
+- **SSH Command**: `ssh root@<pod-id>.runpod.io -p 12345`
+- **Public IP**: May not be directly accessible (use SSH)
+
+#### 3.2 SSH Access
+
+RunPod automatically generates SSH keys for you:
+
+```bash
+# Copy the SSH command from RunPod dashboard
+ssh root@abc123def456.runpod.io -p 12345
+
+# First time: Accept fingerprint
+# You should now be in the GPU server!
+```
+
+**Verify GPU:**
+```bash
+nvidia-smi
+```
+
+Expected output:
+```
+-----------------------------------------------------------------------------+
+| NVIDIA-SMI 535.xx       Driver Version: 535.xx       CUDA Version: 12.1    |
+|-------------------------------+----------------------+----------------------+
+| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+|===============================+======================+======================|
+|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
+| 30%   45C    P0    50W / 450W |      0MiB / 24564MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+```
+
+### Step 4: Initial Server Configuration
+
+#### 4.1 Update System
+
+```bash
+# Update package lists
+apt update
+
+# Upgrade existing packages
+apt upgrade -y
+
+# Install essential tools
+apt install -y \
+  vim \
+  htop \
+  tmux \
+  curl \
+  wget \
+  git \
+  net-tools \
+  iptables-persistent
+```
+
+#### 4.2 Set Timezone
+
+```bash
+timedatectl set-timezone Europe/Berlin
+date  # Verify
+```
+
+#### 4.3 Create Working Directory
+
+```bash
+# Create workspace
+mkdir -p /workspace/{models,configs,data,scripts}
+
+# Check network volume mount
+ls -la /workspace
+# Should show your 500GB volume
+```
+
+#### 4.4 Configure SSH (Optional but Recommended)
+
+**Generate your own SSH key on your local machine:**
+
+```bash
+# On your local machine (not GPU server)
+ssh-keygen -t ed25519 -C "gpu-server-pivoine" -f ~/.ssh/gpu_pivoine
+
+# Copy public key to GPU server
+ssh-copy-id -i ~/.ssh/gpu_pivoine.pub root@abc123def456.runpod.io -p 12345
+```
+
+**Add to your local ~/.ssh/config:**
+
+```bash
+Host gpu-pivoine
+    HostName abc123def456.runpod.io
+    Port 12345
+    User root
+    IdentityFile ~/.ssh/gpu_pivoine
+```
+
+Now you can connect with: `ssh gpu-pivoine`
+
+### Step 5: Verify GPU Access
+
+Run this test:
+
+```bash
+# Test CUDA
+python3 -c "import torch; print('CUDA available:', torch.cuda.is_available()); print('GPU count:', torch.cuda.device_count())"
+```
+
+Expected output:
+```
+CUDA available: True
+GPU count: 1
+```
+
+### Troubleshooting
+
+**Problem: Can't connect via SSH**
+- Check pod is running (not stopped)
+- Verify port number in SSH command
+- Try web terminal in RunPod dashboard
+
+**Problem: GPU not detected**
+- Run `nvidia-smi`
+- Check RunPod selected correct GPU type
+- Restart pod if needed
+
+**Problem: Network volume not mounted**
+- Check RunPod dashboard → Volume tab
+- Verify volume is attached to pod
+- Try: `df -h` to see mounts
+
+### Next Steps
+
+Once SSH access works and GPU is verified:
+✅ Proceed to **Day 3-4: Network Configuration (Tailscale VPN)**
+
+### Save Important Info
+
+Create a file to track your setup:
+
+```bash
+# On GPU server
+cat > /workspace/SERVER_INFO.md << 'EOF'
+# GPU Server Information
+
+## Connection
+- SSH: ssh root@abc123def456.runpod.io -p 12345
+- Pod ID: abc123def456
+- Region: [YOUR_REGION]
+
+## Hardware
+- GPU: RTX 4090 24GB
+- CPU: [Check with: lscpu]
+- RAM: [Check with: free -h]
+- Storage: 500GB network volume at /workspace
+
+## Costs
+- GPU: $0.50/hour
+- Storage: $50/month
+- Total: ~$410/month (24/7)
+
+## Deployed: [DATE]
+EOF
+```
+
+---
+
+## Checkpoint ✓
+
+Before moving to Day 3, verify:
+- [ ] RunPod account created and billing added
+- [ ] RTX 4090 pod deployed successfully
+- [ ] 500GB network volume attached
+- [ ] SSH access working
+- [ ] `nvidia-smi` shows GPU
+- [ ] `torch.cuda.is_available()` returns True
+- [ ] Timezone set to Europe/Berlin
+- [ ] Essential tools installed
+
+**Ready for Tailscale setup? Let's go!**
--- a/ai/TAILSCALE_SETUP.md
+++ b/ai/TAILSCALE_SETUP.md
@@ -0,0 +1,417 @@
+# Tailscale VPN Setup - Better Alternative to WireGuard
+
+## Why Tailscale?
+
+RunPod doesn't support UDP ports, which blocks WireGuard. Tailscale solves this by:
+- ✅ Works over HTTPS (TCP) - no UDP needed
+- ✅ Zero configuration - automatic setup
+- ✅ Free for personal use
+- ✅ Built on WireGuard (same security)
+- ✅ Automatic NAT traversal
+- ✅ Peer-to-peer when possible (low latency)
+
+---
+
+## Step 1: Create Tailscale Account
+
+1. Go to: https://tailscale.com/
+2. Click **"Get Started"**
+3. Sign up with **GitHub** or **Google** (easiest)
+4. You'll be redirected to the Tailscale admin console
+
+**No credit card required!** Free tier is perfect for our use case.
+
+---
+
+## Step 2: Install Tailscale on VPS
+
+**SSH into your VPS:**
+
+```bash
+ssh root@vps
+```
+
+**Install Tailscale:**
+
+```bash
+# Download and run install script
+curl -fsSL https://tailscale.com/install.sh | sh
+
+# Start Tailscale
+tailscale up
+
+# You'll see a URL like:
+# https://login.tailscale.com/a/xxxxxxxxxx
+```
+
+**Authenticate:**
+1. Copy the URL and open in browser
+2. Click **"Connect"** to authorize the device
+3. Name it: `pivoine-vps`
+
+**Check status:**
+```bash
+tailscale status
+```
+
+You should see your VPS listed with an IP like `100.x.x.x`
+
+**Save your VPS Tailscale IP:**
+```bash
+tailscale ip -4
+# Example output: 100.101.102.103
+```
+
+**Write this down - you'll need it!**
+
+---
+
+## Step 3: Install Tailscale on GPU Server
+
+**SSH into your RunPod GPU server:**
+
+```bash
+ssh root@abc123def456-12345678.runpod.io -p 12345
+```
+
+**Install Tailscale:**
+
+```bash
+# Download and run install script
+curl -fsSL https://tailscale.com/install.sh | sh
+
+# Start Tailscale
+tailscale up --advertise-tags=tag:gpu
+
+# You'll see another URL
+```
+
+**Authenticate:**
+1. Copy the URL and open in browser
+2. Click **"Connect"**
+3. Name it: `gpu-runpod`
+
+**Check status:**
+```bash
+tailscale status
+```
+
+You should now see BOTH devices:
+- `pivoine-vps` - 100.x.x.x
+- `gpu-runpod` - 100.x.x.x
+
+**Save your GPU server Tailscale IP:**
+```bash
+tailscale ip -4
+# Example output: 100.104.105.106
+```
+
+---
+
+## Step 4: Test Connectivity
+
+**From VPS, ping GPU server:**
+
+```bash
+# SSH into VPS
+ssh root@vps
+
+# Ping GPU server (use its Tailscale IP)
+ping 100.104.105.106 -c 4
+```
+
+Expected output:
+```
+PING 100.104.105.106 (100.104.105.106) 56(84) bytes of data.
+64 bytes from 100.104.105.106: icmp_seq=1 ttl=64 time=15.3 ms
+64 bytes from 100.104.105.106: icmp_seq=2 ttl=64 time=14.8 ms
+...
+```
+
+**From GPU server, ping VPS:**
+
+```bash
+# SSH into GPU server
+ssh root@abc123def456-12345678.runpod.io -p 12345
+
+# Ping VPS (use its Tailscale IP)
+ping 100.101.102.103 -c 4
+```
+
+**Both should work!** ✅
+
+---
+
+## Step 5: Update Configuration Files
+
+Now update the IP addresses in your configs to use Tailscale IPs.
+
+### On GPU Server (.env file)
+
+**Edit your .env file:**
+
+```bash
+# On GPU server
+cd /workspace/gpu-stack
+
+nano .env
+```
+
+**Update these lines:**
+```bash
+# VPN Network (use your actual Tailscale IPs)
+VPS_IP=100.101.102.103      # Your VPS Tailscale IP
+GPU_IP=100.104.105.106      # Your GPU Tailscale IP
+
+# PostgreSQL (on VPS)
+DB_HOST=100.101.102.103     # Your VPS Tailscale IP
+DB_PORT=5432
+```
+
+Save and exit (Ctrl+X, Y, Enter)
+
+### On VPS (LiteLLM config)
+
+**Edit your LiteLLM config:**
+
+```bash
+# On VPS
+ssh root@vps
+cd ~/Projects/docker-compose/ai
+
+nano litellm-config-gpu.yaml
+```
+
+**Update the GPU server IP:**
+
+```yaml
+# Find this section and update IP:
+  - model_name: llama-3.1-8b
+    litellm_params:
+      model: openai/meta-llama/Meta-Llama-3.1-8B-Instruct
+      api_base: http://100.104.105.106:8000/v1  # Use GPU Tailscale IP
+      api_key: dummy
+```
+
+Save and exit.
+
+---
+
+## Step 6: Verify PostgreSQL Access
+
+**From GPU server, test database connection:**
+
+```bash
+# Install PostgreSQL client
+apt install -y postgresql-client
+
+# Test connection (use your VPS Tailscale IP)
+psql -h 100.101.102.103 -U valknar -d openwebui -c "SELECT 1;"
+```
+
+**If this fails, allow Tailscale network on VPS PostgreSQL:**
+
+```bash
+# On VPS
+ssh root@vps
+
+# Check if postgres allows Tailscale network
+docker exec core_postgres cat /var/lib/postgresql/data/pg_hba.conf | grep 100
+
+# If not present, add it:
+docker exec -it core_postgres bash
+
+# Inside container:
+echo "host    all             all             100.0.0.0/8             scram-sha-256" >> /var/lib/postgresql/data/pg_hba.conf
+
+# Restart postgres
+exit
+docker restart core_postgres
+```
+
+Try connecting again - should work now!
+
+---
+
+## Tailscale Management
+
+### View Connected Devices
+
+**Web dashboard:**
+https://login.tailscale.com/admin/machines
+
+You'll see all your devices with their Tailscale IPs.
+
+**Command line:**
+```bash
+tailscale status
+```
+
+### Disconnect/Reconnect
+
+```bash
+# Stop Tailscale
+tailscale down
+
+# Start Tailscale
+tailscale up
+```
+
+### Remove Device
+
+From web dashboard:
+1. Click on device
+2. Click "..." menu
+3. Select "Disable" or "Delete"
+
+---
+
+## Advantages Over WireGuard
+
+✅ **Works anywhere** - No UDP ports needed
+✅ **Auto-reconnect** - Survives network changes
+✅ **Multiple devices** - Easy to add laptop, phone, etc.
+✅ **NAT traversal** - Direct peer-to-peer when possible
+✅ **Access Control** - Manage from web dashboard
+✅ **Monitoring** - See connection status in real-time
+
+---
+
+## Security Notes
+
+🔒 **Tailscale is secure:**
+- End-to-end encrypted (WireGuard)
+- Zero-trust architecture
+- No Tailscale servers can see your traffic
+- Only authenticated devices can connect
+
+🔒 **Access control:**
+- Only devices you authorize can join
+- Revoke access anytime from dashboard
+- Set ACLs for fine-grained control
+
+---
+
+## Network Reference (Updated)
+
+**Old (WireGuard):**
+- VPS: `10.8.0.1`
+- GPU: `10.8.0.2`
+
+**New (Tailscale):**
+- VPS: `100.101.102.103` (example - use your actual IP)
+- GPU: `100.104.105.106` (example - use your actual IP)
+
+**All services now accessible via Tailscale:**
+
+**From VPS to GPU:**
+- vLLM: `http://100.104.105.106:8000`
+- ComfyUI: `http://100.104.105.106:8188`
+- JupyterLab: `http://100.104.105.106:8888`
+- Netdata: `http://100.104.105.106:19999`
+
+**From GPU to VPS:**
+- PostgreSQL: `100.101.102.103:5432`
+- Redis: `100.101.102.103:6379`
+- LiteLLM: `http://100.101.102.103:4000`
+
+---
+
+## Troubleshooting
+
+### Can't ping between devices
+
+**Check Tailscale status:**
+```bash
+tailscale status
+```
+
+Both devices should show "active" or "online".
+
+**Check connectivity:**
+```bash
+tailscale ping 100.104.105.106
+```
+
+**Restart Tailscale:**
+```bash
+tailscale down && tailscale up
+```
+
+### PostgreSQL connection refused
+
+**Check if postgres is listening on all interfaces:**
+```bash
+# On VPS
+docker exec core_postgres cat /var/lib/postgresql/data/postgresql.conf | grep listen_addresses
+```
+
+Should show: `listen_addresses = '*'`
+
+**Check pg_hba.conf allows Tailscale network:**
+```bash
+docker exec core_postgres cat /var/lib/postgresql/data/pg_hba.conf | grep 100
+```
+
+Should have line:
+```
+host    all             all             100.0.0.0/8             scram-sha-256
+```
+
+### Device not showing in network
+
+**Re-authenticate:**
+```bash
+tailscale logout
+tailscale up
+# Click the new URL to re-authenticate
+```
+
+---
+
+## Verification Checklist
+
+Before proceeding:
+- [ ] Tailscale account created
+- [ ] Tailscale installed on VPS
+- [ ] Tailscale installed on GPU server
+- [ ] Both devices visible in `tailscale status`
+- [ ] VPS can ping GPU server (via Tailscale IP)
+- [ ] GPU server can ping VPS (via Tailscale IP)
+- [ ] PostgreSQL accessible from GPU server
+- [ ] .env file updated with Tailscale IPs
+- [ ] LiteLLM config updated with GPU Tailscale IP
+
+---
+
+## Next Steps
+
+✅ **Network configured!** Proceed to Docker & GPU setup:
+
+```bash
+cat /home/valknar/Projects/docker-compose/ai/DOCKER_GPU_SETUP.md
+```
+
+**Your Tailscale IPs (save these!):**
+- VPS: `__________________` (from `tailscale ip -4` on VPS)
+- GPU: `__________________` (from `tailscale ip -4` on GPU server)
+
+---
+
+## Bonus: Add Your Local Machine
+
+Want to access GPU server from your laptop?
+
+```bash
+# On your local machine
+curl -fsSL https://tailscale.com/install.sh | sh
+tailscale up
+
+# Now you can SSH directly via Tailscale:
+ssh root@100.104.105.106
+
+# Or access ComfyUI in browser:
+# http://100.104.105.106:8188
+```
+
+No more port forwarding needed! 🎉
--- a/ai/WIREGUARD_SETUP.md
+++ b/ai/WIREGUARD_SETUP.md
@@ -0,0 +1,393 @@
+# WireGuard VPN Setup - Connecting GPU Server to VPS
+
+## Day 3-4: Network Configuration
+
+This guide connects your RunPod GPU server to your VPS via WireGuard VPN, enabling secure, low-latency communication.
+
+### Architecture
+
+```
+┌─────────────────────────────┐         ┌──────────────────────────────┐
+│ VPS (pivoine.art)           │         │ GPU Server (RunPod)          │
+│ 10.8.0.1 (WireGuard)        │◄───────►│ 10.8.0.2 (WireGuard)         │
+├─────────────────────────────┤         ├──────────────────────────────┤
+│ - LiteLLM Proxy             │         │ - vLLM (10.8.0.2:8000)       │
+│ - Open WebUI                │         │ - ComfyUI (10.8.0.2:8188)    │
+│ - PostgreSQL                │         │ - Training                    │
+└─────────────────────────────┘         └──────────────────────────────┘
+```
+
+### Prerequisites
+
+- ✅ VPS with root access
+- ✅ GPU server with root access
+- ✅ Both servers have public IPs
+
+---
+
+## Method 1: Using Existing wg-easy (Recommended)
+
+You already have `wg-easy` running on your VPS. Let's use it!
+
+### Step 1: Access wg-easy Dashboard
+
+**On your local machine:**
+
+1. Open browser: https://vpn.pivoine.art (or whatever your wg-easy URL is)
+2. Login with admin password
+
+**Don't have wg-easy set up? Skip to Method 2.**
+
+### Step 2: Create GPU Server Client
+
+1. In wg-easy dashboard, click **"+ New Client"**
+2. **Name**: `gpu-server-runpod`
+3. Click **"Create"**
+4. **Download** configuration file (or copy QR code data)
+
+You'll get a file like: `gpu-server-runpod.conf`
+
+### Step 3: Install WireGuard on GPU Server
+
+**SSH into GPU server:**
+
+```bash
+ssh gpu-pivoine  # or your SSH command
+
+# Install WireGuard
+apt update
+apt install -y wireguard wireguard-tools
+```
+
+### Step 4: Configure WireGuard on GPU Server
+
+**Upload the config file:**
+
+```bash
+# On your local machine, copy the config to GPU server
+scp gpu-server-runpod.conf gpu-pivoine:/etc/wireguard/wg0.conf
+
+# Or manually create it on GPU server:
+nano /etc/wireguard/wg0.conf
+# Paste the configuration from wg-easy
+```
+
+**Example config (yours will be different):**
+```ini
+[Interface]
+PrivateKey = <PRIVATE_KEY_FROM_WG_EASY>
+Address = 10.8.0.2/24
+DNS = 10.8.0.1
+
+[Peer]
+PublicKey = <VPS_PUBLIC_KEY_FROM_WG_EASY>
+PresharedKey = <PRESHARED_KEY>
+AllowedIPs = 10.8.0.0/24
+Endpoint = <VPS_PUBLIC_IP>:51820
+PersistentKeepalive = 25
+```
+
+### Step 5: Start WireGuard
+
+```bash
+# Enable IP forwarding
+echo "net.ipv4.ip_forward=1" >> /etc/sysctl.conf
+sysctl -p
+
+# Set permissions
+chmod 600 /etc/wireguard/wg0.conf
+
+# Start WireGuard
+systemctl enable wg-quick@wg0
+systemctl start wg-quick@wg0
+
+# Check status
+systemctl status wg-quick@wg0
+wg show
+```
+
+Expected output:
+```
+interface: wg0
+  public key: <GPU_SERVER_PUBLIC_KEY>
+  private key: (hidden)
+  listening port: 51820
+
+peer: <VPS_PUBLIC_KEY>
+  endpoint: <VPS_IP>:51820
+  allowed ips: 10.8.0.0/24
+  latest handshake: 1 second ago
+  transfer: 1.2 KiB received, 892 B sent
+  persistent keepalive: every 25 seconds
+```
+
+### Step 6: Test Connectivity
+
+**From GPU server, ping VPS:**
+
+```bash
+ping 10.8.0.1 -c 4
+```
+
+Expected output:
+```
+PING 10.8.0.1 (10.8.0.1) 56(84) bytes of data.
+64 bytes from 10.8.0.1: icmp_seq=1 ttl=64 time=25.3 ms
+64 bytes from 10.8.0.1: icmp_seq=2 ttl=64 time=24.8 ms
+...
+```
+
+**From VPS, ping GPU server:**
+
+```bash
+ssh root@vps
+ping 10.8.0.2 -c 4
+```
+
+**Test PostgreSQL access from GPU server:**
+
+```bash
+# On GPU server
+apt install -y postgresql-client
+
+# Try connecting to VPS postgres
+psql -h 10.8.0.1 -U valknar -d openwebui -c "SELECT 1;"
+# Should work if postgres allows 10.8.0.0/24
+```
+
+---
+
+## Method 2: Manual WireGuard Setup (If no wg-easy)
+
+### Step 1: Install WireGuard on Both Servers
+
+**On VPS:**
+```bash
+ssh root@vps
+apt update
+apt install -y wireguard wireguard-tools
+```
+
+**On GPU Server:**
+```bash
+ssh gpu-pivoine
+apt update
+apt install -y wireguard wireguard-tools
+```
+
+### Step 2: Generate Keys
+
+**On VPS:**
+```bash
+cd /etc/wireguard
+umask 077
+wg genkey | tee vps-private.key | wg pubkey > vps-public.key
+```
+
+**On GPU Server:**
+```bash
+cd /etc/wireguard
+umask 077
+wg genkey | tee gpu-private.key | wg pubkey > gpu-public.key
+```
+
+### Step 3: Create Config on VPS
+
+**On VPS (`/etc/wireguard/wg0.conf`):**
+
+```bash
+cat > /etc/wireguard/wg0.conf << 'EOF'
+[Interface]
+PrivateKey = <VPS_PRIVATE_KEY>
+Address = 10.8.0.1/24
+ListenPort = 51820
+SaveConfig = false
+
+# GPU Server Peer
+[Peer]
+PublicKey = <GPU_PUBLIC_KEY>
+AllowedIPs = 10.8.0.2/32
+PersistentKeepalive = 25
+EOF
+```
+
+Replace `<VPS_PRIVATE_KEY>` with contents of `vps-private.key`
+Replace `<GPU_PUBLIC_KEY>` with contents from GPU server's `gpu-public.key`
+
+### Step 4: Create Config on GPU Server
+
+**On GPU Server (`/etc/wireguard/wg0.conf`):**
+
+```bash
+cat > /etc/wireguard/wg0.conf << 'EOF'
+[Interface]
+PrivateKey = <GPU_PRIVATE_KEY>
+Address = 10.8.0.2/24
+
+[Peer]
+PublicKey = <VPS_PUBLIC_KEY>
+AllowedIPs = 10.8.0.0/24
+Endpoint = <VPS_PUBLIC_IP>:51820
+PersistentKeepalive = 25
+EOF
+```
+
+Replace:
+- `<GPU_PRIVATE_KEY>` with contents of `gpu-private.key`
+- `<VPS_PUBLIC_KEY>` with contents from VPS's `vps-public.key`
+- `<VPS_PUBLIC_IP>` with your VPS's public IP address
+
+### Step 5: Start WireGuard on Both
+
+**On VPS:**
+```bash
+# Enable IP forwarding
+echo "net.ipv4.ip_forward=1" >> /etc/sysctl.conf
+sysctl -p
+
+# Start WireGuard
+chmod 600 /etc/wireguard/wg0.conf
+systemctl enable wg-quick@wg0
+systemctl start wg-quick@wg0
+```
+
+**On GPU Server:**
+```bash
+# Enable IP forwarding
+echo "net.ipv4.ip_forward=1" >> /etc/sysctl.conf
+sysctl -p
+
+# Start WireGuard
+chmod 600 /etc/wireguard/wg0.conf
+systemctl enable wg-quick@wg0
+systemctl start wg-quick@wg0
+```
+
+### Step 6: Configure Firewall
+
+**On VPS:**
+```bash
+# Allow WireGuard port
+ufw allow 51820/udp
+ufw reload
+
+# Or with iptables
+iptables -A INPUT -p udp --dport 51820 -j ACCEPT
+iptables-save > /etc/iptables/rules.v4
+```
+
+**On GPU Server (RunPod):**
+```bash
+# Allow WireGuard
+ufw allow 51820/udp
+ufw reload
+```
+
+### Step 7: Test Connection
+
+Same as Method 1 Step 6.
+
+---
+
+## Troubleshooting
+
+### No handshake
+
+**Check:**
+```bash
+wg show
+```
+
+If "latest handshake" shows "never":
+1. Verify public keys are correct (easy to swap them!)
+2. Check firewall allows UDP 51820
+3. Verify endpoint IP is correct
+4. Check `systemctl status wg-quick@wg0` for errors
+
+### Can ping but can't access services
+
+**On VPS, check PostgreSQL allows 10.8.0.0/24:**
+
+```bash
+# Edit postgresql.conf
+nano /var/lib/postgresql/data/postgresql.conf
+# Add or modify:
+listen_addresses = '*'
+
+# Edit pg_hba.conf
+nano /var/lib/postgresql/data/pg_hba.conf
+# Add:
+host    all             all             10.8.0.0/24             scram-sha-256
+
+# Restart
+docker restart core_postgres
+```
+
+### WireGuard won't start
+
+```bash
+# Check logs
+journalctl -u wg-quick@wg0 -n 50
+
+# Common issues:
+# - Wrong permissions: chmod 600 /etc/wireguard/wg0.conf
+# - Invalid keys: regenerate with wg genkey
+# - Port already in use: lsof -i :51820
+```
+
+---
+
+## Verification Checklist
+
+Before proceeding to Day 5:
+
+- [ ] WireGuard installed on both VPS and GPU server
+- [ ] VPN tunnel established (wg show shows handshake)
+- [ ] GPU server can ping VPS (10.8.0.1)
+- [ ] VPS can ping GPU server (10.8.0.2)
+- [ ] Firewall allows WireGuard (UDP 51820)
+- [ ] PostgreSQL accessible from GPU server
+- [ ] WireGuard starts on boot (systemctl enable)
+
+---
+
+## Network Reference
+
+**VPN IPs:**
+- VPS: `10.8.0.1`
+- GPU Server: `10.8.0.2`
+
+**Service Access from GPU Server:**
+- PostgreSQL: `postgresql://valknar:password@10.8.0.1:5432/dbname`
+- Redis: `10.8.0.1:6379`
+- LiteLLM: `http://10.8.0.1:4000`
+- Mailpit: `10.8.0.1:1025`
+
+**Service Access from VPS:**
+- vLLM: `http://10.8.0.2:8000`
+- ComfyUI: `http://10.8.0.2:8188`
+- JupyterLab: `http://10.8.0.2:8888`
+
+---
+
+## Next: Docker & GPU Setup
+
+Once VPN is working, proceed to **Day 5: Docker & NVIDIA Container Toolkit Setup**.
+
+**Save connection info:**
+
+```bash
+# On GPU server
+cat >> /workspace/SERVER_INFO.md << 'EOF'
+
+## VPN Configuration
+- VPN IP: 10.8.0.2
+- VPS VPN IP: 10.8.0.1
+- WireGuard Status: Active
+- Latest Handshake: [Check with: wg show]
+
+## Network Access
+- Can reach VPS services: ✓
+- VPS can reach GPU services: ✓
+EOF
+```
--- a/ai/deploy-gpu-stack.sh
+++ b/ai/deploy-gpu-stack.sh
@@ -0,0 +1,229 @@
+#!/bin/bash
+# GPU Stack Deployment Script
+# Run this on the GPU server after SSH access is established
+
+set -e  # Exit on error
+
+echo "=================================="
+echo "GPU Stack Deployment Script"
+echo "=================================="
+echo ""
+
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+NC='\033[0m' # No Color
+
+# Functions
+print_success() {
+    echo -e "${GREEN}✓ $1${NC}"
+}
+
+print_error() {
+    echo -e "${RED}✗ $1${NC}"
+}
+
+print_info() {
+    echo -e "${YELLOW}→ $1${NC}"
+}
+
+# Check if running as root
+if [[ $EUID -ne 0 ]]; then
+   print_error "This script must be run as root (use sudo)"
+   exit 1
+fi
+
+# Step 1: Check prerequisites
+print_info "Checking prerequisites..."
+
+if ! command -v docker &> /dev/null; then
+    print_error "Docker is not installed. Please run DOCKER_GPU_SETUP.md first."
+    exit 1
+fi
+print_success "Docker installed"
+
+if ! command -v nvidia-smi &> /dev/null; then
+    print_error "nvidia-smi not found. Is this a GPU server?"
+    exit 1
+fi
+print_success "NVIDIA GPU detected"
+
+if ! docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi &> /dev/null; then
+    print_error "Docker cannot access GPU. Please configure NVIDIA Container Toolkit."
+    exit 1
+fi
+print_success "Docker GPU access working"
+
+# Step 2: Create directory structure
+print_info "Creating directory structure..."
+
+mkdir -p /workspace/gpu-stack/{vllm,comfyui,training/{configs,data,output},notebooks,monitoring}
+cd /workspace/gpu-stack
+
+print_success "Directory structure created"
+
+# Step 3: Create .env file
+if [ ! -f .env ]; then
+    print_info "Creating .env file..."
+
+    cat > .env << 'EOF'
+# GPU Stack Environment Variables
+
+# Timezone
+TIMEZONE=Europe/Berlin
+
+# VPN Network
+VPS_IP=10.8.0.1
+GPU_IP=10.8.0.2
+
+# Model Storage (network volume)
+MODELS_PATH=/workspace/models
+
+# Hugging Face Token (optional, for gated models like Llama)
+# Get from: https://huggingface.co/settings/tokens
+HF_TOKEN=
+
+# Weights & Biases (optional, for training logging)
+# Get from: https://wandb.ai/authorize
+WANDB_API_KEY=
+
+# JupyterLab Access Token
+JUPYTER_TOKEN=pivoine-ai-2025
+
+# PostgreSQL (on VPS)
+DB_HOST=10.8.0.1
+DB_PORT=5432
+DB_USER=valknar
+DB_PASSWORD=ragnarok98
+DB_NAME=openwebui
+EOF
+
+    chmod 600 .env
+    print_success ".env file created (please edit with your tokens)"
+else
+    print_success ".env file already exists"
+fi
+
+# Step 4: Download docker-compose.yaml
+print_info "Downloading docker-compose.yaml..."
+
+# In production, this would be copied from the repo
+# For now, assume it's already in the current directory
+if [ ! -f docker-compose.yaml ]; then
+    print_error "docker-compose.yaml not found. Please copy gpu-server-compose.yaml to docker-compose.yaml"
+    exit 1
+fi
+
+print_success "docker-compose.yaml found"
+
+# Step 5: Pre-download models (optional but recommended)
+print_info "Do you want to pre-download models? (y/n)"
+read -r response
+
+if [[ "$response" =~ ^[Yy]$ ]]; then
+    print_info "Downloading Llama 3.1 8B Instruct (this will take a while)..."
+
+    mkdir -p /workspace/models
+
+    # Use huggingface-cli to download
+    pip install -q huggingface-hub
+
+    huggingface-cli download \
+        meta-llama/Meta-Llama-3.1-8B-Instruct \
+        --local-dir /workspace/models/Meta-Llama-3.1-8B-Instruct \
+        --local-dir-use-symlinks False || print_error "Model download failed (may need HF_TOKEN)"
+
+    print_success "Model downloaded to /workspace/models"
+fi
+
+# Step 6: Start services
+print_info "Starting GPU stack services..."
+
+docker compose up -d vllm comfyui jupyter netdata
+
+print_success "Services starting (this may take a few minutes)..."
+
+# Step 7: Wait for services
+print_info "Waiting for services to be ready..."
+
+sleep 10
+
+# Check service health
+print_info "Checking service status..."
+
+if docker ps | grep -q gpu_vllm; then
+    print_success "vLLM container running"
+else
+    print_error "vLLM container not running"
+fi
+
+if docker ps | grep -q gpu_comfyui; then
+    print_success "ComfyUI container running"
+else
+    print_error "ComfyUI container not running"
+fi
+
+if docker ps | grep -q gpu_jupyter; then
+    print_success "JupyterLab container running"
+else
+    print_error "JupyterLab container not running"
+fi
+
+if docker ps | grep -q gpu_netdata; then
+    print_success "Netdata container running"
+else
+    print_error "Netdata container not running"
+fi
+
+# Step 8: Display access information
+echo ""
+echo "=================================="
+echo "Deployment Complete!"
+echo "=================================="
+echo ""
+echo "Services accessible via VPN (from VPS):"
+echo "  - vLLM API: http://10.8.0.2:8000"
+echo "  - ComfyUI: http://10.8.0.2:8188"
+echo "  - JupyterLab: http://10.8.0.2:8888 (token: pivoine-ai-2025)"
+echo "  - Netdata: http://10.8.0.2:19999"
+echo ""
+echo "Local access (from GPU server):"
+echo "  - vLLM API: http://localhost:8000"
+echo "  - ComfyUI: http://localhost:8188"
+echo "  - JupyterLab: http://localhost:8888"
+echo "  - Netdata: http://localhost:19999"
+echo ""
+echo "Useful commands:"
+echo "  - View logs: docker compose logs -f"
+echo "  - Check status: docker compose ps"
+echo "  - Stop all: docker compose down"
+echo "  - Restart service: docker compose restart vllm"
+echo "  - Start training: docker compose --profile training up -d axolotl"
+echo ""
+echo "Next steps:"
+echo "  1. Wait for vLLM to load model (check logs: docker compose logs -f vllm)"
+echo "  2. Test vLLM: curl http://localhost:8000/v1/models"
+echo "  3. Configure LiteLLM on VPS to use http://10.8.0.2:8000"
+echo "  4. Download ComfyUI models via web interface"
+echo ""
+
+# Step 9: Create helpful aliases
+print_info "Creating helpful aliases..."
+
+cat >> ~/.bashrc << 'EOF'
+
+# GPU Stack Aliases
+alias gpu-logs='cd /workspace/gpu-stack && docker compose logs -f'
+alias gpu-ps='cd /workspace/gpu-stack && docker compose ps'
+alias gpu-restart='cd /workspace/gpu-stack && docker compose restart'
+alias gpu-down='cd /workspace/gpu-stack && docker compose down'
+alias gpu-up='cd /workspace/gpu-stack && docker compose up -d'
+alias gpu-stats='watch -n 1 nvidia-smi'
+alias gpu-top='nvtop'
+EOF
+
+print_success "Aliases added to ~/.bashrc (reload with: source ~/.bashrc)"
+
+echo ""
+print_success "All done! 🚀"
--- a/ai/gpu-server-compose.yaml
+++ b/ai/gpu-server-compose.yaml
@@ -0,0 +1,237 @@
+# GPU Server Docker Compose Configuration
+# Deploy on RunPod GPU server (10.8.0.2)
+# Services accessible from VPS (10.8.0.1) via WireGuard VPN
+
+version: '3.8'
+
+services:
+  # =============================================================================
+  # vLLM - High-performance LLM Inference Server
+  # =============================================================================
+  vllm:
+    image: vllm/vllm-openai:latest
+    container_name: gpu_vllm
+    restart: unless-stopped
+    runtime: nvidia
+    environment:
+      NVIDIA_VISIBLE_DEVICES: all
+      CUDA_VISIBLE_DEVICES: "0"
+      HF_TOKEN: ${HF_TOKEN:-}
+    volumes:
+      - ${MODELS_PATH:-/workspace/models}:/root/.cache/huggingface
+    command:
+      - --model
+      - meta-llama/Meta-Llama-3.1-8B-Instruct  # Change model here
+      - --host
+      - 0.0.0.0
+      - --port
+      - 8000
+      - --tensor-parallel-size
+      - "1"
+      - --gpu-memory-utilization
+      - "0.85"  # Leave 15% for other tasks
+      - --max-model-len
+      - "8192"
+      - --dtype
+      - auto
+      - --trust-remote-code
+    ports:
+      - "8000:8000"
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+      start_period: 120s  # Model loading takes time
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: 1
+              capabilities: [gpu]
+    labels:
+      - "service=vllm"
+      - "stack=gpu-ai"
+
+  # =============================================================================
+  # ComfyUI - Advanced Stable Diffusion Interface
+  # =============================================================================
+  comfyui:
+    image: ghcr.io/ai-dock/comfyui:latest
+    container_name: gpu_comfyui
+    restart: unless-stopped
+    runtime: nvidia
+    environment:
+      NVIDIA_VISIBLE_DEVICES: all
+      TZ: ${TIMEZONE:-Europe/Berlin}
+      # ComfyUI auto-installs custom nodes on first run
+      COMFYUI_FLAGS: "--listen 0.0.0.0 --port 8188"
+    volumes:
+      - comfyui_data:/data
+      - ${MODELS_PATH:-/workspace/models}/comfyui:/opt/ComfyUI/models
+      - comfyui_output:/opt/ComfyUI/output
+      - comfyui_input:/opt/ComfyUI/input
+    ports:
+      - "8188:8188"
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:8188/"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+      start_period: 60s
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: 1
+              capabilities: [gpu]
+    labels:
+      - "service=comfyui"
+      - "stack=gpu-ai"
+
+  # =============================================================================
+  # Axolotl - LLM Fine-tuning Framework
+  # =============================================================================
+  # Note: This service uses "profiles" - only starts when explicitly requested
+  # Start with: docker compose --profile training up -d axolotl
+  axolotl:
+    image: winglian/axolotl:main-py3.11-cu121-2.2.2
+    container_name: gpu_training
+    runtime: nvidia
+    volumes:
+      - ./training/configs:/workspace/configs
+      - ./training/data:/workspace/data
+      - ./training/output:/workspace/output
+      - ${MODELS_PATH:-/workspace/models}:/workspace/models
+      - training_cache:/root/.cache
+    environment:
+      NVIDIA_VISIBLE_DEVICES: all
+      WANDB_API_KEY: ${WANDB_API_KEY:-}
+      HF_TOKEN: ${HF_TOKEN:-}
+    working_dir: /workspace
+    # Default command - override when running specific training
+    command: sleep infinity
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: 1
+              capabilities: [gpu]
+    profiles:
+      - training
+    labels:
+      - "service=axolotl"
+      - "stack=gpu-ai"
+
+  # =============================================================================
+  # JupyterLab - Interactive Development Environment
+  # =============================================================================
+  jupyter:
+    image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel
+    container_name: gpu_jupyter
+    restart: unless-stopped
+    runtime: nvidia
+    volumes:
+      - ./notebooks:/workspace/notebooks
+      - ${MODELS_PATH:-/workspace/models}:/workspace/models
+      - jupyter_cache:/root/.cache
+    ports:
+      - "8888:8888"
+    environment:
+      NVIDIA_VISIBLE_DEVICES: all
+      JUPYTER_ENABLE_LAB: "yes"
+      JUPYTER_TOKEN: ${JUPYTER_TOKEN:-pivoine-ai-2025}
+      HF_TOKEN: ${HF_TOKEN:-}
+    command: |
+      bash -c "
+      pip install --quiet jupyterlab transformers datasets accelerate bitsandbytes peft trl sentencepiece protobuf &&
+      jupyter lab --ip=0.0.0.0 --port=8888 --allow-root --no-browser --NotebookApp.token='${JUPYTER_TOKEN:-pivoine-ai-2025}'
+      "
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:8888/"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+      start_period: 60s
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: 1
+              capabilities: [gpu]
+    labels:
+      - "service=jupyter"
+      - "stack=gpu-ai"
+
+  # =============================================================================
+  # Netdata - System & GPU Monitoring
+  # =============================================================================
+  netdata:
+    image: netdata/netdata:latest
+    container_name: gpu_netdata
+    restart: unless-stopped
+    runtime: nvidia
+    hostname: gpu-runpod
+    cap_add:
+      - SYS_PTRACE
+      - SYS_ADMIN
+    security_opt:
+      - apparmor:unconfined
+    environment:
+      NVIDIA_VISIBLE_DEVICES: all
+      TZ: ${TIMEZONE:-Europe/Berlin}
+    volumes:
+      - /sys:/host/sys:ro
+      - /proc:/host/proc:ro
+      - /var/run/docker.sock:/var/run/docker.sock:ro
+      - /etc/os-release:/host/etc/os-release:ro
+      - netdata_config:/etc/netdata
+      - netdata_cache:/var/cache/netdata
+      - netdata_lib:/var/lib/netdata
+    ports:
+      - "19999:19999"
+    labels:
+      - "service=netdata"
+      - "stack=gpu-ai"
+
+# =============================================================================
+# Volumes
+# =============================================================================
+volumes:
+  # ComfyUI data
+  comfyui_data:
+    driver: local
+  comfyui_output:
+    driver: local
+  comfyui_input:
+    driver: local
+
+  # Training data
+  training_cache:
+    driver: local
+
+  # Jupyter data
+  jupyter_cache:
+    driver: local
+
+  # Netdata data
+  netdata_config:
+    driver: local
+  netdata_cache:
+    driver: local
+  netdata_lib:
+    driver: local
+
+# =============================================================================
+# Networks
+# =============================================================================
+networks:
+  default:
+    driver: bridge
+    ipam:
+      config:
+        - subnet: 172.25.0.0/24
--- a/ai/litellm-config-gpu.yaml
+++ b/ai/litellm-config-gpu.yaml
@@ -0,0 +1,199 @@
+# LiteLLM Configuration with GPU Server Integration
+# This config includes both Anthropic Claude (API) and self-hosted models (vLLM on GPU server)
+
+model_list:
+  # =============================================================================
+  # Anthropic Claude Models (API-based, for complex reasoning)
+  # =============================================================================
+
+  - model_name: claude-sonnet-4
+    litellm_params:
+      model: anthropic/claude-sonnet-4-20250514
+      api_key: os.environ/ANTHROPIC_API_KEY
+
+  - model_name: claude-sonnet-4.5
+    litellm_params:
+      model: anthropic/claude-sonnet-4-5-20250929
+      api_key: os.environ/ANTHROPIC_API_KEY
+
+  - model_name: claude-3-5-sonnet
+    litellm_params:
+      model: anthropic/claude-3-5-sonnet-20241022
+      api_key: os.environ/ANTHROPIC_API_KEY
+
+  - model_name: claude-3-opus
+    litellm_params:
+      model: anthropic/claude-3-opus-20240229
+      api_key: os.environ/ANTHROPIC_API_KEY
+
+  - model_name: claude-3-haiku
+    litellm_params:
+      model: anthropic/claude-3-haiku-20240307
+      api_key: os.environ/ANTHROPIC_API_KEY
+
+  # =============================================================================
+  # Self-Hosted Models (vLLM on GPU server via WireGuard VPN)
+  # =============================================================================
+
+  # Llama 3.1 8B Instruct - Fast, general-purpose, good for routine tasks
+  - model_name: llama-3.1-8b
+    litellm_params:
+      model: openai/meta-llama/Meta-Llama-3.1-8B-Instruct
+      api_base: http://10.8.0.2:8000/v1
+      api_key: dummy  # vLLM doesn't require auth
+      rpm: 1000  # Rate limit: requests per minute
+      tpm: 100000  # Rate limit: tokens per minute
+
+  # Alternative models (uncomment and configure on GPU server as needed)
+
+  # Qwen 2.5 14B Instruct - Excellent multilingual, stronger reasoning
+  # - model_name: qwen-2.5-14b
+  #   litellm_params:
+  #     model: openai/Qwen/Qwen2.5-14B-Instruct
+  #     api_base: http://10.8.0.2:8000/v1
+  #     api_key: dummy
+  #     rpm: 800
+  #     tpm: 80000
+
+  # Mistral 7B Instruct - Very fast, lightweight
+  # - model_name: mistral-7b
+  #   litellm_params:
+  #     model: openai/mistralai/Mistral-7B-Instruct-v0.3
+  #     api_base: http://10.8.0.2:8000/v1
+  #     api_key: dummy
+  #     rpm: 1200
+  #     tpm: 120000
+
+  # DeepSeek Coder 6.7B - Code generation specialist
+  # - model_name: deepseek-coder-6.7b
+  #   litellm_params:
+  #     model: openai/deepseek-ai/deepseek-coder-6.7b-instruct
+  #     api_base: http://10.8.0.2:8000/v1
+  #     api_key: dummy
+  #     rpm: 1000
+  #     tpm: 100000
+
+# =============================================================================
+# Router Settings - Intelligent Model Selection
+# =============================================================================
+
+# Model aliases for easy switching in Open WebUI
+model_name_map:
+  # Default model (self-hosted, fast)
+  gpt-3.5-turbo: llama-3.1-8b
+
+  # Power users can use Claude for complex tasks
+  gpt-4: claude-sonnet-4.5
+  gpt-4-turbo: claude-sonnet-4.5
+
+# LiteLLM Settings
+litellm_settings:
+  drop_params: true
+  set_verbose: false  # Disable verbose logging for better performance
+
+  # Enable caching with Redis for better performance
+  cache: true
+  cache_params:
+    type: redis
+    host: redis
+    port: 6379
+    ttl: 3600  # Cache for 1 hour
+
+  # Force strip specific parameters globally
+  allowed_fails: 0
+
+  # Modify params before sending to provider
+  modify_params: true
+
+  # Enable success and failure logging but minimize overhead
+  success_callback: []  # Disable all success callbacks to reduce DB writes
+  failure_callback: []  # Disable all failure callbacks
+
+# Router Settings
+router_settings:
+  allowed_fails: 0
+
+  # Routing strategy: Try self-hosted first, fallback to Claude on failure
+  routing_strategy: simple-shuffle
+
+  # Cooldown for failed models
+  cooldown_time: 30  # seconds
+
+# Drop unsupported parameters
+default_litellm_params:
+  drop_params: true
+
+# General Settings
+general_settings:
+  disable_responses_id_security: true
+
+  # Disable spend tracking to reduce database overhead
+  disable_spend_logs: false  # Keep enabled to track API vs GPU costs
+
+  # Disable tag tracking
+  disable_tag_tracking: true
+
+  # Disable daily spend updates
+  disable_daily_spend_logs: false  # Keep enabled for cost analysis
+
+  # Master key for authentication (set via env var)
+  master_key: os.environ/LITELLM_MASTER_KEY
+
+  # Database for logging (optional but recommended for cost tracking)
+  database_url: os.environ/DATABASE_URL
+
+  # Enable OpenAPI docs
+  docs_url: /docs
+
+# =============================================================================
+# Usage Guidelines (for Open WebUI users)
+# =============================================================================
+#
+# Model Selection Guide:
+#
+# Use llama-3.1-8b for:
+# - General chat and Q&A
+# - Simple code generation
+# - Data extraction
+# - Summarization
+# - Translation
+# - Most routine tasks
+# Cost: ~$0/month (self-hosted)
+# Speed: ~50-80 tokens/second
+#
+# Use qwen-2.5-14b for:
+# - Complex reasoning
+# - Multi-step problems
+# - Advanced code generation
+# - Multilingual tasks
+# Cost: ~$0/month (self-hosted)
+# Speed: ~30-50 tokens/second
+#
+# Use claude-sonnet-4.5 for:
+# - Very complex reasoning
+# - Long documents (200K context)
+# - Production-critical code
+# - When quality matters most
+# Cost: ~$3/million input tokens, ~$15/million output tokens
+# Speed: ~30-40 tokens/second
+#
+# Use claude-3-haiku for:
+# - API fallback (if self-hosted down)
+# - Very fast responses needed
+# Cost: ~$0.25/million input tokens, ~$1.25/million output tokens
+# Speed: ~60-80 tokens/second
+#
+# =============================================================================
+
+# Health Check Configuration
+health_check:
+  # Check vLLM health endpoint
+  enabled: true
+  interval: 30  # seconds
+  timeout: 5  # seconds
+
+# Fallback Configuration
+# If GPU server is down, automatically use Claude
+fallback:
+  - ["llama-3.1-8b", "claude-3-haiku"]
+  - ["qwen-2.5-14b", "claude-sonnet-4.5"]
--- a/ai/simple_vllm_server.py
+++ b/ai/simple_vllm_server.py
@@ -0,0 +1,302 @@
+#!/usr/bin/env python3
+"""
+Simple vLLM server using AsyncLLMEngine directly
+Bypasses the multiprocessing issues we hit with the default vLLM API server
+OpenAI-compatible endpoints: /v1/models and /v1/completions
+"""
+
+import asyncio
+import json
+import logging
+import os
+from typing import AsyncIterator, Dict, List, Optional
+
+from fastapi import FastAPI, Request
+from fastapi.responses import JSONResponse, StreamingResponse
+from pydantic import BaseModel, Field
+from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams
+from vllm.utils import random_uuid
+
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+)
+logger = logging.getLogger(__name__)
+
+# FastAPI app
+app = FastAPI(title="Simple vLLM Server", version="1.0.0")
+
+# Global engine instance
+engine: Optional[AsyncLLMEngine] = None
+model_name: str = "Qwen/Qwen2.5-7B-Instruct"
+
+# Request/Response models
+class CompletionRequest(BaseModel):
+    """OpenAI-compatible completion request"""
+    model: str = Field(default="qwen-2.5-7b")
+    prompt: str | List[str] = Field(..., description="Text prompt(s)")
+    max_tokens: int = Field(default=512, ge=1, le=4096)
+    temperature: float = Field(default=0.7, ge=0.0, le=2.0)
+    top_p: float = Field(default=1.0, ge=0.0, le=1.0)
+    n: int = Field(default=1, ge=1, le=10)
+    stream: bool = Field(default=False)
+    stop: Optional[str | List[str]] = None
+    presence_penalty: float = Field(default=0.0, ge=-2.0, le=2.0)
+    frequency_penalty: float = Field(default=0.0, ge=-2.0, le=2.0)
+
+class ChatMessage(BaseModel):
+    """Chat message format"""
+    role: str = Field(..., description="Role: system, user, or assistant")
+    content: str = Field(..., description="Message content")
+
+class ChatCompletionRequest(BaseModel):
+    """OpenAI-compatible chat completion request"""
+    model: str = Field(default="qwen-2.5-7b")
+    messages: List[ChatMessage] = Field(..., description="Chat messages")
+    max_tokens: int = Field(default=512, ge=1, le=4096)
+    temperature: float = Field(default=0.7, ge=0.0, le=2.0)
+    top_p: float = Field(default=1.0, ge=0.0, le=1.0)
+    n: int = Field(default=1, ge=1, le=10)
+    stream: bool = Field(default=False)
+    stop: Optional[str | List[str]] = None
+
+@app.on_event("startup")
+async def startup_event():
+    """Initialize vLLM engine on startup"""
+    global engine, model_name
+
+    logger.info(f"Initializing vLLM AsyncLLMEngine with model: {model_name}")
+
+    # Configure engine
+    engine_args = AsyncEngineArgs(
+        model=model_name,
+        tensor_parallel_size=1,  # Single GPU
+        gpu_memory_utilization=0.85,  # Use 85% of GPU memory
+        max_model_len=4096,  # Context length
+        dtype="auto",  # Auto-detect dtype
+        download_dir="/workspace/huggingface_cache",  # Large disk
+        trust_remote_code=True,  # Some models require this
+        enforce_eager=False,  # Use CUDA graphs for better performance
+    )
+
+    # Create async engine
+    engine = AsyncLLMEngine.from_engine_args(engine_args)
+
+    logger.info("vLLM AsyncLLMEngine initialized successfully")
+
+@app.get("/")
+async def root():
+    """Health check endpoint"""
+    return {"status": "ok", "model": model_name}
+
+@app.get("/health")
+async def health():
+    """Detailed health check"""
+    return {
+        "status": "healthy" if engine else "initializing",
+        "model": model_name,
+        "ready": engine is not None
+    }
+
+@app.get("/v1/models")
+async def list_models():
+    """OpenAI-compatible models endpoint"""
+    return {
+        "object": "list",
+        "data": [
+            {
+                "id": "qwen-2.5-7b",
+                "object": "model",
+                "created": 1234567890,
+                "owned_by": "pivoine-gpu",
+                "permission": [],
+                "root": model_name,
+                "parent": None,
+            }
+        ]
+    }
+
+def messages_to_prompt(messages: List[ChatMessage]) -> str:
+    """Convert chat messages to a single prompt string"""
+    # Qwen 2.5 chat template format
+    prompt_parts = []
+
+    for msg in messages:
+        role = msg.role
+        content = msg.content
+
+        if role == "system":
+            prompt_parts.append(f"<|im_start|>system\n{content}<|im_end|>")
+        elif role == "user":
+            prompt_parts.append(f"<|im_start|>user\n{content}<|im_end|>")
+        elif role == "assistant":
+            prompt_parts.append(f"<|im_start|>assistant\n{content}<|im_end|>")
+
+    # Add final assistant prompt
+    prompt_parts.append("<|im_start|>assistant\n")
+
+    return "\n".join(prompt_parts)
+
+@app.post("/v1/completions")
+async def create_completion(request: CompletionRequest):
+    """OpenAI-compatible completion endpoint"""
+    if not engine:
+        return JSONResponse(
+            status_code=503,
+            content={"error": "Engine not initialized"}
+        )
+
+    # Handle both single prompt and batch prompts
+    prompts = [request.prompt] if isinstance(request.prompt, str) else request.prompt
+
+    # Configure sampling parameters
+    sampling_params = SamplingParams(
+        temperature=request.temperature,
+        top_p=request.top_p,
+        max_tokens=request.max_tokens,
+        n=request.n,
+        stop=request.stop if request.stop else [],
+        presence_penalty=request.presence_penalty,
+        frequency_penalty=request.frequency_penalty,
+    )
+
+    # Generate completions
+    results = []
+    for prompt in prompts:
+        request_id = random_uuid()
+
+        if request.stream:
+            # Streaming response
+            async def generate_stream():
+                async for output in engine.generate(prompt, sampling_params, request_id):
+                    chunk = {
+                        "id": request_id,
+                        "object": "text_completion",
+                        "created": 1234567890,
+                        "model": request.model,
+                        "choices": [
+                            {
+                                "text": output.outputs[0].text,
+                                "index": 0,
+                                "logprobs": None,
+                                "finish_reason": output.outputs[0].finish_reason,
+                            }
+                        ]
+                    }
+                    yield f"data: {json.dumps(chunk)}\n\n"
+                yield "data: [DONE]\n\n"
+
+            return StreamingResponse(generate_stream(), media_type="text/event-stream")
+        else:
+            # Non-streaming response
+            async for output in engine.generate(prompt, sampling_params, request_id):
+                final_output = output
+
+            results.append({
+                "text": final_output.outputs[0].text,
+                "index": len(results),
+                "logprobs": None,
+                "finish_reason": final_output.outputs[0].finish_reason,
+            })
+
+    return {
+        "id": random_uuid(),
+        "object": "text_completion",
+        "created": 1234567890,
+        "model": request.model,
+        "choices": results,
+        "usage": {
+            "prompt_tokens": 0,  # vLLM doesn't expose this easily
+            "completion_tokens": 0,
+            "total_tokens": 0,
+        }
+    }
+
+@app.post("/v1/chat/completions")
+async def create_chat_completion(request: ChatCompletionRequest):
+    """OpenAI-compatible chat completion endpoint"""
+    if not engine:
+        return JSONResponse(
+            status_code=503,
+            content={"error": "Engine not initialized"}
+        )
+
+    # Convert messages to prompt
+    prompt = messages_to_prompt(request.messages)
+
+    # Configure sampling parameters
+    sampling_params = SamplingParams(
+        temperature=request.temperature,
+        top_p=request.top_p,
+        max_tokens=request.max_tokens,
+        n=request.n,
+        stop=request.stop if request.stop else ["<|im_end|>"],
+    )
+
+    request_id = random_uuid()
+
+    if request.stream:
+        # Streaming response
+        async def generate_stream():
+            async for output in engine.generate(prompt, sampling_params, request_id):
+                chunk = {
+                    "id": request_id,
+                    "object": "chat.completion.chunk",
+                    "created": 1234567890,
+                    "model": request.model,
+                    "choices": [
+                        {
+                            "index": 0,
+                            "delta": {"content": output.outputs[0].text},
+                            "finish_reason": output.outputs[0].finish_reason,
+                        }
+                    ]
+                }
+                yield f"data: {json.dumps(chunk)}\n\n"
+            yield "data: [DONE]\n\n"
+
+        return StreamingResponse(generate_stream(), media_type="text/event-stream")
+    else:
+        # Non-streaming response
+        async for output in engine.generate(prompt, sampling_params, request_id):
+            final_output = output
+
+        return {
+            "id": request_id,
+            "object": "chat.completion",
+            "created": 1234567890,
+            "model": request.model,
+            "choices": [
+                {
+                    "index": 0,
+                    "message": {
+                        "role": "assistant",
+                        "content": final_output.outputs[0].text,
+                    },
+                    "finish_reason": final_output.outputs[0].finish_reason,
+                }
+            ],
+            "usage": {
+                "prompt_tokens": 0,
+                "completion_tokens": 0,
+                "total_tokens": 0,
+            }
+        }
+
+if __name__ == "__main__":
+    import uvicorn
+
+    # Get configuration from environment
+    host = os.getenv("VLLM_HOST", "0.0.0.0")
+    port = int(os.getenv("VLLM_PORT", "8000"))
+
+    logger.info(f"Starting vLLM server on {host}:{port}")
+
+    uvicorn.run(
+        app,
+        host=host,
+        port=port,
+        log_level="info",
+        access_log=True,
+    )