docs(ai): add comprehensive GPU setup documentation and configs

- Add setup guides (SETUP_GUIDE, TAILSCALE_SETUP, DOCKER_GPU_SETUP, etc.) - Add deployment configurations (litellm-config-gpu.yaml, gpu-server-compose.yaml) - Add GPU_DEPLOYMENT_LOG.md with current infrastructure details - Add GPU_EXPANSION_PLAN.md with complete provider comparison - Add deploy-gpu-stack.sh automation script 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 12:57:06 +01:00
parent c0b1308ffe
commit 8de88d96ac
10 changed files with 4089 additions and 0 deletions
--- a/ai/GPU_DEPLOYMENT_LOG.md
+++ b/ai/GPU_DEPLOYMENT_LOG.md
@@ -0,0 +1,173 @@
+# GPU Server Deployment Log
+
+## Current Deployment (2025-11-21)
+
+### Infrastructure
+- **Provider**: RunPod (Spot Instance)
+- **GPU**: NVIDIA RTX 4090 24GB
+- **Disk**: 50GB local SSD (expanded from 20GB)
+- **Network Volume**: 922TB at `/workspace`
+- **Region**: Europe
+- **Cost**: ~$0.50/hour (~$360/month if running 24/7)
+
+### Network Configuration
+- **VPN**: Tailscale (replaces WireGuard due to RunPod UDP restrictions)
+- **GPU Server Tailscale IP**: 100.100.108.13
+- **VPS Tailscale IP**: (get with `tailscale ip -4` on VPS)
+
+### SSH Access
+```
+Host gpu-pivoine
+    HostName 213.173.102.232
+    Port 29695
+    User root
+    IdentityFile ~/.ssh/id_ed25519
+```
+
+**Note**: RunPod Spot instances can be terminated and restarted with new ports/IPs. Update SSH config accordingly.
+
+### Software Stack
+- **Python**: 3.11.10
+- **vLLM**: 0.6.4.post1 (installed with pip)
+- **PyTorch**: 2.5.1 with CUDA 12.4
+- **Tailscale**: Installed via official script
+
+### vLLM Deployment
+
+**Custom Server**: `ai/simple_vllm_server.py`
+- Uses `AsyncLLMEngine` directly to bypass multiprocessing issues
+- OpenAI-compatible API endpoints:
+  - `GET /v1/models` - List available models
+  - `POST /v1/completions` - Text completion
+  - `POST /v1/chat/completions` - Chat completion
+- Default model: Qwen/Qwen2.5-7B-Instruct
+- Cache directory: `/workspace/huggingface_cache`
+
+**Deployment Command**:
+```bash
+# Copy server script to GPU server
+scp ai/simple_vllm_server.py gpu-pivoine:/workspace/
+
+# Start server
+ssh gpu-pivoine "cd /workspace && nohup python3 simple_vllm_server.py > vllm.log 2>&1 &"
+
+# Check status
+ssh gpu-pivoine "curl http://localhost:8000/v1/models"
+```
+
+**Server Configuration** (environment variables):
+- `VLLM_HOST`: 0.0.0.0 (default)
+- `VLLM_PORT`: 8000 (default)
+
+### Model Configuration
+- **Model**: Qwen/Qwen2.5-7B-Instruct (no auth required)
+- **Context Length**: 4096 tokens
+- **GPU Memory**: 85% utilization
+- **Tensor Parallel**: 1 (single GPU)
+
+### Known Issues & Solutions
+
+#### Issue 1: vLLM Multiprocessing Errors
+**Problem**: Default vLLM v1 engine fails with ZMQ/CUDA multiprocessing errors on RunPod.
+**Solution**: Custom `AsyncLLMEngine` FastAPI server bypasses multiprocessing layer entirely.
+
+#### Issue 2: Disk Space (Solved)
+**Problem**: Original 20GB disk filled up with Hugging Face cache.
+**Solution**: Expanded to 50GB and use `/workspace` for model cache.
+
+#### Issue 3: Gated Models
+**Problem**: Llama models require Hugging Face authentication.
+**Solution**: Use Qwen 2.5 7B Instruct (no auth required) or set `HF_TOKEN` environment variable.
+
+#### Issue 4: Spot Instance Volatility
+**Problem**: RunPod Spot instances can be terminated anytime.
+**Solution**: Accept as trade-off for cost savings. Document SSH details for quick reconnection.
+
+### Monitoring
+
+**Check vLLM logs**:
+```bash
+ssh gpu-pivoine "tail -f /workspace/vllm.log"
+```
+
+**Check GPU usage**:
+```bash
+ssh gpu-pivoine "nvidia-smi"
+```
+
+**Check Tailscale status**:
+```bash
+ssh gpu-pivoine "tailscale status"
+```
+
+**Test API locally (on GPU server)**:
+```bash
+ssh gpu-pivoine "curl http://localhost:8000/v1/models"
+```
+
+**Test API via Tailscale (from VPS)**:
+```bash
+curl http://100.100.108.13:8000/v1/models
+```
+
+### LiteLLM Integration
+
+Update VPS LiteLLM config at `ai/litellm-config-gpu.yaml`:
+
+```yaml
+# Replace old WireGuard IP (10.8.0.2) with Tailscale IP
+- model_name: qwen-2.5-7b
+  litellm_params:
+    model: openai/qwen-2.5-7b
+    api_base: http://100.100.108.13:8000/v1  # Tailscale IP
+    api_key: dummy
+    rpm: 1000
+    tpm: 100000
+```
+
+Restart LiteLLM:
+```bash
+arty restart litellm
+```
+
+### Troubleshooting
+
+**Server not responding**:
+1. Check if process is running: `pgrep -f simple_vllm_server`
+2. Check logs: `tail -100 /workspace/vllm.log`
+3. Check GPU availability: `nvidia-smi`
+4. Restart server: `pkill -f simple_vllm_server && python3 /workspace/simple_vllm_server.py &`
+
+**Tailscale not connected**:
+1. Check status: `tailscale status`
+2. Check daemon: `ps aux | grep tailscaled`
+3. Restart: `tailscale down && tailscale up`
+
+**Model download failing**:
+1. Check disk space: `df -h`
+2. Check cache directory: `ls -lah /workspace/huggingface_cache`
+3. Clear cache if needed: `rm -rf /workspace/huggingface_cache/*`
+
+### Next Steps
+1. ✅ Deploy vLLM with Qwen 2.5 7B
+2. ⏳ Test API endpoints locally and via Tailscale
+3. ⏳ Update VPS LiteLLM configuration
+4. ⏳ Test end-to-end: Open WebUI → LiteLLM → vLLM
+5. ⏹️ Monitor performance and costs
+6. ⏹️ Consider adding more models (Mistral, DeepSeek Coder)
+7. ⏹️ Set up auto-stop for idle periods to save costs
+
+### Cost Optimization Ideas
+1. **Auto-stop**: Configure RunPod to auto-stop after 30 minutes idle
+2. **Spot Instances**: Already using Spot for 50% cost reduction
+3. **Scheduled Operation**: Run only during business hours (8 hours/day = $120/month)
+4. **Smaller Models**: Use Mistral 7B or quantized models for lighter workloads
+5. **Pay-as-you-go**: Manually start/stop pod as needed
+
+### Performance Benchmarks
+*To be measured after deployment*
+
+Expected (based on RTX 4090):
+- Qwen 2.5 7B: 50-80 tokens/second
+- Context processing: ~2-3 seconds for 1000 tokens
+- First token latency: ~200-300ms