docs(ai): add comprehensive GPU setup documentation and configs
- Add setup guides (SETUP_GUIDE, TAILSCALE_SETUP, DOCKER_GPU_SETUP, etc.) - Add deployment configurations (litellm-config-gpu.yaml, gpu-server-compose.yaml) - Add GPU_DEPLOYMENT_LOG.md with current infrastructure details - Add GPU_EXPANSION_PLAN.md with complete provider comparison - Add deploy-gpu-stack.sh automation script 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
173
ai/GPU_DEPLOYMENT_LOG.md
Normal file
173
ai/GPU_DEPLOYMENT_LOG.md
Normal file
@@ -0,0 +1,173 @@
|
||||
# GPU Server Deployment Log
|
||||
|
||||
## Current Deployment (2025-11-21)
|
||||
|
||||
### Infrastructure
|
||||
- **Provider**: RunPod (Spot Instance)
|
||||
- **GPU**: NVIDIA RTX 4090 24GB
|
||||
- **Disk**: 50GB local SSD (expanded from 20GB)
|
||||
- **Network Volume**: 922TB at `/workspace`
|
||||
- **Region**: Europe
|
||||
- **Cost**: ~$0.50/hour (~$360/month if running 24/7)
|
||||
|
||||
### Network Configuration
|
||||
- **VPN**: Tailscale (replaces WireGuard due to RunPod UDP restrictions)
|
||||
- **GPU Server Tailscale IP**: 100.100.108.13
|
||||
- **VPS Tailscale IP**: (get with `tailscale ip -4` on VPS)
|
||||
|
||||
### SSH Access
|
||||
```
|
||||
Host gpu-pivoine
|
||||
HostName 213.173.102.232
|
||||
Port 29695
|
||||
User root
|
||||
IdentityFile ~/.ssh/id_ed25519
|
||||
```
|
||||
|
||||
**Note**: RunPod Spot instances can be terminated and restarted with new ports/IPs. Update SSH config accordingly.
|
||||
|
||||
### Software Stack
|
||||
- **Python**: 3.11.10
|
||||
- **vLLM**: 0.6.4.post1 (installed with pip)
|
||||
- **PyTorch**: 2.5.1 with CUDA 12.4
|
||||
- **Tailscale**: Installed via official script
|
||||
|
||||
### vLLM Deployment
|
||||
|
||||
**Custom Server**: `ai/simple_vllm_server.py`
|
||||
- Uses `AsyncLLMEngine` directly to bypass multiprocessing issues
|
||||
- OpenAI-compatible API endpoints:
|
||||
- `GET /v1/models` - List available models
|
||||
- `POST /v1/completions` - Text completion
|
||||
- `POST /v1/chat/completions` - Chat completion
|
||||
- Default model: Qwen/Qwen2.5-7B-Instruct
|
||||
- Cache directory: `/workspace/huggingface_cache`
|
||||
|
||||
**Deployment Command**:
|
||||
```bash
|
||||
# Copy server script to GPU server
|
||||
scp ai/simple_vllm_server.py gpu-pivoine:/workspace/
|
||||
|
||||
# Start server
|
||||
ssh gpu-pivoine "cd /workspace && nohup python3 simple_vllm_server.py > vllm.log 2>&1 &"
|
||||
|
||||
# Check status
|
||||
ssh gpu-pivoine "curl http://localhost:8000/v1/models"
|
||||
```
|
||||
|
||||
**Server Configuration** (environment variables):
|
||||
- `VLLM_HOST`: 0.0.0.0 (default)
|
||||
- `VLLM_PORT`: 8000 (default)
|
||||
|
||||
### Model Configuration
|
||||
- **Model**: Qwen/Qwen2.5-7B-Instruct (no auth required)
|
||||
- **Context Length**: 4096 tokens
|
||||
- **GPU Memory**: 85% utilization
|
||||
- **Tensor Parallel**: 1 (single GPU)
|
||||
|
||||
### Known Issues & Solutions
|
||||
|
||||
#### Issue 1: vLLM Multiprocessing Errors
|
||||
**Problem**: Default vLLM v1 engine fails with ZMQ/CUDA multiprocessing errors on RunPod.
|
||||
**Solution**: Custom `AsyncLLMEngine` FastAPI server bypasses multiprocessing layer entirely.
|
||||
|
||||
#### Issue 2: Disk Space (Solved)
|
||||
**Problem**: Original 20GB disk filled up with Hugging Face cache.
|
||||
**Solution**: Expanded to 50GB and use `/workspace` for model cache.
|
||||
|
||||
#### Issue 3: Gated Models
|
||||
**Problem**: Llama models require Hugging Face authentication.
|
||||
**Solution**: Use Qwen 2.5 7B Instruct (no auth required) or set `HF_TOKEN` environment variable.
|
||||
|
||||
#### Issue 4: Spot Instance Volatility
|
||||
**Problem**: RunPod Spot instances can be terminated anytime.
|
||||
**Solution**: Accept as trade-off for cost savings. Document SSH details for quick reconnection.
|
||||
|
||||
### Monitoring
|
||||
|
||||
**Check vLLM logs**:
|
||||
```bash
|
||||
ssh gpu-pivoine "tail -f /workspace/vllm.log"
|
||||
```
|
||||
|
||||
**Check GPU usage**:
|
||||
```bash
|
||||
ssh gpu-pivoine "nvidia-smi"
|
||||
```
|
||||
|
||||
**Check Tailscale status**:
|
||||
```bash
|
||||
ssh gpu-pivoine "tailscale status"
|
||||
```
|
||||
|
||||
**Test API locally (on GPU server)**:
|
||||
```bash
|
||||
ssh gpu-pivoine "curl http://localhost:8000/v1/models"
|
||||
```
|
||||
|
||||
**Test API via Tailscale (from VPS)**:
|
||||
```bash
|
||||
curl http://100.100.108.13:8000/v1/models
|
||||
```
|
||||
|
||||
### LiteLLM Integration
|
||||
|
||||
Update VPS LiteLLM config at `ai/litellm-config-gpu.yaml`:
|
||||
|
||||
```yaml
|
||||
# Replace old WireGuard IP (10.8.0.2) with Tailscale IP
|
||||
- model_name: qwen-2.5-7b
|
||||
litellm_params:
|
||||
model: openai/qwen-2.5-7b
|
||||
api_base: http://100.100.108.13:8000/v1 # Tailscale IP
|
||||
api_key: dummy
|
||||
rpm: 1000
|
||||
tpm: 100000
|
||||
```
|
||||
|
||||
Restart LiteLLM:
|
||||
```bash
|
||||
arty restart litellm
|
||||
```
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
**Server not responding**:
|
||||
1. Check if process is running: `pgrep -f simple_vllm_server`
|
||||
2. Check logs: `tail -100 /workspace/vllm.log`
|
||||
3. Check GPU availability: `nvidia-smi`
|
||||
4. Restart server: `pkill -f simple_vllm_server && python3 /workspace/simple_vllm_server.py &`
|
||||
|
||||
**Tailscale not connected**:
|
||||
1. Check status: `tailscale status`
|
||||
2. Check daemon: `ps aux | grep tailscaled`
|
||||
3. Restart: `tailscale down && tailscale up`
|
||||
|
||||
**Model download failing**:
|
||||
1. Check disk space: `df -h`
|
||||
2. Check cache directory: `ls -lah /workspace/huggingface_cache`
|
||||
3. Clear cache if needed: `rm -rf /workspace/huggingface_cache/*`
|
||||
|
||||
### Next Steps
|
||||
1. ✅ Deploy vLLM with Qwen 2.5 7B
|
||||
2. ⏳ Test API endpoints locally and via Tailscale
|
||||
3. ⏳ Update VPS LiteLLM configuration
|
||||
4. ⏳ Test end-to-end: Open WebUI → LiteLLM → vLLM
|
||||
5. ⏹️ Monitor performance and costs
|
||||
6. ⏹️ Consider adding more models (Mistral, DeepSeek Coder)
|
||||
7. ⏹️ Set up auto-stop for idle periods to save costs
|
||||
|
||||
### Cost Optimization Ideas
|
||||
1. **Auto-stop**: Configure RunPod to auto-stop after 30 minutes idle
|
||||
2. **Spot Instances**: Already using Spot for 50% cost reduction
|
||||
3. **Scheduled Operation**: Run only during business hours (8 hours/day = $120/month)
|
||||
4. **Smaller Models**: Use Mistral 7B or quantized models for lighter workloads
|
||||
5. **Pay-as-you-go**: Manually start/stop pod as needed
|
||||
|
||||
### Performance Benchmarks
|
||||
*To be measured after deployment*
|
||||
|
||||
Expected (based on RTX 4090):
|
||||
- Qwen 2.5 7B: 50-80 tokens/second
|
||||
- Context processing: ~2-3 seconds for 1000 tokens
|
||||
- First token latency: ~200-300ms
|
||||
Reference in New Issue
Block a user