ai/GPU_DEPLOYMENT_LOG.md

# GPU Server Deployment Log

## Current Deployment (2025-11-21)

### Infrastructure
- **Provider**: RunPod (Spot Instance)
- **GPU**: NVIDIA RTX 4090 24GB
- **Disk**: 50GB local SSD (expanded from 20GB)
- **Network Volume**: 922TB at `/workspace`
- **Region**: Europe
- **Cost**: ~$0.50/hour (~$360/month if running 24/7)

### Network Configuration
- **VPN**: Tailscale (replaces WireGuard due to RunPod UDP restrictions)
- **GPU Server Tailscale IP**: 100.100.108.13
- **VPS Tailscale IP**: (get with `tailscale ip -4` on VPS)

### SSH Access
```
Host gpu-pivoine
    HostName 213.173.102.232
    Port 29695
    User root
    IdentityFile ~/.ssh/id_ed25519
```

**Note**: RunPod Spot instances can be terminated and restarted with new ports/IPs. Update SSH config accordingly.

### Software Stack
- **Python**: 3.11.10
- **vLLM**: 0.6.4.post1 (installed with pip)
- **PyTorch**: 2.5.1 with CUDA 12.4
- **Tailscale**: Installed via official script

### vLLM Deployment

**Custom Server**: `ai/simple_vllm_server.py`
- Uses `AsyncLLMEngine` directly to bypass multiprocessing issues
- OpenAI-compatible API endpoints:
  - `GET /v1/models` - List available models
  - `POST /v1/completions` - Text completion
  - `POST /v1/chat/completions` - Chat completion
- Default model: Qwen/Qwen2.5-7B-Instruct
- Cache directory: `/workspace/huggingface_cache`

**Deployment Command**:
```bash
# Copy server script to GPU server
scp ai/simple_vllm_server.py gpu-pivoine:/workspace/

# Start server
ssh gpu-pivoine "cd /workspace && nohup python3 simple_vllm_server.py > vllm.log 2>&1 &"

# Check status
ssh gpu-pivoine "curl http://localhost:8000/v1/models"
```

**Server Configuration** (environment variables):
- `VLLM_HOST`: 0.0.0.0 (default)
- `VLLM_PORT`: 8000 (default)

### Model Configuration
- **Model**: Qwen/Qwen2.5-7B-Instruct (no auth required)
- **Context Length**: 4096 tokens
- **GPU Memory**: 85% utilization
- **Tensor Parallel**: 1 (single GPU)

### Known Issues & Solutions

#### Issue 1: vLLM Multiprocessing Errors
**Problem**: Default vLLM v1 engine fails with ZMQ/CUDA multiprocessing errors on RunPod.
**Solution**: Custom `AsyncLLMEngine` FastAPI server bypasses multiprocessing layer entirely.

#### Issue 2: Disk Space (Solved)
**Problem**: Original 20GB disk filled up with Hugging Face cache.
**Solution**: Expanded to 50GB and use `/workspace` for model cache.

#### Issue 3: Gated Models
**Problem**: Llama models require Hugging Face authentication.
**Solution**: Use Qwen 2.5 7B Instruct (no auth required) or set `HF_TOKEN` environment variable.

#### Issue 4: Spot Instance Volatility
**Problem**: RunPod Spot instances can be terminated anytime.
**Solution**: Accept as trade-off for cost savings. Document SSH details for quick reconnection.

### Monitoring

**Check vLLM logs**:
```bash
ssh gpu-pivoine "tail -f /workspace/vllm.log"
```

**Check GPU usage**:
```bash
ssh gpu-pivoine "nvidia-smi"
```

**Check Tailscale status**:
```bash
ssh gpu-pivoine "tailscale status"
```

**Test API locally (on GPU server)**:
```bash
ssh gpu-pivoine "curl http://localhost:8000/v1/models"
```

**Test API via Tailscale (from VPS)**:
```bash
curl http://100.100.108.13:8000/v1/models
```

### LiteLLM Integration

Update VPS LiteLLM config at `ai/litellm-config-gpu.yaml`:

```yaml
# Replace old WireGuard IP (10.8.0.2) with Tailscale IP
- model_name: qwen-2.5-7b
  litellm_params:
    model: openai/qwen-2.5-7b
    api_base: http://100.100.108.13:8000/v1  # Tailscale IP
    api_key: dummy
    rpm: 1000
    tpm: 100000
```

Restart LiteLLM:
```bash
arty restart litellm
```

### Troubleshooting

**Server not responding**:
1. Check if process is running: `pgrep -f simple_vllm_server`
2. Check logs: `tail -100 /workspace/vllm.log`
3. Check GPU availability: `nvidia-smi`
4. Restart server: `pkill -f simple_vllm_server && python3 /workspace/simple_vllm_server.py &`

**Tailscale not connected**:
1. Check status: `tailscale status`
2. Check daemon: `ps aux | grep tailscaled`
3. Restart: `tailscale down && tailscale up`

**Model download failing**:
1. Check disk space: `df -h`
2. Check cache directory: `ls -lah /workspace/huggingface_cache`
3. Clear cache if needed: `rm -rf /workspace/huggingface_cache/*`

### Next Steps
1. ✅ Deploy vLLM with Qwen 2.5 7B
2. ⏳ Test API endpoints locally and via Tailscale
3. ⏳ Update VPS LiteLLM configuration
4. ⏳ Test end-to-end: Open WebUI → LiteLLM → vLLM
5. ⏹️ Monitor performance and costs
6. ⏹️ Consider adding more models (Mistral, DeepSeek Coder)
7. ⏹️ Set up auto-stop for idle periods to save costs

### Cost Optimization Ideas
1. **Auto-stop**: Configure RunPod to auto-stop after 30 minutes idle
2. **Spot Instances**: Already using Spot for 50% cost reduction
3. **Scheduled Operation**: Run only during business hours (8 hours/day = $120/month)
4. **Smaller Models**: Use Mistral 7B or quantized models for lighter workloads
5. **Pay-as-you-go**: Manually start/stop pod as needed

### Performance Benchmarks
*To be measured after deployment*

Expected (based on RTX 4090):
- Qwen 2.5 7B: 50-80 tokens/second
- Context processing: ~2-3 seconds for 1000 tokens
- First token latency: ~200-300ms
docs(ai): add comprehensive GPU setup documentation and configs - Add setup guides (SETUP_GUIDE, TAILSCALE_SETUP, DOCKER_GPU_SETUP, etc.) - Add deployment configurations (litellm-config-gpu.yaml, gpu-server-compose.yaml) - Add GPU_DEPLOYMENT_LOG.md with current infrastructure details - Add GPU_EXPANSION_PLAN.md with complete provider comparison - Add deploy-gpu-stack.sh automation script 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-11-21 12:57:06 +01:00			`# GPU Server Deployment Log`

			`## Current Deployment (2025-11-21)`

			`### Infrastructure`
			`- Provider: RunPod (Spot Instance)`
			`- GPU: NVIDIA RTX 4090 24GB`
			`- Disk: 50GB local SSD (expanded from 20GB)`
			- Network Volume: 922TB at `/workspace`
			`- Region: Europe`
			`- Cost: ~$0.50/hour (~$360/month if running 24/7)`

			`### Network Configuration`
			`- VPN: Tailscale (replaces WireGuard due to RunPod UDP restrictions)`
			`- GPU Server Tailscale IP: 100.100.108.13`
			- VPS Tailscale IP: (get with `tailscale ip -4` on VPS)

			`### SSH Access`
			```
			`Host gpu-pivoine`
			`HostName 213.173.102.232`
			`Port 29695`
			`User root`
			`IdentityFile ~/.ssh/id_ed25519`
			```

			`Note: RunPod Spot instances can be terminated and restarted with new ports/IPs. Update SSH config accordingly.`

			`### Software Stack`
			`- Python: 3.11.10`
			`- vLLM: 0.6.4.post1 (installed with pip)`
			`- PyTorch: 2.5.1 with CUDA 12.4`
			`- Tailscale: Installed via official script`

			`### vLLM Deployment`

			Custom Server: `ai/simple_vllm_server.py`
			- Uses `AsyncLLMEngine` directly to bypass multiprocessing issues
			`- OpenAI-compatible API endpoints:`
			- `GET /v1/models` - List available models
			- `POST /v1/completions` - Text completion
			- `POST /v1/chat/completions` - Chat completion
			`- Default model: Qwen/Qwen2.5-7B-Instruct`
			- Cache directory: `/workspace/huggingface_cache`

			`Deployment Command:`
			```bash
			`# Copy server script to GPU server`
			`scp ai/simple_vllm_server.py gpu-pivoine:/workspace/`

			`# Start server`
			`ssh gpu-pivoine "cd /workspace && nohup python3 simple_vllm_server.py > vllm.log 2>&1 &"`

			`# Check status`
			`ssh gpu-pivoine "curl http://localhost:8000/v1/models"`
			```

			`Server Configuration (environment variables):`
			- `VLLM_HOST`: 0.0.0.0 (default)
			- `VLLM_PORT`: 8000 (default)

			`### Model Configuration`
			`- Model: Qwen/Qwen2.5-7B-Instruct (no auth required)`
			`- Context Length: 4096 tokens`
			`- GPU Memory: 85% utilization`
			`- Tensor Parallel: 1 (single GPU)`

			`### Known Issues & Solutions`

			`#### Issue 1: vLLM Multiprocessing Errors`
			`Problem: Default vLLM v1 engine fails with ZMQ/CUDA multiprocessing errors on RunPod.`
			Solution: Custom `AsyncLLMEngine` FastAPI server bypasses multiprocessing layer entirely.

			`#### Issue 2: Disk Space (Solved)`
			`Problem: Original 20GB disk filled up with Hugging Face cache.`
			Solution: Expanded to 50GB and use `/workspace` for model cache.

			`#### Issue 3: Gated Models`
			`Problem: Llama models require Hugging Face authentication.`
			Solution: Use Qwen 2.5 7B Instruct (no auth required) or set `HF_TOKEN` environment variable.

			`#### Issue 4: Spot Instance Volatility`
			`Problem: RunPod Spot instances can be terminated anytime.`
			`Solution: Accept as trade-off for cost savings. Document SSH details for quick reconnection.`

			`### Monitoring`

			`Check vLLM logs:`
			```bash
			`ssh gpu-pivoine "tail -f /workspace/vllm.log"`
			```

			`Check GPU usage:`
			```bash
			`ssh gpu-pivoine "nvidia-smi"`
			```

			`Check Tailscale status:`
			```bash
			`ssh gpu-pivoine "tailscale status"`
			```

			`Test API locally (on GPU server):`
			```bash
			`ssh gpu-pivoine "curl http://localhost:8000/v1/models"`
			```

			`Test API via Tailscale (from VPS):`
			```bash
			`curl http://100.100.108.13:8000/v1/models`
			```

			`### LiteLLM Integration`

			Update VPS LiteLLM config at `ai/litellm-config-gpu.yaml`:

			```yaml
			`# Replace old WireGuard IP (10.8.0.2) with Tailscale IP`
			`- model_name: qwen-2.5-7b`
			`litellm_params:`
			`model: openai/qwen-2.5-7b`
			`api_base: http://100.100.108.13:8000/v1 # Tailscale IP`
			`api_key: dummy`
			`rpm: 1000`
			`tpm: 100000`
			```

			`Restart LiteLLM:`
			```bash
			`arty restart litellm`
			```

			`### Troubleshooting`

			`Server not responding:`
			1. Check if process is running: `pgrep -f simple_vllm_server`
			2. Check logs: `tail -100 /workspace/vllm.log`
			3. Check GPU availability: `nvidia-smi`
			4. Restart server: `pkill -f simple_vllm_server && python3 /workspace/simple_vllm_server.py &`

			`Tailscale not connected:`
			1. Check status: `tailscale status`
			2. Check daemon: `ps aux \| grep tailscaled`
			3. Restart: `tailscale down && tailscale up`

			`Model download failing:`
			1. Check disk space: `df -h`
			2. Check cache directory: `ls -lah /workspace/huggingface_cache`
			3. Clear cache if needed: `rm -rf /workspace/huggingface_cache/*`

			`### Next Steps`
			`1. ✅ Deploy vLLM with Qwen 2.5 7B`
			`2. ⏳ Test API endpoints locally and via Tailscale`
			`3. ⏳ Update VPS LiteLLM configuration`
			`4. ⏳ Test end-to-end: Open WebUI → LiteLLM → vLLM`
			`5. ⏹️ Monitor performance and costs`
			`6. ⏹️ Consider adding more models (Mistral, DeepSeek Coder)`
			`7. ⏹️ Set up auto-stop for idle periods to save costs`

			`### Cost Optimization Ideas`
			`1. Auto-stop: Configure RunPod to auto-stop after 30 minutes idle`
			`2. Spot Instances: Already using Spot for 50% cost reduction`
			`3. Scheduled Operation: Run only during business hours (8 hours/day = $120/month)`
			`4. Smaller Models: Use Mistral 7B or quantized models for lighter workloads`
			`5. Pay-as-you-go: Manually start/stop pod as needed`

			`### Performance Benchmarks`
			`To be measured after deployment`

			`Expected (based on RTX 4090):`
			`- Qwen 2.5 7B: 50-80 tokens/second`
			`- Context processing: ~2-3 seconds for 1000 tokens`
			`- First token latency: ~200-300ms`