Files
docker-compose/ai/SETUP_GUIDE.md
Sebastian Krüger 8de88d96ac docs(ai): add comprehensive GPU setup documentation and configs
- Add setup guides (SETUP_GUIDE, TAILSCALE_SETUP, DOCKER_GPU_SETUP, etc.)
- Add deployment configurations (litellm-config-gpu.yaml, gpu-server-compose.yaml)
- Add GPU_DEPLOYMENT_LOG.md with current infrastructure details
- Add GPU_EXPANSION_PLAN.md with complete provider comparison
- Add deploy-gpu-stack.sh automation script

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 12:57:06 +01:00

262 lines
5.6 KiB
Markdown

# GPU Server Setup Guide - Week 1
## Day 1-2: RunPod Account & GPU Server
### Step 1: Create RunPod Account
1. **Go to RunPod**: https://www.runpod.io/
2. **Sign up** with email or GitHub
3. **Add billing method**:
- Credit card required
- No charges until you deploy a pod
- Recommended: Add $50 initial credit
4. **Verify email** and complete account setup
### Step 2: Deploy Your First GPU Pod
#### 2.1 Navigate to Pods
1. Click **"Deploy"** in top menu
2. Select **"GPU Pods"**
#### 2.2 Choose GPU Type
**Recommended: RTX 4090**
- 24GB VRAM
- ~$0.50/hour
- Perfect for LLMs up to 14B params
- Great for SDXL/FLUX
**Filter options:**
- GPU Type: RTX 4090
- GPU Count: 1
- Sort by: Price (lowest first)
- Region: Europe (lower latency to Germany)
#### 2.3 Select Template
Choose: **"RunPod PyTorch"** template
- Includes: CUDA, PyTorch, Python
- Pre-configured for GPU workloads
- Docker pre-installed
**Alternative**: "Ubuntu 22.04 with CUDA 12.1" (more control)
#### 2.4 Configure Pod
**Container Settings:**
- **Container Disk**: 50GB (temporary, auto-included)
- **Expose Ports**:
- Add: 22 (SSH)
- Add: 8000 (vLLM)
- Add: 8188 (ComfyUI)
- Add: 8888 (JupyterLab)
**Volume Settings:**
- Click **"+ Network Volume"**
- **Name**: `gpu-models-storage`
- **Size**: 500GB
- **Region**: Same as pod
- **Cost**: ~$50/month
**Environment Variables:**
- Add later (not needed for initial setup)
#### 2.5 Deploy Pod
1. Review configuration
2. Click **"Deploy On-Demand"** (not Spot for reliability)
3. Wait 2-3 minutes for deployment
**Expected cost:**
- GPU: $0.50/hour = $360/month (24/7)
- Storage: $50/month
- **Total: $410/month**
### Step 3: Access Your GPU Server
#### 3.1 Get Connection Info
Once deployed, you'll see:
- **Pod ID**: e.g., `abc123def456`
- **SSH Command**: `ssh root@<pod-id>.runpod.io -p 12345`
- **Public IP**: May not be directly accessible (use SSH)
#### 3.2 SSH Access
RunPod automatically generates SSH keys for you:
```bash
# Copy the SSH command from RunPod dashboard
ssh root@abc123def456.runpod.io -p 12345
# First time: Accept fingerprint
# You should now be in the GPU server!
```
**Verify GPU:**
```bash
nvidia-smi
```
Expected output:
```
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.xx Driver Version: 535.xx CUDA Version: 12.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 30% 45C P0 50W / 450W | 0MiB / 24564MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
```
### Step 4: Initial Server Configuration
#### 4.1 Update System
```bash
# Update package lists
apt update
# Upgrade existing packages
apt upgrade -y
# Install essential tools
apt install -y \
vim \
htop \
tmux \
curl \
wget \
git \
net-tools \
iptables-persistent
```
#### 4.2 Set Timezone
```bash
timedatectl set-timezone Europe/Berlin
date # Verify
```
#### 4.3 Create Working Directory
```bash
# Create workspace
mkdir -p /workspace/{models,configs,data,scripts}
# Check network volume mount
ls -la /workspace
# Should show your 500GB volume
```
#### 4.4 Configure SSH (Optional but Recommended)
**Generate your own SSH key on your local machine:**
```bash
# On your local machine (not GPU server)
ssh-keygen -t ed25519 -C "gpu-server-pivoine" -f ~/.ssh/gpu_pivoine
# Copy public key to GPU server
ssh-copy-id -i ~/.ssh/gpu_pivoine.pub root@abc123def456.runpod.io -p 12345
```
**Add to your local ~/.ssh/config:**
```bash
Host gpu-pivoine
HostName abc123def456.runpod.io
Port 12345
User root
IdentityFile ~/.ssh/gpu_pivoine
```
Now you can connect with: `ssh gpu-pivoine`
### Step 5: Verify GPU Access
Run this test:
```bash
# Test CUDA
python3 -c "import torch; print('CUDA available:', torch.cuda.is_available()); print('GPU count:', torch.cuda.device_count())"
```
Expected output:
```
CUDA available: True
GPU count: 1
```
### Troubleshooting
**Problem: Can't connect via SSH**
- Check pod is running (not stopped)
- Verify port number in SSH command
- Try web terminal in RunPod dashboard
**Problem: GPU not detected**
- Run `nvidia-smi`
- Check RunPod selected correct GPU type
- Restart pod if needed
**Problem: Network volume not mounted**
- Check RunPod dashboard → Volume tab
- Verify volume is attached to pod
- Try: `df -h` to see mounts
### Next Steps
Once SSH access works and GPU is verified:
✅ Proceed to **Day 3-4: Network Configuration (Tailscale VPN)**
### Save Important Info
Create a file to track your setup:
```bash
# On GPU server
cat > /workspace/SERVER_INFO.md << 'EOF'
# GPU Server Information
## Connection
- SSH: ssh root@abc123def456.runpod.io -p 12345
- Pod ID: abc123def456
- Region: [YOUR_REGION]
## Hardware
- GPU: RTX 4090 24GB
- CPU: [Check with: lscpu]
- RAM: [Check with: free -h]
- Storage: 500GB network volume at /workspace
## Costs
- GPU: $0.50/hour
- Storage: $50/month
- Total: ~$410/month (24/7)
## Deployed: [DATE]
EOF
```
---
## Checkpoint ✓
Before moving to Day 3, verify:
- [ ] RunPod account created and billing added
- [ ] RTX 4090 pod deployed successfully
- [ ] 500GB network volume attached
- [ ] SSH access working
- [ ] `nvidia-smi` shows GPU
- [ ] `torch.cuda.is_available()` returns True
- [ ] Timezone set to Europe/Berlin
- [ ] Essential tools installed
**Ready for Tailscale setup? Let's go!**