Files
docker-compose/ai/SETUP_GUIDE.md
Sebastian Krüger 8de88d96ac docs(ai): add comprehensive GPU setup documentation and configs
- Add setup guides (SETUP_GUIDE, TAILSCALE_SETUP, DOCKER_GPU_SETUP, etc.)
- Add deployment configurations (litellm-config-gpu.yaml, gpu-server-compose.yaml)
- Add GPU_DEPLOYMENT_LOG.md with current infrastructure details
- Add GPU_EXPANSION_PLAN.md with complete provider comparison
- Add deploy-gpu-stack.sh automation script

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 12:57:06 +01:00

5.6 KiB

GPU Server Setup Guide - Week 1

Day 1-2: RunPod Account & GPU Server

Step 1: Create RunPod Account

  1. Go to RunPod: https://www.runpod.io/

  2. Sign up with email or GitHub

  3. Add billing method:

    • Credit card required
    • No charges until you deploy a pod
    • Recommended: Add $50 initial credit
  4. Verify email and complete account setup

Step 2: Deploy Your First GPU Pod

2.1 Navigate to Pods

  1. Click "Deploy" in top menu
  2. Select "GPU Pods"

2.2 Choose GPU Type

Recommended: RTX 4090

  • 24GB VRAM
  • ~$0.50/hour
  • Perfect for LLMs up to 14B params
  • Great for SDXL/FLUX

Filter options:

  • GPU Type: RTX 4090
  • GPU Count: 1
  • Sort by: Price (lowest first)
  • Region: Europe (lower latency to Germany)

2.3 Select Template

Choose: "RunPod PyTorch" template

  • Includes: CUDA, PyTorch, Python
  • Pre-configured for GPU workloads
  • Docker pre-installed

Alternative: "Ubuntu 22.04 with CUDA 12.1" (more control)

2.4 Configure Pod

Container Settings:

  • Container Disk: 50GB (temporary, auto-included)
  • Expose Ports:
    • Add: 22 (SSH)
    • Add: 8000 (vLLM)
    • Add: 8188 (ComfyUI)
    • Add: 8888 (JupyterLab)

Volume Settings:

  • Click "+ Network Volume"
  • Name: gpu-models-storage
  • Size: 500GB
  • Region: Same as pod
  • Cost: ~$50/month

Environment Variables:

  • Add later (not needed for initial setup)

2.5 Deploy Pod

  1. Review configuration
  2. Click "Deploy On-Demand" (not Spot for reliability)
  3. Wait 2-3 minutes for deployment

Expected cost:

  • GPU: $0.50/hour = $360/month (24/7)
  • Storage: $50/month
  • Total: $410/month

Step 3: Access Your GPU Server

3.1 Get Connection Info

Once deployed, you'll see:

  • Pod ID: e.g., abc123def456
  • SSH Command: ssh root@<pod-id>.runpod.io -p 12345
  • Public IP: May not be directly accessible (use SSH)

3.2 SSH Access

RunPod automatically generates SSH keys for you:

# Copy the SSH command from RunPod dashboard
ssh root@abc123def456.runpod.io -p 12345

# First time: Accept fingerprint
# You should now be in the GPU server!

Verify GPU:

nvidia-smi

Expected output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.xx       Driver Version: 535.xx       CUDA Version: 12.1    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 30%   45C    P0    50W / 450W |      0MiB / 24564MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Step 4: Initial Server Configuration

4.1 Update System

# Update package lists
apt update

# Upgrade existing packages
apt upgrade -y

# Install essential tools
apt install -y \
  vim \
  htop \
  tmux \
  curl \
  wget \
  git \
  net-tools \
  iptables-persistent

4.2 Set Timezone

timedatectl set-timezone Europe/Berlin
date  # Verify

4.3 Create Working Directory

# Create workspace
mkdir -p /workspace/{models,configs,data,scripts}

# Check network volume mount
ls -la /workspace
# Should show your 500GB volume

Generate your own SSH key on your local machine:

# On your local machine (not GPU server)
ssh-keygen -t ed25519 -C "gpu-server-pivoine" -f ~/.ssh/gpu_pivoine

# Copy public key to GPU server
ssh-copy-id -i ~/.ssh/gpu_pivoine.pub root@abc123def456.runpod.io -p 12345

Add to your local ~/.ssh/config:

Host gpu-pivoine
    HostName abc123def456.runpod.io
    Port 12345
    User root
    IdentityFile ~/.ssh/gpu_pivoine

Now you can connect with: ssh gpu-pivoine

Step 5: Verify GPU Access

Run this test:

# Test CUDA
python3 -c "import torch; print('CUDA available:', torch.cuda.is_available()); print('GPU count:', torch.cuda.device_count())"

Expected output:

CUDA available: True
GPU count: 1

Troubleshooting

Problem: Can't connect via SSH

  • Check pod is running (not stopped)
  • Verify port number in SSH command
  • Try web terminal in RunPod dashboard

Problem: GPU not detected

  • Run nvidia-smi
  • Check RunPod selected correct GPU type
  • Restart pod if needed

Problem: Network volume not mounted

  • Check RunPod dashboard → Volume tab
  • Verify volume is attached to pod
  • Try: df -h to see mounts

Next Steps

Once SSH access works and GPU is verified: Proceed to Day 3-4: Network Configuration (Tailscale VPN)

Save Important Info

Create a file to track your setup:

# On GPU server
cat > /workspace/SERVER_INFO.md << 'EOF'
# GPU Server Information

## Connection
- SSH: ssh root@abc123def456.runpod.io -p 12345
- Pod ID: abc123def456
- Region: [YOUR_REGION]

## Hardware
- GPU: RTX 4090 24GB
- CPU: [Check with: lscpu]
- RAM: [Check with: free -h]
- Storage: 500GB network volume at /workspace

## Costs
- GPU: $0.50/hour
- Storage: $50/month
- Total: ~$410/month (24/7)

## Deployed: [DATE]
EOF

Checkpoint ✓

Before moving to Day 3, verify:

  • RunPod account created and billing added
  • RTX 4090 pod deployed successfully
  • 500GB network volume attached
  • SSH access working
  • nvidia-smi shows GPU
  • torch.cuda.is_available() returns True
  • Timezone set to Europe/Berlin
  • Essential tools installed

Ready for Tailscale setup? Let's go!