262 lines
5.6 KiB
Markdown
262 lines
5.6 KiB
Markdown
|
|
# GPU Server Setup Guide - Week 1
|
||
|
|
|
||
|
|
## Day 1-2: RunPod Account & GPU Server
|
||
|
|
|
||
|
|
### Step 1: Create RunPod Account
|
||
|
|
|
||
|
|
1. **Go to RunPod**: https://www.runpod.io/
|
||
|
|
2. **Sign up** with email or GitHub
|
||
|
|
3. **Add billing method**:
|
||
|
|
- Credit card required
|
||
|
|
- No charges until you deploy a pod
|
||
|
|
- Recommended: Add $50 initial credit
|
||
|
|
|
||
|
|
4. **Verify email** and complete account setup
|
||
|
|
|
||
|
|
### Step 2: Deploy Your First GPU Pod
|
||
|
|
|
||
|
|
#### 2.1 Navigate to Pods
|
||
|
|
|
||
|
|
1. Click **"Deploy"** in top menu
|
||
|
|
2. Select **"GPU Pods"**
|
||
|
|
|
||
|
|
#### 2.2 Choose GPU Type
|
||
|
|
|
||
|
|
**Recommended: RTX 4090**
|
||
|
|
- 24GB VRAM
|
||
|
|
- ~$0.50/hour
|
||
|
|
- Perfect for LLMs up to 14B params
|
||
|
|
- Great for SDXL/FLUX
|
||
|
|
|
||
|
|
**Filter options:**
|
||
|
|
- GPU Type: RTX 4090
|
||
|
|
- GPU Count: 1
|
||
|
|
- Sort by: Price (lowest first)
|
||
|
|
- Region: Europe (lower latency to Germany)
|
||
|
|
|
||
|
|
#### 2.3 Select Template
|
||
|
|
|
||
|
|
Choose: **"RunPod PyTorch"** template
|
||
|
|
- Includes: CUDA, PyTorch, Python
|
||
|
|
- Pre-configured for GPU workloads
|
||
|
|
- Docker pre-installed
|
||
|
|
|
||
|
|
**Alternative**: "Ubuntu 22.04 with CUDA 12.1" (more control)
|
||
|
|
|
||
|
|
#### 2.4 Configure Pod
|
||
|
|
|
||
|
|
**Container Settings:**
|
||
|
|
- **Container Disk**: 50GB (temporary, auto-included)
|
||
|
|
- **Expose Ports**:
|
||
|
|
- Add: 22 (SSH)
|
||
|
|
- Add: 8000 (vLLM)
|
||
|
|
- Add: 8188 (ComfyUI)
|
||
|
|
- Add: 8888 (JupyterLab)
|
||
|
|
|
||
|
|
**Volume Settings:**
|
||
|
|
- Click **"+ Network Volume"**
|
||
|
|
- **Name**: `gpu-models-storage`
|
||
|
|
- **Size**: 500GB
|
||
|
|
- **Region**: Same as pod
|
||
|
|
- **Cost**: ~$50/month
|
||
|
|
|
||
|
|
**Environment Variables:**
|
||
|
|
- Add later (not needed for initial setup)
|
||
|
|
|
||
|
|
#### 2.5 Deploy Pod
|
||
|
|
|
||
|
|
1. Review configuration
|
||
|
|
2. Click **"Deploy On-Demand"** (not Spot for reliability)
|
||
|
|
3. Wait 2-3 minutes for deployment
|
||
|
|
|
||
|
|
**Expected cost:**
|
||
|
|
- GPU: $0.50/hour = $360/month (24/7)
|
||
|
|
- Storage: $50/month
|
||
|
|
- **Total: $410/month**
|
||
|
|
|
||
|
|
### Step 3: Access Your GPU Server
|
||
|
|
|
||
|
|
#### 3.1 Get Connection Info
|
||
|
|
|
||
|
|
Once deployed, you'll see:
|
||
|
|
- **Pod ID**: e.g., `abc123def456`
|
||
|
|
- **SSH Command**: `ssh root@<pod-id>.runpod.io -p 12345`
|
||
|
|
- **Public IP**: May not be directly accessible (use SSH)
|
||
|
|
|
||
|
|
#### 3.2 SSH Access
|
||
|
|
|
||
|
|
RunPod automatically generates SSH keys for you:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Copy the SSH command from RunPod dashboard
|
||
|
|
ssh root@abc123def456.runpod.io -p 12345
|
||
|
|
|
||
|
|
# First time: Accept fingerprint
|
||
|
|
# You should now be in the GPU server!
|
||
|
|
```
|
||
|
|
|
||
|
|
**Verify GPU:**
|
||
|
|
```bash
|
||
|
|
nvidia-smi
|
||
|
|
```
|
||
|
|
|
||
|
|
Expected output:
|
||
|
|
```
|
||
|
|
+-----------------------------------------------------------------------------+
|
||
|
|
| NVIDIA-SMI 535.xx Driver Version: 535.xx CUDA Version: 12.1 |
|
||
|
|
|-------------------------------+----------------------+----------------------+
|
||
|
|
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
|
||
|
|
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|
||
|
|
|===============================+======================+======================|
|
||
|
|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
|
||
|
|
| 30% 45C P0 50W / 450W | 0MiB / 24564MiB | 0% Default |
|
||
|
|
+-------------------------------+----------------------+----------------------+
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 4: Initial Server Configuration
|
||
|
|
|
||
|
|
#### 4.1 Update System
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Update package lists
|
||
|
|
apt update
|
||
|
|
|
||
|
|
# Upgrade existing packages
|
||
|
|
apt upgrade -y
|
||
|
|
|
||
|
|
# Install essential tools
|
||
|
|
apt install -y \
|
||
|
|
vim \
|
||
|
|
htop \
|
||
|
|
tmux \
|
||
|
|
curl \
|
||
|
|
wget \
|
||
|
|
git \
|
||
|
|
net-tools \
|
||
|
|
iptables-persistent
|
||
|
|
```
|
||
|
|
|
||
|
|
#### 4.2 Set Timezone
|
||
|
|
|
||
|
|
```bash
|
||
|
|
timedatectl set-timezone Europe/Berlin
|
||
|
|
date # Verify
|
||
|
|
```
|
||
|
|
|
||
|
|
#### 4.3 Create Working Directory
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Create workspace
|
||
|
|
mkdir -p /workspace/{models,configs,data,scripts}
|
||
|
|
|
||
|
|
# Check network volume mount
|
||
|
|
ls -la /workspace
|
||
|
|
# Should show your 500GB volume
|
||
|
|
```
|
||
|
|
|
||
|
|
#### 4.4 Configure SSH (Optional but Recommended)
|
||
|
|
|
||
|
|
**Generate your own SSH key on your local machine:**
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# On your local machine (not GPU server)
|
||
|
|
ssh-keygen -t ed25519 -C "gpu-server-pivoine" -f ~/.ssh/gpu_pivoine
|
||
|
|
|
||
|
|
# Copy public key to GPU server
|
||
|
|
ssh-copy-id -i ~/.ssh/gpu_pivoine.pub root@abc123def456.runpod.io -p 12345
|
||
|
|
```
|
||
|
|
|
||
|
|
**Add to your local ~/.ssh/config:**
|
||
|
|
|
||
|
|
```bash
|
||
|
|
Host gpu-pivoine
|
||
|
|
HostName abc123def456.runpod.io
|
||
|
|
Port 12345
|
||
|
|
User root
|
||
|
|
IdentityFile ~/.ssh/gpu_pivoine
|
||
|
|
```
|
||
|
|
|
||
|
|
Now you can connect with: `ssh gpu-pivoine`
|
||
|
|
|
||
|
|
### Step 5: Verify GPU Access
|
||
|
|
|
||
|
|
Run this test:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Test CUDA
|
||
|
|
python3 -c "import torch; print('CUDA available:', torch.cuda.is_available()); print('GPU count:', torch.cuda.device_count())"
|
||
|
|
```
|
||
|
|
|
||
|
|
Expected output:
|
||
|
|
```
|
||
|
|
CUDA available: True
|
||
|
|
GPU count: 1
|
||
|
|
```
|
||
|
|
|
||
|
|
### Troubleshooting
|
||
|
|
|
||
|
|
**Problem: Can't connect via SSH**
|
||
|
|
- Check pod is running (not stopped)
|
||
|
|
- Verify port number in SSH command
|
||
|
|
- Try web terminal in RunPod dashboard
|
||
|
|
|
||
|
|
**Problem: GPU not detected**
|
||
|
|
- Run `nvidia-smi`
|
||
|
|
- Check RunPod selected correct GPU type
|
||
|
|
- Restart pod if needed
|
||
|
|
|
||
|
|
**Problem: Network volume not mounted**
|
||
|
|
- Check RunPod dashboard → Volume tab
|
||
|
|
- Verify volume is attached to pod
|
||
|
|
- Try: `df -h` to see mounts
|
||
|
|
|
||
|
|
### Next Steps
|
||
|
|
|
||
|
|
Once SSH access works and GPU is verified:
|
||
|
|
✅ Proceed to **Day 3-4: Network Configuration (Tailscale VPN)**
|
||
|
|
|
||
|
|
### Save Important Info
|
||
|
|
|
||
|
|
Create a file to track your setup:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# On GPU server
|
||
|
|
cat > /workspace/SERVER_INFO.md << 'EOF'
|
||
|
|
# GPU Server Information
|
||
|
|
|
||
|
|
## Connection
|
||
|
|
- SSH: ssh root@abc123def456.runpod.io -p 12345
|
||
|
|
- Pod ID: abc123def456
|
||
|
|
- Region: [YOUR_REGION]
|
||
|
|
|
||
|
|
## Hardware
|
||
|
|
- GPU: RTX 4090 24GB
|
||
|
|
- CPU: [Check with: lscpu]
|
||
|
|
- RAM: [Check with: free -h]
|
||
|
|
- Storage: 500GB network volume at /workspace
|
||
|
|
|
||
|
|
## Costs
|
||
|
|
- GPU: $0.50/hour
|
||
|
|
- Storage: $50/month
|
||
|
|
- Total: ~$410/month (24/7)
|
||
|
|
|
||
|
|
## Deployed: [DATE]
|
||
|
|
EOF
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Checkpoint ✓
|
||
|
|
|
||
|
|
Before moving to Day 3, verify:
|
||
|
|
- [ ] RunPod account created and billing added
|
||
|
|
- [ ] RTX 4090 pod deployed successfully
|
||
|
|
- [ ] 500GB network volume attached
|
||
|
|
- [ ] SSH access working
|
||
|
|
- [ ] `nvidia-smi` shows GPU
|
||
|
|
- [ ] `torch.cuda.is_available()` returns True
|
||
|
|
- [ ] Timezone set to Europe/Berlin
|
||
|
|
- [ ] Essential tools installed
|
||
|
|
|
||
|
|
**Ready for Tailscale setup? Let's go!**
|