ai/README_GPU_SETUP.md

# GPU-Enhanced AI Stack - Implementation Guide

Welcome to your GPU expansion setup! This directory contains everything you need to deploy a production-ready GPU server for LLM hosting, image generation, and model training.

## 📚 Documentation Files

### Planning & Architecture
- **`GPU_EXPANSION_PLAN.md`** - Complete 70-page plan with provider comparison, architecture, and roadmap
- **`README_GPU_SETUP.md`** - This file

### Step-by-Step Setup Guides
1. **`SETUP_GUIDE.md`** - Day 1-2: RunPod account & GPU server deployment
2. **`WIREGUARD_SETUP.md`** - Day 3-4: VPN connection between VPS and GPU server
3. **`DOCKER_GPU_SETUP.md`** - Day 5: Docker + NVIDIA Container Toolkit configuration

### Configuration Files
- **`gpu-server-compose.yaml`** - Production Docker Compose for GPU server
- **`litellm-config-gpu.yaml`** - Updated LiteLLM config with self-hosted models
- **`deploy-gpu-stack.sh`** - Automated deployment script

---

## 🚀 Quick Start (Week 1 Checklist)

### Day 1-2: RunPod & GPU Server ✓
- [ ] Create RunPod account at https://www.runpod.io/
- [ ] Add billing method ($50 initial credit recommended)
- [ ] Deploy RTX 4090 pod with PyTorch template
- [ ] Configure 500GB network volume
- [ ] Verify SSH access
- [ ] Test GPU with `nvidia-smi`
- [ ] **Guide:** `SETUP_GUIDE.md`

### Day 3-4: Network Configuration ✓
- [ ] Install Tailscale on VPS
- [ ] Install Tailscale on GPU server
- [ ] Authenticate both devices
- [ ] Test VPN connectivity
- [ ] Configure firewall rules
- [ ] Verify VPS can reach GPU server
- [ ] **Guide:** `TAILSCALE_SETUP.md`

### Day 5: Docker & GPU Setup ✓
- [ ] Install Docker on GPU server
- [ ] Install NVIDIA Container Toolkit
- [ ] Test GPU access in containers
- [ ] Create /workspace/gpu-stack directory
- [ ] Copy configuration files
- [ ] **Guide:** `DOCKER_GPU_SETUP.md`

### Day 6-7: Deploy Services ✓
- [ ] Copy `gpu-server-compose.yaml` to GPU server
- [ ] Edit `.env` with your settings
- [ ] Run `./deploy-gpu-stack.sh`
- [ ] Wait for vLLM to load model (~5 minutes)
- [ ] Test vLLM: `curl http://localhost:8000/v1/models`
- [ ] Access ComfyUI: `http://[tailscale-ip]:8188`
- [ ] **Script:** `deploy-gpu-stack.sh`

---

## 📦 Services Included

### vLLM (http://[tailscale-ip]:8000)
**Purpose:** High-performance LLM inference
**Default Model:** Llama 3.1 8B Instruct
**Performance:** 50-80 tokens/second on RTX 4090
**Use for:** General chat, Q&A, code generation, summarization

**Switch models:**
Edit `gpu-server-compose.yaml`, change `--model` parameter, restart:
```bash
docker compose restart vllm
```

### ComfyUI (http://[tailscale-ip]:8188)
**Purpose:** Advanced Stable Diffusion interface
**Features:** FLUX, SDXL, ControlNet, LoRA
**Use for:** Image generation, img2img, inpainting

**Download models:**
Access web UI → ComfyUI Manager → Install Models

### JupyterLab (http://[tailscale-ip]:8888)
**Purpose:** Interactive development environment
**Token:** `pivoine-ai-2025` (change in `.env`)
**Use for:** Research, experimentation, custom training scripts

### Axolotl (Training - on-demand)
**Purpose:** LLM fine-tuning framework
**Start:** `docker compose --profile training up -d axolotl`
**Use for:** LoRA training, full fine-tuning, RLHF

### Netdata (http://[tailscale-ip]:19999)
**Purpose:** System & GPU monitoring
**Features:** Real-time metrics, GPU utilization, memory usage
**Use for:** Performance monitoring, troubleshooting

---

## 🔧 Configuration

### Environment Variables (.env)

```bash
# VPN Network (Tailscale)
VPS_IP=100.x.x.x         # Your VPS Tailscale IP (get with: tailscale ip -4)
GPU_IP=100.x.x.x         # GPU server Tailscale IP (get with: tailscale ip -4)

# Model Storage
MODELS_PATH=/workspace/models

# Hugging Face Token (for gated models like Llama)
HF_TOKEN=hf_xxxxxxxxxxxxx

# Weights & Biases (for training logging)
WANDB_API_KEY=

# JupyterLab Access
JUPYTER_TOKEN=pivoine-ai-2025

# PostgreSQL (on VPS)
DB_HOST=100.x.x.x        # Your VPS Tailscale IP
DB_PORT=5432
DB_USER=valknar
DB_PASSWORD=ragnarok98
DB_NAME=openwebui
```

### Updating LiteLLM on VPS

After GPU server is running, update your VPS LiteLLM config:

```bash
# On VPS
cd ~/Projects/docker-compose/ai

# Backup current config
cp litellm-config.yaml litellm-config.yaml.backup

# Copy new config with GPU models
cp litellm-config-gpu.yaml litellm-config.yaml

# Restart LiteLLM
arty restart litellm
```

Now Open WebUI will have access to both Claude (API) and Llama (self-hosted)!

---

## 💰 Cost Management

### Current Costs (24/7 Operation)
- **GPU Server:** RTX 4090 @ $0.50/hour = $360/month
- **Storage:** 500GB network volume = $50/month
- **Total:** **$410/month**

### Cost-Saving Options

**1. Pay-as-you-go (8 hours/day)**
- GPU: $0.50 × 8 × 30 = $120/month
- Storage: $50/month
- **Total: $170/month**

**2. Auto-stop idle pods**
RunPod can auto-stop after X minutes idle:
- Dashboard → Pod Settings → Auto-stop after 30 minutes

**3. Use smaller models**
- Mistral 7B instead of Llama 8B: Faster, cheaper GPU
- Quantized models: 4-bit = 1/4 the VRAM

**4. Batch image generation**
- Generate multiple images at once
- Use scheduled jobs (cron) during off-peak hours

### Cost Tracking

**Check GPU usage:**
```bash
# On RunPod dashboard
Billing → Usage History

# See hourly costs, total spent
```

**Check API vs GPU savings:**
```bash
# On VPS, check LiteLLM logs
docker logs ai_litellm | grep "model="

# Count requests to llama-3.1-8b vs claude-*
```

**Expected savings:**
- 80% of requests → self-hosted = $0 cost
- 20% of requests → Claude = API cost
- Break-even if currently spending >$500/month on APIs

---

## 🔍 Monitoring & Troubleshooting

### Check Service Status

```bash
# On GPU server
cd /workspace/gpu-stack

# View all services
docker compose ps

# Check specific service logs
docker compose logs -f vllm
docker compose logs -f comfyui
docker compose logs -f jupyter

# Check GPU usage
nvidia-smi
# or prettier:
nvtop
```

### Common Issues

**vLLM not loading model:**
```bash
# Check logs
docker compose logs vllm

# Common causes:
# - Model download in progress (wait 5-10 minutes)
# - Out of VRAM (try smaller model)
# - Missing HF_TOKEN (for gated models like Llama)
```

**ComfyUI slow/crashing:**
```bash
# Check GPU memory
nvidia-smi

# If VRAM full:
# - Close vLLM temporarily
# - Use smaller models
# - Reduce batch size in ComfyUI
```

**Can't access from VPS:**
```bash
# Test VPN
ping [tailscale-ip]

# If fails:
# - Check Tailscale status: tailscale status
# - Restart Tailscale: tailscale down && tailscale up
# - Check firewall: ufw status
```

**Docker can't see GPU:**
```bash
# Test GPU access
docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.1.0-base nvidia-smi

# If fails:
# - Check NVIDIA driver: nvidia-smi
# - Check nvidia-docker: nvidia-ctk --version
# - Restart Docker: systemctl restart docker
```

---

## 📊 Performance Benchmarks

### Expected Performance (RTX 4090)

**LLM Inference (vLLM):**
- Llama 3.1 8B: 50-80 tokens/second
- Qwen 2.5 14B: 30-50 tokens/second
- Batch size 32: ~1500 tokens/second

**Image Generation (ComfyUI):**
- SDXL (1024×1024): ~4-6 seconds
- FLUX (1024×1024): ~8-12 seconds
- SD 1.5 (512×512): ~1-2 seconds

**Training (Axolotl):**
- LoRA fine-tuning (8B model): ~3-5 hours for 3 epochs
- Full fine-tuning: Not recommended on 24GB VRAM

---

## 🔐 Security Best Practices

### Network Security
✅ All services behind Tailscale VPN (end-to-end encrypted)
✅ No public exposure (except RunPod's SSH)
✅ Firewall configured (no additional ports needed)

### Access Control
✅ JupyterLab password-protected
✅ ComfyUI accessible via VPN only
✅ vLLM internal API (no auth needed)

### SSH Security
```bash
# On GPU server, harden SSH
nano /etc/ssh/sshd_config

# Set:
PermitRootLogin prohibit-password
PasswordAuthentication no
PubkeyAuthentication yes

systemctl restart sshd
```

### Regular Updates
```bash
# Weekly updates
apt update && apt upgrade -y

# Update Docker images
docker compose pull
docker compose up -d
```

---

## 📈 Scaling Up

### When to Add More GPUs

**Current limitations (1× RTX 4090):**
- Can run ONE of these at a time:
  - 8B LLM at full speed
  - 14B LLM at moderate speed
  - SDXL image generation
  - Training job

**Add 2nd GPU if:**
- You want LLM + image gen simultaneously
- Training + inference at same time
- Multiple users with high demand

**Multi-GPU options:**
- 2× RTX 4090: Run vLLM + ComfyUI separately ($720/month)
- 1× A100 40GB: Larger models (70B with quantization) ($1,080/month)
- Mix: RTX 4090 (inference) + A100 (training) (~$1,300/month)

### Deploying Larger Models

**70B models (need 2× A100 or 4× RTX 4090):**
```yaml
# In gpu-server-compose.yaml
vllm:
  command:
    - --model
    - meta-llama/Meta-Llama-3.1-70B-Instruct
    - --tensor-parallel-size
    - "2"  # Split across 2 GPUs
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 2  # Use 2 GPUs
            capabilities: [gpu]
```

---

## 🎯 Next Steps (Week 2+)

### Week 2: LLM Production Deployment
- [ ] Test Llama 3.1 8B performance
- [ ] Download additional models (Qwen, Mistral)
- [ ] Configure model routing in LiteLLM
- [ ] Set up usage monitoring
- [ ] Benchmark tokens/second for each model

### Week 3: Image Generation
- [ ] Download FLUX and SDXL models
- [ ] Install ComfyUI Manager
- [ ] Download ControlNet models
- [ ] Create sample workflows
- [ ] Test API integration with Open WebUI

### Week 4: Training Infrastructure
- [ ] Prepare a sample dataset
- [ ] Test LoRA fine-tuning with Axolotl
- [ ] Set up Weights & Biases logging
- [ ] Create training documentation
- [ ] Benchmark training speed

---

## 🆘 Getting Help

### Resources
- **RunPod Docs:** https://docs.runpod.io/
- **vLLM Docs:** https://docs.vllm.ai/
- **ComfyUI Wiki:** https://github.com/comfyanonymous/ComfyUI/wiki
- **Axolotl Docs:** https://github.com/OpenAccess-AI-Collective/axolotl

### Community
- **RunPod Discord:** https://discord.gg/runpod
- **vLLM Discord:** https://discord.gg/vllm
- **r/LocalLLaMA:** https://reddit.com/r/LocalLLaMA

### Support
If you encounter issues:
1. Check logs: `docker compose logs -f [service]`
2. Check GPU: `nvidia-smi`
3. Check VPN: `wg show`
4. Restart service: `docker compose restart [service]`
5. Full restart: `docker compose down && docker compose up -d`

---

## ✅ Success Criteria

You're ready to proceed when:
- [ ] GPU server responds to `ping [tailscale-ip]` from VPS
- [ ] vLLM returns models: `curl http://[tailscale-ip]:8000/v1/models`
- [ ] ComfyUI web interface loads: `http://[tailscale-ip]:8188`
- [ ] JupyterLab accessible with token
- [ ] Netdata shows GPU metrics
- [ ] Open WebUI shows both Claude and Llama models

**Total setup time:** 4-6 hours (if following guides sequentially)

---

## 🎉 You're All Set!

Your GPU-enhanced AI stack is ready. You now have:
- ✅ Self-hosted LLM inference (saves $$$)
- ✅ Advanced image generation (FLUX, SDXL)
- ✅ Model training capabilities (LoRA, fine-tuning)
- ✅ Secure VPN connection
- ✅ Full monitoring and logging

Enjoy building with your new AI infrastructure! 🚀
-												docs(ai): add comprehensive GPU setup documentation and configs

- Add setup guides (SETUP_GUIDE, TAILSCALE_SETUP, DOCKER_GPU_SETUP, etc.)
- Add deployment configurations (litellm-config-gpu.yaml, gpu-server-compose.yaml)
- Add GPU_DEPLOYMENT_LOG.md with current infrastructure details
- Add GPU_EXPANSION_PLAN.md with complete provider comparison
- Add deploy-gpu-stack.sh automation script

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-21 12:57:06 +01:00
+								# GPU-Enhanced AI Stack - Implementation Guide
 								Welcome to your GPU expansion setup! This directory contains everything you need to deploy a production-ready GPU server for LLM hosting, image generation, and model training.
 								## 📚 Documentation Files
 								### Planning & Architecture
 								- **`GPU_EXPANSION_PLAN.md`** - Complete 70-page plan with provider comparison, architecture, and roadmap
 								- **`README_GPU_SETUP.md`** - This file
 								### Step-by-Step Setup Guides
 . **`SETUP_GUIDE.md`** - Day 1-2: RunPod account & GPU server deployment
 . **`WIREGUARD_SETUP.md`** - Day 3-4: VPN connection between VPS and GPU server
 . **`DOCKER_GPU_SETUP.md`** - Day 5: Docker + NVIDIA Container Toolkit configuration
 								### Configuration Files
 								- **`gpu-server-compose.yaml`** - Production Docker Compose for GPU server
 								- **`litellm-config-gpu.yaml`** - Updated LiteLLM config with self-hosted models
 								- **`deploy-gpu-stack.sh`** - Automated deployment script
 								---
 								## 🚀 Quick Start (Week 1 Checklist)
 								### Day 1-2: RunPod & GPU Server ✓
 								- [ ] Create RunPod account at https://www.runpod.io/
 								- [ ] Add billing method ($50 initial credit recommended)
 								- [ ] Deploy RTX 4090 pod with PyTorch template
 								- [ ] Configure 500GB network volume
 								- [ ] Verify SSH access
 								- [ ] Test GPU with `nvidia-smi`
 								- [ ] **Guide:** `SETUP_GUIDE.md`
 								### Day 3-4: Network Configuration ✓
 								- [ ] Install Tailscale on VPS
 								- [ ] Install Tailscale on GPU server
 								- [ ] Authenticate both devices
 								- [ ] Test VPN connectivity
 								- [ ] Configure firewall rules
 								- [ ] Verify VPS can reach GPU server
 								- [ ] **Guide:** `TAILSCALE_SETUP.md`
 								### Day 5: Docker & GPU Setup ✓
 								- [ ] Install Docker on GPU server
 								- [ ] Install NVIDIA Container Toolkit
 								- [ ] Test GPU access in containers
 								- [ ] Create /workspace/gpu-stack directory
 								- [ ] Copy configuration files
 								- [ ] **Guide:** `DOCKER_GPU_SETUP.md`
 								### Day 6-7: Deploy Services ✓
 								- [ ] Copy `gpu-server-compose.yaml` to GPU server
 								- [ ] Edit `.env` with your settings
 								- [ ] Run `./deploy-gpu-stack.sh`
 								- [ ] Wait for vLLM to load model (~5 minutes)
 								- [ ] Test vLLM: `curl http://localhost:8000/v1/models`
 								- [ ] Access ComfyUI: `http://[tailscale-ip]:8188`
 								- [ ] **Script:** `deploy-gpu-stack.sh`
 								---
 								## 📦 Services Included
 								### vLLM (http://[tailscale-ip]:8000)
 								**Purpose:** High-performance LLM inference
 								**Default Model:** Llama 3.1 8B Instruct
 								**Performance:** 50-80 tokens/second on RTX 4090
 								**Use for:** General chat, Q&A, code generation, summarization
 								**Switch models:**
 								Edit `gpu-server-compose.yaml`, change `--model` parameter, restart:
 								```bash
 								docker compose restart vllm
 								```
 								### ComfyUI (http://[tailscale-ip]:8188)
 								**Purpose:** Advanced Stable Diffusion interface
 								**Features:** FLUX, SDXL, ControlNet, LoRA
 								**Use for:** Image generation, img2img, inpainting
 								**Download models:**
 								Access web UI → ComfyUI Manager → Install Models
 								### JupyterLab (http://[tailscale-ip]:8888)
 								**Purpose:** Interactive development environment
 								**Token:** `pivoine-ai-2025` (change in `.env`)
 								**Use for:** Research, experimentation, custom training scripts
 								### Axolotl (Training - on-demand)
 								**Purpose:** LLM fine-tuning framework
 								**Start:** `docker compose --profile training up -d axolotl`
 								**Use for:** LoRA training, full fine-tuning, RLHF
 								### Netdata (http://[tailscale-ip]:19999)
 								**Purpose:** System & GPU monitoring
 								**Features:** Real-time metrics, GPU utilization, memory usage
 								**Use for:** Performance monitoring, troubleshooting
 								---
 								## 🔧 Configuration
 								### Environment Variables (.env)
 								```bash
 								# VPN Network (Tailscale)
 								VPS_IP=100.x.x.x         # Your VPS Tailscale IP (get with: tailscale ip -4)
 								GPU_IP=100.x.x.x         # GPU server Tailscale IP (get with: tailscale ip -4)
 								# Model Storage
 								MODELS_PATH=/workspace/models
 								# Hugging Face Token (for gated models like Llama)
 								HF_TOKEN=hf_xxxxxxxxxxxxx
 								# Weights & Biases (for training logging)
 								WANDB_API_KEY=
 								# JupyterLab Access
 								JUPYTER_TOKEN=pivoine-ai-2025
 								# PostgreSQL (on VPS)
 								DB_HOST=100.x.x.x        # Your VPS Tailscale IP
 								DB_PORT=5432
 								DB_USER=valknar
 								DB_PASSWORD=ragnarok98
 								DB_NAME=openwebui
 								```
 								### Updating LiteLLM on VPS
 								After GPU server is running, update your VPS LiteLLM config:
 								```bash
 								# On VPS
 								cd ~/Projects/docker-compose/ai
 								# Backup current config
 								cp litellm-config.yaml litellm-config.yaml.backup
 								# Copy new config with GPU models
 								cp litellm-config-gpu.yaml litellm-config.yaml
 								# Restart LiteLLM
 								arty restart litellm
 								```
 								Now Open WebUI will have access to both Claude (API) and Llama (self-hosted)!
 								---
 								## 💰 Cost Management
 								### Current Costs (24/7 Operation)
 								- **GPU Server:** RTX 4090 @ $0.50/hour = $360/month
 								- **Storage:** 500GB network volume = $50/month
 								- **Total:** **$410/month**
 								### Cost-Saving Options
 								**1. Pay-as-you-go (8 hours/day)**
 								- GPU: $0.50 × 8 × 30 = $120/month
 								- Storage: $50/month
 								- **Total: $170/month**
 								**2. Auto-stop idle pods**
 								RunPod can auto-stop after X minutes idle:
 								- Dashboard → Pod Settings → Auto-stop after 30 minutes
 								**3. Use smaller models**
 								- Mistral 7B instead of Llama 8B: Faster, cheaper GPU
 								- Quantized models: 4-bit = 1/4 the VRAM
 								**4. Batch image generation**
 								- Generate multiple images at once
 								- Use scheduled jobs (cron) during off-peak hours
 								### Cost Tracking
 								**Check GPU usage:**
 								```bash
 								# On RunPod dashboard
 								Billing → Usage History
 								# See hourly costs, total spent
 								```
 								**Check API vs GPU savings:**
 								```bash
 								# On VPS, check LiteLLM logs
 								docker logs ai_litellm | grep "model="
 								# Count requests to llama-3.1-8b vs claude-*
 								```
 								**Expected savings:**
 								- 80% of requests → self-hosted = $0 cost
 								- 20% of requests → Claude = API cost
 								- Break-even if currently spending >$500/month on APIs
 								---
 								## 🔍 Monitoring & Troubleshooting
 								### Check Service Status
 								```bash
 								# On GPU server
 								cd /workspace/gpu-stack
 								# View all services
 								docker compose ps
 								# Check specific service logs
 								docker compose logs -f vllm
 								docker compose logs -f comfyui
 								docker compose logs -f jupyter
 								# Check GPU usage
 								nvidia-smi
 								# or prettier:
 								nvtop
 								```
 								### Common Issues
 								**vLLM not loading model:**
 								```bash
 								# Check logs
 								docker compose logs vllm
 								# Common causes:
 								# - Model download in progress (wait 5-10 minutes)
 								# - Out of VRAM (try smaller model)
 								# - Missing HF_TOKEN (for gated models like Llama)
 								```
 								**ComfyUI slow/crashing:**
 								```bash
 								# Check GPU memory
 								nvidia-smi
 								# If VRAM full:
 								# - Close vLLM temporarily
 								# - Use smaller models
 								# - Reduce batch size in ComfyUI
 								```
 								**Can't access from VPS:**
 								```bash
 								# Test VPN
 								ping [tailscale-ip]
 								# If fails:
 								# - Check Tailscale status: tailscale status
 								# - Restart Tailscale: tailscale down && tailscale up
 								# - Check firewall: ufw status
 								```
 								**Docker can't see GPU:**
 								```bash
 								# Test GPU access
 								docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.1.0-base nvidia-smi
 								# If fails:
 								# - Check NVIDIA driver: nvidia-smi
 								# - Check nvidia-docker: nvidia-ctk --version
 								# - Restart Docker: systemctl restart docker
 								```
 								---
 								## 📊 Performance Benchmarks
 								### Expected Performance (RTX 4090)
 								**LLM Inference (vLLM):**
 								- Llama 3.1 8B: 50-80 tokens/second
 								- Qwen 2.5 14B: 30-50 tokens/second
 								- Batch size 32: ~1500 tokens/second
 								**Image Generation (ComfyUI):**
 								- SDXL (1024×1024): ~4-6 seconds
 								- FLUX (1024×1024): ~8-12 seconds
 								- SD 1.5 (512×512): ~1-2 seconds
 								**Training (Axolotl):**
 								- LoRA fine-tuning (8B model): ~3-5 hours for 3 epochs
 								- Full fine-tuning: Not recommended on 24GB VRAM
 								---
 								## 🔐 Security Best Practices
 								### Network Security
 								✅ All services behind Tailscale VPN (end-to-end encrypted)
 								✅ No public exposure (except RunPod's SSH)
 								✅ Firewall configured (no additional ports needed)
 								### Access Control
 								✅ JupyterLab password-protected
 								✅ ComfyUI accessible via VPN only
 								✅ vLLM internal API (no auth needed)
 								### SSH Security
 								```bash
 								# On GPU server, harden SSH
 								nano /etc/ssh/sshd_config
 								# Set:
 								PermitRootLogin prohibit-password
 								PasswordAuthentication no
 								PubkeyAuthentication yes
 								systemctl restart sshd
 								```
 								### Regular Updates
 								```bash
 								# Weekly updates
 								apt update && apt upgrade -y
 								# Update Docker images
 								docker compose pull
 								docker compose up -d
 								```
 								---
 								## 📈 Scaling Up
 								### When to Add More GPUs
 								**Current limitations (1× RTX 4090):**
 								- Can run ONE of these at a time:
 								  - 8B LLM at full speed
 								  - 14B LLM at moderate speed
 								  - SDXL image generation
 								  - Training job
 								**Add 2nd GPU if:**
 								- You want LLM + image gen simultaneously
 								- Training + inference at same time
 								- Multiple users with high demand
 								**Multi-GPU options:**
 								- 2× RTX 4090: Run vLLM + ComfyUI separately ($720/month)
 								- 1× A100 40GB: Larger models (70B with quantization) ($1,080/month)
 								- Mix: RTX 4090 (inference) + A100 (training) (~$1,300/month)
 								### Deploying Larger Models
 								**70B models (need 2× A100 or 4× RTX 4090):**
 								```yaml
 								# In gpu-server-compose.yaml
 								vllm:
 								  command:
 								    - --model
 								    - meta-llama/Meta-Llama-3.1-70B-Instruct
 								    - --tensor-parallel-size
 								    - "2"  # Split across 2 GPUs
 								  deploy:
 								    resources:
 								      reservations:
 								        devices:
 								          - driver: nvidia
 								            count: 2  # Use 2 GPUs
 								            capabilities: [gpu]
 								```
 								---
 								## 🎯 Next Steps (Week 2+)
 								### Week 2: LLM Production Deployment
 								- [ ] Test Llama 3.1 8B performance
 								- [ ] Download additional models (Qwen, Mistral)
 								- [ ] Configure model routing in LiteLLM
 								- [ ] Set up usage monitoring
 								- [ ] Benchmark tokens/second for each model
 								### Week 3: Image Generation
 								- [ ] Download FLUX and SDXL models
 								- [ ] Install ComfyUI Manager
 								- [ ] Download ControlNet models
 								- [ ] Create sample workflows
 								- [ ] Test API integration with Open WebUI
 								### Week 4: Training Infrastructure
 								- [ ] Prepare a sample dataset
 								- [ ] Test LoRA fine-tuning with Axolotl
 								- [ ] Set up Weights & Biases logging
 								- [ ] Create training documentation
 								- [ ] Benchmark training speed
 								---
 								## 🆘 Getting Help
 								### Resources
 								- **RunPod Docs:** https://docs.runpod.io/
 								- **vLLM Docs:** https://docs.vllm.ai/
 								- **ComfyUI Wiki:** https://github.com/comfyanonymous/ComfyUI/wiki
 								- **Axolotl Docs:** https://github.com/OpenAccess-AI-Collective/axolotl
 								### Community
 								- **RunPod Discord:** https://discord.gg/runpod
 								- **vLLM Discord:** https://discord.gg/vllm
 								- **r/LocalLLaMA:** https://reddit.com/r/LocalLLaMA
 								### Support
 								If you encounter issues:
 . Check logs: `docker compose logs -f [service]`
 . Check GPU: `nvidia-smi`
 . Check VPN: `wg show`
 . Restart service: `docker compose restart [service]`
 . Full restart: `docker compose down && docker compose up -d`
 								---
 								## ✅ Success Criteria
 								You're ready to proceed when:
 								- [ ] GPU server responds to `ping [tailscale-ip]` from VPS
 								- [ ] vLLM returns models: `curl http://[tailscale-ip]:8000/v1/models`
 								- [ ] ComfyUI web interface loads: `http://[tailscale-ip]:8188`
 								- [ ] JupyterLab accessible with token
 								- [ ] Netdata shows GPU metrics
 								- [ ] Open WebUI shows both Claude and Llama models
 								**Total setup time:** 4-6 hours (if following guides sequentially)
 								---
 								## 🎉 You're All Set!
 								Your GPU-enhanced AI stack is ready. You now have:
 								- ✅ Self-hosted LLM inference (saves $$$)
 								- ✅ Advanced image generation (FLUX, SDXL)
 								- ✅ Model training capabilities (LoRA, fine-tuning)
 								- ✅ Secure VPN connection
 								- ✅ Full monitoring and logging
 								Enjoy building with your new AI infrastructure! 🚀