1307 lines
33 KiB
Markdown
1307 lines
33 KiB
Markdown
|
|
# GPU-Enhanced AI Stack Expansion Plan
|
|||
|
|
|
|||
|
|
## Executive Summary
|
|||
|
|
|
|||
|
|
This document outlines a comprehensive plan to extend the current AI stack (LiteLLM, Open WebUI, Crawl4AI) with dedicated GPU hosting capabilities for:
|
|||
|
|
- **LLM Model Hosting**: Self-hosted models (Llama, Mistral, Qwen, etc.)
|
|||
|
|
- **Model Training**: Fine-tuning and training workflows
|
|||
|
|
- **Image Generation**: Stable Diffusion, FLUX via ComfyUI
|
|||
|
|
- **Video Generation**: AnimateDiff, CogVideo, etc.
|
|||
|
|
|
|||
|
|
**Current Architecture**: CPU-based stack on pivoine.art VPS → Claude API via LiteLLM
|
|||
|
|
**Target Architecture**: Hybrid stack with GPU server(s) for self-hosted models + API-based models
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase 1: Current Stack Analysis
|
|||
|
|
|
|||
|
|
### Existing Components
|
|||
|
|
|
|||
|
|
1. **ai_postgres** (pgvector/pgvector:pg16)
|
|||
|
|
- PostgreSQL with pgvector for RAG
|
|||
|
|
- Stores: conversations, embeddings, LiteLLM logs
|
|||
|
|
|
|||
|
|
2. **webui** (Open WebUI)
|
|||
|
|
- User-facing ChatGPT-like interface
|
|||
|
|
- URL: https://ai.pivoine.art
|
|||
|
|
- Features: RAG, web search, document upload
|
|||
|
|
- Connected to LiteLLM proxy
|
|||
|
|
|
|||
|
|
3. **litellm** (LiteLLM proxy)
|
|||
|
|
- Currently proxies Anthropic Claude API
|
|||
|
|
- OpenAI-compatible endpoint at http://litellm:4000
|
|||
|
|
- Supports multiple providers via config
|
|||
|
|
|
|||
|
|
4. **crawl4ai**
|
|||
|
|
- Internal web scraping for LLM content prep
|
|||
|
|
- Port 11235 (internal only)
|
|||
|
|
|
|||
|
|
5. **facefusion** (CPU-only)
|
|||
|
|
- Face swapping/enhancement
|
|||
|
|
- Currently CPU-based (slow)
|
|||
|
|
- Protected by Authelia SSO
|
|||
|
|
|
|||
|
|
### Current Limitations
|
|||
|
|
|
|||
|
|
- ❌ No self-hosted LLMs (relies on expensive API calls)
|
|||
|
|
- ❌ No GPU acceleration for facefusion
|
|||
|
|
- ❌ No image generation capabilities
|
|||
|
|
- ❌ No model training/fine-tuning capabilities
|
|||
|
|
- ❌ No video generation
|
|||
|
|
- ❌ High operational costs for API usage
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase 2: GPU Provider Comparison
|
|||
|
|
|
|||
|
|
### Provider Options
|
|||
|
|
|
|||
|
|
#### 1. **RunPod** ⭐ RECOMMENDED
|
|||
|
|
**Pros:**
|
|||
|
|
- Pay-per-second GPU billing
|
|||
|
|
- Wide GPU selection (RTX 4090, A100, H100)
|
|||
|
|
- Docker-first platform
|
|||
|
|
- Global locations
|
|||
|
|
- Easy HTTP/SSH tunneling
|
|||
|
|
- Volume persistence
|
|||
|
|
|
|||
|
|
**Pricing (Approximate):**
|
|||
|
|
- RTX 4090 (24GB): ~$0.50/hour ($360/month 24/7)
|
|||
|
|
- RTX 3090 (24GB): ~$0.35/hour ($250/month)
|
|||
|
|
- A6000 (48GB): ~$0.80/hour ($576/month)
|
|||
|
|
- A100 (40GB): ~$1.50/hour ($1,080/month)
|
|||
|
|
|
|||
|
|
**Best for:** On-demand workloads, experimentation, cost-conscious hosting
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### 2. **Lambda Labs**
|
|||
|
|
**Pros:**
|
|||
|
|
- Flat monthly pricing
|
|||
|
|
- High-end GPUs (A100, H100)
|
|||
|
|
- Jupyter notebooks included
|
|||
|
|
- Fast network
|
|||
|
|
|
|||
|
|
**Pricing:**
|
|||
|
|
- 1x A100 (40GB): $1.10/hour ($792/month)
|
|||
|
|
- 8x A100 (40GB): $8.00/hour (~$5,760/month)
|
|||
|
|
|
|||
|
|
**Best for:** Research, high-utilization workloads
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### 3. **Vast.ai**
|
|||
|
|
**Pros:**
|
|||
|
|
- Marketplace model (cheapest)
|
|||
|
|
- Many GPU options
|
|||
|
|
- Spot pricing available
|
|||
|
|
|
|||
|
|
**Cons:**
|
|||
|
|
- Variable reliability
|
|||
|
|
- Setup complexity
|
|||
|
|
- Community-hosted machines
|
|||
|
|
|
|||
|
|
**Pricing:**
|
|||
|
|
- RTX 4090: ~$0.25-0.40/hour
|
|||
|
|
- A100: ~$0.80-1.20/hour
|
|||
|
|
|
|||
|
|
**Best for:** Budget-conscious, experimental workloads
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### 4. **Google Cloud Platform (GCP)**
|
|||
|
|
**Pros:**
|
|||
|
|
- Enterprise reliability
|
|||
|
|
- Auto-scaling
|
|||
|
|
- Integration with Google services
|
|||
|
|
- Preemptible instances available
|
|||
|
|
|
|||
|
|
**Pricing:**
|
|||
|
|
- T4 (16GB): ~$0.35/hour
|
|||
|
|
- V100 (16GB): ~$2.48/hour
|
|||
|
|
- A100 (40GB): ~$2.93/hour
|
|||
|
|
- TPU options available
|
|||
|
|
|
|||
|
|
**Best for:** Enterprise workloads, auto-scaling needs
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### 5. **AWS**
|
|||
|
|
**Pros:**
|
|||
|
|
- Global infrastructure
|
|||
|
|
- Broad GPU selection
|
|||
|
|
- Spot instances for cost savings
|
|||
|
|
- Enterprise support
|
|||
|
|
|
|||
|
|
**Pricing:**
|
|||
|
|
- g4dn.xlarge (T4 16GB): ~$0.526/hour
|
|||
|
|
- p3.2xlarge (V100 16GB): ~$3.06/hour
|
|||
|
|
- p4d.24xlarge (8x A100 40GB): ~$32.77/hour
|
|||
|
|
|
|||
|
|
**Best for:** Enterprise, existing AWS infrastructure
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### 6. **Hugging Face Spaces / Inference Endpoints**
|
|||
|
|
**Pros:**
|
|||
|
|
- Managed model hosting
|
|||
|
|
- Auto-scaling
|
|||
|
|
- Simple deployment
|
|||
|
|
- Community models
|
|||
|
|
|
|||
|
|
**Pricing:**
|
|||
|
|
- CPU: $0.03/hour
|
|||
|
|
- T4: $0.60/hour
|
|||
|
|
- A10G: $1.00/hour
|
|||
|
|
- A100: $4.00/hour
|
|||
|
|
|
|||
|
|
**Best for:** Quick model deployment, serverless inference
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Recommendation: **RunPod** for Primary GPU Server
|
|||
|
|
|
|||
|
|
**Rationale:**
|
|||
|
|
1. **Cost-effective**: Pay-per-second billing, ~$0.50/hour for RTX 4090
|
|||
|
|
2. **Docker-native**: Easy integration with existing compose stack
|
|||
|
|
3. **Flexibility**: Start/stop as needed, scale up for training
|
|||
|
|
4. **Community**: Large user base, good documentation
|
|||
|
|
5. **Network**: Built-in HTTP/SSH tunneling
|
|||
|
|
|
|||
|
|
**Supplementary**: Use Hugging Face for specific model hosting if needed
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase 3: Architecture Design
|
|||
|
|
|
|||
|
|
### Network Topology
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
┌─────────────────────────────────────────────────────────────┐
|
|||
|
|
│ pivoine.art VPS (CPU-based) │
|
|||
|
|
├─────────────────────────────────────────────────────────────┤
|
|||
|
|
│ │
|
|||
|
|
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
|||
|
|
│ │ Open │─────▶│ LiteLLM │◀────▶│ ai_ │ │
|
|||
|
|
│ │ WebUI │ │ Proxy │ │ postgres │ │
|
|||
|
|
│ └──────────┘ └──────────┘ └──────────┘ │
|
|||
|
|
│ │ │ │
|
|||
|
|
│ │ │ │
|
|||
|
|
└───────┼──────────────────┼──────────────────────────────────┘
|
|||
|
|
│ │
|
|||
|
|
│ ▼
|
|||
|
|
│ ┌─────────────────┐
|
|||
|
|
│ │ Anthropic API │
|
|||
|
|
│ │ (Claude) │
|
|||
|
|
│ └─────────────────┘
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
┌────────────────────────────────────────────────────────────┐
|
|||
|
|
│ GPU Server (RunPod) │
|
|||
|
|
├────────────────────────────────────────────────────────────┤
|
|||
|
|
│ │
|
|||
|
|
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
|||
|
|
│ │ vLLM │ │ ComfyUI │ │ Model │ │ JupyterLab│ │
|
|||
|
|
│ │ (LLMs) │ │ (SD/FLUX)│ │ Training │ │ │ │
|
|||
|
|
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
|
|||
|
|
│ │ │ │ │ │
|
|||
|
|
│ └──────────────┴─────────────┴──────────────┘ │
|
|||
|
|
│ │ │
|
|||
|
|
│ ┌───────────────┐ │
|
|||
|
|
│ │ Model Storage │ │
|
|||
|
|
│ │ (Persistent) │ │
|
|||
|
|
│ └───────────────┘ │
|
|||
|
|
│ │
|
|||
|
|
└────────────────────────────────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
▼ (Tunneled via WireGuard or Tailscale)
|
|||
|
|
┌────────────────────────────────────────────────────────────┐
|
|||
|
|
│ Integration Options: │
|
|||
|
|
├────────────────────────────────────────────────────────────┤
|
|||
|
|
│ 1. LiteLLM adds vLLM endpoint (http://gpu.internal:8000) │
|
|||
|
|
│ 2. ComfyUI exposed via subdomain (comfy.ai.pivoine.art) │
|
|||
|
|
│ 3. Model storage synced via rclone/restic │
|
|||
|
|
└────────────────────────────────────────────────────────────┘
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Connection Methods
|
|||
|
|
|
|||
|
|
#### Option A: WireGuard VPN (RECOMMENDED)
|
|||
|
|
- Create WireGuard tunnel between VPS and GPU server
|
|||
|
|
- GPU services accessible via private IPs
|
|||
|
|
- Secure, low overhead, easy to manage
|
|||
|
|
- Already have wg-easy in your stack
|
|||
|
|
|
|||
|
|
**Setup:**
|
|||
|
|
1. Deploy WireGuard on GPU server
|
|||
|
|
2. Add GPU server as VPN peer
|
|||
|
|
3. Configure LiteLLM to use VPN IPs
|
|||
|
|
|
|||
|
|
#### Option B: SSH Tunnel
|
|||
|
|
- SSH reverse tunnel from GPU to VPS
|
|||
|
|
- Simple, no additional software
|
|||
|
|
- Higher latency
|
|||
|
|
|
|||
|
|
#### Option C: Tailscale
|
|||
|
|
- Zero-config VPN mesh
|
|||
|
|
- Easy setup, good UX
|
|||
|
|
- Proprietary (but free tier available)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase 4: Service Implementation Plans
|
|||
|
|
|
|||
|
|
### 4.1 LLM Hosting with vLLM
|
|||
|
|
|
|||
|
|
**vLLM** is the industry-standard for high-performance LLM inference.
|
|||
|
|
|
|||
|
|
#### Features:
|
|||
|
|
- PagedAttention for efficient KV cache
|
|||
|
|
- Continuous batching
|
|||
|
|
- OpenAI-compatible API
|
|||
|
|
- Tensor parallelism for multi-GPU
|
|||
|
|
- Quantization support (AWQ, GPTQ)
|
|||
|
|
|
|||
|
|
#### Docker Compose Configuration:
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
services:
|
|||
|
|
vllm:
|
|||
|
|
image: vllm/vllm-openai:latest
|
|||
|
|
container_name: gpu_vllm
|
|||
|
|
restart: unless-stopped
|
|||
|
|
runtime: nvidia
|
|||
|
|
environment:
|
|||
|
|
NVIDIA_VISIBLE_DEVICES: all
|
|||
|
|
CUDA_VISIBLE_DEVICES: 0
|
|||
|
|
volumes:
|
|||
|
|
- vllm_models:/root/.cache/huggingface
|
|||
|
|
command:
|
|||
|
|
- --model
|
|||
|
|
- meta-llama/Meta-Llama-3.1-8B-Instruct # or any model
|
|||
|
|
- --host
|
|||
|
|
- 0.0.0.0
|
|||
|
|
- --port
|
|||
|
|
- 8000
|
|||
|
|
- --tensor-parallel-size
|
|||
|
|
- '1'
|
|||
|
|
- --gpu-memory-utilization
|
|||
|
|
- '0.9'
|
|||
|
|
- --max-model-len
|
|||
|
|
- '8192'
|
|||
|
|
ports:
|
|||
|
|
- "8000:8000"
|
|||
|
|
deploy:
|
|||
|
|
resources:
|
|||
|
|
reservations:
|
|||
|
|
devices:
|
|||
|
|
- driver: nvidia
|
|||
|
|
count: 1
|
|||
|
|
capabilities: [gpu]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Recommended Models for RTX 4090 (24GB):
|
|||
|
|
|
|||
|
|
**Text Generation:**
|
|||
|
|
- Llama 3.1 8B Instruct (8GB VRAM, fast)
|
|||
|
|
- Qwen2.5 14B Instruct (14GB VRAM, multilingual)
|
|||
|
|
- Mistral 7B Instruct v0.3 (7GB VRAM)
|
|||
|
|
- Nous Hermes 2 Mixtral 8x7B (with quantization, 16GB)
|
|||
|
|
|
|||
|
|
**Code:**
|
|||
|
|
- DeepSeek Coder 6.7B (7GB VRAM)
|
|||
|
|
- CodeLlama 13B (13GB VRAM)
|
|||
|
|
- Qwen2.5-Coder 14B (14GB VRAM)
|
|||
|
|
|
|||
|
|
#### Integration with LiteLLM:
|
|||
|
|
|
|||
|
|
Add to `ai/litellm-config.yaml`:
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
model_list:
|
|||
|
|
# Existing Anthropic
|
|||
|
|
- model_name: claude-sonnet-4-5
|
|||
|
|
litellm_params:
|
|||
|
|
model: anthropic/claude-sonnet-4-5-20250929
|
|||
|
|
api_key: os.environ/ANTHROPIC_API_KEY
|
|||
|
|
|
|||
|
|
# New vLLM models
|
|||
|
|
- model_name: llama-3.1-8b
|
|||
|
|
litellm_params:
|
|||
|
|
model: openai/meta-llama/Meta-Llama-3.1-8B-Instruct
|
|||
|
|
api_base: http://gpu.internal:8000/v1
|
|||
|
|
api_key: dummy
|
|||
|
|
|
|||
|
|
- model_name: qwen-2.5-14b
|
|||
|
|
litellm_params:
|
|||
|
|
model: openai/Qwen/Qwen2.5-14B-Instruct
|
|||
|
|
api_base: http://gpu.internal:8000/v1
|
|||
|
|
api_key: dummy
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 4.2 ComfyUI for Image/Video Generation
|
|||
|
|
|
|||
|
|
**ComfyUI** is a node-based UI for Stable Diffusion with advanced workflows.
|
|||
|
|
|
|||
|
|
#### Features:
|
|||
|
|
- Node-based workflow editor
|
|||
|
|
- Support for SD 1.5, SDXL, SD3, FLUX
|
|||
|
|
- ControlNet, LoRA, embeddings
|
|||
|
|
- Video generation (AnimateDiff, SVD)
|
|||
|
|
- API for automation
|
|||
|
|
|
|||
|
|
#### Docker Compose Configuration:
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
services:
|
|||
|
|
comfyui:
|
|||
|
|
image: ghcr.io/ai-dock/comfyui:latest
|
|||
|
|
container_name: gpu_comfyui
|
|||
|
|
restart: unless-stopped
|
|||
|
|
runtime: nvidia
|
|||
|
|
environment:
|
|||
|
|
NVIDIA_VISIBLE_DEVICES: all
|
|||
|
|
# Custom nodes auto-install
|
|||
|
|
COMFYUI_FLAGS: --listen 0.0.0.0 --port 8188
|
|||
|
|
volumes:
|
|||
|
|
- comfyui_data:/data
|
|||
|
|
- comfyui_models:/opt/ComfyUI/models
|
|||
|
|
- comfyui_output:/opt/ComfyUI/output
|
|||
|
|
ports:
|
|||
|
|
- "8188:8188"
|
|||
|
|
deploy:
|
|||
|
|
resources:
|
|||
|
|
reservations:
|
|||
|
|
devices:
|
|||
|
|
- driver: nvidia
|
|||
|
|
count: 1
|
|||
|
|
capabilities: [gpu]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Model Downloads (via ComfyUI Manager):
|
|||
|
|
|
|||
|
|
**Stable Diffusion Models:**
|
|||
|
|
- FLUX.1-dev (12GB, newest, best quality)
|
|||
|
|
- FLUX.1-schnell (12GB, fast)
|
|||
|
|
- SDXL Base 1.0 (6.9GB)
|
|||
|
|
- SD 1.5 (4GB, fast, wide LoRA support)
|
|||
|
|
|
|||
|
|
**ControlNet Models:**
|
|||
|
|
- controlnet-canny-sdxl
|
|||
|
|
- controlnet-depth-sdxl
|
|||
|
|
- controlnet-openpose-sdxl
|
|||
|
|
|
|||
|
|
**LoRA Models** (download from Civitai):
|
|||
|
|
- Style LoRAs (anime, realistic, etc.)
|
|||
|
|
- Character LoRAs
|
|||
|
|
- Concept LoRAs
|
|||
|
|
|
|||
|
|
#### Traefik Integration:
|
|||
|
|
|
|||
|
|
Add subdomain routing for ComfyUI:
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
labels:
|
|||
|
|
- 'traefik.enable=true'
|
|||
|
|
- 'traefik.http.routers.comfyui-web-secure.rule=Host(`comfy.ai.pivoine.art`)'
|
|||
|
|
- 'traefik.http.routers.comfyui-web-secure.tls.certresolver=resolver'
|
|||
|
|
- 'traefik.http.routers.comfyui-web-secure.entrypoints=web-secure'
|
|||
|
|
- 'traefik.http.routers.comfyui-web-secure.middlewares=net-authelia,security-headers@file'
|
|||
|
|
- 'traefik.http.services.comfyui.loadbalancer.server.port=8188'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Open WebUI Integration:
|
|||
|
|
|
|||
|
|
ComfyUI has a REST API that can be called from Open WebUI using function calling.
|
|||
|
|
|
|||
|
|
Example workflow API call:
|
|||
|
|
```python
|
|||
|
|
import requests
|
|||
|
|
|
|||
|
|
def generate_image(prompt: str, negative_prompt: str = ""):
|
|||
|
|
workflow = {
|
|||
|
|
# ComfyUI workflow JSON
|
|||
|
|
}
|
|||
|
|
response = requests.post(
|
|||
|
|
"http://comfyui:8188/prompt",
|
|||
|
|
json={"prompt": workflow}
|
|||
|
|
)
|
|||
|
|
return response.json()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 4.3 Model Training Infrastructure
|
|||
|
|
|
|||
|
|
For fine-tuning LLMs and training custom models.
|
|||
|
|
|
|||
|
|
#### Option A: Axolotl (Recommended)
|
|||
|
|
|
|||
|
|
**Axolotl** is a user-friendly fine-tuning framework supporting:
|
|||
|
|
- LoRA, QLoRA
|
|||
|
|
- Full fine-tuning
|
|||
|
|
- RLHF/DPO
|
|||
|
|
- Multi-GPU training
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
services:
|
|||
|
|
axolotl:
|
|||
|
|
image: winglian/axolotl:main-py3.11-cu121-2.2.2
|
|||
|
|
container_name: gpu_training
|
|||
|
|
runtime: nvidia
|
|||
|
|
volumes:
|
|||
|
|
- ./training/configs:/workspace/configs
|
|||
|
|
- ./training/data:/workspace/data
|
|||
|
|
- ./training/output:/workspace/output
|
|||
|
|
- training_cache:/root/.cache
|
|||
|
|
environment:
|
|||
|
|
NVIDIA_VISIBLE_DEVICES: all
|
|||
|
|
WANDB_API_KEY: ${WANDB_API_KEY:-} # Optional: Weights & Biases logging
|
|||
|
|
command: |
|
|||
|
|
bash -c "
|
|||
|
|
accelerate launch -m axolotl.cli.train /workspace/configs/config.yaml
|
|||
|
|
"
|
|||
|
|
deploy:
|
|||
|
|
resources:
|
|||
|
|
reservations:
|
|||
|
|
devices:
|
|||
|
|
- driver: nvidia
|
|||
|
|
count: 1
|
|||
|
|
capabilities: [gpu]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Training Workflow:
|
|||
|
|
1. Prepare dataset (JSONL format)
|
|||
|
|
2. Create Axolotl config (LoRA, batch size, epochs)
|
|||
|
|
3. Start training container
|
|||
|
|
4. Monitor via Weights & Biases or TensorBoard
|
|||
|
|
5. Export LoRA adapters
|
|||
|
|
6. Merge with base model or use in vLLM
|
|||
|
|
|
|||
|
|
#### Example Config:
|
|||
|
|
```yaml
|
|||
|
|
# training/configs/lora-llama3.yaml
|
|||
|
|
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
|
|||
|
|
model_type: AutoModelForCausalLM
|
|||
|
|
tokenizer_type: AutoTokenizer
|
|||
|
|
|
|||
|
|
load_in_8bit: false
|
|||
|
|
load_in_4bit: true
|
|||
|
|
strict: false
|
|||
|
|
|
|||
|
|
datasets:
|
|||
|
|
- path: /workspace/data/train.jsonl
|
|||
|
|
type: completion
|
|||
|
|
field: text
|
|||
|
|
|
|||
|
|
output_dir: /workspace/output/llama3-lora
|
|||
|
|
|
|||
|
|
adapter: lora
|
|||
|
|
lora_r: 16
|
|||
|
|
lora_alpha: 32
|
|||
|
|
lora_dropout: 0.05
|
|||
|
|
lora_target_modules:
|
|||
|
|
- q_proj
|
|||
|
|
- v_proj
|
|||
|
|
- k_proj
|
|||
|
|
- o_proj
|
|||
|
|
|
|||
|
|
gradient_accumulation_steps: 4
|
|||
|
|
micro_batch_size: 2
|
|||
|
|
num_epochs: 3
|
|||
|
|
learning_rate: 0.0002
|
|||
|
|
|
|||
|
|
optimizer: adamw_bnb_8bit
|
|||
|
|
lr_scheduler: cosine
|
|||
|
|
warmup_steps: 100
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Option B: JupyterLab for Custom Training
|
|||
|
|
|
|||
|
|
For research and custom training scripts:
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
services:
|
|||
|
|
jupyter:
|
|||
|
|
image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel
|
|||
|
|
container_name: gpu_jupyter
|
|||
|
|
runtime: nvidia
|
|||
|
|
volumes:
|
|||
|
|
- ./notebooks:/workspace
|
|||
|
|
- jupyter_cache:/root/.cache
|
|||
|
|
ports:
|
|||
|
|
- "8888:8888"
|
|||
|
|
environment:
|
|||
|
|
NVIDIA_VISIBLE_DEVICES: all
|
|||
|
|
JUPYTER_ENABLE_LAB: "yes"
|
|||
|
|
command: |
|
|||
|
|
bash -c "
|
|||
|
|
pip install jupyterlab transformers datasets accelerate bitsandbytes peft &&
|
|||
|
|
jupyter lab --ip=0.0.0.0 --allow-root --no-browser --NotebookApp.token=''
|
|||
|
|
"
|
|||
|
|
deploy:
|
|||
|
|
resources:
|
|||
|
|
reservations:
|
|||
|
|
devices:
|
|||
|
|
- driver: nvidia
|
|||
|
|
count: 1
|
|||
|
|
capabilities: [gpu]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 4.4 Model Storage Strategy
|
|||
|
|
|
|||
|
|
#### Storage Requirements:
|
|||
|
|
|
|||
|
|
**Per Model Type:**
|
|||
|
|
- LLM 7B: ~14GB (FP16)
|
|||
|
|
- LLM 13B: ~26GB
|
|||
|
|
- SDXL: ~7GB
|
|||
|
|
- FLUX: ~12GB
|
|||
|
|
- ControlNet: ~2.5GB each
|
|||
|
|
- LoRA: ~100-500MB each
|
|||
|
|
|
|||
|
|
**Total Estimated:**
|
|||
|
|
- 3-4 LLMs: ~80GB
|
|||
|
|
- SD models + LoRAs: ~50GB
|
|||
|
|
- Training checkpoints: ~100GB
|
|||
|
|
- **Total: 250-300GB minimum**
|
|||
|
|
|
|||
|
|
#### RunPod Storage Options:
|
|||
|
|
|
|||
|
|
1. **Network Volume** (Recommended)
|
|||
|
|
- Persistent across pod restarts
|
|||
|
|
- Shared between multiple pods
|
|||
|
|
- ~$0.10/GB/month
|
|||
|
|
- 500GB = $50/month
|
|||
|
|
|
|||
|
|
2. **Container Disk**
|
|||
|
|
- Included with pod
|
|||
|
|
- Lost when pod stops
|
|||
|
|
- Good for temporary storage
|
|||
|
|
|
|||
|
|
3. **External Storage (rclone)**
|
|||
|
|
- Sync to/from VPS or cloud storage
|
|||
|
|
- Backup models to Backblaze B2 or Wasabi
|
|||
|
|
- Good for disaster recovery
|
|||
|
|
|
|||
|
|
#### Model Management:
|
|||
|
|
|
|||
|
|
Use **Hugging Face Hub** as model cache:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Download models on first run
|
|||
|
|
huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct \
|
|||
|
|
--local-dir /models/llama-3.1-8b
|
|||
|
|
|
|||
|
|
# Or let vLLM/ComfyUI auto-download
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Model Sync Script:**
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# sync-models.sh - Sync models from VPS to GPU server
|
|||
|
|
|
|||
|
|
rclone sync \
|
|||
|
|
/mnt/hidrive/AI/models \
|
|||
|
|
gpu:/workspace/models \
|
|||
|
|
--progress \
|
|||
|
|
--transfers 4
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase 5: Implementation Roadmap
|
|||
|
|
|
|||
|
|
### Week 1: Infrastructure Setup
|
|||
|
|
|
|||
|
|
**Day 1-2: RunPod Account & GPU Server**
|
|||
|
|
- [ ] Create RunPod account
|
|||
|
|
- [ ] Deploy RTX 4090 pod with Ubuntu 22.04 + PyTorch template
|
|||
|
|
- [ ] Configure persistent network volume (500GB)
|
|||
|
|
- [ ] Set up SSH access
|
|||
|
|
|
|||
|
|
**Day 3-4: Network Configuration**
|
|||
|
|
- [ ] Deploy WireGuard on GPU server
|
|||
|
|
- [ ] Add GPU server as peer to existing VPN (vpn/compose.yaml)
|
|||
|
|
- [ ] Test connectivity between VPS and GPU server
|
|||
|
|
- [ ] Configure firewall rules
|
|||
|
|
|
|||
|
|
**Day 5: Docker Setup on GPU Server**
|
|||
|
|
- [ ] Install Docker + NVIDIA Container Toolkit
|
|||
|
|
- [ ] Create docker-compose.yaml for GPU services
|
|||
|
|
- [ ] Test GPU access in containers
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Week 2: LLM Hosting
|
|||
|
|
|
|||
|
|
**Day 1-2: vLLM Deployment**
|
|||
|
|
- [ ] Deploy vLLM container
|
|||
|
|
- [ ] Download Llama 3.1 8B Instruct
|
|||
|
|
- [ ] Test inference locally
|
|||
|
|
- [ ] Benchmark performance (tokens/sec)
|
|||
|
|
|
|||
|
|
**Day 3-4: LiteLLM Integration**
|
|||
|
|
- [ ] Update litellm-config.yaml with vLLM endpoint
|
|||
|
|
- [ ] Test via Open WebUI
|
|||
|
|
- [ ] Configure model routing (cheap models → vLLM, complex → Claude)
|
|||
|
|
- [ ] Set up usage monitoring
|
|||
|
|
|
|||
|
|
**Day 5: Model Expansion**
|
|||
|
|
- [ ] Download Qwen 2.5 14B
|
|||
|
|
- [ ] Download Mistral 7B Instruct
|
|||
|
|
- [ ] Test model switching in Open WebUI
|
|||
|
|
- [ ] Document performance characteristics
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Week 3: Image Generation
|
|||
|
|
|
|||
|
|
**Day 1-2: ComfyUI Setup**
|
|||
|
|
- [ ] Deploy ComfyUI container
|
|||
|
|
- [ ] Download FLUX.1-schnell
|
|||
|
|
- [ ] Download SDXL
|
|||
|
|
- [ ] Install ComfyUI Manager
|
|||
|
|
|
|||
|
|
**Day 3-4: Model Downloads**
|
|||
|
|
- [ ] Download ControlNet models
|
|||
|
|
- [ ] Download VAE models
|
|||
|
|
- [ ] Download popular LoRAs from Civitai
|
|||
|
|
- [ ] Organize model directory
|
|||
|
|
|
|||
|
|
**Day 5: Integration & Workflows**
|
|||
|
|
- [ ] Create basic text-to-image workflow
|
|||
|
|
- [ ] Create ControlNet workflow
|
|||
|
|
- [ ] Test API access
|
|||
|
|
- [ ] Add Traefik subdomain (comfy.ai.pivoine.art)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Week 4: Training Infrastructure
|
|||
|
|
|
|||
|
|
**Day 1-2: Axolotl Setup**
|
|||
|
|
- [ ] Deploy Axolotl container
|
|||
|
|
- [ ] Create sample dataset
|
|||
|
|
- [ ] Test LoRA fine-tuning with tiny model
|
|||
|
|
- [ ] Verify GPU utilization
|
|||
|
|
|
|||
|
|
**Day 3-4: JupyterLab Setup**
|
|||
|
|
- [ ] Deploy JupyterLab container
|
|||
|
|
- [ ] Install ML libraries
|
|||
|
|
- [ ] Create example notebooks
|
|||
|
|
- [ ] Test custom training scripts
|
|||
|
|
|
|||
|
|
**Day 5: Documentation & Testing**
|
|||
|
|
- [ ] Write training guides
|
|||
|
|
- [ ] Test end-to-end workflows
|
|||
|
|
- [ ] Benchmark training speeds
|
|||
|
|
- [ ] Document best practices
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Ongoing: Optimization & Expansion
|
|||
|
|
|
|||
|
|
**Month 2:**
|
|||
|
|
- Monitor costs and optimize GPU utilization
|
|||
|
|
- Implement model caching strategies
|
|||
|
|
- Add more models based on usage patterns
|
|||
|
|
- Set up automated model updates
|
|||
|
|
- Implement usage quotas per user
|
|||
|
|
|
|||
|
|
**Month 3+:**
|
|||
|
|
- Consider multi-GPU setup for larger models
|
|||
|
|
- Implement model quantization (AWQ/GPTQ)
|
|||
|
|
- Add video generation (AnimateDiff, CogVideo)
|
|||
|
|
- Explore voice synthesis (XTTS, Bark)
|
|||
|
|
- Custom model training for specific use cases
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase 6: Cost Analysis
|
|||
|
|
|
|||
|
|
### Scenario A: Single RTX 4090 (24/7)
|
|||
|
|
|
|||
|
|
**GPU Server (RunPod):**
|
|||
|
|
- RTX 4090 pod: $0.50/hour × 720 hours = $360/month
|
|||
|
|
- 500GB network volume: $50/month
|
|||
|
|
- **Subtotal: $410/month**
|
|||
|
|
|
|||
|
|
**VPS (Existing):**
|
|||
|
|
- No change in cost
|
|||
|
|
|
|||
|
|
**Total: ~$410/month**
|
|||
|
|
|
|||
|
|
**Savings:**
|
|||
|
|
- Claude API costs reduced by ~80% (self-hosted for routine tasks)
|
|||
|
|
- Break-even if currently spending >$500/month on API calls
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Scenario B: Pay-as-you-go (8 hours/day)
|
|||
|
|
|
|||
|
|
**GPU Server (RunPod):**
|
|||
|
|
- RTX 4090: $0.50/hour × 8 hours × 30 days = $120/month
|
|||
|
|
- Storage: $50/month
|
|||
|
|
- **Subtotal: $170/month**
|
|||
|
|
|
|||
|
|
**Best for:**
|
|||
|
|
- Development/experimentation
|
|||
|
|
- Burst workloads
|
|||
|
|
- Image generation on-demand
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Scenario C: Dual GPU (Training + Inference)
|
|||
|
|
|
|||
|
|
**GPU Server 1 (Inference):**
|
|||
|
|
- RTX 4090 24/7: $360/month
|
|||
|
|
|
|||
|
|
**GPU Server 2 (Training - On-demand):**
|
|||
|
|
- A100 40GB: $1.50/hour × 40 hours/month = $60/month
|
|||
|
|
- Used only for fine-tuning sessions
|
|||
|
|
|
|||
|
|
**Storage:**
|
|||
|
|
- 1TB network volume: $100/month
|
|||
|
|
|
|||
|
|
**Total: ~$520/month**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Cost Optimization Tips
|
|||
|
|
|
|||
|
|
1. **Auto-stop idle pods**: RunPod can auto-stop after X minutes idle
|
|||
|
|
2. **Use spot instances**: ~50% cheaper but can be interrupted
|
|||
|
|
3. **Quantized models**: 4-bit models use 4x less VRAM → cheaper GPUs
|
|||
|
|
4. **Batch processing**: Queue image gen jobs to maximize GPU usage
|
|||
|
|
5. **Model sharing**: One vLLM instance can serve multiple models via adapters
|
|||
|
|
6. **Monitoring**: Track per-model costs to optimize routing
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase 7: Monitoring & Operations
|
|||
|
|
|
|||
|
|
### Metrics to Track
|
|||
|
|
|
|||
|
|
**GPU Utilization:**
|
|||
|
|
- nvidia-smi metrics (utilization %, memory usage)
|
|||
|
|
- Temperature and power draw
|
|||
|
|
- Per-process GPU usage
|
|||
|
|
|
|||
|
|
**Model Performance:**
|
|||
|
|
- Tokens per second (LLM inference)
|
|||
|
|
- Images per second (SD/FLUX)
|
|||
|
|
- Training time per epoch
|
|||
|
|
|
|||
|
|
**Costs:**
|
|||
|
|
- GPU hours consumed
|
|||
|
|
- Storage usage
|
|||
|
|
- API vs self-hosted breakdown
|
|||
|
|
|
|||
|
|
### Monitoring Stack
|
|||
|
|
|
|||
|
|
**Option A: Netdata (Already deployed)**
|
|||
|
|
|
|||
|
|
Add GPU monitoring to existing Netdata:
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# On GPU server
|
|||
|
|
services:
|
|||
|
|
netdata:
|
|||
|
|
image: netdata/netdata:latest
|
|||
|
|
container_name: gpu_netdata
|
|||
|
|
runtime: nvidia
|
|||
|
|
environment:
|
|||
|
|
NVIDIA_VISIBLE_DEVICES: all
|
|||
|
|
volumes:
|
|||
|
|
- /sys:/host/sys:ro
|
|||
|
|
- /proc:/host/proc:ro
|
|||
|
|
- /var/run/docker.sock:/var/run/docker.sock:ro
|
|||
|
|
command: |
|
|||
|
|
bash -c "
|
|||
|
|
# Enable nvidia_smi plugin
|
|||
|
|
/usr/libexec/netdata/plugins.d/charts.d.plugin nvidia_smi
|
|||
|
|
"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Option B: Prometheus + Grafana**
|
|||
|
|
|
|||
|
|
For detailed metrics:
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
services:
|
|||
|
|
prometheus:
|
|||
|
|
image: prom/prometheus:latest
|
|||
|
|
volumes:
|
|||
|
|
- ./prometheus.yml:/etc/prometheus/prometheus.yml
|
|||
|
|
- prometheus_data:/prometheus
|
|||
|
|
command:
|
|||
|
|
- '--config.file=/etc/prometheus/prometheus.yml'
|
|||
|
|
|
|||
|
|
dcgm-exporter:
|
|||
|
|
image: nvidia/dcgm-exporter:latest
|
|||
|
|
runtime: nvidia
|
|||
|
|
environment:
|
|||
|
|
NVIDIA_VISIBLE_DEVICES: all
|
|||
|
|
|
|||
|
|
grafana:
|
|||
|
|
image: grafana/grafana:latest
|
|||
|
|
ports:
|
|||
|
|
- "3000:3000"
|
|||
|
|
volumes:
|
|||
|
|
- grafana_data:/var/lib/grafana
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Import Grafana dashboard #12219 for GPU metrics.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase 8: Backup & Disaster Recovery
|
|||
|
|
|
|||
|
|
### What to Backup
|
|||
|
|
|
|||
|
|
1. **Models** (250-300GB)
|
|||
|
|
- Base models can be re-downloaded
|
|||
|
|
- Custom fine-tuned models: CRITICAL
|
|||
|
|
- LoRAs: CRITICAL
|
|||
|
|
|
|||
|
|
2. **Training Data** (~10-50GB)
|
|||
|
|
- Datasets
|
|||
|
|
- Preprocessing scripts
|
|||
|
|
|
|||
|
|
3. **Configurations** (<1GB)
|
|||
|
|
- Docker compose files
|
|||
|
|
- Training configs
|
|||
|
|
- Workflow JSONs
|
|||
|
|
|
|||
|
|
### Backup Strategy
|
|||
|
|
|
|||
|
|
**Tier 1: Critical (Daily)**
|
|||
|
|
- Fine-tuned models
|
|||
|
|
- Training checkpoints
|
|||
|
|
- Custom datasets
|
|||
|
|
|
|||
|
|
**Backup to:**
|
|||
|
|
- Restic → HiDrive (already configured)
|
|||
|
|
- Backblaze B2 (~$6/TB/month)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Add to core/compose.yaml backrest config
|
|||
|
|
- gpu_models:/volumes/gpu_models:ro
|
|||
|
|
- gpu_checkpoints:/volumes/gpu_checkpoints:ro
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Tier 2: Nice-to-have (Weekly)**
|
|||
|
|
- Base models (can re-download)
|
|||
|
|
- ComfyUI outputs
|
|||
|
|
|
|||
|
|
**Tier 3: Ephemeral (No backup)**
|
|||
|
|
- Inference cache
|
|||
|
|
- Temporary generations
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase 9: Security Considerations
|
|||
|
|
|
|||
|
|
### GPU Server Security
|
|||
|
|
|
|||
|
|
1. **Firewall:**
|
|||
|
|
- Only allow WireGuard port (51820)
|
|||
|
|
- All services accessed via VPN
|
|||
|
|
- No public exposure
|
|||
|
|
|
|||
|
|
2. **SSH:**
|
|||
|
|
- Key-based auth only
|
|||
|
|
- Disable password auth
|
|||
|
|
- Change default port
|
|||
|
|
|
|||
|
|
3. **Docker:**
|
|||
|
|
- Rootless Docker (optional but recommended)
|
|||
|
|
- Limited container capabilities
|
|||
|
|
- No privileged containers except for nvidia-runtime
|
|||
|
|
|
|||
|
|
4. **Secrets:**
|
|||
|
|
- Store API keys in .env
|
|||
|
|
- Use Docker secrets for sensitive data
|
|||
|
|
- Rotate keys periodically
|
|||
|
|
|
|||
|
|
### Access Control
|
|||
|
|
|
|||
|
|
- **ComfyUI**: Protected by Authelia SSO (already configured)
|
|||
|
|
- **vLLM**: Internal only, accessed via LiteLLM proxy
|
|||
|
|
- **JupyterLab**: Password-protected or Authelia
|
|||
|
|
- **Training**: No public access, VPN only
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase 10: Advanced Features (Future)
|
|||
|
|
|
|||
|
|
### Multi-GPU Scaling
|
|||
|
|
|
|||
|
|
**Tensor Parallelism** (vLLM):
|
|||
|
|
- Split large models across multiple GPUs
|
|||
|
|
- Example: 70B model on 2x A100s
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
command:
|
|||
|
|
- --model
|
|||
|
|
- meta-llama/Meta-Llama-3.1-70B-Instruct
|
|||
|
|
- --tensor-parallel-size
|
|||
|
|
- '2' # Use 2 GPUs
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Pipeline Parallelism** (training):
|
|||
|
|
- Split model layers across GPUs
|
|||
|
|
- Useful for very large models
|
|||
|
|
|
|||
|
|
### Model Serving Optimization
|
|||
|
|
|
|||
|
|
**vLLM Features:**
|
|||
|
|
- Speculative decoding (faster generation)
|
|||
|
|
- Prefix caching (faster for repeated prompts)
|
|||
|
|
- Multi-LoRA serving (multiple adapters, one base model)
|
|||
|
|
|
|||
|
|
**Example multi-LoRA:**
|
|||
|
|
```yaml
|
|||
|
|
command:
|
|||
|
|
- --model
|
|||
|
|
- meta-llama/Meta-Llama-3.1-8B-Instruct
|
|||
|
|
- --enable-lora
|
|||
|
|
- --max-loras
|
|||
|
|
- '4'
|
|||
|
|
- --lora-modules
|
|||
|
|
- customer-support=/models/loras/support-lora
|
|||
|
|
- creative-writing=/models/loras/writing-lora
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Video Generation
|
|||
|
|
|
|||
|
|
**AnimateDiff in ComfyUI:**
|
|||
|
|
- Generate short videos from text prompts
|
|||
|
|
- Animate static images
|
|||
|
|
- ~8GB VRAM for 512x512 16-frame videos
|
|||
|
|
|
|||
|
|
**CogVideo:**
|
|||
|
|
- High-quality video generation
|
|||
|
|
- Requires A100 or H100
|
|||
|
|
- 5-second clips at 720p
|
|||
|
|
|
|||
|
|
### Voice Synthesis
|
|||
|
|
|
|||
|
|
**XTTS v2:**
|
|||
|
|
- High-quality voice cloning
|
|||
|
|
- Multi-language support
|
|||
|
|
- ~6GB VRAM
|
|||
|
|
|
|||
|
|
**Bark:**
|
|||
|
|
- Text-to-speech with emotions
|
|||
|
|
- Sound effects
|
|||
|
|
- ~10GB VRAM
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Appendix A: Quick Start Commands
|
|||
|
|
|
|||
|
|
### Initial GPU Server Setup
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# SSH into RunPod instance
|
|||
|
|
ssh root@gpu.runpod.io -p 12345
|
|||
|
|
|
|||
|
|
# Install Docker
|
|||
|
|
curl -fsSL https://get.docker.com -o get-docker.sh
|
|||
|
|
sh get-docker.sh
|
|||
|
|
|
|||
|
|
# Install NVIDIA Container Toolkit
|
|||
|
|
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
|
|||
|
|
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
|
|||
|
|
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
|
|||
|
|
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
|
|||
|
|
tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
|
|||
|
|
apt-get update
|
|||
|
|
apt-get install -y nvidia-container-toolkit
|
|||
|
|
systemctl restart docker
|
|||
|
|
|
|||
|
|
# Test GPU access
|
|||
|
|
docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Deploy vLLM (Quick Test)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Create directory
|
|||
|
|
mkdir -p /workspace/vllm
|
|||
|
|
cd /workspace/vllm
|
|||
|
|
|
|||
|
|
# Run vLLM
|
|||
|
|
docker run -d \
|
|||
|
|
--name vllm \
|
|||
|
|
--runtime=nvidia \
|
|||
|
|
--gpus all \
|
|||
|
|
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
|||
|
|
-p 8000:8000 \
|
|||
|
|
vllm/vllm-openai:latest \
|
|||
|
|
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
|
|||
|
|
--dtype auto \
|
|||
|
|
--max-model-len 8192
|
|||
|
|
|
|||
|
|
# Test inference
|
|||
|
|
curl http://localhost:8000/v1/completions \
|
|||
|
|
-H "Content-Type: application/json" \
|
|||
|
|
-d '{
|
|||
|
|
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
|||
|
|
"prompt": "Once upon a time",
|
|||
|
|
"max_tokens": 50
|
|||
|
|
}'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Deploy ComfyUI (Quick Test)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
docker run -d \
|
|||
|
|
--name comfyui \
|
|||
|
|
--runtime=nvidia \
|
|||
|
|
--gpus all \
|
|||
|
|
-v /workspace/comfyui:/data \
|
|||
|
|
-p 8188:8188 \
|
|||
|
|
ghcr.io/ai-dock/comfyui:latest
|
|||
|
|
|
|||
|
|
# Access at http://gpu-ip:8188
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Appendix B: Sample Docker Compose (Full GPU Stack)
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# gpu-server/compose.yaml
|
|||
|
|
version: '3.8'
|
|||
|
|
|
|||
|
|
services:
|
|||
|
|
# vLLM for LLM inference
|
|||
|
|
vllm:
|
|||
|
|
image: vllm/vllm-openai:latest
|
|||
|
|
container_name: gpu_vllm
|
|||
|
|
restart: unless-stopped
|
|||
|
|
runtime: nvidia
|
|||
|
|
environment:
|
|||
|
|
NVIDIA_VISIBLE_DEVICES: all
|
|||
|
|
CUDA_VISIBLE_DEVICES: 0
|
|||
|
|
volumes:
|
|||
|
|
- vllm_models:/root/.cache/huggingface
|
|||
|
|
command:
|
|||
|
|
- --model
|
|||
|
|
- meta-llama/Meta-Llama-3.1-8B-Instruct
|
|||
|
|
- --host
|
|||
|
|
- 0.0.0.0
|
|||
|
|
- --port
|
|||
|
|
- 8000
|
|||
|
|
- --gpu-memory-utilization
|
|||
|
|
- '0.9'
|
|||
|
|
ports:
|
|||
|
|
- "8000:8000"
|
|||
|
|
deploy:
|
|||
|
|
resources:
|
|||
|
|
reservations:
|
|||
|
|
devices:
|
|||
|
|
- driver: nvidia
|
|||
|
|
count: 1
|
|||
|
|
capabilities: [gpu]
|
|||
|
|
|
|||
|
|
# ComfyUI for image generation
|
|||
|
|
comfyui:
|
|||
|
|
image: ghcr.io/ai-dock/comfyui:latest
|
|||
|
|
container_name: gpu_comfyui
|
|||
|
|
restart: unless-stopped
|
|||
|
|
runtime: nvidia
|
|||
|
|
environment:
|
|||
|
|
NVIDIA_VISIBLE_DEVICES: all
|
|||
|
|
volumes:
|
|||
|
|
- comfyui_data:/data
|
|||
|
|
- comfyui_models:/opt/ComfyUI/models
|
|||
|
|
- comfyui_output:/opt/ComfyUI/output
|
|||
|
|
ports:
|
|||
|
|
- "8188:8188"
|
|||
|
|
deploy:
|
|||
|
|
resources:
|
|||
|
|
reservations:
|
|||
|
|
devices:
|
|||
|
|
- driver: nvidia
|
|||
|
|
count: 1
|
|||
|
|
capabilities: [gpu]
|
|||
|
|
|
|||
|
|
# Axolotl for model training
|
|||
|
|
axolotl:
|
|||
|
|
image: winglian/axolotl:main-py3.11-cu121-2.2.2
|
|||
|
|
container_name: gpu_training
|
|||
|
|
runtime: nvidia
|
|||
|
|
volumes:
|
|||
|
|
- ./training/configs:/workspace/configs
|
|||
|
|
- ./training/data:/workspace/data
|
|||
|
|
- ./training/output:/workspace/output
|
|||
|
|
- training_cache:/root/.cache
|
|||
|
|
environment:
|
|||
|
|
NVIDIA_VISIBLE_DEVICES: all
|
|||
|
|
deploy:
|
|||
|
|
resources:
|
|||
|
|
reservations:
|
|||
|
|
devices:
|
|||
|
|
- driver: nvidia
|
|||
|
|
count: 1
|
|||
|
|
capabilities: [gpu]
|
|||
|
|
# Only start when training
|
|||
|
|
profiles:
|
|||
|
|
- training
|
|||
|
|
|
|||
|
|
# JupyterLab for research
|
|||
|
|
jupyter:
|
|||
|
|
image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel
|
|||
|
|
container_name: gpu_jupyter
|
|||
|
|
restart: unless-stopped
|
|||
|
|
runtime: nvidia
|
|||
|
|
volumes:
|
|||
|
|
- ./notebooks:/workspace
|
|||
|
|
- jupyter_cache:/root/.cache
|
|||
|
|
ports:
|
|||
|
|
- "8888:8888"
|
|||
|
|
environment:
|
|||
|
|
NVIDIA_VISIBLE_DEVICES: all
|
|||
|
|
JUPYTER_ENABLE_LAB: "yes"
|
|||
|
|
command: |
|
|||
|
|
bash -c "
|
|||
|
|
pip install jupyterlab transformers datasets accelerate bitsandbytes peft &&
|
|||
|
|
jupyter lab --ip=0.0.0.0 --allow-root --no-browser
|
|||
|
|
"
|
|||
|
|
deploy:
|
|||
|
|
resources:
|
|||
|
|
reservations:
|
|||
|
|
devices:
|
|||
|
|
- driver: nvidia
|
|||
|
|
count: 1
|
|||
|
|
capabilities: [gpu]
|
|||
|
|
|
|||
|
|
# Netdata monitoring
|
|||
|
|
netdata:
|
|||
|
|
image: netdata/netdata:latest
|
|||
|
|
container_name: gpu_netdata
|
|||
|
|
restart: unless-stopped
|
|||
|
|
runtime: nvidia
|
|||
|
|
environment:
|
|||
|
|
NVIDIA_VISIBLE_DEVICES: all
|
|||
|
|
volumes:
|
|||
|
|
- /sys:/host/sys:ro
|
|||
|
|
- /proc:/host/proc:ro
|
|||
|
|
- /var/run/docker.sock:/var/run/docker.sock:ro
|
|||
|
|
ports:
|
|||
|
|
- "19999:19999"
|
|||
|
|
cap_add:
|
|||
|
|
- SYS_PTRACE
|
|||
|
|
- SYS_ADMIN
|
|||
|
|
security_opt:
|
|||
|
|
- apparmor:unconfined
|
|||
|
|
|
|||
|
|
volumes:
|
|||
|
|
vllm_models:
|
|||
|
|
comfyui_data:
|
|||
|
|
comfyui_models:
|
|||
|
|
comfyui_output:
|
|||
|
|
training_cache:
|
|||
|
|
jupyter_cache:
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Appendix C: Cost Calculator
|
|||
|
|
|
|||
|
|
**Monthly GPU Costs:**
|
|||
|
|
|
|||
|
|
| GPU Model | VRAM | $/hour | 24/7 Month | 8hr/day | Use Case |
|
|||
|
|
|-----------|------|--------|------------|---------|----------|
|
|||
|
|
| RTX 3090 | 24GB | $0.35 | $252 | $84 | Development, small models |
|
|||
|
|
| RTX 4090 | 24GB | $0.50 | $360 | $120 | Production inference, SD |
|
|||
|
|
| A6000 | 48GB | $0.80 | $576 | $192 | Large models, training |
|
|||
|
|
| A100 40GB | 40GB | $1.50 | $1,080 | $360 | Enterprise, training |
|
|||
|
|
| A100 80GB | 80GB | $2.50 | $1,800 | $600 | Massive models, research |
|
|||
|
|
|
|||
|
|
**Storage Costs:**
|
|||
|
|
- Network volume: $0.10/GB/month
|
|||
|
|
- 500GB = $50/month
|
|||
|
|
- 1TB = $100/month
|
|||
|
|
|
|||
|
|
**Total Estimated Monthly:**
|
|||
|
|
- RTX 4090 + 500GB storage = $410/month (24/7)
|
|||
|
|
- RTX 4090 + 500GB storage = $170/month (8hr/day)
|
|||
|
|
|
|||
|
|
**Break-even Analysis:**
|
|||
|
|
- If spending >$500/month on API calls → GPU server saves money
|
|||
|
|
- If spending <$200/month → stick with APIs
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Appendix D: Model Recommendations by Use Case
|
|||
|
|
|
|||
|
|
### General Chat (24/7 Inference)
|
|||
|
|
**Best:** Qwen 2.5 14B Instruct
|
|||
|
|
- Excellent multilingual support
|
|||
|
|
- Fast inference
|
|||
|
|
- Good reasoning
|
|||
|
|
|
|||
|
|
**Alternative:** Mistral 7B Instruct v0.3
|
|||
|
|
- Fastest inference
|
|||
|
|
- Lower VRAM
|
|||
|
|
|
|||
|
|
### Code Generation
|
|||
|
|
**Best:** Qwen 2.5 Coder 14B
|
|||
|
|
- SOTA coding performance
|
|||
|
|
- Multi-language support
|
|||
|
|
|
|||
|
|
**Alternative:** DeepSeek Coder 6.7B
|
|||
|
|
- Faster, lighter
|
|||
|
|
|
|||
|
|
### Creative Writing
|
|||
|
|
**Best:** Nous Hermes 2 Mixtral 8x7B (quantized)
|
|||
|
|
- Creative, engaging
|
|||
|
|
- Follows instructions well
|
|||
|
|
|
|||
|
|
### Image Generation (Realistic)
|
|||
|
|
**Best:** FLUX.1-dev
|
|||
|
|
- Highest quality
|
|||
|
|
- Best prompt following
|
|||
|
|
|
|||
|
|
**Alternative:** SDXL + RealVisXL LoRA
|
|||
|
|
- Faster generation
|
|||
|
|
- Good quality
|
|||
|
|
|
|||
|
|
### Image Generation (Anime)
|
|||
|
|
**Best:** SDXL + AnimagineXL LoRA
|
|||
|
|
- Anime-specific training
|
|||
|
|
- Vibrant colors
|
|||
|
|
|
|||
|
|
### Video Generation
|
|||
|
|
**Best:** AnimateDiff + SDXL
|
|||
|
|
- 16-frame clips
|
|||
|
|
- Good quality
|
|||
|
|
|
|||
|
|
**Needs:** A100 40GB or better
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Next Steps
|
|||
|
|
|
|||
|
|
1. **Review this plan** and provide feedback
|
|||
|
|
2. **Set budget** for GPU infrastructure
|
|||
|
|
3. **Choose provider** (recommend RunPod)
|
|||
|
|
4. **Define priority services** (LLM hosting first? Image gen first?)
|
|||
|
|
5. **Schedule implementation** (4-week timeline above)
|
|||
|
|
|
|||
|
|
Would you like me to:
|
|||
|
|
- Create the detailed Docker Compose configurations?
|
|||
|
|
- Set up a cost estimation spreadsheet?
|
|||
|
|
- Research specific models for your use cases?
|
|||
|
|
- Begin implementation with Phase 1?
|
|||
|
|
|
|||
|
|
Let me know how you'd like to proceed! 🚀
|