# GPU-Enhanced AI Stack Expansion Plan ## Executive Summary This document outlines a comprehensive plan to extend the current AI stack (LiteLLM, Open WebUI, Crawl4AI) with dedicated GPU hosting capabilities for: - **LLM Model Hosting**: Self-hosted models (Llama, Mistral, Qwen, etc.) - **Model Training**: Fine-tuning and training workflows - **Image Generation**: Stable Diffusion, FLUX via ComfyUI - **Video Generation**: AnimateDiff, CogVideo, etc. **Current Architecture**: CPU-based stack on pivoine.art VPS → Claude API via LiteLLM **Target Architecture**: Hybrid stack with GPU server(s) for self-hosted models + API-based models --- ## Phase 1: Current Stack Analysis ### Existing Components 1. **ai_postgres** (pgvector/pgvector:pg16) - PostgreSQL with pgvector for RAG - Stores: conversations, embeddings, LiteLLM logs 2. **webui** (Open WebUI) - User-facing ChatGPT-like interface - URL: https://ai.pivoine.art - Features: RAG, web search, document upload - Connected to LiteLLM proxy 3. **litellm** (LiteLLM proxy) - Currently proxies Anthropic Claude API - OpenAI-compatible endpoint at http://litellm:4000 - Supports multiple providers via config 4. **crawl4ai** - Internal web scraping for LLM content prep - Port 11235 (internal only) 5. **facefusion** (CPU-only) - Face swapping/enhancement - Currently CPU-based (slow) - Protected by Authelia SSO ### Current Limitations - ❌ No self-hosted LLMs (relies on expensive API calls) - ❌ No GPU acceleration for facefusion - ❌ No image generation capabilities - ❌ No model training/fine-tuning capabilities - ❌ No video generation - ❌ High operational costs for API usage --- ## Phase 2: GPU Provider Comparison ### Provider Options #### 1. **RunPod** ⭐ RECOMMENDED **Pros:** - Pay-per-second GPU billing - Wide GPU selection (RTX 4090, A100, H100) - Docker-first platform - Global locations - Easy HTTP/SSH tunneling - Volume persistence **Pricing (Approximate):** - RTX 4090 (24GB): ~$0.50/hour ($360/month 24/7) - RTX 3090 (24GB): ~$0.35/hour ($250/month) - A6000 (48GB): ~$0.80/hour ($576/month) - A100 (40GB): ~$1.50/hour ($1,080/month) **Best for:** On-demand workloads, experimentation, cost-conscious hosting --- #### 2. **Lambda Labs** **Pros:** - Flat monthly pricing - High-end GPUs (A100, H100) - Jupyter notebooks included - Fast network **Pricing:** - 1x A100 (40GB): $1.10/hour ($792/month) - 8x A100 (40GB): $8.00/hour (~$5,760/month) **Best for:** Research, high-utilization workloads --- #### 3. **Vast.ai** **Pros:** - Marketplace model (cheapest) - Many GPU options - Spot pricing available **Cons:** - Variable reliability - Setup complexity - Community-hosted machines **Pricing:** - RTX 4090: ~$0.25-0.40/hour - A100: ~$0.80-1.20/hour **Best for:** Budget-conscious, experimental workloads --- #### 4. **Google Cloud Platform (GCP)** **Pros:** - Enterprise reliability - Auto-scaling - Integration with Google services - Preemptible instances available **Pricing:** - T4 (16GB): ~$0.35/hour - V100 (16GB): ~$2.48/hour - A100 (40GB): ~$2.93/hour - TPU options available **Best for:** Enterprise workloads, auto-scaling needs --- #### 5. **AWS** **Pros:** - Global infrastructure - Broad GPU selection - Spot instances for cost savings - Enterprise support **Pricing:** - g4dn.xlarge (T4 16GB): ~$0.526/hour - p3.2xlarge (V100 16GB): ~$3.06/hour - p4d.24xlarge (8x A100 40GB): ~$32.77/hour **Best for:** Enterprise, existing AWS infrastructure --- #### 6. **Hugging Face Spaces / Inference Endpoints** **Pros:** - Managed model hosting - Auto-scaling - Simple deployment - Community models **Pricing:** - CPU: $0.03/hour - T4: $0.60/hour - A10G: $1.00/hour - A100: $4.00/hour **Best for:** Quick model deployment, serverless inference --- ### Recommendation: **RunPod** for Primary GPU Server **Rationale:** 1. **Cost-effective**: Pay-per-second billing, ~$0.50/hour for RTX 4090 2. **Docker-native**: Easy integration with existing compose stack 3. **Flexibility**: Start/stop as needed, scale up for training 4. **Community**: Large user base, good documentation 5. **Network**: Built-in HTTP/SSH tunneling **Supplementary**: Use Hugging Face for specific model hosting if needed --- ## Phase 3: Architecture Design ### Network Topology ``` ┌─────────────────────────────────────────────────────────────┐ │ pivoine.art VPS (CPU-based) │ ├─────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Open │─────▶│ LiteLLM │◀────▶│ ai_ │ │ │ │ WebUI │ │ Proxy │ │ postgres │ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ │ │ │ │ │ └───────┼──────────────────┼──────────────────────────────────┘ │ │ │ ▼ │ ┌─────────────────┐ │ │ Anthropic API │ │ │ (Claude) │ │ └─────────────────┘ │ ▼ ┌────────────────────────────────────────────────────────────┐ │ GPU Server (RunPod) │ ├────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ vLLM │ │ ComfyUI │ │ Model │ │ JupyterLab│ │ │ │ (LLMs) │ │ (SD/FLUX)│ │ Training │ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ │ │ │ │ └──────────────┴─────────────┴──────────────┘ │ │ │ │ │ ┌───────────────┐ │ │ │ Model Storage │ │ │ │ (Persistent) │ │ │ └───────────────┘ │ │ │ └────────────────────────────────────────────────────────────┘ │ ▼ (Tunneled via WireGuard or Tailscale) ┌────────────────────────────────────────────────────────────┐ │ Integration Options: │ ├────────────────────────────────────────────────────────────┤ │ 1. LiteLLM adds vLLM endpoint (http://gpu.internal:8000) │ │ 2. ComfyUI exposed via subdomain (comfy.ai.pivoine.art) │ │ 3. Model storage synced via rclone/restic │ └────────────────────────────────────────────────────────────┘ ``` ### Connection Methods #### Option A: WireGuard VPN (RECOMMENDED) - Create WireGuard tunnel between VPS and GPU server - GPU services accessible via private IPs - Secure, low overhead, easy to manage - Already have wg-easy in your stack **Setup:** 1. Deploy WireGuard on GPU server 2. Add GPU server as VPN peer 3. Configure LiteLLM to use VPN IPs #### Option B: SSH Tunnel - SSH reverse tunnel from GPU to VPS - Simple, no additional software - Higher latency #### Option C: Tailscale - Zero-config VPN mesh - Easy setup, good UX - Proprietary (but free tier available) --- ## Phase 4: Service Implementation Plans ### 4.1 LLM Hosting with vLLM **vLLM** is the industry-standard for high-performance LLM inference. #### Features: - PagedAttention for efficient KV cache - Continuous batching - OpenAI-compatible API - Tensor parallelism for multi-GPU - Quantization support (AWQ, GPTQ) #### Docker Compose Configuration: ```yaml services: vllm: image: vllm/vllm-openai:latest container_name: gpu_vllm restart: unless-stopped runtime: nvidia environment: NVIDIA_VISIBLE_DEVICES: all CUDA_VISIBLE_DEVICES: 0 volumes: - vllm_models:/root/.cache/huggingface command: - --model - meta-llama/Meta-Llama-3.1-8B-Instruct # or any model - --host - 0.0.0.0 - --port - 8000 - --tensor-parallel-size - '1' - --gpu-memory-utilization - '0.9' - --max-model-len - '8192' ports: - "8000:8000" deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] ``` #### Recommended Models for RTX 4090 (24GB): **Text Generation:** - Llama 3.1 8B Instruct (8GB VRAM, fast) - Qwen2.5 14B Instruct (14GB VRAM, multilingual) - Mistral 7B Instruct v0.3 (7GB VRAM) - Nous Hermes 2 Mixtral 8x7B (with quantization, 16GB) **Code:** - DeepSeek Coder 6.7B (7GB VRAM) - CodeLlama 13B (13GB VRAM) - Qwen2.5-Coder 14B (14GB VRAM) #### Integration with LiteLLM: Add to `ai/litellm-config.yaml`: ```yaml model_list: # Existing Anthropic - model_name: claude-sonnet-4-5 litellm_params: model: anthropic/claude-sonnet-4-5-20250929 api_key: os.environ/ANTHROPIC_API_KEY # New vLLM models - model_name: llama-3.1-8b litellm_params: model: openai/meta-llama/Meta-Llama-3.1-8B-Instruct api_base: http://gpu.internal:8000/v1 api_key: dummy - model_name: qwen-2.5-14b litellm_params: model: openai/Qwen/Qwen2.5-14B-Instruct api_base: http://gpu.internal:8000/v1 api_key: dummy ``` --- ### 4.2 ComfyUI for Image/Video Generation **ComfyUI** is a node-based UI for Stable Diffusion with advanced workflows. #### Features: - Node-based workflow editor - Support for SD 1.5, SDXL, SD3, FLUX - ControlNet, LoRA, embeddings - Video generation (AnimateDiff, SVD) - API for automation #### Docker Compose Configuration: ```yaml services: comfyui: image: ghcr.io/ai-dock/comfyui:latest container_name: gpu_comfyui restart: unless-stopped runtime: nvidia environment: NVIDIA_VISIBLE_DEVICES: all # Custom nodes auto-install COMFYUI_FLAGS: --listen 0.0.0.0 --port 8188 volumes: - comfyui_data:/data - comfyui_models:/opt/ComfyUI/models - comfyui_output:/opt/ComfyUI/output ports: - "8188:8188" deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] ``` #### Model Downloads (via ComfyUI Manager): **Stable Diffusion Models:** - FLUX.1-dev (12GB, newest, best quality) - FLUX.1-schnell (12GB, fast) - SDXL Base 1.0 (6.9GB) - SD 1.5 (4GB, fast, wide LoRA support) **ControlNet Models:** - controlnet-canny-sdxl - controlnet-depth-sdxl - controlnet-openpose-sdxl **LoRA Models** (download from Civitai): - Style LoRAs (anime, realistic, etc.) - Character LoRAs - Concept LoRAs #### Traefik Integration: Add subdomain routing for ComfyUI: ```yaml labels: - 'traefik.enable=true' - 'traefik.http.routers.comfyui-web-secure.rule=Host(`comfy.ai.pivoine.art`)' - 'traefik.http.routers.comfyui-web-secure.tls.certresolver=resolver' - 'traefik.http.routers.comfyui-web-secure.entrypoints=web-secure' - 'traefik.http.routers.comfyui-web-secure.middlewares=net-authelia,security-headers@file' - 'traefik.http.services.comfyui.loadbalancer.server.port=8188' ``` #### Open WebUI Integration: ComfyUI has a REST API that can be called from Open WebUI using function calling. Example workflow API call: ```python import requests def generate_image(prompt: str, negative_prompt: str = ""): workflow = { # ComfyUI workflow JSON } response = requests.post( "http://comfyui:8188/prompt", json={"prompt": workflow} ) return response.json() ``` --- ### 4.3 Model Training Infrastructure For fine-tuning LLMs and training custom models. #### Option A: Axolotl (Recommended) **Axolotl** is a user-friendly fine-tuning framework supporting: - LoRA, QLoRA - Full fine-tuning - RLHF/DPO - Multi-GPU training ```yaml services: axolotl: image: winglian/axolotl:main-py3.11-cu121-2.2.2 container_name: gpu_training runtime: nvidia volumes: - ./training/configs:/workspace/configs - ./training/data:/workspace/data - ./training/output:/workspace/output - training_cache:/root/.cache environment: NVIDIA_VISIBLE_DEVICES: all WANDB_API_KEY: ${WANDB_API_KEY:-} # Optional: Weights & Biases logging command: | bash -c " accelerate launch -m axolotl.cli.train /workspace/configs/config.yaml " deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] ``` #### Training Workflow: 1. Prepare dataset (JSONL format) 2. Create Axolotl config (LoRA, batch size, epochs) 3. Start training container 4. Monitor via Weights & Biases or TensorBoard 5. Export LoRA adapters 6. Merge with base model or use in vLLM #### Example Config: ```yaml # training/configs/lora-llama3.yaml base_model: meta-llama/Meta-Llama-3.1-8B-Instruct model_type: AutoModelForCausalLM tokenizer_type: AutoTokenizer load_in_8bit: false load_in_4bit: true strict: false datasets: - path: /workspace/data/train.jsonl type: completion field: text output_dir: /workspace/output/llama3-lora adapter: lora lora_r: 16 lora_alpha: 32 lora_dropout: 0.05 lora_target_modules: - q_proj - v_proj - k_proj - o_proj gradient_accumulation_steps: 4 micro_batch_size: 2 num_epochs: 3 learning_rate: 0.0002 optimizer: adamw_bnb_8bit lr_scheduler: cosine warmup_steps: 100 ``` #### Option B: JupyterLab for Custom Training For research and custom training scripts: ```yaml services: jupyter: image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel container_name: gpu_jupyter runtime: nvidia volumes: - ./notebooks:/workspace - jupyter_cache:/root/.cache ports: - "8888:8888" environment: NVIDIA_VISIBLE_DEVICES: all JUPYTER_ENABLE_LAB: "yes" command: | bash -c " pip install jupyterlab transformers datasets accelerate bitsandbytes peft && jupyter lab --ip=0.0.0.0 --allow-root --no-browser --NotebookApp.token='' " deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] ``` --- ### 4.4 Model Storage Strategy #### Storage Requirements: **Per Model Type:** - LLM 7B: ~14GB (FP16) - LLM 13B: ~26GB - SDXL: ~7GB - FLUX: ~12GB - ControlNet: ~2.5GB each - LoRA: ~100-500MB each **Total Estimated:** - 3-4 LLMs: ~80GB - SD models + LoRAs: ~50GB - Training checkpoints: ~100GB - **Total: 250-300GB minimum** #### RunPod Storage Options: 1. **Network Volume** (Recommended) - Persistent across pod restarts - Shared between multiple pods - ~$0.10/GB/month - 500GB = $50/month 2. **Container Disk** - Included with pod - Lost when pod stops - Good for temporary storage 3. **External Storage (rclone)** - Sync to/from VPS or cloud storage - Backup models to Backblaze B2 or Wasabi - Good for disaster recovery #### Model Management: Use **Hugging Face Hub** as model cache: ```bash # Download models on first run huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct \ --local-dir /models/llama-3.1-8b # Or let vLLM/ComfyUI auto-download ``` **Model Sync Script:** ```bash #!/bin/bash # sync-models.sh - Sync models from VPS to GPU server rclone sync \ /mnt/hidrive/AI/models \ gpu:/workspace/models \ --progress \ --transfers 4 ``` --- ## Phase 5: Implementation Roadmap ### Week 1: Infrastructure Setup **Day 1-2: RunPod Account & GPU Server** - [ ] Create RunPod account - [ ] Deploy RTX 4090 pod with Ubuntu 22.04 + PyTorch template - [ ] Configure persistent network volume (500GB) - [ ] Set up SSH access **Day 3-4: Network Configuration** - [ ] Deploy WireGuard on GPU server - [ ] Add GPU server as peer to existing VPN (vpn/compose.yaml) - [ ] Test connectivity between VPS and GPU server - [ ] Configure firewall rules **Day 5: Docker Setup on GPU Server** - [ ] Install Docker + NVIDIA Container Toolkit - [ ] Create docker-compose.yaml for GPU services - [ ] Test GPU access in containers --- ### Week 2: LLM Hosting **Day 1-2: vLLM Deployment** - [ ] Deploy vLLM container - [ ] Download Llama 3.1 8B Instruct - [ ] Test inference locally - [ ] Benchmark performance (tokens/sec) **Day 3-4: LiteLLM Integration** - [ ] Update litellm-config.yaml with vLLM endpoint - [ ] Test via Open WebUI - [ ] Configure model routing (cheap models → vLLM, complex → Claude) - [ ] Set up usage monitoring **Day 5: Model Expansion** - [ ] Download Qwen 2.5 14B - [ ] Download Mistral 7B Instruct - [ ] Test model switching in Open WebUI - [ ] Document performance characteristics --- ### Week 3: Image Generation **Day 1-2: ComfyUI Setup** - [ ] Deploy ComfyUI container - [ ] Download FLUX.1-schnell - [ ] Download SDXL - [ ] Install ComfyUI Manager **Day 3-4: Model Downloads** - [ ] Download ControlNet models - [ ] Download VAE models - [ ] Download popular LoRAs from Civitai - [ ] Organize model directory **Day 5: Integration & Workflows** - [ ] Create basic text-to-image workflow - [ ] Create ControlNet workflow - [ ] Test API access - [ ] Add Traefik subdomain (comfy.ai.pivoine.art) --- ### Week 4: Training Infrastructure **Day 1-2: Axolotl Setup** - [ ] Deploy Axolotl container - [ ] Create sample dataset - [ ] Test LoRA fine-tuning with tiny model - [ ] Verify GPU utilization **Day 3-4: JupyterLab Setup** - [ ] Deploy JupyterLab container - [ ] Install ML libraries - [ ] Create example notebooks - [ ] Test custom training scripts **Day 5: Documentation & Testing** - [ ] Write training guides - [ ] Test end-to-end workflows - [ ] Benchmark training speeds - [ ] Document best practices --- ### Ongoing: Optimization & Expansion **Month 2:** - Monitor costs and optimize GPU utilization - Implement model caching strategies - Add more models based on usage patterns - Set up automated model updates - Implement usage quotas per user **Month 3+:** - Consider multi-GPU setup for larger models - Implement model quantization (AWQ/GPTQ) - Add video generation (AnimateDiff, CogVideo) - Explore voice synthesis (XTTS, Bark) - Custom model training for specific use cases --- ## Phase 6: Cost Analysis ### Scenario A: Single RTX 4090 (24/7) **GPU Server (RunPod):** - RTX 4090 pod: $0.50/hour × 720 hours = $360/month - 500GB network volume: $50/month - **Subtotal: $410/month** **VPS (Existing):** - No change in cost **Total: ~$410/month** **Savings:** - Claude API costs reduced by ~80% (self-hosted for routine tasks) - Break-even if currently spending >$500/month on API calls --- ### Scenario B: Pay-as-you-go (8 hours/day) **GPU Server (RunPod):** - RTX 4090: $0.50/hour × 8 hours × 30 days = $120/month - Storage: $50/month - **Subtotal: $170/month** **Best for:** - Development/experimentation - Burst workloads - Image generation on-demand --- ### Scenario C: Dual GPU (Training + Inference) **GPU Server 1 (Inference):** - RTX 4090 24/7: $360/month **GPU Server 2 (Training - On-demand):** - A100 40GB: $1.50/hour × 40 hours/month = $60/month - Used only for fine-tuning sessions **Storage:** - 1TB network volume: $100/month **Total: ~$520/month** --- ### Cost Optimization Tips 1. **Auto-stop idle pods**: RunPod can auto-stop after X minutes idle 2. **Use spot instances**: ~50% cheaper but can be interrupted 3. **Quantized models**: 4-bit models use 4x less VRAM → cheaper GPUs 4. **Batch processing**: Queue image gen jobs to maximize GPU usage 5. **Model sharing**: One vLLM instance can serve multiple models via adapters 6. **Monitoring**: Track per-model costs to optimize routing --- ## Phase 7: Monitoring & Operations ### Metrics to Track **GPU Utilization:** - nvidia-smi metrics (utilization %, memory usage) - Temperature and power draw - Per-process GPU usage **Model Performance:** - Tokens per second (LLM inference) - Images per second (SD/FLUX) - Training time per epoch **Costs:** - GPU hours consumed - Storage usage - API vs self-hosted breakdown ### Monitoring Stack **Option A: Netdata (Already deployed)** Add GPU monitoring to existing Netdata: ```yaml # On GPU server services: netdata: image: netdata/netdata:latest container_name: gpu_netdata runtime: nvidia environment: NVIDIA_VISIBLE_DEVICES: all volumes: - /sys:/host/sys:ro - /proc:/host/proc:ro - /var/run/docker.sock:/var/run/docker.sock:ro command: | bash -c " # Enable nvidia_smi plugin /usr/libexec/netdata/plugins.d/charts.d.plugin nvidia_smi " ``` **Option B: Prometheus + Grafana** For detailed metrics: ```yaml services: prometheus: image: prom/prometheus:latest volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - prometheus_data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' dcgm-exporter: image: nvidia/dcgm-exporter:latest runtime: nvidia environment: NVIDIA_VISIBLE_DEVICES: all grafana: image: grafana/grafana:latest ports: - "3000:3000" volumes: - grafana_data:/var/lib/grafana ``` Import Grafana dashboard #12219 for GPU metrics. --- ## Phase 8: Backup & Disaster Recovery ### What to Backup 1. **Models** (250-300GB) - Base models can be re-downloaded - Custom fine-tuned models: CRITICAL - LoRAs: CRITICAL 2. **Training Data** (~10-50GB) - Datasets - Preprocessing scripts 3. **Configurations** (<1GB) - Docker compose files - Training configs - Workflow JSONs ### Backup Strategy **Tier 1: Critical (Daily)** - Fine-tuned models - Training checkpoints - Custom datasets **Backup to:** - Restic → HiDrive (already configured) - Backblaze B2 (~$6/TB/month) ```bash # Add to core/compose.yaml backrest config - gpu_models:/volumes/gpu_models:ro - gpu_checkpoints:/volumes/gpu_checkpoints:ro ``` **Tier 2: Nice-to-have (Weekly)** - Base models (can re-download) - ComfyUI outputs **Tier 3: Ephemeral (No backup)** - Inference cache - Temporary generations --- ## Phase 9: Security Considerations ### GPU Server Security 1. **Firewall:** - Only allow WireGuard port (51820) - All services accessed via VPN - No public exposure 2. **SSH:** - Key-based auth only - Disable password auth - Change default port 3. **Docker:** - Rootless Docker (optional but recommended) - Limited container capabilities - No privileged containers except for nvidia-runtime 4. **Secrets:** - Store API keys in .env - Use Docker secrets for sensitive data - Rotate keys periodically ### Access Control - **ComfyUI**: Protected by Authelia SSO (already configured) - **vLLM**: Internal only, accessed via LiteLLM proxy - **JupyterLab**: Password-protected or Authelia - **Training**: No public access, VPN only --- ## Phase 10: Advanced Features (Future) ### Multi-GPU Scaling **Tensor Parallelism** (vLLM): - Split large models across multiple GPUs - Example: 70B model on 2x A100s ```yaml command: - --model - meta-llama/Meta-Llama-3.1-70B-Instruct - --tensor-parallel-size - '2' # Use 2 GPUs ``` **Pipeline Parallelism** (training): - Split model layers across GPUs - Useful for very large models ### Model Serving Optimization **vLLM Features:** - Speculative decoding (faster generation) - Prefix caching (faster for repeated prompts) - Multi-LoRA serving (multiple adapters, one base model) **Example multi-LoRA:** ```yaml command: - --model - meta-llama/Meta-Llama-3.1-8B-Instruct - --enable-lora - --max-loras - '4' - --lora-modules - customer-support=/models/loras/support-lora - creative-writing=/models/loras/writing-lora ``` ### Video Generation **AnimateDiff in ComfyUI:** - Generate short videos from text prompts - Animate static images - ~8GB VRAM for 512x512 16-frame videos **CogVideo:** - High-quality video generation - Requires A100 or H100 - 5-second clips at 720p ### Voice Synthesis **XTTS v2:** - High-quality voice cloning - Multi-language support - ~6GB VRAM **Bark:** - Text-to-speech with emotions - Sound effects - ~10GB VRAM --- ## Appendix A: Quick Start Commands ### Initial GPU Server Setup ```bash # SSH into RunPod instance ssh root@gpu.runpod.io -p 12345 # Install Docker curl -fsSL https://get.docker.com -o get-docker.sh sh get-docker.sh # Install NVIDIA Container Toolkit distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ tee /etc/apt/sources.list.d/nvidia-container-toolkit.list apt-get update apt-get install -y nvidia-container-toolkit systemctl restart docker # Test GPU access docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi ``` ### Deploy vLLM (Quick Test) ```bash # Create directory mkdir -p /workspace/vllm cd /workspace/vllm # Run vLLM docker run -d \ --name vllm \ --runtime=nvidia \ --gpus all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -p 8000:8000 \ vllm/vllm-openai:latest \ --model meta-llama/Meta-Llama-3.1-8B-Instruct \ --dtype auto \ --max-model-len 8192 # Test inference curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "prompt": "Once upon a time", "max_tokens": 50 }' ``` ### Deploy ComfyUI (Quick Test) ```bash docker run -d \ --name comfyui \ --runtime=nvidia \ --gpus all \ -v /workspace/comfyui:/data \ -p 8188:8188 \ ghcr.io/ai-dock/comfyui:latest # Access at http://gpu-ip:8188 ``` --- ## Appendix B: Sample Docker Compose (Full GPU Stack) ```yaml # gpu-server/compose.yaml version: '3.8' services: # vLLM for LLM inference vllm: image: vllm/vllm-openai:latest container_name: gpu_vllm restart: unless-stopped runtime: nvidia environment: NVIDIA_VISIBLE_DEVICES: all CUDA_VISIBLE_DEVICES: 0 volumes: - vllm_models:/root/.cache/huggingface command: - --model - meta-llama/Meta-Llama-3.1-8B-Instruct - --host - 0.0.0.0 - --port - 8000 - --gpu-memory-utilization - '0.9' ports: - "8000:8000" deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] # ComfyUI for image generation comfyui: image: ghcr.io/ai-dock/comfyui:latest container_name: gpu_comfyui restart: unless-stopped runtime: nvidia environment: NVIDIA_VISIBLE_DEVICES: all volumes: - comfyui_data:/data - comfyui_models:/opt/ComfyUI/models - comfyui_output:/opt/ComfyUI/output ports: - "8188:8188" deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] # Axolotl for model training axolotl: image: winglian/axolotl:main-py3.11-cu121-2.2.2 container_name: gpu_training runtime: nvidia volumes: - ./training/configs:/workspace/configs - ./training/data:/workspace/data - ./training/output:/workspace/output - training_cache:/root/.cache environment: NVIDIA_VISIBLE_DEVICES: all deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] # Only start when training profiles: - training # JupyterLab for research jupyter: image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel container_name: gpu_jupyter restart: unless-stopped runtime: nvidia volumes: - ./notebooks:/workspace - jupyter_cache:/root/.cache ports: - "8888:8888" environment: NVIDIA_VISIBLE_DEVICES: all JUPYTER_ENABLE_LAB: "yes" command: | bash -c " pip install jupyterlab transformers datasets accelerate bitsandbytes peft && jupyter lab --ip=0.0.0.0 --allow-root --no-browser " deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] # Netdata monitoring netdata: image: netdata/netdata:latest container_name: gpu_netdata restart: unless-stopped runtime: nvidia environment: NVIDIA_VISIBLE_DEVICES: all volumes: - /sys:/host/sys:ro - /proc:/host/proc:ro - /var/run/docker.sock:/var/run/docker.sock:ro ports: - "19999:19999" cap_add: - SYS_PTRACE - SYS_ADMIN security_opt: - apparmor:unconfined volumes: vllm_models: comfyui_data: comfyui_models: comfyui_output: training_cache: jupyter_cache: ``` --- ## Appendix C: Cost Calculator **Monthly GPU Costs:** | GPU Model | VRAM | $/hour | 24/7 Month | 8hr/day | Use Case | |-----------|------|--------|------------|---------|----------| | RTX 3090 | 24GB | $0.35 | $252 | $84 | Development, small models | | RTX 4090 | 24GB | $0.50 | $360 | $120 | Production inference, SD | | A6000 | 48GB | $0.80 | $576 | $192 | Large models, training | | A100 40GB | 40GB | $1.50 | $1,080 | $360 | Enterprise, training | | A100 80GB | 80GB | $2.50 | $1,800 | $600 | Massive models, research | **Storage Costs:** - Network volume: $0.10/GB/month - 500GB = $50/month - 1TB = $100/month **Total Estimated Monthly:** - RTX 4090 + 500GB storage = $410/month (24/7) - RTX 4090 + 500GB storage = $170/month (8hr/day) **Break-even Analysis:** - If spending >$500/month on API calls → GPU server saves money - If spending <$200/month → stick with APIs --- ## Appendix D: Model Recommendations by Use Case ### General Chat (24/7 Inference) **Best:** Qwen 2.5 14B Instruct - Excellent multilingual support - Fast inference - Good reasoning **Alternative:** Mistral 7B Instruct v0.3 - Fastest inference - Lower VRAM ### Code Generation **Best:** Qwen 2.5 Coder 14B - SOTA coding performance - Multi-language support **Alternative:** DeepSeek Coder 6.7B - Faster, lighter ### Creative Writing **Best:** Nous Hermes 2 Mixtral 8x7B (quantized) - Creative, engaging - Follows instructions well ### Image Generation (Realistic) **Best:** FLUX.1-dev - Highest quality - Best prompt following **Alternative:** SDXL + RealVisXL LoRA - Faster generation - Good quality ### Image Generation (Anime) **Best:** SDXL + AnimagineXL LoRA - Anime-specific training - Vibrant colors ### Video Generation **Best:** AnimateDiff + SDXL - 16-frame clips - Good quality **Needs:** A100 40GB or better --- ## Next Steps 1. **Review this plan** and provide feedback 2. **Set budget** for GPU infrastructure 3. **Choose provider** (recommend RunPod) 4. **Define priority services** (LLM hosting first? Image gen first?) 5. **Schedule implementation** (4-week timeline above) Would you like me to: - Create the detailed Docker Compose configurations? - Set up a cost estimation spreadsheet? - Research specific models for your use cases? - Begin implementation with Phase 1? Let me know how you'd like to proceed! 🚀