runpod-ai-orchestrator

Author	SHA1	Message	Date
Sebastian Krüger	f8694653d0	fix: adjust VRAM for 24K context based on actual usage All checks were successful Build and Push RunPod Docker Image / build-and-push (push) Successful in 13s Details Based on error output, model uses ~17.5GB (not 15GB estimated). - Llama: 85% VRAM for 24576 context (3GB KV cache) - BGE: 6% VRAM (reduced to fit) - Total: 91% 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-30 22:46:34 +01:00
Sebastian Krüger	078043e35a	feat: balance Llama 24K context with concurrent BGE All checks were successful Build and Push RunPod Docker Image / build-and-push (push) Successful in 14s Details Adjusted VRAM allocation for concurrent operation: - Llama: 80% VRAM, 24576 context - BGE: 8% VRAM - Total: 88% of 24GB RTX 4090 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-30 22:39:08 +01:00
Sebastian Krüger	c969d10eaf	feat: increase Llama context to 32K with 95% VRAM All checks were successful Build and Push RunPod Docker Image / build-and-push (push) Successful in 14s Details For larger input requirements, increased max-model-len from 20480 to 32768. BGE remains available but cannot run concurrently at this VRAM level. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-30 22:37:11 +01:00
Sebastian Krüger	f68bc47915	feat: increase Llama max-model-len to 20480 All checks were successful Build and Push RunPod Docker Image / build-and-push (push) Successful in 14s Details Adjusted VRAM allocation for larger context window: - Llama: 90% VRAM, 20480 context (up from 8192) - BGE: 8% VRAM (down from 10%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-30 22:15:08 +01:00
Sebastian Krüger	b2de3b17ee	fix: adjust VRAM allocation for concurrent Llama+BGE All checks were successful Build and Push RunPod Docker Image / build-and-push (push) Successful in 13s Details - Llama: 85% GPU, 8K context (model needs ~15GB base) - BGE: 10% GPU (1.3GB model) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-30 20:16:00 +01:00
Sebastian Krüger	f668e06228	feat: add BGE embedding model for concurrent operation with Llama All checks were successful Build and Push RunPod Docker Image / build-and-push (push) Successful in 36s Details - Create config_bge.yaml for BAAI/bge-large-en-v1.5 on port 8002 - Reduce Llama VRAM to 70% and context to 16K for concurrent use - Add BGE service to supervisor with vllm group 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-30 19:55:13 +01:00
Sebastian Krüger	b9beef283d	fix: remove vllm embedding All checks were successful Build and Push RunPod Docker Image / build-and-push (push) Successful in 14s Details	2025-11-27 01:24:05 +01:00
Sebastian Krüger	90fa8a073c	fix: remove vllm embedding All checks were successful Build and Push RunPod Docker Image / build-and-push (push) Successful in 36s Details	2025-11-27 01:12:57 +01:00
Sebastian Krüger	4d7c811a46	fix: vllm gpu utilization 2 All checks were successful Build and Push RunPod Docker Image / build-and-push (push) Successful in 14s Details	2025-11-27 00:57:14 +01:00
Sebastian Krüger	eaa8e0ebab	fix: vllm gpu utilization All checks were successful Build and Push RunPod Docker Image / build-and-push (push) Successful in 14s Details	2025-11-27 00:50:42 +01:00
Sebastian Krüger	5c61ac5c67	Initial commit All checks were successful Build and Push RunPod Docker Image / build-and-push (push) Successful in 1m28s Details	2025-11-26 17:15:08 +01:00

11 Commits