f8694653d0
fix: adjust VRAM for 24K context based on actual usage
...
Build and Push RunPod Docker Image / build-and-push (push) Successful in 13s
Based on error output, model uses ~17.5GB (not 15GB estimated).
- Llama: 85% VRAM for 24576 context (3GB KV cache)
- BGE: 6% VRAM (reduced to fit)
- Total: 91%
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-30 22:46:34 +01:00
078043e35a
feat: balance Llama 24K context with concurrent BGE
...
Build and Push RunPod Docker Image / build-and-push (push) Successful in 14s
Adjusted VRAM allocation for concurrent operation:
- Llama: 80% VRAM, 24576 context
- BGE: 8% VRAM
- Total: 88% of 24GB RTX 4090
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-30 22:39:08 +01:00
c969d10eaf
feat: increase Llama context to 32K with 95% VRAM
...
Build and Push RunPod Docker Image / build-and-push (push) Successful in 14s
For larger input requirements, increased max-model-len from 20480 to 32768.
BGE remains available but cannot run concurrently at this VRAM level.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-30 22:37:11 +01:00
f68bc47915
feat: increase Llama max-model-len to 20480
...
Build and Push RunPod Docker Image / build-and-push (push) Successful in 14s
Adjusted VRAM allocation for larger context window:
- Llama: 90% VRAM, 20480 context (up from 8192)
- BGE: 8% VRAM (down from 10%)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-30 22:15:08 +01:00
b2de3b17ee
fix: adjust VRAM allocation for concurrent Llama+BGE
...
Build and Push RunPod Docker Image / build-and-push (push) Successful in 13s
- Llama: 85% GPU, 8K context (model needs ~15GB base)
- BGE: 10% GPU (1.3GB model)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-30 20:16:00 +01:00
f668e06228
feat: add BGE embedding model for concurrent operation with Llama
...
Build and Push RunPod Docker Image / build-and-push (push) Successful in 36s
- Create config_bge.yaml for BAAI/bge-large-en-v1.5 on port 8002
- Reduce Llama VRAM to 70% and context to 16K for concurrent use
- Add BGE service to supervisor with vllm group
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-30 19:55:13 +01:00
b9beef283d
fix: remove vllm embedding
Build and Push RunPod Docker Image / build-and-push (push) Successful in 14s
2025-11-27 01:24:05 +01:00
90fa8a073c
fix: remove vllm embedding
Build and Push RunPod Docker Image / build-and-push (push) Successful in 36s
2025-11-27 01:12:57 +01:00
4d7c811a46
fix: vllm gpu utilization 2
Build and Push RunPod Docker Image / build-and-push (push) Successful in 14s
2025-11-27 00:57:14 +01:00
eaa8e0ebab
fix: vllm gpu utilization
Build and Push RunPod Docker Image / build-and-push (push) Successful in 14s
2025-11-27 00:50:42 +01:00
5c61ac5c67
Initial commit
Build and Push RunPod Docker Image / build-and-push (push) Successful in 1m28s
2025-11-26 17:15:08 +01:00