runpod

Author	SHA1	Message	Date
Sebastian Krüger	9947fe37bb	fix: properly proxy streaming requests without buffering The orchestrator was calling response.json() which buffered the entire streaming response before returning it. This caused LiteLLM to receive only one chunk with empty content instead of token-by-token streaming. Changes: - Detect streaming requests by parsing request body for 'stream': true - Use client.stream() with aiter_bytes() for streaming requests - Return StreamingResponse with proper SSE headers - Keep original JSONResponse behavior for non-streaming requests This fixes streaming from vLLM → orchestrator → LiteLLM chain.	2025-11-21 19:21:56 +01:00
Sebastian Krüger	7f1890517d	fix: enable eager execution for proper token streaming in vLLM - Set enforce_eager=True to disable CUDA graphs which were batching outputs - Add disable_log_stats=True for better streaming performance - This ensures AsyncLLMEngine yields tokens incrementally instead of returning complete response	2025-11-21 18:25:50 +01:00
Sebastian Krüger	94080da341	fix: remove incorrect start-vllm.sh that would break orchestrator architecture	2025-11-21 18:10:53 +01:00
Sebastian Krüger	6944e4ebd5	feat: add vllm serve script with proper streaming support	2025-11-21 18:08:21 +01:00
Sebastian Krüger	d21caa56bc	fix: implement incremental streaming deltas for vLLM chat completions - Track previous_text to calculate deltas instead of sending full accumulated text - Fixes WebUI streaming issue where responses appeared empty - Only send new tokens in each SSE chunk delta - Resolves OpenAI API compatibility for streaming chat completions	2025-11-21 17:23:18 +01:00
Sebastian Krüger	57b706abe6	fix: correct vLLM service port to 8000 - Updated qwen-2.5-7b port from 8001 to 8000 in models.yaml - Matches actual vLLM server default port configuration - Tested and verified: orchestrator successfully loaded model and generated response	2025-11-21 16:28:54 +01:00
Sebastian Krüger	9a637cc4fc	refactor: clean Docker files and restore standalone model services - Remove all Docker-related files (Dockerfiles, compose.yaml) - Remove documentation files (README, ARCHITECTURE, docs/) - Remove old core/ directory (base_service, service_manager) - Update models.yaml with correct service_script paths (models/*/server.py) - Simplify vLLM requirements.txt to let vLLM manage dependencies - Restore original standalone vLLM server (no base_service dependency) - Remove obsolete vllm/, musicgen/, flux/ directories Process-based architecture is now fully functional on RunPod. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-21 16:17:38 +01:00
Sebastian Krüger	9ee626a78e	feat: implement Ansible-based process architecture for RunPod Major architecture overhaul to address RunPod Docker limitations: Core Infrastructure: - Add base_service.py: Abstract base class for all AI services - Add service_manager.py: Process lifecycle management - Add core/requirements.txt: Core dependencies Model Services (Standalone Python): - Add models/vllm/server.py: Qwen 2.5 7B text generation - Add models/flux/server.py: Flux.1 Schnell image generation - Add models/musicgen/server.py: MusicGen Medium music generation - Each service inherits from GPUService base class - OpenAI-compatible APIs - Standalone execution support Ansible Deployment: - Add playbook.yml: Comprehensive deployment automation - Add ansible.cfg: Ansible configuration - Add inventory.yml: Localhost inventory - Tags: base, python, dependencies, models, tailscale, validate, cleanup Scripts: - Add scripts/install.sh: Full installation wrapper - Add scripts/download-models.sh: Model download wrapper - Add scripts/start-all.sh: Start orchestrator - Add scripts/stop-all.sh: Stop all services Documentation: - Update ARCHITECTURE.md: Document distributed VPS+GPU architecture Benefits: - No Docker: Avoids RunPod CAP_SYS_ADMIN limitations - Fully reproducible via Ansible - Extensible: Add models in 3 steps - Direct Python execution (no container overhead) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-21 15:37:18 +01:00
Sebastian Krüger	03a430894d	docs: add clean extensible architecture design Created comprehensive architecture document for RunPod deployment: Key Design Principles: - No Docker (direct Python for RunPod compatibility) - Extensible (add models in 3 simple steps) - Maintainable (clear structure, base classes) - Simple (one command startup) Structure: - core/ - Base service class + service manager - model-orchestrator/ - Request routing - models/ - Service implementations (vllm, flux, musicgen) - scripts/ - Install, start, stop, template prep - docs/ - Adding models, deployment, templates Adding New Models: 1. Create server.py inheriting BaseService 2. Add entry to models.yaml 3. Add requirements.txt That's it! Orchestrator handles lifecycle automatically. Next: Implement base_service.py and refactor existing services. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-21 15:16:51 +01:00
Sebastian Krüger	31be1932e7	wip: start architecture redesign for RunPod (no Docker) Started redesigning architecture to run services directly without Docker: Completed: - Created new process-based orchestrator (orchestrator_subprocess.py) - Uses subprocess instead of Docker SDK for process management - Updated models.yaml to reference service_script paths - vLLM server already standalone-ready Still needed: - Create/update Flux and MusicGen standalone servers - Create systemd service files or startup scripts - Update prepare-template script for Python deployment - Remove Docker/Compose dependencies - Test full stack on RunPod - Update documentation Reason for change: RunPod's containerized environment doesn't support Docker-in-Docker (requires CAP_SYS_ADMIN). Direct Python execution is simpler, faster, and more reliable for RunPod. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-21 15:09:30 +01:00
Sebastian Krüger	cd9e2eee2e	fix: use legacy Docker builder for RunPod compatibility - Set DOCKER_BUILDKIT=0 to use legacy builder - BuildKit has permission issues in RunPod's containerized environment - Legacy builder works reliably with RunPod's security constraints 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-21 15:01:16 +01:00
Sebastian Krüger	8f1d4bedd2	fix: update Docker daemon startup for RunPod environment - Changed from systemctl/service to direct dockerd command - Added --iptables=false --bridge=none flags (required for RunPod) - Added proper error checking and 10s wait time - Improved logging with verification step This fixes Docker startup in RunPod's containerized environment where systemd is not available and iptables require special handling. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-21 15:00:42 +01:00
Sebastian Krüger	0fa69cae28	refactor: rename docker-compose.gpu.yaml to compose.yaml Simplified compose file naming to follow Docker Compose best practices: - Renamed docker-compose.gpu.yaml to compose.yaml - Updated all references in documentation files (README.md, DEPLOYMENT.md, GPU_DEPLOYMENT_LOG.md, RUNPOD_TEMPLATE.md) - Updated references in scripts (prepare-template.sh) This change enables simpler command syntax: - Before: docker compose -f docker-compose.gpu.yaml up -d orchestrator - After: docker compose up -d orchestrator Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-21 14:49:42 +01:00
Sebastian Krüger	cafa0a1147	refactor: clean up runpod repository structure Removed facefusion and VPS-related files: - compose.yaml, postgres/, litellm-config.yaml (VPS services) - Dockerfile, entrypoint.sh, disable-nsfw-filter.patch (facefusion) Removed outdated documentation: - DOCKER_GPU_SETUP.md, README_GPU_SETUP.md, SETUP_GUIDE.md - TAILSCALE_SETUP.md, WIREGUARD_SETUP.md (covered in DEPLOYMENT.md) - GPU_EXPANSION_PLAN.md (historical planning doc) - gpu-server-compose.yaml, litellm-config-gpu.yaml (old versions) - deploy-gpu-stack.sh, simple_vllm_server.py (old scripts) Organized documentation: - Created docs/ directory - Moved DEPLOYMENT.md, RUNPOD_TEMPLATE.md, GPU_DEPLOYMENT_LOG.md to docs/ - Updated all documentation links in README.md Final structure: - Clean root directory with only GPU-specific files - Organized documentation in docs/ - Model services in dedicated directories (model-orchestrator/, vllm/, flux/, musicgen/) - Automation scripts in scripts/	2025-11-21 14:45:49 +01:00
Sebastian Krüger	277f1c95bd	Initial commit: RunPod multi-modal AI orchestration stack - Multi-modal AI infrastructure for RunPod RTX 4090 - Automatic model orchestration (text, image, music) - Text: vLLM + Qwen 2.5 7B Instruct - Image: Flux.1 Schnell via OpenEDAI - Music: MusicGen Medium via AudioCraft - Cost-optimized sequential loading on single GPU - Template preparation scripts for rapid deployment - Comprehensive documentation (README, DEPLOYMENT, TEMPLATE)	2025-11-21 14:34:55 +01:00

15 Commits