The orchestrator was calling response.json() which buffered the entire
streaming response before returning it. This caused LiteLLM to receive
only one chunk with empty content instead of token-by-token streaming.
Changes:
- Detect streaming requests by parsing request body for 'stream': true
- Use client.stream() with aiter_bytes() for streaming requests
- Return StreamingResponse with proper SSE headers
- Keep original JSONResponse behavior for non-streaming requests
This fixes streaming from vLLM → orchestrator → LiteLLM chain.
- Set enforce_eager=True to disable CUDA graphs which were batching outputs
- Add disable_log_stats=True for better streaming performance
- This ensures AsyncLLMEngine yields tokens incrementally instead of returning complete response
- Track previous_text to calculate deltas instead of sending full accumulated text
- Fixes WebUI streaming issue where responses appeared empty
- Only send new tokens in each SSE chunk delta
- Resolves OpenAI API compatibility for streaming chat completions
- Updated qwen-2.5-7b port from 8001 to 8000 in models.yaml
- Matches actual vLLM server default port configuration
- Tested and verified: orchestrator successfully loaded model and generated response
Started redesigning architecture to run services directly without Docker:
**Completed:**
- Created new process-based orchestrator (orchestrator_subprocess.py)
- Uses subprocess instead of Docker SDK for process management
- Updated models.yaml to reference service_script paths
- vLLM server already standalone-ready
**Still needed:**
- Create/update Flux and MusicGen standalone servers
- Create systemd service files or startup scripts
- Update prepare-template script for Python deployment
- Remove Docker/Compose dependencies
- Test full stack on RunPod
- Update documentation
Reason for change: RunPod's containerized environment doesn't support
Docker-in-Docker (requires CAP_SYS_ADMIN). Direct Python execution is
simpler, faster, and more reliable for RunPod.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Set DOCKER_BUILDKIT=0 to use legacy builder
- BuildKit has permission issues in RunPod's containerized environment
- Legacy builder works reliably with RunPod's security constraints
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Changed from systemctl/service to direct dockerd command
- Added --iptables=false --bridge=none flags (required for RunPod)
- Added proper error checking and 10s wait time
- Improved logging with verification step
This fixes Docker startup in RunPod's containerized environment where
systemd is not available and iptables require special handling.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>