The orchestrator was calling response.json() which buffered the entire streaming response before returning it. This caused LiteLLM to receive only one chunk with empty content instead of token-by-token streaming. Changes: - Detect streaming requests by parsing request body for 'stream': true - Use client.stream() with aiter_bytes() for streaming requests - Return StreamingResponse with proper SSE headers - Keep original JSONResponse behavior for non-streaming requests This fixes streaming from vLLM → orchestrator → LiteLLM chain.
12 KiB
12 KiB