runpod

Files

Sebastian Krüger 9947fe37bb fix: properly proxy streaming requests without buffering

The orchestrator was calling response.json() which buffered the entire
streaming response before returning it. This caused LiteLLM to receive
only one chunk with empty content instead of token-by-token streaming.

Changes:
- Detect streaming requests by parsing request body for 'stream': true
- Use client.stream() with aiter_bytes() for streaming requests
- Return StreamingResponse with proper SSE headers
- Keep original JSONResponse behavior for non-streaming requests

This fixes streaming from vLLM → orchestrator → LiteLLM chain.

2025-11-21 19:21:56 +01:00

models.yaml

fix: correct vLLM service port to 8000

2025-11-21 16:28:54 +01:00

orchestrator_subprocess.py

wip: start architecture redesign for RunPod (no Docker)

2025-11-21 15:09:30 +01:00

orchestrator.py

fix: properly proxy streaming requests without buffering

2025-11-21 19:21:56 +01:00

requirements.txt

Initial commit: RunPod multi-modal AI orchestration stack

2025-11-21 14:34:55 +01:00