runpod

valknar/runpod

Fork 0

Commit Graph

Author	SHA1	Message	Date
Sebastian Krüger	91089d3edc	feat: add /v1/models endpoint and systemd service for orchestrator - Add OpenAI-compatible /v1/models endpoint to list available models - Create systemd service file for proper service management - Service runs as root with automatic restart on failure - Logs to systemd journal for easy debugging	2025-11-21 19:28:16 +01:00
Sebastian Krüger	9947fe37bb	fix: properly proxy streaming requests without buffering The orchestrator was calling response.json() which buffered the entire streaming response before returning it. This caused LiteLLM to receive only one chunk with empty content instead of token-by-token streaming. Changes: - Detect streaming requests by parsing request body for 'stream': true - Use client.stream() with aiter_bytes() for streaming requests - Return StreamingResponse with proper SSE headers - Keep original JSONResponse behavior for non-streaming requests This fixes streaming from vLLM → orchestrator → LiteLLM chain.	2025-11-21 19:21:56 +01:00
Sebastian Krüger	277f1c95bd	Initial commit: RunPod multi-modal AI orchestration stack - Multi-modal AI infrastructure for RunPod RTX 4090 - Automatic model orchestration (text, image, music) - Text: vLLM + Qwen 2.5 7B Instruct - Image: Flux.1 Schnell via OpenEDAI - Music: MusicGen Medium via AudioCraft - Cost-optimized sequential loading on single GPU - Template preparation scripts for rapid deployment - Comprehensive documentation (README, DEPLOYMENT, TEMPLATE)	2025-11-21 14:34:55 +01:00

Author

SHA1

Message

Date

Sebastian Krüger

91089d3edc

feat: add /v1/models endpoint and systemd service for orchestrator

- Add OpenAI-compatible /v1/models endpoint to list available models
- Create systemd service file for proper service management
- Service runs as root with automatic restart on failure
- Logs to systemd journal for easy debugging

2025-11-21 19:28:16 +01:00

Sebastian Krüger

9947fe37bb

fix: properly proxy streaming requests without buffering

The orchestrator was calling response.json() which buffered the entire
streaming response before returning it. This caused LiteLLM to receive
only one chunk with empty content instead of token-by-token streaming.

Changes:
- Detect streaming requests by parsing request body for 'stream': true
- Use client.stream() with aiter_bytes() for streaming requests
- Return StreamingResponse with proper SSE headers
- Keep original JSONResponse behavior for non-streaming requests

This fixes streaming from vLLM → orchestrator → LiteLLM chain.

2025-11-21 19:21:56 +01:00

Sebastian Krüger

277f1c95bd

Initial commit: RunPod multi-modal AI orchestration stack

- Multi-modal AI infrastructure for RunPod RTX 4090
- Automatic model orchestration (text, image, music)
- Text: vLLM + Qwen 2.5 7B Instruct
- Image: Flux.1 Schnell via OpenEDAI
- Music: MusicGen Medium via AudioCraft
- Cost-optimized sequential loading on single GPU
- Template preparation scripts for rapid deployment
- Comprehensive documentation (README, DEPLOYMENT, TEMPLATE)

2025-11-21 14:34:55 +01:00

3 Commits