docker-compose

valknar/docker-compose

Fork 0

Commit Graph

Author	SHA1	Message	Date
Sebastian Krüger	3ed3e68271	feat(ai): add multi-modal orchestration system for text, image, and music generation Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with automatic model switching based on request type. This enables text, image, and music generation on the same hardware with sequential loading. ## New Components Model Orchestrator (ai/model-orchestrator/): - FastAPI service managing model lifecycle - Automatic model detection and switching based on request type - OpenAI-compatible API proxy for all models - Simple YAML configuration for adding new models - Docker SDK integration for service management - Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations Text Generation (ai/vllm/): - Reorganized existing vLLM server into proper structure - Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec) - Docker containerized with CUDA 12.4 support Image Generation (ai/flux/): - Flux.1 Schnell for fast, high-quality images - 14GB VRAM, 4-5 sec per image - OpenAI DALL-E compatible API - Pre-built image: ghcr.io/matatonic/openedai-images-flux Music Generation (ai/musicgen/): - Meta's MusicGen Medium (facebook/musicgen-medium) - Text-to-music generation (11GB VRAM) - 60-90 seconds for 30s audio clips - Custom FastAPI wrapper with AudioCraft ## Architecture ``` VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000) ↓ ┌───────────────┼───────────────┐ vLLM (8001) Flux (8002) MusicGen (8003) [Only ONE active at a time - sequential loading] ``` ## Configuration Files - docker-compose.gpu.yaml: Main orchestration file for RunPod deployment - model-orchestrator/models.yaml: Model registry (easy to add new models) - .env.example: Environment variable template - README.md: Comprehensive deployment and usage guide ## Updated Files - litellm-config.yaml: Updated to route through orchestrator (port 9000) - GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture ## Features ✅ Automatic model switching (30-120s latency) ✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU) ✅ Easy model addition via YAML configuration ✅ OpenAI-compatible APIs for all model types ✅ Centralized routing through LiteLLM proxy ✅ GPU memory safety (only one model loaded at time) ## Usage Deploy to RunPod: ```bash scp -r ai/* gpu-pivoine:/workspace/ai/ ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator" ``` Test models: ```bash # Text curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}' # Image curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}' # Music curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}' ``` All models available via Open WebUI at https://ai.pivoine.art ## Adding New Models 1. Add entry to models.yaml 2. Define Docker service in docker-compose.gpu.yaml 3. Restart orchestrator That's it! The orchestrator automatically detects and manages the new model. ## Performance \| Model \| VRAM \| Startup \| Speed \| \|-------\|------\|---------\|-------\| \| Qwen 2.5 7B \| 14GB \| 120s \| ~50 tok/sec \| \| Flux.1 Schnell \| 14GB \| 60s \| 4-5s/image \| \| MusicGen Medium \| 11GB \| 45s \| 60-90s for 30s audio \| Model switching overhead: 30-120 seconds ## License Notes - vLLM: Apache 2.0 - Flux.1: Apache 2.0 - AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-21 14:12:13 +01:00
Sebastian Krüger	bb3dabcba7	feat(ai): complete GPU deployment with self-hosted Qwen 2.5 7B model This commit finalizes the GPU infrastructure deployment on RunPod: - Added qwen-2.5-7b model to LiteLLM configuration - Self-hosted on RunPod RTX 4090 GPU server - Connected via Tailscale VPN (100.100.108.13:8000) - OpenAI-compatible API endpoint - Rate limits: 1000 RPM, 100k TPM - Marked GPU deployment as COMPLETE in deployment log - vLLM 0.6.4.post1 with custom AsyncLLMEngine server - Qwen/Qwen2.5-7B-Instruct model (14.25 GB) - 85% GPU memory utilization, 4096 context length - Successfully integrated with Open WebUI at ai.pivoine.art Infrastructure: - Provider: RunPod Spot Instance (~$0.50/hr) - GPU: NVIDIA RTX 4090 24GB - Disk: 50GB local SSD + 922TB network volume - VPN: Tailscale (replaces WireGuard due to RunPod UDP restrictions) Model now visible and accessible in Open WebUI for end users. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-21 13:18:17 +01:00
Sebastian Krüger	8de88d96ac	docs(ai): add comprehensive GPU setup documentation and configs - Add setup guides (SETUP_GUIDE, TAILSCALE_SETUP, DOCKER_GPU_SETUP, etc.) - Add deployment configurations (litellm-config-gpu.yaml, gpu-server-compose.yaml) - Add GPU_DEPLOYMENT_LOG.md with current infrastructure details - Add GPU_EXPANSION_PLAN.md with complete provider comparison - Add deploy-gpu-stack.sh automation script 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-21 12:57:06 +01:00

Author

SHA1

Message

Date

Sebastian Krüger

3ed3e68271

feat(ai): add multi-modal orchestration system for text, image, and music generation

Implemented a cost-optimized AI infrastructure running on single RTX 4090 GPU with
automatic model switching based on request type. This enables text, image, and
music generation on the same hardware with sequential loading.

## New Components

**Model Orchestrator** (ai/model-orchestrator/):
- FastAPI service managing model lifecycle
- Automatic model detection and switching based on request type
- OpenAI-compatible API proxy for all models
- Simple YAML configuration for adding new models
- Docker SDK integration for service management
- Endpoints: /v1/chat/completions, /v1/images/generations, /v1/audio/generations

**Text Generation** (ai/vllm/):
- Reorganized existing vLLM server into proper structure
- Qwen 2.5 7B Instruct (14GB VRAM, ~50 tok/sec)
- Docker containerized with CUDA 12.4 support

**Image Generation** (ai/flux/):
- Flux.1 Schnell for fast, high-quality images
- 14GB VRAM, 4-5 sec per image
- OpenAI DALL-E compatible API
- Pre-built image: ghcr.io/matatonic/openedai-images-flux

**Music Generation** (ai/musicgen/):
- Meta's MusicGen Medium (facebook/musicgen-medium)
- Text-to-music generation (11GB VRAM)
- 60-90 seconds for 30s audio clips
- Custom FastAPI wrapper with AudioCraft

## Architecture

```
VPS (LiteLLM) → Tailscale VPN → GPU Orchestrator (Port 9000)
                                       ↓
                       ┌───────────────┼───────────────┐
                  vLLM (8001)    Flux (8002)    MusicGen (8003)
                   [Only ONE active at a time - sequential loading]
```

## Configuration Files

- docker-compose.gpu.yaml: Main orchestration file for RunPod deployment
- model-orchestrator/models.yaml: Model registry (easy to add new models)
- .env.example: Environment variable template
- README.md: Comprehensive deployment and usage guide

## Updated Files

- litellm-config.yaml: Updated to route through orchestrator (port 9000)
- GPU_DEPLOYMENT_LOG.md: Documented multi-modal architecture

## Features

✅ Automatic model switching (30-120s latency)
✅ Cost-optimized single GPU deployment (~$0.50/hr vs ~$0.75/hr multi-GPU)
✅ Easy model addition via YAML configuration
✅ OpenAI-compatible APIs for all model types
✅ Centralized routing through LiteLLM proxy
✅ GPU memory safety (only one model loaded at time)

## Usage

Deploy to RunPod:
```bash
scp -r ai/* gpu-pivoine:/workspace/ai/
ssh gpu-pivoine "cd /workspace/ai && docker compose -f docker-compose.gpu.yaml up -d orchestrator"
```

Test models:
```bash
# Text
curl http://100.100.108.13:9000/v1/chat/completions -d '{"model":"qwen-2.5-7b","messages":[...]}'

# Image
curl http://100.100.108.13:9000/v1/images/generations -d '{"model":"flux-schnell","prompt":"..."}'

# Music
curl http://100.100.108.13:9000/v1/audio/generations -d '{"model":"musicgen-medium","prompt":"..."}'
```

All models available via Open WebUI at https://ai.pivoine.art

## Adding New Models

1. Add entry to models.yaml
2. Define Docker service in docker-compose.gpu.yaml
3. Restart orchestrator

That's it! The orchestrator automatically detects and manages the new model.

## Performance

| Model | VRAM | Startup | Speed |
|-------|------|---------|-------|
| Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec |
| Flux.1 Schnell | 14GB | 60s | 4-5s/image |
| MusicGen Medium | 11GB | 45s | 60-90s for 30s audio |

Model switching overhead: 30-120 seconds

## License Notes

- vLLM: Apache 2.0
- Flux.1: Apache 2.0
- AudioCraft: MIT (code), CC-BY-NC (pre-trained weights - non-commercial)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-21 14:12:13 +01:00

Sebastian Krüger

bb3dabcba7

feat(ai): complete GPU deployment with self-hosted Qwen 2.5 7B model

This commit finalizes the GPU infrastructure deployment on RunPod:

- Added qwen-2.5-7b model to LiteLLM configuration
  - Self-hosted on RunPod RTX 4090 GPU server
  - Connected via Tailscale VPN (100.100.108.13:8000)
  - OpenAI-compatible API endpoint
  - Rate limits: 1000 RPM, 100k TPM

- Marked GPU deployment as COMPLETE in deployment log
  - vLLM 0.6.4.post1 with custom AsyncLLMEngine server
  - Qwen/Qwen2.5-7B-Instruct model (14.25 GB)
  - 85% GPU memory utilization, 4096 context length
  - Successfully integrated with Open WebUI at ai.pivoine.art

Infrastructure:
- Provider: RunPod Spot Instance (~$0.50/hr)
- GPU: NVIDIA RTX 4090 24GB
- Disk: 50GB local SSD + 922TB network volume
- VPN: Tailscale (replaces WireGuard due to RunPod UDP restrictions)

Model now visible and accessible in Open WebUI for end users.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-21 13:18:17 +01:00

Sebastian Krüger

8de88d96ac

docs(ai): add comprehensive GPU setup documentation and configs

- Add setup guides (SETUP_GUIDE, TAILSCALE_SETUP, DOCKER_GPU_SETUP, etc.)
- Add deployment configurations (litellm-config-gpu.yaml, gpu-server-compose.yaml)
- Add GPU_DEPLOYMENT_LOG.md with current infrastructure details
- Add GPU_EXPANSION_PLAN.md with complete provider comparison
- Add deploy-gpu-stack.sh automation script

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-21 12:57:06 +01:00

3 Commits