refactor: clean Docker files and restore standalone model services

- Remove all Docker-related files (Dockerfiles, compose.yaml)
- Remove documentation files (README, ARCHITECTURE, docs/)
- Remove old core/ directory (base_service, service_manager)
- Update models.yaml with correct service_script paths (models/*/server.py)
- Simplify vLLM requirements.txt to let vLLM manage dependencies
- Restore original standalone vLLM server (no base_service dependency)
- Remove obsolete vllm/, musicgen/, flux/ directories

Process-based architecture is now fully functional on RunPod.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
2025-11-21 16:17:38 +01:00
parent 9ee626a78e
commit 9a637cc4fc
20 changed files with 228 additions and 3122 deletions

View File

@@ -1,214 +0,0 @@
# RunPod Multi-Modal AI Architecture
**Clean, extensible distributed AI infrastructure spanning VPS and GPU**
## Design Principles
1. **Distributed** - VPS (UI/proxy) + GPU (models) connected via Tailscale
2. **No Docker on GPU** - Direct Python for RunPod compatibility
3. **Extensible** - Adding new models requires minimal code
4. **Maintainable** - Clear structure and separation of concerns
5. **Simple** - One command to start, easy to debug
6. **OpenAI Compatible** - Works with standard AI tools
## Directory Structure
```
runpod/
├── core/ # Core infrastructure
│ ├── base_service.py # Abstract base class for all services
│ ├── service_manager.py # Process lifecycle management
│ └── requirements.txt # Core dependencies
├── model-orchestrator/ # Request orchestration
│ ├── orchestrator.py # Main orchestrator (process-based)
│ ├── models.yaml # Model registry (simple config)
│ └── requirements.txt # Orchestrator dependencies
├── models/ # Model service implementations
│ ├── vllm/ # Text generation
│ │ ├── server.py # vLLM service (inherits base_service)
│ │ └── requirements.txt # vLLM dependencies
│ │
│ ├── flux/ # Image generation
│ │ ├── server.py # Flux service
│ │ └── requirements.txt # Flux dependencies
│ │
│ └── musicgen/ # Music generation
│ ├── server.py # MusicGen service
│ └── requirements.txt # AudioCraft dependencies
├── scripts/ # Deployment & management
│ ├── install.sh # Install all dependencies
│ ├── download-models.sh # Pre-download models
│ ├── start-all.sh # Start orchestrator + services
│ ├── stop-all.sh # Stop all services
│ └── prepare-template.sh # RunPod template preparation
├── systemd/ # Optional systemd services
│ ├── ai-orchestrator.service
│ └── install-services.sh
└── docs/ # Documentation
├── ADDING_MODELS.md # Guide for adding new models
├── DEPLOYMENT.md # Deployment guide
└── RUNPOD_TEMPLATE.md # Template creation guide
```
## Component Responsibilities
### Core (`core/`)
- **base_service.py**: Abstract base class for all model services
- Health check endpoint
- Graceful shutdown
- Logging configuration
- Common utilities
- **service_manager.py**: Process lifecycle management
- Start/stop services
- Health monitoring
- Auto-restart on failure
- Resource cleanup
### Orchestrator (`model-orchestrator/`)
- **orchestrator.py**: Routes requests to appropriate model
- Reads `models.yaml` configuration
- Manages model switching
- Proxies requests to services
- OpenAI-compatible API
- **models.yaml**: Simple model registry
```yaml
models:
model-name:
type: text|image|audio
service_script: path/to/server.py
port: 8001
startup_time: 120
endpoint: /v1/chat/completions
```
### Models (`models/`)
Each model directory contains:
- **server.py**: Service implementation (inherits `BaseService`)
- **requirements.txt**: Model-specific dependencies
Services are standalone - can run independently for testing.
### Scripts (`scripts/`)
- **install.sh**: Install Python packages for all services
- **download-models.sh**: Pre-download models to `/workspace`
- **start-all.sh**: Start orchestrator (which manages model services)
- **stop-all.sh**: Graceful shutdown of all services
- **prepare-template.sh**: RunPod template preparation
## Adding a New Model (3 steps)
### 1. Create Model Service
```python
# models/mymodel/server.py
from core.base_service import BaseService
class MyModelService(BaseService):
def __init__(self):
super().__init__(
name="mymodel",
port=8004
)
async def initialize(self):
"""Load model"""
self.model = load_my_model()
def create_app(self):
"""Define FastAPI routes"""
@self.app.post("/v1/mymodel/generate")
async def generate(request: MyRequest):
return self.model.generate(request.prompt)
if __name__ == "__main__":
service = MyModelService()
service.run()
```
### 2. Add to Registry
```yaml
# model-orchestrator/models.yaml
models:
mymodel:
type: custom
service_script: models/mymodel/server.py
port: 8004
startup_time: 60
endpoint: /v1/mymodel/generate
```
### 3. Add Dependencies
```
# models/mymodel/requirements.txt
transformers==4.36.0
torch==2.1.0
```
That's it! The orchestrator handles everything else.
## Request Flow
```
Client Request
Orchestrator (port 9000)
↓ (determines model from endpoint)
Model Service (port 8001-800X)
Response
```
## Startup Flow
1. Run `scripts/start-all.sh`
2. Orchestrator starts on port 9000
3. Orchestrator reads `models.yaml`
4. On first request:
- Orchestrator starts appropriate model service
- Waits for health check
- Proxies request
5. On subsequent requests:
- If same model: direct proxy
- If different model: stop current, start new
## Benefits
- **Simple**: No Docker complexity, just Python
- **Fast**: No container overhead, direct execution
- **Debuggable**: Standard Python processes, easy to inspect
- **Extensible**: Add models by creating one file + YAML entry
- **Maintainable**: Clear structure, base classes, DRY principles
- **Portable**: Works anywhere Python runs (local, RunPod, other cloud)
## Development Workflow
```bash
# Local development
python3 models/vllm/server.py # Test service directly
python3 model-orchestrator/orchestrator.py # Test orchestrator
# RunPod deployment
./scripts/install.sh # Install dependencies
./scripts/download-models.sh # Pre-download models
./scripts/start-all.sh # Start everything
# Create template
./scripts/prepare-template.sh # Prepare for template save
```
## Future Enhancements
- Load balancing across multiple GPUs
- Model pooling (keep multiple models loaded)
- Batch request queueing
- Metrics and monitoring
- Auto-scaling based on demand

173
README.md
View File

@@ -1,173 +0,0 @@
# RunPod Multi-Modal AI Stack
**Cost-optimized GPU deployment for text, image, and music generation on RunPod RTX 4090.**
This repository contains everything needed to deploy and manage a multi-modal AI infrastructure on RunPod, featuring intelligent model orchestration that automatically switches between models based on request type.
## Features
- **Text Generation**: Qwen 2.5 7B Instruct via vLLM (~50 tokens/sec)
- **Image Generation**: Flux.1 Schnell (~4-5 seconds per image)
- **Music Generation**: MusicGen Medium (30 seconds of audio in 60-90 seconds)
- **Automatic Model Switching**: Intelligent orchestrator manages sequential model loading
- **OpenAI-Compatible APIs**: Works with existing AI tools and clients
- **Easy Model Addition**: Just edit `model-orchestrator/models.yaml` to add new models
- **Template Support**: Create reusable templates for 2-3 minute deployments (vs 60-90 minutes)
## Quick Start
### Option 1: Deploy from Template (Recommended)
If you've already created a RunPod template:
1. Deploy pod from template in RunPod dashboard
2. SSH to the pod
3. Create `.env` file with your credentials
4. Start orchestrator: `docker compose -f compose.yaml up -d orchestrator`
**See**: [RUNPOD_TEMPLATE.md](RUNPOD_TEMPLATE.md) for template usage instructions.
### Option 2: Fresh Deployment
For first-time setup on a new RunPod instance:
1. Copy files to RunPod: `scp -r * gpu-server:/workspace/ai/`
2. SSH to GPU server: `ssh gpu-server`
3. Run preparation script: `cd /workspace/ai && chmod +x scripts/prepare-template.sh && ./scripts/prepare-template.sh`
**See**: [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md) for detailed deployment guide.
## Architecture
```
VPS (LiteLLM Proxy)
↓ Tailscale VPN
GPU Server (Orchestrator Port 9000)
├── vLLM (Qwen 2.5 7B) - Port 8001
├── Flux.1 Schnell - Port 8002
└── MusicGen Medium - Port 8003
```
All requests route through the orchestrator, which automatically loads the appropriate model. Only one model is active at a time for cost optimization (~$0.50/hr vs ~$0.75/hr for multi-GPU).
## Cost Analysis
**RunPod RTX 4090 Spot Instance**:
- **Hourly**: ~$0.50
- **Monthly (24/7)**: ~$360
- **Monthly (8hr/day)**: ~$120
**Template Benefits**:
- **Without Template**: 60-90 minutes setup per Spot restart
- **With Template**: 2-3 minutes deployment time
- **Spot Restart Frequency**: 2-5 times per week (variable)
## Documentation
- **[docs/DEPLOYMENT.md](docs/DEPLOYMENT.md)** - Complete deployment and usage guide
- **[docs/RUNPOD_TEMPLATE.md](docs/RUNPOD_TEMPLATE.md)** - Template creation and usage
- **[docs/GPU_DEPLOYMENT_LOG.md](docs/GPU_DEPLOYMENT_LOG.md)** - Deployment history and technical notes
### Architecture Components
- `model-orchestrator/` - FastAPI orchestrator managing model lifecycle
- `vllm/` - Text generation service (Qwen 2.5 7B)
- `flux/` - Image generation service (Flux.1 Schnell)
- `musicgen/` - Music generation service (MusicGen Medium)
- `scripts/` - Automation scripts
## Creating a RunPod Template
**Why create a template?**
- Save 60-90 minutes on every Spot instance restart
- Pre-downloaded models (~37GB cached)
- Pre-built Docker images
- Ready-to-use configuration
**How to create:**
1. Run `scripts/prepare-template.sh` on a fresh RunPod instance
2. Wait 45-60 minutes for models to download and images to build
3. Save pod as template in RunPod dashboard
4. Name: `multi-modal-ai-v1.0`
**See**: [docs/RUNPOD_TEMPLATE.md](docs/RUNPOD_TEMPLATE.md) for step-by-step guide.
## Adding New Models
Adding models is easy! Just edit `model-orchestrator/models.yaml`:
```yaml
models:
llama-3.1-8b: # New model
type: text
framework: vllm
docker_service: vllm-llama
port: 8004
vram_gb: 17
startup_time_seconds: 120
endpoint: /v1/chat/completions
```
Then add the Docker service to `compose.yaml` and restart the orchestrator.
**See**: [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md#adding-new-models) for complete instructions.
## Usage Examples
### Text Generation
```bash
curl http://100.100.108.13:9000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Hello!"}]}'
```
### Image Generation
```bash
curl http://100.100.108.13:9000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{"model": "flux-schnell", "prompt": "a cute cat", "size": "1024x1024"}'
```
### Music Generation
```bash
curl http://100.100.108.13:9000/v1/audio/generations \
-H "Content-Type: application/json" \
-d '{"model": "musicgen-medium", "prompt": "upbeat electronic", "duration": 30}'
```
## Infrastructure
**Provider**: RunPod (Spot Instance)
**GPU**: NVIDIA RTX 4090 24GB VRAM
**Region**: Europe
**Network**: Tailscale VPN (100.100.108.13)
**Storage**: 922TB network volume at `/workspace`
## Monitoring
```bash
# Check active model
curl http://100.100.108.13:9000/health
# View orchestrator logs
docker logs -f ai_orchestrator
# GPU usage
nvidia-smi
```
## Support
For issues:
1. Check orchestrator logs: `docker logs ai_orchestrator`
2. Review [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md#troubleshooting)
3. Check [docs/GPU_DEPLOYMENT_LOG.md](docs/GPU_DEPLOYMENT_LOG.md) for deployment history
## License
Built with:
- [vLLM](https://github.com/vllm-project/vllm) - Apache 2.0
- [AudioCraft](https://github.com/facebookresearch/audiocraft) - MIT (code), CC-BY-NC (weights)
- [Flux.1](https://github.com/black-forest-labs/flux) - Apache 2.0
- [LiteLLM](https://github.com/BerriAI/litellm) - MIT
**Note**: MusicGen pre-trained weights are non-commercial (CC-BY-NC).

View File

@@ -1,104 +0,0 @@
version: '3.8'
# Multi-Modal AI Orchestration for RunPod RTX 4090
# Manages text, image, and music generation with sequential model loading
services:
# ============================================================================
# ORCHESTRATOR (Always Running)
# ============================================================================
orchestrator:
build: ./model-orchestrator
container_name: ai_orchestrator
ports:
- "9000:9000"
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- ./model-orchestrator/models.yaml:/app/models.yaml:ro
environment:
- MODELS_CONFIG=/app/models.yaml
- COMPOSE_PROJECT_NAME=ai
- GPU_MEMORY_GB=24
restart: unless-stopped
network_mode: host
# ============================================================================
# TEXT GENERATION (vLLM + Qwen 2.5 7B)
# ============================================================================
vllm-qwen:
build: ./vllm
container_name: ai_vllm-qwen_1
ports:
- "8001:8000"
volumes:
- /workspace/huggingface_cache:/workspace/huggingface_cache
environment:
- HF_TOKEN=${HF_TOKEN}
- VLLM_HOST=0.0.0.0
- VLLM_PORT=8000
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
profiles: ["text"] # Only start when requested by orchestrator
restart: "no" # Orchestrator manages lifecycle
# ============================================================================
# IMAGE GENERATION (Flux.1 Schnell)
# ============================================================================
flux:
image: ghcr.io/matatonic/openedai-images-flux:latest
container_name: ai_flux_1
ports:
- "8002:5005"
volumes:
- /workspace/flux/models:/app/models
- ./flux/config:/app/config:ro
environment:
- HF_TOKEN=${HF_TOKEN}
- CONFIG_PATH=/app/config/config.json
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
profiles: ["image"] # Only start when requested by orchestrator
restart: "no" # Orchestrator manages lifecycle
# ============================================================================
# MUSIC GENERATION (MusicGen Medium)
# ============================================================================
musicgen:
build: ./musicgen
container_name: ai_musicgen_1
ports:
- "8003:8000"
volumes:
- /workspace/musicgen/models:/app/models
environment:
- HF_TOKEN=${HF_TOKEN}
- MODEL_NAME=facebook/musicgen-medium
- HOST=0.0.0.0
- PORT=8000
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
profiles: ["audio"] # Only start when requested by orchestrator
restart: "no" # Orchestrator manages lifecycle
# ============================================================================
# VOLUMES
# ============================================================================
# Model caches are stored on RunPod's /workspace directory (922TB network volume)
# This persists across pod restarts and reduces model download times
# No named volumes - using host paths on RunPod /workspace

View File

@@ -1,166 +0,0 @@
#!/usr/bin/env python3
"""
Base Service Class for AI Model Services
Provides common functionality for all model services:
- Health check endpoint
- Graceful shutdown handling
- Logging configuration
- Standard FastAPI setup
"""
import asyncio
import logging
import os
import signal
import sys
from abc import ABC, abstractmethod
from typing import Optional
from fastapi import FastAPI
import uvicorn
class BaseService(ABC):
"""Abstract base class for all AI model services"""
def __init__(self, name: str, port: int, host: str = "0.0.0.0"):
"""
Initialize base service
Args:
name: Service name (for logging)
port: Port to run service on
host: Host to bind to (default: 0.0.0.0)
"""
self.name = name
self.port = port
self.host = host
self.app = FastAPI(title=f"{name} Service", version="1.0.0")
self.logger = self._setup_logging()
self.shutdown_event = asyncio.Event()
# Register standard endpoints
self._register_health_endpoint()
# Register signal handlers for graceful shutdown
self._register_signal_handlers()
# Allow subclasses to add custom routes
self.create_app()
def _setup_logging(self) -> logging.Logger:
"""Configure logging for the service"""
logging.basicConfig(
level=logging.INFO,
format=f'%(asctime)s - {self.name} - %(levelname)s - %(message)s',
handlers=[
logging.StreamHandler(sys.stdout)
]
)
return logging.getLogger(self.name)
def _register_health_endpoint(self):
"""Register standard health check endpoint"""
@self.app.get("/health")
async def health_check():
"""Health check endpoint"""
return {
"status": "healthy",
"service": self.name,
"port": self.port
}
def _register_signal_handlers(self):
"""Register signal handlers for graceful shutdown"""
def signal_handler(sig, frame):
self.logger.info(f"Received signal {sig}, initiating graceful shutdown...")
self.shutdown_event.set()
signal.signal(signal.SIGINT, signal_handler)
signal.signal(signal.SIGTERM, signal_handler)
@abstractmethod
def create_app(self):
"""
Create FastAPI routes for this service.
Subclasses must implement this to add their specific endpoints.
Example:
@self.app.post("/v1/generate")
async def generate(request: MyRequest):
return await self.model.generate(request)
"""
pass
async def initialize(self):
"""
Initialize the service (load models, etc.).
Subclasses can override this for custom initialization.
"""
self.logger.info(f"Initializing {self.name} service...")
async def cleanup(self):
"""
Cleanup resources on shutdown.
Subclasses can override this for custom cleanup.
"""
self.logger.info(f"Cleaning up {self.name} service...")
def run(self):
"""
Run the service.
This is the main entry point that starts the FastAPI server.
"""
try:
self.logger.info(f"Starting {self.name} service on {self.host}:{self.port}")
# Run initialization
asyncio.run(self.initialize())
# Start uvicorn server
config = uvicorn.Config(
app=self.app,
host=self.host,
port=self.port,
log_level="info",
access_log=True
)
server = uvicorn.Server(config)
# Run server
asyncio.run(server.serve())
except KeyboardInterrupt:
self.logger.info("Keyboard interrupt received")
except Exception as e:
self.logger.error(f"Error running service: {e}", exc_info=True)
sys.exit(1)
finally:
# Cleanup
asyncio.run(self.cleanup())
self.logger.info(f"{self.name} service stopped")
class GPUService(BaseService):
"""
Base class for GPU-accelerated services.
Provides additional GPU-specific functionality.
"""
def __init__(self, name: str, port: int, host: str = "0.0.0.0"):
super().__init__(name, port, host)
self._check_gpu_availability()
def _check_gpu_availability(self):
"""Check if GPU is available"""
try:
import torch
if torch.cuda.is_available():
gpu_count = torch.cuda.device_count()
gpu_name = torch.cuda.get_device_name(0)
self.logger.info(f"GPU available: {gpu_name} (count: {gpu_count})")
else:
self.logger.warning("No GPU available - service may run slowly")
except ImportError:
self.logger.warning("PyTorch not installed - cannot check GPU availability")

View File

@@ -1,15 +0,0 @@
# Core dependencies for AI service infrastructure
# FastAPI and server
fastapi==0.104.1
uvicorn[standard]==0.24.0
pydantic==2.5.0
# HTTP client for health checks and proxying
httpx==0.25.1
# YAML configuration
pyyaml==6.0.1
# Process management
psutil==5.9.6

View File

@@ -1,301 +0,0 @@
#!/usr/bin/env python3
"""
Service Manager for AI Model Services
Manages lifecycle of model services running as Python processes:
- Start/stop services
- Health monitoring
- Auto-restart on failure
- Resource cleanup
"""
import asyncio
import logging
import os
import signal
import subprocess
import time
from dataclasses import dataclass
from pathlib import Path
from typing import Dict, Optional
import httpx
@dataclass
class ServiceConfig:
"""Configuration for a service"""
name: str
script_path: Path
port: int
startup_timeout: int = 120
health_check_path: str = "/health"
auto_restart: bool = False
env: Optional[Dict[str, str]] = None
class ServiceManager:
"""Manages multiple AI model services as subprocesses"""
def __init__(self):
self.logger = logging.getLogger("ServiceManager")
self.processes: Dict[str, subprocess.Popen] = {}
self.configs: Dict[str, ServiceConfig] = {}
self.shutdown_event = asyncio.Event()
def register_service(self, config: ServiceConfig):
"""Register a service configuration"""
self.configs[config.name] = config
self.logger.info(f"Registered service: {config.name} on port {config.port}")
async def start_service(self, name: str) -> bool:
"""
Start a service by name
Args:
name: Service name to start
Returns:
bool: True if service started successfully
"""
if name not in self.configs:
self.logger.error(f"Service {name} not registered")
return False
if name in self.processes:
proc = self.processes[name]
if proc.poll() is None:
self.logger.info(f"Service {name} already running")
return True
config = self.configs[name]
self.logger.info(f"Starting service {name}...")
try:
# Prepare environment
env = os.environ.copy()
if config.env:
env.update(config.env)
env.update({
'PORT': str(config.port),
'HOST': '0.0.0.0'
})
# Start process
proc = subprocess.Popen(
['python3', str(config.script_path)],
env=env,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
preexec_fn=os.setsid # Create new process group
)
self.processes[name] = proc
self.logger.info(f"Process started for {name} (PID: {proc.pid})")
# Wait for health check
if await self._wait_for_health(name, config):
self.logger.info(f"Service {name} is healthy and ready")
return True
else:
self.logger.error(f"Service {name} failed health check")
await self.stop_service(name)
return False
except Exception as e:
self.logger.error(f"Error starting {name}: {e}", exc_info=True)
return False
async def _wait_for_health(self, name: str, config: ServiceConfig) -> bool:
"""
Wait for service to become healthy
Args:
name: Service name
config: Service configuration
Returns:
bool: True if service becomes healthy within timeout
"""
proc = self.processes.get(name)
if not proc:
return False
start_time = time.time()
url = f"http://localhost:{config.port}{config.health_check_path}"
while time.time() - start_time < config.startup_timeout:
# Check if process is still running
if proc.poll() is not None:
self.logger.error(f"Process for {name} exited prematurely (code: {proc.returncode})")
return False
# Try health check
try:
async with httpx.AsyncClient() as client:
response = await client.get(url, timeout=5.0)
if response.status_code == 200:
return True
except Exception:
pass
await asyncio.sleep(2)
return False
async def stop_service(self, name: str, timeout: int = 10):
"""
Stop a running service
Args:
name: Service name
timeout: Seconds to wait for graceful shutdown
"""
if name not in self.processes:
self.logger.warning(f"Service {name} not in process registry")
return
proc = self.processes[name]
if proc.poll() is None: # Still running
self.logger.info(f"Stopping service {name}...")
try:
# Send SIGTERM to process group
os.killpg(os.getpgid(proc.pid), signal.SIGTERM)
# Wait for graceful shutdown
try:
proc.wait(timeout=timeout)
self.logger.info(f"Service {name} stopped gracefully")
except subprocess.TimeoutExpired:
# Force kill if not terminated
self.logger.warning(f"Service {name} did not stop gracefully, forcing kill")
os.killpg(os.getpgid(proc.pid), signal.SIGKILL)
proc.wait()
except Exception as e:
self.logger.error(f"Error stopping {name}: {e}", exc_info=True)
del self.processes[name]
async def restart_service(self, name: str) -> bool:
"""
Restart a service
Args:
name: Service name
Returns:
bool: True if service restarted successfully
"""
self.logger.info(f"Restarting service {name}...")
await self.stop_service(name)
await asyncio.sleep(2) # Brief pause between stop and start
return await self.start_service(name)
async def check_health(self, name: str) -> bool:
"""
Check if a service is healthy
Args:
name: Service name
Returns:
bool: True if service is running and healthy
"""
if name not in self.processes:
return False
proc = self.processes[name]
if proc.poll() is not None:
return False
config = self.configs[name]
url = f"http://localhost:{config.port}{config.health_check_path}"
try:
async with httpx.AsyncClient() as client:
response = await client.get(url, timeout=5.0)
return response.status_code == 200
except Exception:
return False
async def monitor_services(self):
"""
Monitor all services and auto-restart if configured
This runs continuously until shutdown_event is set.
"""
self.logger.info("Starting service monitor...")
while not self.shutdown_event.is_set():
for name, config in self.configs.items():
if not config.auto_restart:
continue
# Check if process exists and is healthy
if name in self.processes:
proc = self.processes[name]
if proc.poll() is not None:
self.logger.warning(f"Service {name} died (code: {proc.returncode}), restarting...")
await self.restart_service(name)
elif not await self.check_health(name):
self.logger.warning(f"Service {name} unhealthy, restarting...")
await self.restart_service(name)
# Wait before next check
try:
await asyncio.wait_for(self.shutdown_event.wait(), timeout=10.0)
except asyncio.TimeoutError:
pass
self.logger.info("Service monitor stopped")
async def stop_all_services(self):
"""Stop all running services"""
self.logger.info("Stopping all services...")
for name in list(self.processes.keys()):
await self.stop_service(name)
self.logger.info("All services stopped")
def get_service_status(self, name: str) -> Dict:
"""
Get status information for a service
Args:
name: Service name
Returns:
dict: Status information
"""
if name not in self.configs:
return {"status": "unknown", "error": "Service not registered"}
if name not in self.processes:
return {"status": "stopped"}
proc = self.processes[name]
if proc.poll() is not None:
return {
"status": "exited",
"exit_code": proc.returncode
}
config = self.configs[name]
return {
"status": "running",
"pid": proc.pid,
"port": config.port
}
def get_all_service_status(self) -> Dict:
"""
Get status for all registered services
Returns:
dict: Service name -> status mapping
"""
return {
name: self.get_service_status(name)
for name in self.configs.keys()
}

View File

@@ -1,467 +0,0 @@
# Multi-Modal AI Orchestration System
**Cost-optimized AI infrastructure running text, image, and music generation on a single RunPod RTX 4090 GPU.**
## Architecture Overview
This system provides a unified API for multiple AI model types with automatic model switching on a single GPU (24GB VRAM). All requests route through an intelligent orchestrator that manages model lifecycle.
### Components
```
┌─────────────────────────────────────────────────────────────────┐
│ VPS (Tailscale: 100.102.217.79) │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ LiteLLM Proxy (Port 4000) │ │
│ │ Routes to: Claude API + GPU Orchestrator │ │
│ └────────────────────┬──────────────────────────────────────┘ │
└───────────────────────┼─────────────────────────────────────────┘
│ Tailscale VPN
┌───────────────────────┼─────────────────────────────────────────┐
│ RunPod GPU Server (Tailscale: 100.100.108.13) │
│ ┌────────────────────▼──────────────────────────────────────┐ │
│ │ Orchestrator (Port 9000) │ │
│ │ Manages sequential model loading based on request type │ │
│ └─────┬──────────────┬──────────────────┬──────────────────┘ │
│ │ │ │ │
│ ┌─────▼──────┐ ┌────▼────────┐ ┌──────▼───────┐ │
│ │vLLM │ │Flux.1 │ │MusicGen │ │
│ │Qwen 2.5 7B │ │Schnell │ │Medium │ │
│ │Port: 8001 │ │Port: 8002 │ │Port: 8003 │ │
│ │VRAM: 14GB │ │VRAM: 14GB │ │VRAM: 11GB │ │
│ └────────────┘ └─────────────┘ └──────────────┘ │
│ │
│ Only ONE model active at a time (sequential loading) │
└─────────────────────────────────────────────────────────────────┘
```
### Features
**Automatic Model Switching** - Orchestrator detects request type and loads appropriate model
**OpenAI-Compatible APIs** - Works with existing OpenAI clients and tools
**Cost-Optimized** - Sequential loading on single GPU (~$0.50/hr vs ~$0.75/hr for multi-GPU)
**Easy Model Addition** - Add new models by editing YAML config
**Centralized Routing** - LiteLLM proxy provides unified API for all models
**GPU Memory Safe** - Orchestrator ensures only one model loaded at a time
## Supported Model Types
### Text Generation
- **Qwen 2.5 7B Instruct** (facebook/Qwen2.5-7B-Instruct)
- VRAM: 14GB | Speed: Fast | OpenAI-compatible chat API
### Image Generation
- **Flux.1 Schnell** (black-forest-labs/FLUX.1-schnell)
- VRAM: 14GB | Speed: 4-5 sec/image | OpenAI DALL-E compatible API
### Music Generation
- **MusicGen Medium** (facebook/musicgen-medium)
- VRAM: 11GB | Speed: 60-90 sec for 30s audio | Custom audio API
## Quick Start
### 1. Prerequisites
```bash
# On RunPod GPU server
- RunPod RTX 4090 instance (24GB VRAM)
- Docker & Docker Compose installed
- Tailscale VPN configured
- HuggingFace token (for model downloads)
```
### 2. Clone & Configure
```bash
# On local machine
cd ai/
# Create environment file
cp .env.example .env
# Edit .env and add your HF_TOKEN
```
### 3. Deploy to RunPod
```bash
# Copy all files to RunPod GPU server
scp -r ai/* gpu-pivoine:/workspace/ai/
# SSH to GPU server
ssh gpu-pivoine
# Navigate to project
cd /workspace/ai/
# Start orchestrator (always running)
docker compose -f compose.yaml up -d orchestrator
# Orchestrator will automatically manage model services as needed
```
### 4. Test Deployment
```bash
# Check orchestrator health
curl http://100.100.108.13:9000/health
# Test text generation (auto-loads vLLM)
curl http://100.100.108.13:9000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-2.5-7b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Test image generation (auto-switches to Flux)
curl http://100.100.108.13:9000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "flux-schnell",
"prompt": "a cute cat",
"size": "1024x1024"
}'
# Test music generation (auto-switches to MusicGen)
curl http://100.100.108.13:9000/v1/audio/generations \
-H "Content-Type: application/json" \
-d '{
"model": "musicgen-medium",
"prompt": "upbeat electronic dance music",
"duration": 30
}'
```
### 5. Update VPS LiteLLM
```bash
# On VPS, restart LiteLLM to pick up new config
ssh vps
cd ~/Projects/docker-compose
arty restart litellm
```
## Usage Examples
### Via Open WebUI (https://ai.pivoine.art)
**Text Generation:**
1. Select model: `qwen-2.5-7b`
2. Type message and send
3. Orchestrator loads vLLM automatically
**Image Generation:**
1. Select model: `flux-schnell`
2. Enter image prompt
3. Orchestrator switches to Flux.1
**Music Generation:**
1. Select model: `musicgen-medium`
2. Describe the music you want
3. Orchestrator switches to MusicGen
### Via API (Direct)
```python
import openai
# Configure client to use orchestrator
client = openai.OpenAI(
base_url="http://100.100.108.13:9000/v1",
api_key="dummy" # Not used but required
)
# Text generation
response = client.chat.completions.create(
model="qwen-2.5-7b",
messages=[{"role": "user", "content": "Write a haiku"}]
)
# Image generation
image = client.images.generate(
model="flux-schnell",
prompt="a sunset over mountains",
size="1024x1024"
)
# Music generation (custom endpoint)
import requests
music = requests.post(
"http://100.100.108.13:9000/v1/audio/generations",
json={
"model": "musicgen-medium",
"prompt": "calm piano music",
"duration": 30
}
)
```
## Adding New Models
### Step 1: Update `models.yaml`
```yaml
# Add to ai/model-orchestrator/models.yaml
models:
llama-3.1-8b: # New model
type: text
framework: vllm
docker_service: vllm-llama
port: 8004
vram_gb: 17
startup_time_seconds: 120
endpoint: /v1/chat/completions
description: "Llama 3.1 8B Instruct - Meta's latest model"
```
### Step 2: Add Docker Service
```yaml
# Add to ai/compose.yaml
services:
vllm-llama:
build: ./vllm
container_name: ai_vllm-llama_1
command: >
vllm serve meta-llama/Llama-3.1-8B-Instruct
--port 8000 --dtype bfloat16
ports:
- "8004:8000"
environment:
- HF_TOKEN=${HF_TOKEN}
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
profiles: ["text"]
restart: "no"
```
### Step 3: Restart Orchestrator
```bash
ssh gpu-pivoine
cd /workspace/ai/
docker compose -f compose.yaml restart orchestrator
```
**That's it!** The orchestrator automatically detects the new model.
## Management Commands
### Orchestrator
```bash
# Start orchestrator
docker compose -f compose.yaml up -d orchestrator
# View orchestrator logs
docker logs -f ai_orchestrator
# Restart orchestrator
docker compose -f compose.yaml restart orchestrator
# Check active model
curl http://100.100.108.13:9000/health
# List all models
curl http://100.100.108.13:9000/models
```
### Manual Model Control
```bash
# Manually switch to specific model
curl -X POST http://100.100.108.13:9000/switch \
-H "Content-Type: application/json" \
-d '{"model": "flux-schnell"}'
# Check which model is running
curl http://100.100.108.13:9000/health | jq '.current_model'
```
### Model Services
```bash
# Manually start a specific model (bypassing orchestrator)
docker compose -f compose.yaml --profile text up -d vllm-qwen
# Stop a model
docker compose -f compose.yaml stop vllm-qwen
# View model logs
docker logs -f ai_vllm-qwen_1
docker logs -f ai_flux_1
docker logs -f ai_musicgen_1
```
## Monitoring
### GPU Usage
```bash
ssh gpu-pivoine "nvidia-smi"
```
### Model Status
```bash
# Which model is active?
curl http://100.100.108.13:9000/health
# Model memory usage
curl http://100.100.108.13:9000/health | jq '{current: .current_model, vram: .model_info.vram_gb}'
```
### Performance
```bash
# Orchestrator logs (model switching)
docker logs -f ai_orchestrator
# Model-specific logs
docker logs -f ai_vllm-qwen_1
docker logs -f ai_flux_1
docker logs -f ai_musicgen_1
```
## Troubleshooting
### Model Won't Load
```bash
# Check orchestrator logs
docker logs ai_orchestrator
# Check if model service exists
docker compose -f compose.yaml config | grep -A 10 "vllm-qwen"
# Manually test model service
docker compose -f compose.yaml --profile text up -d vllm-qwen
curl http://localhost:8001/health
```
### Orchestrator Can't Connect
```bash
# Check Docker socket permissions
ls -l /var/run/docker.sock
# Restart Docker daemon
sudo systemctl restart docker
# Rebuild orchestrator
docker compose -f compose.yaml build orchestrator
docker compose -f compose.yaml up -d orchestrator
```
### Model Switching Too Slow
```bash
# Check model startup times in models.yaml
# Adjust startup_time_seconds if needed
# Pre-download models to /workspace cache
docker run --rm -it --gpus all \
-v /workspace/huggingface_cache:/cache \
-e HF_HOME=/cache \
nvidia/cuda:12.4.0-runtime-ubuntu22.04 \
huggingface-cli download facebook/musicgen-medium
```
## File Structure
```
ai/
├── compose.yaml # Main orchestration file
├── .env.example # Environment template
├── README.md # This file
├── model-orchestrator/ # Central orchestrator service
│ ├── orchestrator.py # FastAPI app managing models
│ ├── models.yaml # Model registry (EDIT TO ADD MODELS)
│ ├── Dockerfile
│ └── requirements.txt
├── vllm/ # Text generation (vLLM)
│ ├── server.py # Qwen 2.5 7B server
│ ├── Dockerfile
│ └── requirements.txt
├── flux/ # Image generation (Flux.1 Schnell)
│ └── config/
│ └── config.json # Flux configuration
├── musicgen/ # Music generation (MusicGen)
│ ├── server.py # MusicGen API server
│ ├── Dockerfile
│ └── requirements.txt
├── litellm-config.yaml # LiteLLM proxy configuration
└── GPU_DEPLOYMENT_LOG.md # Deployment history and notes
```
## Cost Analysis
### Current Setup (Single GPU)
- **Provider**: RunPod Spot Instance
- **GPU**: RTX 4090 24GB
- **Cost**: ~$0.50/hour
- **Monthly**: ~$360 (if running 24/7)
- **Optimized**: ~$120 (8 hours/day during business hours)
### Alternative: Multi-GPU (All Models Always On)
- **GPUs**: 2× RTX 4090
- **Cost**: ~$0.75/hour
- **Monthly**: ~$540 (if running 24/7)
- **Trade-off**: No switching latency, +$180/month
### Recommendation
Stick with single GPU sequential loading for cost optimization. Model switching (30-120 seconds) is acceptable for most use cases.
## Performance Expectations
| Model | VRAM | Startup Time | Generation Speed |
|-------|------|--------------|------------------|
| Qwen 2.5 7B | 14GB | 120s | ~50 tokens/sec |
| Flux.1 Schnell | 14GB | 60s | ~4-5 sec/image |
| MusicGen Medium | 11GB | 45s | ~60-90 sec for 30s audio |
**Model Switching**: 30-120 seconds (unload current + load new)
## Security Notes
- Orchestrator requires Docker socket access (`/var/run/docker.sock`)
- All services run on private Tailscale network
- No public exposure (only via VPS LiteLLM proxy)
- HuggingFace token stored in `.env` (not committed to git)
## Future Enhancements
1. ⏹️ Add Llama 3.1 8B for alternative text generation
2. ⏹️ Add Whisper Large v3 for speech-to-text
3. ⏹️ Add XTTS v2 for text-to-speech
4. ⏹️ Implement model preloading/caching for faster switching
5. ⏹️ Add usage metrics and cost tracking
6. ⏹️ Auto-stop GPU pod during idle periods
## Support
For issues or questions:
- Check orchestrator logs: `docker logs ai_orchestrator`
- View model-specific logs: `docker logs ai_<service>_1`
- Test direct model access: `curl http://localhost:<port>/health`
- Review GPU deployment log: `GPU_DEPLOYMENT_LOG.md`
## License
Built with:
- [vLLM](https://github.com/vllm-project/vllm) - Apache 2.0
- [AudioCraft](https://github.com/facebookresearch/audiocraft) - MIT (code), CC-BY-NC (weights)
- [Flux.1](https://github.com/black-forest-labs/flux) - Apache 2.0
- [LiteLLM](https://github.com/BerriAI/litellm) - MIT
**Note**: MusicGen pre-trained weights are non-commercial (CC-BY-NC). Train your own models for commercial use with the MIT-licensed code.

View File

@@ -1,421 +0,0 @@
# GPU Server Deployment Log
## Current Deployment (2025-11-21)
### Infrastructure
- **Provider**: RunPod (Spot Instance)
- **GPU**: NVIDIA RTX 4090 24GB
- **Disk**: 50GB local SSD (expanded from 20GB)
- **Network Volume**: 922TB at `/workspace`
- **Region**: Europe
- **Cost**: ~$0.50/hour (~$360/month if running 24/7)
### Network Configuration
- **VPN**: Tailscale (replaces WireGuard due to RunPod UDP restrictions)
- **GPU Server Tailscale IP**: 100.100.108.13
- **VPS Tailscale IP**: (get with `tailscale ip -4` on VPS)
### SSH Access
```
Host gpu-pivoine
HostName 213.173.102.232
Port 29695
User root
IdentityFile ~/.ssh/id_ed25519
```
**Note**: RunPod Spot instances can be terminated and restarted with new ports/IPs. Update SSH config accordingly.
### Software Stack
- **Python**: 3.11.10
- **vLLM**: 0.6.4.post1 (installed with pip)
- **PyTorch**: 2.5.1 with CUDA 12.4
- **Tailscale**: Installed via official script
### vLLM Deployment
**Custom Server**: `ai/simple_vllm_server.py`
- Uses `AsyncLLMEngine` directly to bypass multiprocessing issues
- OpenAI-compatible API endpoints:
- `GET /v1/models` - List available models
- `POST /v1/completions` - Text completion
- `POST /v1/chat/completions` - Chat completion
- Default model: Qwen/Qwen2.5-7B-Instruct
- Cache directory: `/workspace/huggingface_cache`
**Deployment Command**:
```bash
# Copy server script to GPU server
scp ai/simple_vllm_server.py gpu-pivoine:/workspace/
# Start server
ssh gpu-pivoine "cd /workspace && nohup python3 simple_vllm_server.py > vllm.log 2>&1 &"
# Check status
ssh gpu-pivoine "curl http://localhost:8000/v1/models"
```
**Server Configuration** (environment variables):
- `VLLM_HOST`: 0.0.0.0 (default)
- `VLLM_PORT`: 8000 (default)
### Model Configuration
- **Model**: Qwen/Qwen2.5-7B-Instruct (no auth required)
- **Context Length**: 4096 tokens
- **GPU Memory**: 85% utilization
- **Tensor Parallel**: 1 (single GPU)
### Known Issues & Solutions
#### Issue 1: vLLM Multiprocessing Errors
**Problem**: Default vLLM v1 engine fails with ZMQ/CUDA multiprocessing errors on RunPod.
**Solution**: Custom `AsyncLLMEngine` FastAPI server bypasses multiprocessing layer entirely.
#### Issue 2: Disk Space (Solved)
**Problem**: Original 20GB disk filled up with Hugging Face cache.
**Solution**: Expanded to 50GB and use `/workspace` for model cache.
#### Issue 3: Gated Models
**Problem**: Llama models require Hugging Face authentication.
**Solution**: Use Qwen 2.5 7B Instruct (no auth required) or set `HF_TOKEN` environment variable.
#### Issue 4: Spot Instance Volatility
**Problem**: RunPod Spot instances can be terminated anytime.
**Solution**: Accept as trade-off for cost savings. Document SSH details for quick reconnection.
### Monitoring
**Check vLLM logs**:
```bash
ssh gpu-pivoine "tail -f /workspace/vllm.log"
```
**Check GPU usage**:
```bash
ssh gpu-pivoine "nvidia-smi"
```
**Check Tailscale status**:
```bash
ssh gpu-pivoine "tailscale status"
```
**Test API locally (on GPU server)**:
```bash
ssh gpu-pivoine "curl http://localhost:8000/v1/models"
```
**Test API via Tailscale (from VPS)**:
```bash
curl http://100.100.108.13:8000/v1/models
```
### LiteLLM Integration
Update VPS LiteLLM config at `ai/litellm-config-gpu.yaml`:
```yaml
# Replace old WireGuard IP (10.8.0.2) with Tailscale IP
- model_name: qwen-2.5-7b
litellm_params:
model: openai/qwen-2.5-7b
api_base: http://100.100.108.13:8000/v1 # Tailscale IP
api_key: dummy
rpm: 1000
tpm: 100000
```
Restart LiteLLM:
```bash
arty restart litellm
```
### Troubleshooting
**Server not responding**:
1. Check if process is running: `pgrep -f simple_vllm_server`
2. Check logs: `tail -100 /workspace/vllm.log`
3. Check GPU availability: `nvidia-smi`
4. Restart server: `pkill -f simple_vllm_server && python3 /workspace/simple_vllm_server.py &`
**Tailscale not connected**:
1. Check status: `tailscale status`
2. Check daemon: `ps aux | grep tailscaled`
3. Restart: `tailscale down && tailscale up`
**Model download failing**:
1. Check disk space: `df -h`
2. Check cache directory: `ls -lah /workspace/huggingface_cache`
3. Clear cache if needed: `rm -rf /workspace/huggingface_cache/*`
### Deployment Status ✅ COMPLETE
**Deployment Date**: 2025-11-21
1. ✅ Deploy vLLM with Qwen 2.5 7B - COMPLETE
2. ✅ Test API endpoints locally and via Tailscale - COMPLETE
3. ✅ Update VPS LiteLLM configuration - COMPLETE
4. ✅ Test end-to-end: Open WebUI → LiteLLM → vLLM - COMPLETE
5. ⏳ Monitor performance and costs - ONGOING
**Model Available**: `qwen-2.5-7b` visible in Open WebUI at https://ai.pivoine.art
### Next Steps (2025-11-21 Original)
6. ✅ Consider adding more models → COMPLETE (added Flux.1 Schnell + MusicGen Medium)
7. ⏹️ Set up auto-stop for idle periods to save costs
---
## Multi-Modal Architecture (2025-11-21 Update)
### Overview
Expanded GPU deployment to support **text, image, and music generation** with intelligent model orchestration. All models run sequentially on a single RTX 4090 GPU with automatic switching based on request type.
### Architecture Components
#### 1. **Orchestrator Service** (Port 9000 - Always Running)
- **Location**: `ai/model-orchestrator/`
- **Purpose**: Central service managing model lifecycle
- **Features**:
- Detects request type (text/image/audio)
- Automatically unloads current model
- Loads requested model
- Proxies requests to active model
- Tracks GPU memory usage
- **Technology**: FastAPI + Docker SDK Python
- **Endpoints**:
- `POST /v1/chat/completions` → Routes to text models
- `POST /v1/images/generations` → Routes to image models
- `POST /v1/audio/generations` → Routes to music models
- `GET /health` → Shows active model and status
- `GET /models` → Lists all available models
- `POST /switch` → Manually switch models
#### 2. **Text Generation** (vLLM + Qwen 2.5 7B)
- **Service**: `vllm-qwen` (Port 8001)
- **Location**: `ai/vllm/`
- **Model**: Qwen/Qwen2.5-7B-Instruct
- **VRAM**: 14GB (85% GPU utilization)
- **Speed**: ~50 tokens/second
- **Startup**: 120 seconds
- **Status**: ✅ Working (same as original deployment)
#### 3. **Image Generation** (Flux.1 Schnell)
- **Service**: `flux` (Port 8002)
- **Location**: `ai/flux/`
- **Model**: black-forest-labs/FLUX.1-schnell
- **VRAM**: 14GB with CPU offloading
- **Speed**: 4-5 seconds per image
- **Startup**: 60 seconds
- **Features**: OpenAI DALL-E compatible API
- **Image**: `ghcr.io/matatonic/openedai-images-flux:latest`
#### 4. **Music Generation** (MusicGen Medium)
- **Service**: `musicgen` (Port 8003)
- **Location**: `ai/musicgen/`
- **Model**: facebook/musicgen-medium
- **VRAM**: 11GB
- **Speed**: 60-90 seconds for 30 seconds of audio
- **Startup**: 45 seconds
- **Features**: Text-to-music generation with sampling controls
- **Technology**: Meta's AudioCraft + custom FastAPI wrapper
### Model Registry (`models.yaml`)
Simple configuration file for managing all models:
```yaml
models:
qwen-2.5-7b:
type: text
framework: vllm
docker_service: vllm-qwen
port: 8001
vram_gb: 14
startup_time_seconds: 120
endpoint: /v1/chat/completions
flux-schnell:
type: image
framework: openedai-images
docker_service: flux
port: 8002
vram_gb: 14
startup_time_seconds: 60
endpoint: /v1/images/generations
musicgen-medium:
type: audio
framework: audiocraft
docker_service: musicgen
port: 8003
vram_gb: 11
startup_time_seconds: 45
endpoint: /v1/audio/generations
```
**Adding new models**: Just add a new entry to this file and define the Docker service.
### Deployment Changes
#### Docker Compose Structure
- **File**: `compose.yaml`
- **Services**: 4 total (1 orchestrator + 3 models)
- **Profiles**: `text`, `image`, `audio` (orchestrator manages activation)
- **Restart Policy**: `no` for models (orchestrator controls lifecycle)
- **Volumes**: All model caches on `/workspace` (922TB network volume)
#### LiteLLM Integration
Updated `litellm-config.yaml` to route all self-hosted models through orchestrator:
```yaml
# Text
- model_name: qwen-2.5-7b
api_base: http://100.100.108.13:9000/v1 # Orchestrator
# Image
- model_name: flux-schnell
api_base: http://100.100.108.13:9000/v1 # Orchestrator
# Music
- model_name: musicgen-medium
api_base: http://100.100.108.13:9000/v1 # Orchestrator
```
All models now available via Open WebUI at https://ai.pivoine.art
### Usage Examples
**Text Generation**:
```bash
curl http://100.100.108.13:9000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "qwen-2.5-7b", "messages": [{"role": "user", "content": "Hello"}]}'
```
**Image Generation**:
```bash
curl http://100.100.108.13:9000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{"model": "flux-schnell", "prompt": "a cute cat", "size": "1024x1024"}'
```
**Music Generation**:
```bash
curl http://100.100.108.13:9000/v1/audio/generations \
-H "Content-Type: application/json" \
-d '{"model": "musicgen-medium", "prompt": "upbeat electronic", "duration": 30}'
```
### Deployment Commands
```bash
# Copy all files to RunPod
scp -r ai/* gpu-pivoine:/workspace/ai/
# SSH to GPU server
ssh gpu-pivoine
cd /workspace/ai/
# Start orchestrator (manages everything)
docker compose -f compose.yaml up -d orchestrator
# Check status
curl http://100.100.108.13:9000/health
# View logs
docker logs -f ai_orchestrator
# Manually switch models (optional)
curl -X POST http://100.100.108.13:9000/switch \
-H "Content-Type: application/json" \
-d '{"model": "flux-schnell"}'
```
### Performance Characteristics
| Model | VRAM | Startup Time | Generation Time | Notes |
|-------|------|--------------|-----------------|-------|
| Qwen 2.5 7B | 14GB | 120s | ~50 tok/sec | Fast text generation |
| Flux.1 Schnell | 14GB | 60s | 4-5s/image | High-quality images |
| MusicGen Medium | 11GB | 45s | 60-90s for 30s audio | Text-to-music |
**Model Switching Overhead**: 30-120 seconds (unload + load)
### Cost Analysis
**Current (Single GPU Sequential)**:
- Cost: ~$0.50/hour
- Monthly: ~$360 (24/7) or ~$120 (8hr/day)
- Trade-off: 30-120s switching time
**Alternative (Multi-GPU Concurrent)**:
- Cost: ~$0.75/hour (+50%)
- Monthly: ~$540 (24/7) or ~$180 (8hr/day)
- Benefit: No switching time, all models always available
**Decision**: Stick with single GPU for cost optimization. Switching time is acceptable for most use cases.
### Known Limitations
1. **Sequential Only**: Only one model active at a time
2. **Switching Latency**: 30-120 seconds to change models
3. **MusicGen License**: Pre-trained weights are CC-BY-NC (non-commercial)
4. **Spot Instance Volatility**: Pod can be terminated anytime
### Monitoring
**Check active model**:
```bash
curl http://100.100.108.13:9000/health | jq '{model: .current_model, vram: .model_info.vram_gb}'
```
**View orchestrator logs**:
```bash
docker logs -f ai_orchestrator
```
**GPU usage**:
```bash
ssh gpu-pivoine "nvidia-smi"
```
### Deployment Status ✅ COMPLETE (Multi-Modal)
**Deployment Date**: 2025-11-21
1. ✅ Create model orchestrator service - COMPLETE
2. ✅ Deploy vLLM text generation (Qwen 2.5 7B) - COMPLETE
3. ✅ Deploy Flux.1 Schnell image generation - COMPLETE
4. ✅ Deploy MusicGen Medium music generation - COMPLETE
5. ✅ Update LiteLLM configuration - COMPLETE
6. ✅ Test all three model types via orchestrator - READY FOR TESTING
7. ⏳ Monitor performance and costs - ONGOING
**Models Available**: `qwen-2.5-7b`, `flux-schnell`, `musicgen-medium` via Open WebUI
### Future Model Additions
**Easy to add** (just edit `models.yaml`):
- Llama 3.1 8B Instruct (text, gated model)
- Whisper Large v3 (speech-to-text)
- XTTS v2 (text-to-speech)
- Stable Diffusion XL (alternative image generation)
See `README.md` for detailed instructions on adding new models.
### Cost Optimization Ideas
1. **Auto-stop**: Configure RunPod to auto-stop after 30 minutes idle
2. **Spot Instances**: Already using Spot for 50% cost reduction
3. **Scheduled Operation**: Run only during business hours (8 hours/day = $120/month)
4. **Smaller Models**: Use Mistral 7B or quantized models for lighter workloads
5. **Pay-as-you-go**: Manually start/stop pod as needed
### Performance Benchmarks
*To be measured after deployment*
Expected (based on RTX 4090):
- Qwen 2.5 7B: 50-80 tokens/second
- Context processing: ~2-3 seconds for 1000 tokens
- First token latency: ~200-300ms

View File

@@ -1,416 +0,0 @@
# RunPod Template Creation Guide
This guide shows you how to create a reusable RunPod template so you never have to reinstall everything from scratch when Spot instances restart.
## Why Create a Template?
**Without Template** (Manual Setup Every Time):
- ❌ Install Docker & Docker Compose (10-15 min)
- ❌ Install Tailscale (5 min)
- ❌ Pull Docker images (10-20 min)
- ❌ Download models: Qwen (~14GB), Flux (~12GB), MusicGen (~11GB) = 30-45 min
- ❌ Configure everything (5-10 min)
- **Total: 60-90 minutes per Spot instance restart**
**With Template** (Ready to Go):
- ✅ Everything pre-installed
- ✅ Models cached in `/workspace`
- ✅ Just start orchestrator
- **Total: 2-3 minutes**
## Template Contents
### System Software
- ✅ Docker 24.x + Docker Compose v2
- ✅ Tailscale latest
- ✅ NVIDIA Docker runtime
- ✅ Python 3.11
- ✅ Git, curl, wget, htop, nvtop
### Docker Images (Pre-built)
-`ai_orchestrator` - Model orchestration service
-`ai_vllm-qwen_1` - Text generation (vLLM + Qwen 2.5 7B)
-`ai_musicgen_1` - Music generation (AudioCraft)
-`ghcr.io/matatonic/openedai-images-flux:latest` - Image generation
### Model Cache (/workspace - Persistent)
- ✅ Qwen 2.5 7B Instruct (~14GB)
- ✅ Flux.1 Schnell (~12GB)
- ✅ MusicGen Medium (~11GB)
- **Total: ~37GB cached**
### Project Files (/workspace/ai)
- ✅ All orchestrator code
- ✅ Docker Compose configurations
- ✅ Model service configurations
- ✅ Documentation
---
## Step-by-Step Template Creation
### Prerequisites
1. RunPod account
2. Active RTX 4090 pod (or similar GPU)
3. SSH access to the pod
4. This repository cloned locally
### Step 1: Deploy Fresh Pod
```bash
# Create new RunPod instance:
# - GPU: RTX 4090 (24GB VRAM)
# - Disk: 50GB container disk
# - Network Volume: Attach or create 100GB+ volume
# - Template: Start with official PyTorch or CUDA template
# Note the SSH connection details (host, port, password)
```
### Step 2: Prepare the Instance
Run the automated preparation script:
```bash
# On your local machine, copy everything to RunPod
scp -P <PORT> -r /home/valknar/Projects/runpod/* root@<HOST>:/workspace/ai/
# SSH to the pod
ssh -p <PORT> root@<HOST>
# Run the preparation script
cd /workspace/ai
chmod +x scripts/prepare-template.sh
./scripts/prepare-template.sh
```
**What the script does:**
1. Installs Docker & Docker Compose
2. Installs Tailscale
3. Builds all Docker images
4. Pre-downloads all models
5. Validates everything works
6. Cleans up temporary files
**Estimated time: 45-60 minutes**
### Step 3: Manual Verification
After the script completes, verify everything:
```bash
# Check Docker is installed
docker --version
docker compose version
# Check Tailscale
tailscale version
# Check all images are built
docker images | grep ai_
# Check models are cached
ls -lh /workspace/huggingface_cache/
ls -lh /workspace/flux/models/
ls -lh /workspace/musicgen/models/
# Test orchestrator starts
cd /workspace/ai
docker compose -f compose.yaml up -d orchestrator
docker logs ai_orchestrator
# Test model loading (should be fast since models are cached)
curl http://localhost:9000/health
# Stop orchestrator
docker compose -f compose.yaml down
```
### Step 4: Clean Up Before Saving
**IMPORTANT**: Remove secrets and temporary data before creating template!
```bash
# Remove sensitive data
rm -f /workspace/ai/.env
rm -f /root/.ssh/known_hosts
rm -f /root/.bash_history
# Clear logs
rm -f /var/log/*.log
docker system prune -af --volumes # Clean Docker cache but keep images
# Clear Tailscale state (will re-authenticate on first use)
tailscale logout
# Create template-ready marker
echo "RunPod Multi-Modal AI Template v1.0" > /workspace/TEMPLATE_VERSION
echo "Created: $(date)" >> /workspace/TEMPLATE_VERSION
```
### Step 5: Save Template in RunPod Dashboard
1. **Go to RunPod Dashboard** → "My Pods"
2. **Select your prepared pod**
3. **Click "⋮" menu** → "Save as Template"
4. **Template Configuration**:
- **Name**: `multi-modal-ai-v1.0`
- **Description**:
```
Multi-Modal AI Stack with Orchestrator
- Text: vLLM + Qwen 2.5 7B
- Image: Flux.1 Schnell
- Music: MusicGen Medium
- Models pre-cached (~37GB)
- Ready to deploy in 2-3 minutes
```
- **Category**: `AI/ML`
- **Docker Image**: (auto-detected)
- **Container Disk**: 50GB
- **Expose Ports**: 9000, 8001, 8002, 8003
- **Environment Variables** (optional):
```
HF_TOKEN=<leave empty, user will add>
TAILSCALE_AUTHKEY=<leave empty, user will add>
```
5. **Click "Save Template"**
6. **Wait for template creation** (5-10 minutes)
7. **Test the template** by deploying a new pod with it
---
## Using Your Template
### Deploy New Pod from Template
1. **RunPod Dashboard** → " Deploy"
2. **Select "Community Templates"** or "My Templates"
3. **Choose**: `multi-modal-ai-v1.0`
4. **Configure**:
- GPU: RTX 4090 (or compatible)
- Network Volume: Attach your existing volume with `/workspace` mount
- Environment:
- `HF_TOKEN`: Your Hugging Face token
- (Tailscale will be configured via SSH)
5. **Deploy Pod**
### First-Time Setup (On New Pod)
```bash
# SSH to the new pod
ssh -p <PORT> root@<HOST>
# Navigate to project
cd /workspace/ai
# Create .env file
cat > .env <<EOF
HF_TOKEN=hf_your_token_here
GPU_TAILSCALE_IP=100.100.108.13
EOF
# Configure Tailscale (one-time)
tailscale up --authkey=<YOUR_TAILSCALE_KEY>
# Start orchestrator (models already cached, starts in seconds!)
docker compose -f compose.yaml up -d orchestrator
# Verify
curl http://localhost:9000/health
# Check logs
docker logs -f ai_orchestrator
```
**Total setup time: 2-3 minutes!** 🎉
### Updating SSH Config (If Spot Instance Restarts)
Since Spot instances can restart with new IPs/ports:
```bash
# On your local machine
# Update ~/.ssh/config with new connection details
Host gpu-pivoine
HostName <NEW_IP>
Port <NEW_PORT>
User root
IdentityFile ~/.ssh/id_ed25519
```
---
## Template Maintenance
### Updating the Template
When you add new models or make improvements:
1. Deploy a pod from your existing template
2. Make your changes
3. Test everything
4. Clean up (remove secrets)
5. Save as new template version: `multi-modal-ai-v1.1`
6. Update your documentation
### Version History
Keep track of template versions:
```
v1.0 (2025-11-21) - Initial release
- Text: Qwen 2.5 7B
- Image: Flux.1 Schnell
- Music: MusicGen Medium
- Docker orchestrator
v1.1 (future) - Planned
- Add Llama 3.1 8B
- Add Whisper Large v3
- Optimize model loading
```
---
## Troubleshooting Template Creation
### Models Not Downloading
```bash
# Manually trigger model downloads
docker compose --profile text up -d vllm-qwen
docker logs -f ai_vllm-qwen_1
# Wait for "Model loaded successfully"
docker compose stop vllm-qwen
# Repeat for other models
docker compose --profile image up -d flux
docker compose --profile audio up -d musicgen
```
### Docker Images Not Building
```bash
# Build images one at a time
docker compose -f compose.yaml build orchestrator
docker compose -f compose.yaml build vllm-qwen
docker compose -f compose.yaml build musicgen
# Check build logs for errors
docker compose -f compose.yaml build --no-cache --progress=plain orchestrator
```
### Tailscale Won't Install
```bash
# Manual Tailscale installation
curl -fsSL https://tailscale.com/install.sh | sh
# Start daemon
tailscaled --tun=userspace-networking --socks5-server=localhost:1055 &
# Test
tailscale version
```
### Template Too Large
RunPod templates have size limits. If your template is too large:
**Option 1**: Use network volume for models
- Move models to network volume: `/workspace/models/`
- Mount volume when deploying from template
- Models persist across pod restarts
**Option 2**: Reduce cached models
- Only cache most-used model (Qwen 2.5 7B)
- Download others on first use
- Accept slightly longer first-time startup
**Option 3**: Use Docker layer optimization
```dockerfile
# In Dockerfile, order commands by change frequency
# Less frequently changed layers first
```
---
## Cost Analysis
### Template Storage Cost
- RunPod charges for template storage: ~$0.10/GB/month
- This template: ~50GB = **~$5/month**
- **Worth it!** Saves 60-90 minutes per Spot restart
### Time Savings
- Spot instance restarts: 2-5 times per week (highly variable)
- Time saved per restart: 60-90 minutes
- **Total saved per month: 8-20 hours**
- **Value: Priceless for rapid deployment**
---
## Advanced: Automated Template Updates
Create a CI/CD pipeline to automatically update templates:
```bash
# GitHub Actions workflow (future enhancement)
# 1. Deploy pod from template
# 2. Pull latest code
# 3. Rebuild images
# 4. Test
# 5. Save new template version
# 6. Notify team
```
---
## Template Checklist
Before saving your template, verify:
- [ ] All Docker images built and working
- [ ] All models downloaded and cached
- [ ] Tailscale installed (but logged out)
- [ ] Docker Compose files present
- [ ] `.env` file removed (secrets cleared)
- [ ] Logs cleared
- [ ] SSH keys removed
- [ ] Bash history cleared
- [ ] Template version documented
- [ ] Test deployment successful
---
## Support
If you have issues creating the template:
1. Check `/workspace/ai/scripts/prepare-template.sh` logs
2. Review Docker build logs: `docker compose build --progress=plain`
3. Check model download logs: `docker logs <container>`
4. Verify disk space: `df -h`
5. Check network volume is mounted: `mount | grep workspace`
For RunPod-specific issues:
- RunPod Docs: https://docs.runpod.io/
- RunPod Discord: https://discord.gg/runpod
---
## Next Steps
After creating your template:
1. ✅ Test deployment from template
2. ✅ Document in `GPU_DEPLOYMENT_LOG.md`
3. ✅ Share template ID with team (if applicable)
4. ✅ Set up monitoring (Netdata, etc.)
5. ✅ Configure auto-stop for cost optimization
6. ✅ Add more models as needed
**Your multi-modal AI infrastructure is now portable and reproducible!** 🚀

View File

@@ -1,13 +0,0 @@
{
"model": "flux-schnell",
"offload": true,
"sequential_cpu_offload": false,
"vae_tiling": true,
"enable_model_cpu_offload": true,
"low_vram_mode": false,
"torch_compile": false,
"safety_checker": false,
"watermark": false,
"flux_device": "cuda",
"compile": false
}

View File

@@ -1,22 +0,0 @@
FROM python:3.11-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
curl \
&& rm -rf /var/lib/apt/lists/*
# Copy requirements and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY orchestrator.py .
COPY models.yaml .
# Expose port
EXPOSE 9000
# Run the orchestrator
CMD ["python", "orchestrator.py"]

View File

@@ -6,7 +6,7 @@ models:
qwen-2.5-7b:
type: text
framework: vllm
service_script: vllm/server.py
service_script: models/vllm/server.py
port: 8001
vram_gb: 14
startup_time_seconds: 120
@@ -17,7 +17,7 @@ models:
flux-schnell:
type: image
framework: openedai-images
service_script: flux/server.py
service_script: models/flux/server.py
port: 8002
vram_gb: 14
startup_time_seconds: 60
@@ -28,7 +28,7 @@ models:
musicgen-medium:
type: audio
framework: audiocraft
service_script: musicgen/server.py
service_script: models/musicgen/server.py
port: 8003
vram_gb: 11
startup_time_seconds: 45

View File

@@ -1,13 +1,4 @@
# vLLM Text Generation Service Dependencies
# vLLM engine
# vLLM engine (will install compatible torch, transformers, etc.)
vllm==0.6.4.post1
# PyTorch (required by vLLM)
torch==2.1.0
# Transformers (for model loading)
transformers==4.36.0
# Additional dependencies
accelerate==0.25.0

View File

@@ -1,27 +1,35 @@
#!/usr/bin/env python3
"""
vLLM Text Generation Service
OpenAI-compatible text generation using vLLM and Qwen 2.5 7B Instruct model.
Provides /v1/completions and /v1/chat/completions endpoints.
Simple vLLM server using AsyncLLMEngine directly
Bypasses the multiprocessing issues we hit with the default vLLM API server
OpenAI-compatible endpoints: /v1/models and /v1/completions
"""
import asyncio
import json
import logging
import os
from typing import AsyncIterator, Dict, List, Optional
from fastapi import Request
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse, StreamingResponse
from pydantic import BaseModel, Field
from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams
from vllm.utils import random_uuid
# Import base service class
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '../..'))
from core.base_service import GPUService
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
# FastAPI app
app = FastAPI(title="Simple vLLM Server", version="1.0.0")
# Global engine instance
engine: Optional[AsyncLLMEngine] = None
model_name: str = "Qwen/Qwen2.5-7B-Instruct"
# Request/Response models
class CompletionRequest(BaseModel):
@@ -37,13 +45,11 @@ class CompletionRequest(BaseModel):
presence_penalty: float = Field(default=0.0, ge=-2.0, le=2.0)
frequency_penalty: float = Field(default=0.0, ge=-2.0, le=2.0)
class ChatMessage(BaseModel):
"""Chat message format"""
role: str = Field(..., description="Role: system, user, or assistant")
content: str = Field(..., description="Message content")
class ChatCompletionRequest(BaseModel):
"""OpenAI-compatible chat completion request"""
model: str = Field(default="qwen-2.5-7b")
@@ -55,243 +61,242 @@ class ChatCompletionRequest(BaseModel):
stream: bool = Field(default=False)
stop: Optional[str | List[str]] = None
@app.on_event("startup")
async def startup_event():
"""Initialize vLLM engine on startup"""
global engine, model_name
class VLLMService(GPUService):
"""vLLM text generation service"""
logger.info(f"Initializing vLLM AsyncLLMEngine with model: {model_name}")
def __init__(self):
# Get port from environment or use default
port = int(os.getenv("PORT", "8001"))
super().__init__(name="vllm-qwen", port=port)
# Configure engine
engine_args = AsyncEngineArgs(
model=model_name,
tensor_parallel_size=1, # Single GPU
gpu_memory_utilization=0.85, # Use 85% of GPU memory
max_model_len=4096, # Context length
dtype="auto", # Auto-detect dtype
download_dir="/workspace/huggingface_cache", # Large disk
trust_remote_code=True, # Some models require this
enforce_eager=False, # Use CUDA graphs for better performance
)
# Service-specific attributes
self.engine: Optional[AsyncLLMEngine] = None
self.model_name = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-7B-Instruct")
# Create async engine
engine = AsyncLLMEngine.from_engine_args(engine_args)
async def initialize(self):
"""Initialize vLLM engine"""
await super().initialize()
logger.info("vLLM AsyncLLMEngine initialized successfully")
self.logger.info(f"Initializing vLLM AsyncLLMEngine with model: {self.model_name}")
@app.get("/")
async def root():
"""Health check endpoint"""
return {"status": "ok", "model": model_name}
# Configure engine
engine_args = AsyncEngineArgs(
model=self.model_name,
tensor_parallel_size=1, # Single GPU
gpu_memory_utilization=0.85, # Use 85% of GPU memory
max_model_len=4096, # Context length
dtype="auto", # Auto-detect dtype
download_dir=os.getenv("HF_CACHE_DIR", "/workspace/huggingface_cache"),
trust_remote_code=True, # Some models require this
enforce_eager=False, # Use CUDA graphs for better performance
@app.get("/health")
async def health():
"""Detailed health check"""
return {
"status": "healthy" if engine else "initializing",
"model": model_name,
"ready": engine is not None
}
@app.get("/v1/models")
async def list_models():
"""OpenAI-compatible models endpoint"""
return {
"object": "list",
"data": [
{
"id": "qwen-2.5-7b",
"object": "model",
"created": 1234567890,
"owned_by": "pivoine-gpu",
"permission": [],
"root": model_name,
"parent": None,
}
]
}
def messages_to_prompt(messages: List[ChatMessage]) -> str:
"""Convert chat messages to a single prompt string"""
# Qwen 2.5 chat template format
prompt_parts = []
for msg in messages:
role = msg.role
content = msg.content
if role == "system":
prompt_parts.append(f"<|im_start|>system\n{content}<|im_end|>")
elif role == "user":
prompt_parts.append(f"<|im_start|>user\n{content}<|im_end|>")
elif role == "assistant":
prompt_parts.append(f"<|im_start|>assistant\n{content}<|im_end|>")
# Add final assistant prompt
prompt_parts.append("<|im_start|>assistant\n")
return "\n".join(prompt_parts)
@app.post("/v1/completions")
async def create_completion(request: CompletionRequest):
"""OpenAI-compatible completion endpoint"""
if not engine:
return JSONResponse(
status_code=503,
content={"error": "Engine not initialized"}
)
# Create async engine
self.engine = AsyncLLMEngine.from_engine_args(engine_args)
# Handle both single prompt and batch prompts
prompts = [request.prompt] if isinstance(request.prompt, str) else request.prompt
self.logger.info("vLLM AsyncLLMEngine initialized successfully")
# Configure sampling parameters
sampling_params = SamplingParams(
temperature=request.temperature,
top_p=request.top_p,
max_tokens=request.max_tokens,
n=request.n,
stop=request.stop if request.stop else [],
presence_penalty=request.presence_penalty,
frequency_penalty=request.frequency_penalty,
)
async def cleanup(self):
"""Cleanup resources"""
await super().cleanup()
if self.engine:
# vLLM doesn't have an explicit shutdown method
self.logger.info("vLLM engine cleanup")
self.engine = None
# Generate completions
results = []
for prompt in prompts:
request_id = random_uuid()
def messages_to_prompt(self, messages: List[ChatMessage]) -> str:
"""Convert chat messages to Qwen 2.5 prompt format"""
prompt_parts = []
for msg in messages:
role = msg.role
content = msg.content
if role == "system":
prompt_parts.append(f"<|im_start|>system\n{content}<|im_end|>")
elif role == "user":
prompt_parts.append(f"<|im_start|>user\n{content}<|im_end|>")
elif role == "assistant":
prompt_parts.append(f"<|im_start|>assistant\n{content}<|im_end|>")
# Add final assistant prompt
prompt_parts.append("<|im_start|>assistant\n")
return "\n".join(prompt_parts)
def create_app(self):
"""Create FastAPI routes"""
@self.app.get("/")
async def root():
"""Root endpoint"""
return {"status": "ok", "model": self.model_name}
@self.app.get("/v1/models")
async def list_models():
"""OpenAI-compatible models endpoint"""
return {
"object": "list",
"data": [
{
"id": "qwen-2.5-7b",
"object": "model",
if request.stream:
# Streaming response
async def generate_stream():
async for output in engine.generate(prompt, sampling_params, request_id):
chunk = {
"id": request_id,
"object": "text_completion",
"created": 1234567890,
"owned_by": "pivoine-gpu",
"permission": [],
"root": self.model_name,
"parent": None,
}
]
}
@self.app.post("/v1/completions")
async def create_completion(request: CompletionRequest):
"""OpenAI-compatible completion endpoint"""
if not self.engine:
return JSONResponse(
status_code=503,
content={"error": "Engine not initialized"}
)
# Handle both single prompt and batch prompts
prompts = [request.prompt] if isinstance(request.prompt, str) else request.prompt
# Configure sampling parameters
sampling_params = SamplingParams(
temperature=request.temperature,
top_p=request.top_p,
max_tokens=request.max_tokens,
n=request.n,
stop=request.stop if request.stop else [],
presence_penalty=request.presence_penalty,
frequency_penalty=request.frequency_penalty,
)
# Generate completions
results = []
for prompt in prompts:
request_id = random_uuid()
if request.stream:
# Streaming response
async def generate_stream():
async for output in self.engine.generate(prompt, sampling_params, request_id):
chunk = {
"id": request_id,
"object": "text_completion",
"created": 1234567890,
"model": request.model,
"choices": [
{
"text": output.outputs[0].text,
"index": 0,
"logprobs": None,
"finish_reason": output.outputs[0].finish_reason,
}
]
"model": request.model,
"choices": [
{
"text": output.outputs[0].text,
"index": 0,
"logprobs": None,
"finish_reason": output.outputs[0].finish_reason,
}
yield f"data: {json.dumps(chunk)}\n\n"
yield "data: [DONE]\n\n"
]
}
yield f"data: {json.dumps(chunk)}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate_stream(), media_type="text/event-stream")
else:
# Non-streaming response
async for output in self.engine.generate(prompt, sampling_params, request_id):
final_output = output
return StreamingResponse(generate_stream(), media_type="text/event-stream")
else:
# Non-streaming response
async for output in engine.generate(prompt, sampling_params, request_id):
final_output = output
results.append({
"text": final_output.outputs[0].text,
"index": len(results),
"logprobs": None,
"finish_reason": final_output.outputs[0].finish_reason,
})
results.append({
"text": final_output.outputs[0].text,
"index": len(results),
"logprobs": None,
"finish_reason": final_output.outputs[0].finish_reason,
})
return {
"id": random_uuid(),
"object": "text_completion",
"created": 1234567890,
"model": request.model,
"choices": results,
"usage": {
"prompt_tokens": 0, # vLLM doesn't expose this easily
"completion_tokens": 0,
"total_tokens": 0,
}
}
return {
"id": random_uuid(),
"object": "text_completion",
"created": 1234567890,
"model": request.model,
"choices": results,
"usage": {
"prompt_tokens": 0, # vLLM doesn't expose this easily
"completion_tokens": 0,
"total_tokens": 0,
}
}
@self.app.post("/v1/chat/completions")
async def create_chat_completion(request: ChatCompletionRequest):
"""OpenAI-compatible chat completion endpoint"""
if not self.engine:
return JSONResponse(
status_code=503,
content={"error": "Engine not initialized"}
)
@app.post("/v1/chat/completions")
async def create_chat_completion(request: ChatCompletionRequest):
"""OpenAI-compatible chat completion endpoint"""
if not engine:
return JSONResponse(
status_code=503,
content={"error": "Engine not initialized"}
)
# Convert messages to prompt
prompt = self.messages_to_prompt(request.messages)
# Convert messages to prompt
prompt = messages_to_prompt(request.messages)
# Configure sampling parameters
sampling_params = SamplingParams(
temperature=request.temperature,
top_p=request.top_p,
max_tokens=request.max_tokens,
n=request.n,
stop=request.stop if request.stop else ["<|im_end|>"],
)
# Configure sampling parameters
sampling_params = SamplingParams(
temperature=request.temperature,
top_p=request.top_p,
max_tokens=request.max_tokens,
n=request.n,
stop=request.stop if request.stop else ["<|im_end|>"],
)
request_id = random_uuid()
request_id = random_uuid()
if request.stream:
# Streaming response
async def generate_stream():
async for output in self.engine.generate(prompt, sampling_params, request_id):
chunk = {
"id": request_id,
"object": "chat.completion.chunk",
"created": 1234567890,
"model": request.model,
"choices": [
{
"index": 0,
"delta": {"content": output.outputs[0].text},
"finish_reason": output.outputs[0].finish_reason,
}
]
}
yield f"data: {json.dumps(chunk)}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate_stream(), media_type="text/event-stream")
else:
# Non-streaming response
async for output in self.engine.generate(prompt, sampling_params, request_id):
final_output = output
return {
if request.stream:
# Streaming response
async def generate_stream():
async for output in engine.generate(prompt, sampling_params, request_id):
chunk = {
"id": request_id,
"object": "chat.completion",
"object": "chat.completion.chunk",
"created": 1234567890,
"model": request.model,
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": final_output.outputs[0].text,
},
"finish_reason": final_output.outputs[0].finish_reason,
"delta": {"content": output.outputs[0].text},
"finish_reason": output.outputs[0].finish_reason,
}
],
"usage": {
"prompt_tokens": 0,
"completion_tokens": 0,
"total_tokens": 0,
}
]
}
yield f"data: {json.dumps(chunk)}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate_stream(), media_type="text/event-stream")
else:
# Non-streaming response
async for output in engine.generate(prompt, sampling_params, request_id):
final_output = output
return {
"id": request_id,
"object": "chat.completion",
"created": 1234567890,
"model": request.model,
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": final_output.outputs[0].text,
},
"finish_reason": final_output.outputs[0].finish_reason,
}
],
"usage": {
"prompt_tokens": 0,
"completion_tokens": 0,
"total_tokens": 0,
}
}
if __name__ == "__main__":
service = VLLMService()
service.run()
import uvicorn
# Get configuration from environment
host = os.getenv("VLLM_HOST", "0.0.0.0")
port = int(os.getenv("VLLM_PORT", "8000"))
logger.info(f"Starting vLLM server on {host}:{port}")
uvicorn.run(
app,
host=host,
port=port,
log_level="info",
access_log=True,
)

View File

@@ -1,38 +0,0 @@
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04
WORKDIR /app
# Install Python and system dependencies
RUN apt-get update && apt-get install -y \
python3.10 \
python3-pip \
ffmpeg \
git \
&& rm -rf /var/lib/apt/lists/*
# Upgrade pip
RUN pip3 install --no-cache-dir --upgrade pip
# Install PyTorch with CUDA support
RUN pip3 install --no-cache-dir torch==2.1.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121
# Copy requirements and install dependencies
COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt
# Copy application code
COPY server.py .
# Create directory for model cache
RUN mkdir -p /app/models
# Environment variables
ENV HF_HOME=/app/models
ENV TORCH_HOME=/app/models
ENV MODEL_NAME=facebook/musicgen-medium
# Expose port
EXPOSE 8000
# Run the server
CMD ["python3", "server.py"]

View File

@@ -1,6 +0,0 @@
torch==2.1.0
torchaudio==2.1.0
audiocraft==1.3.0
fastapi==0.104.1
uvicorn[standard]==0.24.0
pydantic==2.5.0

View File

@@ -1,194 +0,0 @@
#!/usr/bin/env python3
"""
MusicGen API Server
OpenAI-compatible API for music generation using Meta's MusicGen
Endpoints:
- POST /v1/audio/generations - Generate music from text prompt
- GET /health - Health check
- GET / - Service info
"""
import base64
import io
import logging
import os
import tempfile
from typing import Optional
import torch
import torchaudio
from audiocraft.models import MusicGen
from fastapi import FastAPI, HTTPException
from fastapi.responses import JSONResponse
from pydantic import BaseModel, Field
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
# FastAPI app
app = FastAPI(title="MusicGen API Server", version="1.0.0")
# Global model instance
model: Optional[MusicGen] = None
model_name: str = os.getenv("MODEL_NAME", "facebook/musicgen-medium")
device: str = "cuda" if torch.cuda.is_available() else "cpu"
class AudioGenerationRequest(BaseModel):
"""Music generation request"""
model: str = Field(default="musicgen-medium", description="Model name")
prompt: str = Field(..., description="Text description of the music to generate")
duration: float = Field(default=30.0, ge=1.0, le=30.0, description="Duration in seconds")
temperature: float = Field(default=1.0, ge=0.1, le=2.0, description="Sampling temperature")
top_k: int = Field(default=250, ge=0, le=500, description="Top-k sampling")
top_p: float = Field(default=0.0, ge=0.0, le=1.0, description="Top-p (nucleus) sampling")
cfg_coef: float = Field(default=3.0, ge=1.0, le=15.0, description="Classifier-free guidance coefficient")
response_format: str = Field(default="wav", description="Audio format (wav or mp3)")
class AudioGenerationResponse(BaseModel):
"""Music generation response"""
audio: str = Field(..., description="Base64-encoded audio data")
format: str = Field(..., description="Audio format (wav or mp3)")
duration: float = Field(..., description="Duration in seconds")
sample_rate: int = Field(..., description="Sample rate in Hz")
@app.on_event("startup")
async def startup_event():
"""Load MusicGen model on startup"""
global model
logger.info(f"Loading MusicGen model: {model_name}")
logger.info(f"Device: {device}")
# Load model
model = MusicGen.get_pretrained(model_name, device=device)
logger.info(f"MusicGen model loaded successfully")
logger.info(f"Max duration: 30 seconds at 32kHz")
@app.get("/")
async def root():
"""Root endpoint"""
return {
"service": "MusicGen API Server",
"model": model_name,
"device": device,
"max_duration": 30.0,
"sample_rate": 32000
}
@app.get("/health")
async def health():
"""Health check endpoint"""
return {
"status": "healthy" if model else "initializing",
"model": model_name,
"device": device,
"ready": model is not None,
"gpu_available": torch.cuda.is_available()
}
@app.post("/v1/audio/generations")
async def generate_audio(request: AudioGenerationRequest) -> AudioGenerationResponse:
"""Generate music from text prompt"""
if not model:
raise HTTPException(status_code=503, detail="Model not initialized")
logger.info(f"Generating music: {request.prompt[:100]}...")
logger.info(f"Duration: {request.duration}s, Temperature: {request.temperature}")
try:
# Set generation parameters
model.set_generation_params(
duration=request.duration,
temperature=request.temperature,
top_k=request.top_k,
top_p=request.top_p,
cfg_coef=request.cfg_coef,
)
# Generate audio
descriptions = [request.prompt]
with torch.no_grad():
wav = model.generate(descriptions)
# wav shape: [batch_size, channels, samples]
# Extract first batch item
audio_data = wav[0].cpu() # [channels, samples]
# Get sample rate
sample_rate = model.sample_rate
# Save to temporary file
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_file:
temp_path = temp_file.name
torchaudio.save(temp_path, audio_data, sample_rate)
# Read audio file and encode to base64
with open(temp_path, 'rb') as f:
audio_bytes = f.read()
# Clean up temporary file
os.unlink(temp_path)
# Encode to base64
audio_base64 = base64.b64encode(audio_bytes).decode('utf-8')
logger.info(f"Generated {request.duration}s of audio")
return AudioGenerationResponse(
audio=audio_base64,
format="wav",
duration=request.duration,
sample_rate=sample_rate
)
except Exception as e:
logger.error(f"Error generating audio: {e}")
raise HTTPException(status_code=500, detail=str(e))
@app.get("/v1/models")
async def list_models():
"""List available models (OpenAI-compatible)"""
return {
"object": "list",
"data": [
{
"id": "musicgen-medium",
"object": "model",
"created": 1234567890,
"owned_by": "meta",
"permission": [],
"root": model_name,
"parent": None,
}
]
}
if __name__ == "__main__":
import uvicorn
host = os.getenv("HOST", "0.0.0.0")
port = int(os.getenv("PORT", "8000"))
logger.info(f"Starting MusicGen API server on {host}:{port}")
uvicorn.run(
app,
host=host,
port=port,
log_level="info",
access_log=True,
)

View File

@@ -1,34 +0,0 @@
FROM nvidia/cuda:12.4.0-runtime-ubuntu22.04
WORKDIR /app
# Install Python and system dependencies
RUN apt-get update && apt-get install -y \
python3.11 \
python3-pip \
git \
&& rm -rf /var/lib/apt/lists/*
# Upgrade pip
RUN pip3 install --no-cache-dir --upgrade pip
# Install vLLM and dependencies
COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt
# Copy application code
COPY server.py .
# Create directory for model cache
RUN mkdir -p /workspace/huggingface_cache
# Environment variables
ENV HF_HOME=/workspace/huggingface_cache
ENV VLLM_HOST=0.0.0.0
ENV VLLM_PORT=8000
# Expose port
EXPOSE 8000
# Run the server
CMD ["python3", "server.py"]

View File

@@ -1,4 +0,0 @@
vllm==0.6.4.post1
fastapi==0.104.1
uvicorn[standard]==0.24.0
pydantic==2.5.0

View File

@@ -1,302 +0,0 @@
#!/usr/bin/env python3
"""
Simple vLLM server using AsyncLLMEngine directly
Bypasses the multiprocessing issues we hit with the default vLLM API server
OpenAI-compatible endpoints: /v1/models and /v1/completions
"""
import asyncio
import json
import logging
import os
from typing import AsyncIterator, Dict, List, Optional
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse, StreamingResponse
from pydantic import BaseModel, Field
from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams
from vllm.utils import random_uuid
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
# FastAPI app
app = FastAPI(title="Simple vLLM Server", version="1.0.0")
# Global engine instance
engine: Optional[AsyncLLMEngine] = None
model_name: str = "Qwen/Qwen2.5-7B-Instruct"
# Request/Response models
class CompletionRequest(BaseModel):
"""OpenAI-compatible completion request"""
model: str = Field(default="qwen-2.5-7b")
prompt: str | List[str] = Field(..., description="Text prompt(s)")
max_tokens: int = Field(default=512, ge=1, le=4096)
temperature: float = Field(default=0.7, ge=0.0, le=2.0)
top_p: float = Field(default=1.0, ge=0.0, le=1.0)
n: int = Field(default=1, ge=1, le=10)
stream: bool = Field(default=False)
stop: Optional[str | List[str]] = None
presence_penalty: float = Field(default=0.0, ge=-2.0, le=2.0)
frequency_penalty: float = Field(default=0.0, ge=-2.0, le=2.0)
class ChatMessage(BaseModel):
"""Chat message format"""
role: str = Field(..., description="Role: system, user, or assistant")
content: str = Field(..., description="Message content")
class ChatCompletionRequest(BaseModel):
"""OpenAI-compatible chat completion request"""
model: str = Field(default="qwen-2.5-7b")
messages: List[ChatMessage] = Field(..., description="Chat messages")
max_tokens: int = Field(default=512, ge=1, le=4096)
temperature: float = Field(default=0.7, ge=0.0, le=2.0)
top_p: float = Field(default=1.0, ge=0.0, le=1.0)
n: int = Field(default=1, ge=1, le=10)
stream: bool = Field(default=False)
stop: Optional[str | List[str]] = None
@app.on_event("startup")
async def startup_event():
"""Initialize vLLM engine on startup"""
global engine, model_name
logger.info(f"Initializing vLLM AsyncLLMEngine with model: {model_name}")
# Configure engine
engine_args = AsyncEngineArgs(
model=model_name,
tensor_parallel_size=1, # Single GPU
gpu_memory_utilization=0.85, # Use 85% of GPU memory
max_model_len=4096, # Context length
dtype="auto", # Auto-detect dtype
download_dir="/workspace/huggingface_cache", # Large disk
trust_remote_code=True, # Some models require this
enforce_eager=False, # Use CUDA graphs for better performance
)
# Create async engine
engine = AsyncLLMEngine.from_engine_args(engine_args)
logger.info("vLLM AsyncLLMEngine initialized successfully")
@app.get("/")
async def root():
"""Health check endpoint"""
return {"status": "ok", "model": model_name}
@app.get("/health")
async def health():
"""Detailed health check"""
return {
"status": "healthy" if engine else "initializing",
"model": model_name,
"ready": engine is not None
}
@app.get("/v1/models")
async def list_models():
"""OpenAI-compatible models endpoint"""
return {
"object": "list",
"data": [
{
"id": "qwen-2.5-7b",
"object": "model",
"created": 1234567890,
"owned_by": "pivoine-gpu",
"permission": [],
"root": model_name,
"parent": None,
}
]
}
def messages_to_prompt(messages: List[ChatMessage]) -> str:
"""Convert chat messages to a single prompt string"""
# Qwen 2.5 chat template format
prompt_parts = []
for msg in messages:
role = msg.role
content = msg.content
if role == "system":
prompt_parts.append(f"<|im_start|>system\n{content}<|im_end|>")
elif role == "user":
prompt_parts.append(f"<|im_start|>user\n{content}<|im_end|>")
elif role == "assistant":
prompt_parts.append(f"<|im_start|>assistant\n{content}<|im_end|>")
# Add final assistant prompt
prompt_parts.append("<|im_start|>assistant\n")
return "\n".join(prompt_parts)
@app.post("/v1/completions")
async def create_completion(request: CompletionRequest):
"""OpenAI-compatible completion endpoint"""
if not engine:
return JSONResponse(
status_code=503,
content={"error": "Engine not initialized"}
)
# Handle both single prompt and batch prompts
prompts = [request.prompt] if isinstance(request.prompt, str) else request.prompt
# Configure sampling parameters
sampling_params = SamplingParams(
temperature=request.temperature,
top_p=request.top_p,
max_tokens=request.max_tokens,
n=request.n,
stop=request.stop if request.stop else [],
presence_penalty=request.presence_penalty,
frequency_penalty=request.frequency_penalty,
)
# Generate completions
results = []
for prompt in prompts:
request_id = random_uuid()
if request.stream:
# Streaming response
async def generate_stream():
async for output in engine.generate(prompt, sampling_params, request_id):
chunk = {
"id": request_id,
"object": "text_completion",
"created": 1234567890,
"model": request.model,
"choices": [
{
"text": output.outputs[0].text,
"index": 0,
"logprobs": None,
"finish_reason": output.outputs[0].finish_reason,
}
]
}
yield f"data: {json.dumps(chunk)}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate_stream(), media_type="text/event-stream")
else:
# Non-streaming response
async for output in engine.generate(prompt, sampling_params, request_id):
final_output = output
results.append({
"text": final_output.outputs[0].text,
"index": len(results),
"logprobs": None,
"finish_reason": final_output.outputs[0].finish_reason,
})
return {
"id": random_uuid(),
"object": "text_completion",
"created": 1234567890,
"model": request.model,
"choices": results,
"usage": {
"prompt_tokens": 0, # vLLM doesn't expose this easily
"completion_tokens": 0,
"total_tokens": 0,
}
}
@app.post("/v1/chat/completions")
async def create_chat_completion(request: ChatCompletionRequest):
"""OpenAI-compatible chat completion endpoint"""
if not engine:
return JSONResponse(
status_code=503,
content={"error": "Engine not initialized"}
)
# Convert messages to prompt
prompt = messages_to_prompt(request.messages)
# Configure sampling parameters
sampling_params = SamplingParams(
temperature=request.temperature,
top_p=request.top_p,
max_tokens=request.max_tokens,
n=request.n,
stop=request.stop if request.stop else ["<|im_end|>"],
)
request_id = random_uuid()
if request.stream:
# Streaming response
async def generate_stream():
async for output in engine.generate(prompt, sampling_params, request_id):
chunk = {
"id": request_id,
"object": "chat.completion.chunk",
"created": 1234567890,
"model": request.model,
"choices": [
{
"index": 0,
"delta": {"content": output.outputs[0].text},
"finish_reason": output.outputs[0].finish_reason,
}
]
}
yield f"data: {json.dumps(chunk)}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate_stream(), media_type="text/event-stream")
else:
# Non-streaming response
async for output in engine.generate(prompt, sampling_params, request_id):
final_output = output
return {
"id": request_id,
"object": "chat.completion",
"created": 1234567890,
"model": request.model,
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": final_output.outputs[0].text,
},
"finish_reason": final_output.outputs[0].finish_reason,
}
],
"usage": {
"prompt_tokens": 0,
"completion_tokens": 0,
"total_tokens": 0,
}
}
if __name__ == "__main__":
import uvicorn
# Get configuration from environment
host = os.getenv("VLLM_HOST", "0.0.0.0")
port = int(os.getenv("VLLM_PORT", "8000"))
logger.info(f"Starting vLLM server on {host}:{port}")
uvicorn.run(
app,
host=host,
port=port,
log_level="info",
access_log=True,
)