feat: integrate ComfyUI_DiffRhythm extension with 7 models and 4 workflows
All checks were successful
Build and Push RunPod Docker Image / build-and-push (push) Successful in 15s

- Add DiffRhythm to arty.yml references and setup/comfyui-nodes
- Install espeak-ng system dependency for phoneme processing
- Add 7 DiffRhythm models to models_huggingface.yaml with file mappings:
  * ASLP-lab/DiffRhythm-1_2 (95s generation)
  * ASLP-lab/DiffRhythm-full (4m45s generation)
  * ASLP-lab/DiffRhythm-base
  * ASLP-lab/DiffRhythm-vae
  * OpenMuQ/MuQ-MuLan-large
  * OpenMuQ/MuQ-large-msd-iter
  * FacebookAI/xlm-roberta-base
- Create 4 comprehensive workflows:
  * diffrhythm-simple-t2m-v1.json (basic 95s text-to-music)
  * diffrhythm-full-length-t2m-v1.json (4m45s full-length)
  * diffrhythm-reference-based-v1.json (style transfer with reference audio)
  * diffrhythm-random-generation-v1.json (no-prompt random generation)
- Update storage requirements: 90GB essential, 149GB total

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
2025-11-24 09:50:45 +01:00
parent 9439185b3d
commit f2186db78e
6 changed files with 646 additions and 7 deletions

View File

@@ -485,6 +485,116 @@ model_categories:
- source: "sdxl_models/ip-adapter-plus-face_sdxl_vit-h.safetensors"
dest: "ip-adapter-plus-face_sdxl_vit-h.safetensors"
# ==========================================================================
# DIFFRHYTHM MODELS (Full-length song generation)
# ==========================================================================
diffrhythm_models:
- repo_id: ASLP-lab/DiffRhythm-1_2
description: DiffRhythm 1.2 - 95 second generation model
size_gb: 2
essential: true
category: diffrhythm
type: TTS/DiffRhythm
format: pt
vram_gb: 12
duration_seconds: 95
notes: Latest 95-second generation model
files:
- source: "cfm_model.pt"
dest: "cfm_model_v1_2.pt"
- repo_id: ASLP-lab/DiffRhythm-full
description: DiffRhythm Full - 4m45s full-length generation
size_gb: 2
essential: false
category: diffrhythm
type: TTS/DiffRhythm
format: pt
vram_gb: 16
duration_seconds: 285
notes: Full-length 4 minute 45 second music generation
files:
- source: "cfm_model.pt"
dest: "cfm_full_model.pt"
- repo_id: ASLP-lab/DiffRhythm-base
description: DiffRhythm Base - 95 second base model
size_gb: 2
essential: false
category: diffrhythm
type: TTS/DiffRhythm
format: pt
vram_gb: 12
duration_seconds: 95
notes: Base 95-second model
files:
- source: "cfm_model.pt"
dest: "cfm_model.pt"
- repo_id: ASLP-lab/DiffRhythm-vae
description: DiffRhythm VAE - Variational autoencoder
size_gb: 1
essential: true
category: diffrhythm
type: TTS/DiffRhythm
format: pt
vram_gb: 2
notes: VAE component fine-tuned from Stable Audio Open (Stability AI Community License)
files:
- source: "vae_model.pt"
dest: "vae_model.pt"
- repo_id: OpenMuQ/MuQ-MuLan-large
description: MuQ-MuLan-large - Music-text joint embedding (~700M parameters)
size_gb: 3
essential: true
category: diffrhythm
type: TTS/DiffRhythm/MuQ-MuLan-large
format: bin
vram_gb: 4
notes: Music-text joint embedding for semantic understanding (English/Chinese)
files:
- source: "config.json"
dest: "config.json"
- source: "pytorch_model.bin"
dest: "pytorch_model.bin"
- repo_id: OpenMuQ/MuQ-large-msd-iter
description: MuQ-large-msd-iter - Music representation learning (~300M parameters)
size_gb: 1.2
essential: true
category: diffrhythm
type: TTS/DiffRhythm/MuQ-large-msd-iter
format: safetensors
vram_gb: 2
notes: Music representation model trained on Million Song Dataset
files:
- source: "config.json"
dest: "config.json"
- source: "model.safetensors"
dest: "model.safetensors"
- repo_id: FacebookAI/xlm-roberta-base
description: XLM-RoBERTa Base - Multilingual text encoder (100 languages, 0.3B params)
size_gb: 1.1
essential: true
category: diffrhythm
type: TTS/DiffRhythm/xlm-roberta-base
format: safetensors
vram_gb: 1
notes: Multilingual text encoding for 100 languages
files:
- source: "config.json"
dest: "config.json"
- source: "model.safetensors"
dest: "model.safetensors"
- source: "sentencepiece.bpe.model"
dest: "sentencepiece.bpe.model"
- source: "tokenizer.json"
dest: "tokenizer.json"
- source: "tokenizer_config.json"
dest: "tokenizer_config.json"
# ============================================================================
# STORAGE & VRAM SUMMARIES
# ============================================================================
@@ -495,14 +605,16 @@ storage_requirements:
video: 28 # CogVideoX + SVD
audio: 11 # MusicGen Medium
support: 11 # All 3 CLIP models
total: 80 # Total essential storage
diffrhythm: 10 # DiffRhythm essential models
total: 90 # Total essential storage
all_models:
image: 54 # All image models
video: 36 # All video models
audio: 36 # All audio models
support: 11 # All support models
total: 137 # Total with optional models
diffrhythm: 12 # All DiffRhythm models
total: 149 # Total with optional models
vram_requirements:
# For 24GB GPU (RTX 4090)