Adds monkey-patch for DiT.__init__() to properly configure LlamaConfig with
num_attention_heads and num_key_value_heads parameters, which are missing
in the upstream DiffRhythm code.
Root cause: transformers 4.49.0+ requires these parameters but DiffRhythm's
dit.py only specifies hidden_size, causing the library to incorrectly infer
head_dim as 32 instead of 64, leading to tensor dimension mismatches.
Solution:
- Sets num_attention_heads = hidden_size // 64 (standard Llama architecture)
- Sets num_key_value_heads = num_attention_heads // 4 (GQA configuration)
- Ensures head_dim = 64, fixing the "tensor a (32) vs tensor b (64)" error
This is a proper fix rather than just downgrading transformers version.
References:
- https://github.com/billwuhao/ComfyUI_DiffRhythm/issues/44
- https://github.com/billwuhao/ComfyUI_DiffRhythm/issues/48🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
The correct function to patch is decode_audio from infer_utils module,
which is where chunked VAE decoding actually happens. This intercepts
the call at the right level to force chunked=False.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
The previous approach of overriding diffrhythmgen wasn't working because
ComfyUI doesn't pass the chunked parameter when it's not in INPUT_TYPES.
This fix monkey-patches the infer() function at module level to always
force chunked=False, preventing the tensor dimension mismatch error.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
The parameters must match the diffrhythmgen() function signature order,
not the INPUT_TYPES order. The function has 'edit' as the first parameter.
Correct widgets_values order (11 parameters):
0: edit (boolean)
1: model (string)
2: style_prompt (string)
3: lyrics_or_edit_lyrics (string)
4: edit_segments (string)
5: odeint_method (enum)
6: steps (int)
7: cfg (int)
8: quality_or_speed (enum)
9: unload_model (boolean)
10: seed (int)
Note: style_audio_or_edit_song comes from input connection (not in widgets)
Note: chunked parameter is hidden (not in widgets)
Updated workflows:
- diffrhythm-simple-t2m-v1.json
- diffrhythm-random-generation-v1.json
- diffrhythm-reference-based-v1.json
- diffrhythm-full-length-t2m-v1.json
🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
Fix "edit song, edit lyrics, edit segments must be provided" error by adding
the two missing parameters to all three DiffRhythm workflow files:
- diffrhythm-random-generation-v1.json
- diffrhythm-reference-based-v1.json
- diffrhythm-full-length-t2m-v1.json
Added empty string parameters at positions 9 and 10 in widgets_values array:
- edit_song: "" (empty when edit=false)
- edit_lyrics: "" (empty when edit=false)
The DiffRhythmRun node requires 12 parameters total, not 10. These workflows
use edit=false (no editing), so the edit parameters should be empty strings.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add torchcodec to ComfyUI requirements.txt to fix audio tensor caching
error in DiffRhythm. This package is required for save_with_torchcodec
function used by DiffRhythm audio nodes.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Add DiffRhythm dependencies to requirements.txt (19 packages)
- Add reference audio placeholder for style transfer workflow
- DiffRhythm nodes now loading in ComfyUI
- All four workflows ready for music generation
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Added all required packages for ComfyUI_DiffRhythm extension:
- torchdiffeq: ODE solvers for diffusion models
- x-transformers: Transformer architecture components
- librosa: Audio analysis and feature extraction
- pandas, pyarrow: Data handling
- ema-pytorch, prefigure: Training utilities
- muq: Music quality model
- mutagen: Audio metadata handling
- pykakasi, jieba, cn2an, pypinyin: Chinese/Japanese text processing
- Unidecode, phonemizer, inflect: Text normalization and phonetic conversion
- py3langid: Language identification
These dependencies enable the DiffRhythm node to load and function properly in ComfyUI, fixing the "ModuleNotFoundError: No module named 'infer_utils'" error.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Changed checkpoint from waiIllustriousSDXL_v150.safetensors to ponyDiffusionV6XL_v6StartWithThisOne.safetensors
- Fixed metadata model reference (was incorrectly referencing LoRA)
- Added files field to models_civitai.yaml for explicit filename mapping
- Aligns workflow with actual Pony Diffusion V6 XL model
Changed checkpoint from 'add-detail-xl.safetensors' (which is a LoRA) to
'waiIllustriousSDXL_v150.safetensors' which is the downloaded anime NSFW model
The upscale_model input was at index 5 instead of index 12, causing all
widget parameters to be misaligned. Fixed by:
- Updating link target index from 5 to 12 for upscale_model
- Adding explicit entries for widget parameters in inputs array
- Maintaining correct parameter order per custom node definition
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Added example images for testing workflows:
- input_image.png (512x512) - for general upscaling workflows
- input_portrait.png (512x768) - for portrait/face upscaling workflows
Sound Lab's Musicgen_ node outputs AUDIO format that is only compatible with Sound Lab nodes like AudioPlay, not the built-in ComfyUI audio nodes (SaveAudio/PreviewAudio).
SaveAudio was erroring on 'waveform' key - the AUDIO output from
Musicgen_ node has a different internal structure than what SaveAudio
expects. PreviewAudio is more compatible with Sound Lab's AUDIO format.
Files are still saved to ComfyUI output directory, just through
PreviewAudio instead of SaveAudio.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Fixed medium, small, and melody workflows:
- Replaced non-existent nodes with Musicgen_ from Sound Lab
- Added missing links arrays to connect nodes properly
- Updated all metadata and performance specs
Note: Melody workflow simplified to text-only as Sound Lab doesn't
currently support melody conditioning via audio input.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Changed from non-existent nodes to actual Sound Lab nodes:
- Replaced MusicGenLoader/MusicGenTextEncode/MusicGenSampler with Musicgen_
- Replaced custom SaveAudio with standard SaveAudio node
- Added missing links array to connect nodes
- All parameters: prompt, duration, guidance_scale, seed, device
Node is called "Musicgen_" (with underscore) from comfyui-sound-lab.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
SD3.5 checkpoint doesn't contain CLIP encoders. Now using:
- CheckpointLoaderSimple for MODEL and VAE
- TripleCLIPLoader for CLIP-L, CLIP-G, and T5-XXL
- Standard CLIPTextEncode for prompts
This fixes the "clip input is invalid: None" error.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Replaced CheckpointLoaderSimple with UNETLoader + DualCLIPLoader.
Replaced CLIPTextEncode with CLIPTextEncodeFlux.
Added proper VAELoader with ae.safetensors.
Added ConditioningZeroOut for empty negative conditioning.
Removed old negative prompt input (FLUX doesn't use it).
Changes match FLUX Dev workflow structure.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Added FLUX VAE (ae.safetensors) to model configuration and updated
workflow to use it instead of non-existent pixel_space VAE.
This fixes the SaveImage data type error (1, 1, 16), |u1.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Node 3 (CLIPTextEncodeFlux) output feeds both KSampler (link 3) and
ConditioningZeroOut (link 8), so the output links array must include
both links.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
FLUX models require negative conditioning even though they don't use it.
Added ConditioningZeroOut node to create empty negative conditioning from
positive output, satisfying KSampler's required negative input.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Replace CheckpointLoaderSimple with UNETLoader
- Replace CLIPTextEncode with DualCLIPLoader + CLIPTextEncodeFlux
- Add VAELoader with pixel_space
- Remove negative prompt (FLUX uses guidance differently)
- Set CFG to 1.0, guidance in text encoder (3.5)
- Add all node connections in links array
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Replace DiffusersLoader with ImageOnlyCheckpointLoader
- Replace SVDSampler with SVD_img2vid_Conditioning + KSampler
- Add VideoLinearCFGGuidance for temporal consistency
- Add all node connections in links array
- Configure VHS_VideoCombine with correct parameters (25 frames)
- Increase steps to 30 for better quality with longer video
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Remove format-specific parameters from widgets_values array.
Only base parameters should be in widgets_values:
- frame_rate, loop_count, filename_prefix, format, pingpong, save_output
Format-specific params (pix_fmt, crf) are added dynamically by ComfyUI.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Replace DiffusersLoader with ImageOnlyCheckpointLoader
- Replace SVDSampler with SVD_img2vid_Conditioning + KSampler
- Add VideoLinearCFGGuidance for temporal consistency
- Add all node connections in links array
- Configure VHS_VideoCombine with H.264 parameters
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- CogVideoX-5b-I2V requires specific resolution (720x480)
- Cannot generate videos at different resolutions
- Update placeholder image to match model requirements
- Add enable_sequential_cpu_offload=true to DownloadAndLoadCogVideoModel
- Reduces VRAM from ~20GB to ~12GB at cost of slower inference
- Widget values: [model, precision, quantization, cpu_offload] = ['THUDM/CogVideoX-5b-I2V', 'bf16', 'disabled', true]
- Necessary for 24GB GPU with other services running