Adds monkey-patch for DiT.__init__() to properly configure LlamaConfig with
num_attention_heads and num_key_value_heads parameters, which are missing
in the upstream DiffRhythm code.
Root cause: transformers 4.49.0+ requires these parameters but DiffRhythm's
dit.py only specifies hidden_size, causing the library to incorrectly infer
head_dim as 32 instead of 64, leading to tensor dimension mismatches.
Solution:
- Sets num_attention_heads = hidden_size // 64 (standard Llama architecture)
- Sets num_key_value_heads = num_attention_heads // 4 (GQA configuration)
- Ensures head_dim = 64, fixing the "tensor a (32) vs tensor b (64)" error
This is a proper fix rather than just downgrading transformers version.
References:
- https://github.com/billwuhao/ComfyUI_DiffRhythm/issues/44
- https://github.com/billwuhao/ComfyUI_DiffRhythm/issues/48🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
The correct function to patch is decode_audio from infer_utils module,
which is where chunked VAE decoding actually happens. This intercepts
the call at the right level to force chunked=False.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
The previous approach of overriding diffrhythmgen wasn't working because
ComfyUI doesn't pass the chunked parameter when it's not in INPUT_TYPES.
This fix monkey-patches the infer() function at module level to always
force chunked=False, preventing the tensor dimension mismatch error.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>