Diffrhythm 2: Efficient And High Fidelity Song Generation Via Block Flow Matching
2025 Β· Yuepeng Jiang, Huakang Chen, Ziqian Ning, et al.
Abstract
Generating full-length, high-quality songs is challenging, as it requires maintaining long-term coherence both across text and music modalities and within the music modality itself. Existing non-autoregressive (NAR) frameworks, while capable of producing high-quality songs, often struggle with the alignment between lyrics and vocal. Concurrently, catering to diverse musical preferences necessitates reinforcement learning from human feedback (RLHF). However, existing methods often rely on merging multiple models during multi-preference optimization, which results in significant performance degradation. To address these challenges, we introduce DiffRhythm 2, an end-to-end framework designed for high-fidelity, controllable song generation. To tackle the lyric alignment problem, DiffRhythm 2 employs a semi-autoregressive architecture based on block flow matching. This design enables faithful alignment of lyrics to singing vocals without relying on external labels and constraints, all while
Authors
(none)
Tags
Stats
Related papers
- Diffrhythm+: Controllable And Flexible Full-length Song Generation With Preference Optimization (2025)3.58
- Diffrhythm: Blazingly Fast And Embarrassingly Simple End-to-end Full-length Song Generation With Latent Diffusion (2025)0.00
- Segtune: Structured And Fine-grained Control For Song Generation (2025)0.00
- Auto-regressive Vs Flow-matching: A Comparative Study Of Modeling Paradigms For Text-to-music Generation (2025)0.00
- Unsupervised Melody-to-lyric Generation (2023)0.00
- Joint Learning Of Wording And Formatting For Singable Melody-to-lyric Generation (2023)0.00
- Diff-a-riff: Musical Accompaniment Co-creation Via Latent Diffusion Models (2024)0.00
- CSL-L2M: Controllable Song-level Lyric-to-melody Generation Based On Conditional Transformer With Fine-grained Lyric And Musical Controls (2024)2.26