Editing Music With Melody And Text: Using Controlnet For Diffusion Transformer
2024 Β· Siyuan Hou, Shansong Liu, Ruibin Yuan, et al.
Abstract
Despite the significant progress in controllable music generation and editing, challenges remain in the quality and length of generated music due to the use of Mel-spectrogram representations and UNet-based model structures. To address these limitations, we propose a novel approach using a Diffusion Transformer (DiT) augmented with an additional control branch using ControlNet. This allows for long-form and variable-length music generation and editing controlled by text and melody prompts. For more precise and fine-grained melody control, we introduce a novel top-\(k\) constant-Q Transform representation as the melody prompt, reducing ambiguity compared to previous representations (e.g., chroma), particularly for music with multiple tracks or a wide range of pitch values. To effectively balance the control signals from text and melody prompts, we adopt a curriculum learning strategy that progressively masks the melody prompt, resulting in a more stable training process. Experiments hav
Authors
(none)
Tags
Stats
Related papers
- Mustango: Toward Controllable Text-to-music Generation (2023)11.67
- Quality-aware Masked Diffusion Transformer For Enhanced Music Generation (2024)5.60
- Neural Music Synthesis For Flexible Timbre Control (2018)9.92
- Musicldm: Enhancing Novelty In Text-to-music Generation Using Beat-synchronous Mixup Strategies (2023)13.55
- Mel-refine: A Plug-and-play Approach To Refine Mel-spectrogram In Audio Generation (2024)0.00
- Degdit: Controllable Audio Generation With Dynamic Event Graph Guided Diffusion Transformer (2025)0.00
- Diffrhythm+: Controllable And Flexible Full-length Song Generation With Preference Optimization (2025)3.58
- CSL-L2M: Controllable Song-level Lyric-to-melody Generation Based On Conditional Transformer With Fine-grained Lyric And Musical Controls (2024)2.26