Mustango: Toward Controllable Text-to-music Generation
2023 Β· Jan Melechovsky, Zixun Guo, Deepanway Ghosal, et al.
Abstract
The quality of the text-to-music models has reached new heights due to recent advancements in diffusion models. The controllability of various musical aspects, however, has barely been explored. In this paper, we propose Mustango: a music-domain-knowledge-inspired text-to-music system based on diffusion. Mustango aims to control the generated music, not only with general text captions, but with more rich captions that can include specific instructions related to chords, beats, tempo, and key. At the core of Mustango is MuNet, a Music-Domain-Knowledge-Informed UNet guidance module that steers the generated music to include the music-specific conditions, which we predict from the text prompt, as well as the general text embedding, during the reverse diffusion process. To overcome the limited availability of open datasets of music with text captions, we propose a novel data augmentation method that includes altering the harmonic, rhythmic, and dynamic aspects of music audio and using stat
Authors
(none)
Tags
Stats
Related papers
- Quality-aware Masked Diffusion Transformer For Enhanced Music Generation (2024)5.60
- Editing Music With Melody And Text: Using Controlnet For Diffusion Transformer (2024)5.84
- Musicldm: Enhancing Novelty In Text-to-music Generation Using Beat-synchronous Mixup Strategies (2023)13.55
- JEN-1: Text-guided Universal Music Generation With Omnidirectional Diffusion Models (2023)9.03
- Diffusion Based Text-to-music Generation With Global And Local Text Based Conditioning (2025)0.00
- Generalized Multi-source Inference For Text Conditioned Music Diffusion Models (2024)0.00
- Controlaudio: Tackling Text-guided, Timing-indicated And Intelligible Audio Generation Via Progressive Diffusion Modeling (2025)0.00
- Degdit: Controllable Audio Generation With Dynamic Event Graph Guided Diffusion Transformer (2025)0.00