Quality-aware Masked Diffusion Transformer For Enhanced Music Generation
2024 Β· Chang Li, Ruoyu Wang, Lijuan Liu, et al.
Abstract
Text-to-music (TTM) generation, which converts textual descriptions into audio, opens up innovative avenues for multimedia creation. Achieving high quality and diversity in this process demands extensive, high-quality data, which are often scarce in available datasets. Most open-source datasets frequently suffer from issues like low-quality waveforms and low text-audio consistency, hindering the advancement of music generation models. To address these challenges, we propose a novel quality-aware training paradigm for generating high-quality, high-musicality music from large-scale, quality-imbalanced datasets. Additionally, by leveraging unique properties in the latent space of musical signals, we adapt and implement a masked diffusion transformer (MDT) model for the TTM task, showcasing its capacity for quality control and enhanced musicality. Furthermore, we introduce a three-stage caption refinement approach to address low-quality captions' issue. Experiments show state-of-the-art (S
Authors
(none)
Tags
Stats
Related papers
- Mustango: Toward Controllable Text-to-music Generation (2023)11.67
- 3mdit: Unified Tri-modal Diffusion Transformer For Text-driven Synchronized Audio-video Generation (2025)0.00
- Controlaudio: Tackling Text-guided, Timing-indicated And Intelligible Audio Generation Via Progressive Diffusion Modeling (2025)0.00
- Diffusion Based Text-to-music Generation With Global And Local Text Based Conditioning (2025)0.00
- Musicldm: Enhancing Novelty In Text-to-music Generation Using Beat-synchronous Mixup Strategies (2023)13.55
- Editing Music With Melody And Text: Using Controlnet For Diffusion Transformer (2024)5.84
- Ezaudio: Enhancing Text-to-audio Generation With Efficient Diffusion Transformer (2024)7.50
- Fakemusiccaps: A Dataset For Detection And Attribution Of Synthetic Music Generated Via Text-to-music Models (2024)0.00