Musicldm: Enhancing Novelty In Text-to-music Generation Using Beat-synchronous Mixup Strategies
2023 Β· Ke Chen, Yusong Wu, Haohe Liu, et al.
Abstract
Diffusion models have shown promising results in cross-modal generation tasks, including text-to-image and text-to-audio generation. However, generating music, as a special type of audio, presents unique challenges due to limited availability of music data and sensitive issues related to copyright and plagiarism. In this paper, to tackle these challenges, we first construct a state-of-the-art text-to-music model, MusicLDM, that adapts Stable Diffusion and AudioLDM architectures to the music domain. We achieve this by retraining the contrastive language-audio pretraining model (CLAP) and the Hifi-GAN vocoder, as components of MusicLDM, on a collection of music data samples. Then, to address the limitations of training data and to avoid plagiarism, we leverage a beat tracking model and propose two different mixup strategies for data augmentation: beat-synchronous audio mixup and beat-synchronous latent mixup, which recombine training audio directly or via a latent embeddings space, respe
Authors
(none)
Tags
Stats
Related papers
- Augment, Drop & Swap: Improving Diversity In LLM Captions For Efficient Music-text Representation Learning (2024)0.00
- Mustango: Toward Controllable Text-to-music Generation (2023)11.67
- Quality-aware Masked Diffusion Transformer For Enhanced Music Generation (2024)5.60
- Diff-a-riff: Musical Accompaniment Co-creation Via Latent Diffusion Models (2024)0.00
- Diffrhythm: Blazingly Fast And Embarrassingly Simple End-to-end Full-length Song Generation With Latent Diffusion (2025)0.00
- M\(^{2}\)ugen: Multi-modal Music Understanding And Generation With The Power Of Large Language Models (2023)0.00
- Editing Music With Melody And Text: Using Controlnet For Diffusion Transformer (2024)5.84
- Audiotoken: Adaptation Of Text-conditioned Diffusion Models For Audio-to-image Generation (2023)9.76