Mumu-llama: Multi-modal Music Understanding And Generation Via Large Language Models
2024 Β· Shansong Liu, Atin Sakkeer Hussain, Qilong Wu, et al.
Abstract
Research on large language models has advanced significantly across text, speech, images, and videos. However, multi-modal music understanding and generation remain underexplored due to the lack of well-annotated datasets. To address this, we introduce a dataset with 167.69 hours of multi-modal data, including text, images, videos, and music annotations. Based on this dataset, we propose MuMu-LLaMA, a model that leverages pre-trained encoders for music, images, and videos. For music generation, we integrate AudioLDM 2 and MusicGen. Our evaluation across four tasks--music understanding, text-to-music generation, prompt-based music editing, and multi-modal music generation--demonstrates that MuMu-LLaMA outperforms state-of-the-art models, showing its potential for multi-modal music applications.
Authors
(none)
Tags
Stats
Related papers
- M\(^{2}\)ugen: Multi-modal Music Understanding And Generation With The Power Of Large Language Models (2023)0.00
- Gamma: Towards Joint Global-temporal Music Understanding In Large Multimodal Models (2026)0.00
- Musilingo: Bridging Music And Text With Pre-trained Language Models For Music Captioning And Query Response (2023)9.03
- Llms Meet Multimodal Generation And Editing: A Survey (2024)5.48
- C3LLM: Conditional Multimodal Content Generation Using Large Language Models (2024)0.00
- A Review Of Multi-modal Large Language And Vision Models (2024)0.00
- Uniaudio: An Audio Foundation Model Toward Universal Audio Generation (2023)5.56
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59