Musilingo: Bridging Music And Text With Pre-trained Language Models For Music Captioning And Query Response
2023 Β· Zihao Deng, Yinghao Ma, Yudong Liu, et al.
Abstract
Large Language Models (LLMs) have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains not well-explored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT with a frozen LLM, bridging the gap between music audio and textual contexts. We train it on an extensive music caption dataset and fine-tune it with instructional data. Due to the scarcity of high-quality music Q&A datasets, we created the MusicInstruct (MI) dataset from captions in the MusicCaps datasets, tailored for open-ended music inquiries. Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs. Our introduced dataset enables notable advancements beyond previous ones.
Authors
(none)
Tags
Stats
Related papers
- M\(^{2}\)ugen: Multi-modal Music Understanding And Generation With The Power Of Large Language Models (2023)0.00
- Mumu-llama: Multi-modal Music Understanding And Generation Via Large Language Models (2024)6.34
- Musictm-dataset For Joint Representation Learning Among Sheet Music, Lyrics, And Musical Audio (2020)3.58
- Gamma: Towards Joint Global-temporal Music Understanding In Large Multimodal Models (2026)0.00
- Advancing Singlish Understanding: Bridging The Gap With Datasets And Multimodal Models (2025)0.00
- Muscaps: Generating Captions For Music Audio (2021)9.59
- Uniaudio 1.5: Large Language Model-driven Audio Codec Is A Few-shot Audio Task Learner (2024)0.00
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59