Inspiremusic: Integrating Super Resolution And Large Language Model For High-fidelity Long-form Music Generation
2025 Β· Chong Zhang, Yukun Ma, Qian Chen, et al.
Abstract
We introduce InspireMusic, a framework integrated super resolution and large language model for high-fidelity long-form music generation. A unified framework generates high-fidelity music, songs, and audio, which incorporates an autoregressive transformer with a super-resolution flow-matching model. This framework enables the controllable generation of high-fidelity long-form music at a higher sampling rate from both text and audio prompts. Our model differs from previous approaches, as we utilize an audio tokenizer with one codebook that contains richer semantic information, thereby reducing training costs and enhancing efficiency. This combination enables us to achieve high-quality audio generation with long-form coherence of up to \(8\) minutes. Then, an autoregressive transformer model based on Qwen 2.5 predicts audio tokens. Next, we employ a super-resolution flow-matching model to generate high-sampling rate audio with fine-grained details learned from an acoustic codec model. Co
Authors
(none)
Tags
Stats
Related papers
- Melnet: A Generative Model For Audio In The Frequency Domain (2019)0.00
- Flowhigh: Towards Efficient And High-quality Audio Super-resolution With Single-step Flow Matching (2025)5.84
- Universr: Unified And Versatile Audio Super-resolution Via Vocoder-free Flow Matching (2025)0.00
- Audiolm: A Language Modeling Approach To Audio Generation (2022)18.91
- Bigwavgan: A Wave-to-wave Generative Adversarial Network For Music Super-resolution (2023)0.00
- M\(^{2}\)ugen: Multi-modal Music Understanding And Generation With The Power Of Large Language Models (2023)0.00
- Phase-aware Music Super-resolution Using Generative Adversarial Networks (2020)9.59
- Uniaudio: An Audio Foundation Model Toward Universal Audio Generation (2023)5.56