Songgen: A Single Stage Auto-regressive Transformer For Text-to-song Generation
2025 Β· Zihan Liu, Shuangrui Ding, Zhixiong Zhang, et al.
Abstract
Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity and data scarcity. Existing approaches often employ multi-stage generation procedures, leading to cumbersome training and inference pipelines, as well as suboptimal overall generation quality due to error accumulation across stages. In this paper, we propose SongGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical attributes, including lyrics and textual descriptions of instrumentation, genre, mood, and timbre, while also offering an optional three-second reference clip for voice cloning. Within a unified auto-regressive framework, SongGen supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately for greater flexibil
Authors
(none)
Tags
Stats
Related papers
- Text-to-song: Towards Controllable Music Generation Incorporating Vocals And Accompaniment (2024)0.00
- Segtune: Structured And Fine-grained Control For Song Generation (2025)0.00
- Fastsag: Towards Fast Non-autoregressive Singing Accompaniment Generation (2024)0.00
- Audiogen: Textually Guided Audio Generation (2022)0.00
- Auto-regressive Vs Flow-matching: A Comparative Study Of Modeling Paradigms For Text-to-music Generation (2025)0.00
- Sequence-to-sequence Singing Synthesis Using The Feed-forward Transformer (2019)10.85
- Songtrans: An Unified Song Transcription And Alignment Method For Lyrics And Notes (2024)0.00
- Singgan: Generative Adversarial Network For High-fidelity Singing Voice Generation (2021)10.61