Auto-regressive Vs Flow-matching: A Comparative Study Of Modeling Paradigms For Text-to-music Generation
2025 Β· Or Tal, Felix Kreuk, Yossi Adi
Abstract
Recent progress in text-to-music generation has enabled models to synthesize high-quality musical segments, full compositions, and even respond to fine-grained control signals, e.g. chord progressions. State-of-the-art (SOTA) systems differ significantly in many dimensions, such as training datasets, modeling paradigms, and architectural choices. This diversity complicates efforts to evaluate models fairly and identify which design choices influence performance the most. While factors like data and architecture are important, in this study we focus exclusively on the modeling paradigm. We conduct a systematic empirical analysis to isolate its effects, offering insights into associated trade-offs and emergent behaviors that can guide future text-to-music generation systems. Specifically, we compare the two arguably most common modeling paradigms: auto-regressive decoding and conditional flow-matching. We conduct a controlled comparison by training all models from scratch using identical
Authors
(none)
Tags
Stats
Related papers
- Diffrhythm 2: Efficient And High Fidelity Song Generation Via Block Flow Matching (2025)0.00
- Songgen: A Single Stage Auto-regressive Transformer For Text-to-song Generation (2025)4.98
- Generative Pre-training For Speech With Flow Matching (2023)0.00
- Comparing Normalizing Flows And Diffusion Models For Prosody And Acoustic Modelling In Text-to-speech (2023)0.00
- ETTA: Elucidating The Design Space Of Text-to-audio Models (2024)0.00
- Inspiremusic: Integrating Super Resolution And Large Language Model For High-fidelity Long-form Music Generation (2025)6.26
- Musicldm: Enhancing Novelty In Text-to-music Generation Using Beat-synchronous Mixup Strategies (2023)13.55
- Flowtron: An Autoregressive Flow-based Generative Network For Text-to-speech Synthesis (2020)5.91