Continuous Audio Language Models
2025 Β· Simon Rouard, Manu Orsini, Axel Roebel, et al.
Abstract
Audio Language Models (ALM) have emerged as the dominant paradigm for speech and music generation by representing audio as sequences of discrete tokens. Yet, unlike text tokens, which are invertible, audio tokens are extracted from lossy codecs with a limited bitrate. As a consequence, increasing audio quality requires generating more tokens, which imposes a trade-off between fidelity and computational cost. We address this issue by studying Continuous Audio Language Models (CALM). These models instantiate a large Transformer backbone that produces a contextual embedding at every timestep. This sequential information then conditions an MLP that generates the next continuous frame of an audio VAE through consistency modeling. By avoiding lossy compression, CALM achieves higher quality at lower computational cost than their discrete counterpart. Experiments on speech and music demonstrate improved efficiency and fidelity over state-of-the-art discrete audio language models, facilitating
Authors
(none)
Tags
Stats
Related papers
- Audiolm: A Language Modeling Approach To Audio Generation (2022)18.91
- CALM: Contrastive Aligned Audio-language Multirate And Multimodal Representations (2022)0.00
- Almtokenizer: A Low-bitrate And Semantic-rich Audio Codec Tokenizer For Audio Language Modeling (2025)0.00
- Audio Language Modeling Using Perceptually-guided Discrete Representations (2022)0.00
- From Alignment To Advancement: Bootstrapping Audio-language Alignment With Synthetic Data (2025)2.26
- C3LLM: Conditional Multimodal Content Generation Using Large Language Models (2024)0.00
- Do Audio-language Models Understand Linguistic Variations? (2024)0.00
- Continuous Autoregressive Modeling With Stochastic Monotonic Alignment For Speech Synthesis (2025)0.00