Long-form Speech Generation With Spoken Language Models
2024 Β· Se Jin Park, Julian Salazar, Aren Jansen, et al.
Abstract
We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, textless spoken language models struggle to generate plausible speech past tens of seconds, due to high temporal resolution of speech tokens causing loss of coherence, architectural issues with long-sequence training or extrapolation, and memory costs at inference time. From these considerations we derive SpeechSSM, the first speech language model family to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates. SpeechSSMs leverage recent advances in linear-time sequence modeling to greatly surpass current Transformer spoken LMs in coherence and efficiency on multi-minute generations while still matching them at the utterance level. As we found current spoken language evaluations uninformative, especially in this new long-form se
Authors
(none)
Tags
Stats
Related papers
- Generative Spoken Language Model Based On Continuous Word-sized Audio Tokens (2023)3.58
- Audiolm: A Language Modeling Approach To Audio Generation (2022)18.91
- Scaling Spoken Language Models With Syllabic Speech Tokenization (2025)0.00
- PSLM: Parallel Generation Of Text And Speech With Llms For Low-latency Spoken Dialogue Systems (2024)2.26
- Evaluating Text-to-speech Synthesis From A Large Discrete Token-based Speech Language Model (2024)0.00
- SLM-S2ST: A Multimodal Language Model For Direct Speech-to-speech Translation (2025)0.00
- Text-free Prosody-aware Generative Spoken Language Modeling (2021)20.95
- Generative Pre-trained Speech Language Model With Efficient Hierarchical Transformer (2024)5.96