Scaling Spoken Language Models With Syllabic Speech Tokenization
2025 Β· Nicholas Lee, Cheol Jun Cho, Alan W Black, et al.
Abstract
Spoken language models (SLMs) typically discretize speech into high-frame-rate tokens extracted from SSL speech models. As the most successful LMs are based on the Transformer architecture, processing these long token streams with self-attention is expensive, as attention scales quadratically with sequence length. A recent SSL work introduces acoustic tokenization of speech at the syllable level, which is more interpretable and potentially more scalable with significant compression in token lengths (4-5 Hz). Yet, their value for spoken language modeling is not yet fully explored. We present the first systematic study of syllabic tokenization for spoken language modeling, evaluating models on a suite of SLU benchmarks while varying training data scale. Syllabic tokens can match or surpass the previous high-frame rate tokens while significantly cutting training and inference costs, achieving more than a 2x reduction in training time and a 5x reduction in FLOPs. Our findings highlight syl
Authors
(none)
Tags
Stats
Related papers
- Sylber: Syllabic Embedding Representation Of Speech From Raw Audio (2024)0.00
- What Makes A Good Speech Tokenizer For Llm-centric Speech Generation? A Systematic Study (2025)0.00
- A Study On The Integration Of Pre-trained SSL, ASR, LM And SLU Models For Spoken Language Understanding (2022)8.09
- Long-form Speech Generation With Spoken Language Models (2024)0.00
- Discreteslu: A Large Language Model With Self-supervised Discrete Speech Units For Spoken Language Understanding (2024)5.84
- SLM-S2ST: A Multimodal Language Model For Direct Speech-to-speech Translation (2025)0.00
- Sd-hubert: Sentence-level Self-distillation Induces Syllabic Organization In Hubert (2023)5.24
- Fastadasp: Multitask-adapted Efficient Inference For Large Speech Language Model (2024)2.51