Stabletoken: A Noise-robust Semantic Speech Tokenizer For Resilient Speechllms
2025 · Yuhan Song, Linhao Zhang, Chuhan Wu, et al.
Abstract
Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates
Authors
(none)
Tags
Stats
Related papers
- What Makes A Good Speech Tokenizer For Llm-centric Speech Generation? A Systematic Study (2025)0.00
- Almtokenizer: A Low-bitrate And Semantic-rich Audio Codec Tokenizer For Audio Language Modeling (2025)0.00
- Tokenchain: A Discrete Speech Chain Via Semantic Token Modeling (2025)0.00
- BEST-STD2.0: Balanced And Efficient Speech Tokenizer For Spoken Term Detection (2025)0.00
- Wavtokenizer: An Efficient Acoustic Discrete Codec Tokenizer For Audio Language Modeling (2024)6.22
- Enhancing The Stability Of Llm-based Speech Generation Systems Through Self-supervised Representations (2024)0.00
- Cosyvoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer Based On Supervised Semantic Tokens (2024)0.00
- Continuous Speech Tokenizer In Text To Speech (2024)4.06