An Exploration Of Mamba For Speech Self-supervised Models
2025 Β· Tzu-Quan Lin, Heng-Cheng Kuo, Tzu-Chieh Wei, et al.
Abstract
While Mamba has demonstrated strong performance in language modeling, its potential as a speech self-supervised learning (SSL) model remains underexplored, with prior studies limited to isolated tasks. To address this, we explore Mamba-based HuBERT models as alternatives to Transformer-based SSL architectures. Leveraging the linear-time Selective State Space, these models enable fine-tuning on long-context ASR with significantly lower compute. Moreover, they show superior performance when fine-tuned for streaming ASR. Beyond fine-tuning, these models show competitive performance on SUPERB probing benchmarks, particularly in causal settings. Our analysis shows that they yield higher-quality quantized representations and capture speaker-related features more distinctly than Transformer-based models. These findings highlight Mamba-based SSL as a promising and complementary direction for long-sequence modeling, real-time speech modeling, and speech unit extraction. The codebase is availabl
Authors
(none)
Tags
Stats
Related papers
- An Investigation Of Incorporating Mamba For Speech Enhancement (2024)13.70
- Mamba-seunet: Mamba Unet For Monaural Speech Enhancement (2024)7.16
- Pushing The Limits Of Unsupervised Unit Discovery For SSL Speech Representation (2023)6.34
- Audio Mamba: Selective State Spaces For Self-supervised Audio Representations (2024)9.23
- Speech Slytherin: Examining The Performance And Efficiency Of Mamba For Speech Separation, Recognition, And Synthesis (2024)13.88
- Dual-path Mamba: Short And Long-term Bidirectional Selective Structured State Space Models For Speech Separation (2024)4.12
- SSAMBA: Self-supervised Audio Representation Learning With Mamba State Space Model (2024)0.00
- Schr\"odinger Bridge Mamba For One-step Speech Enhancement (2025)0.00