Mamba-based Decoder-only Approach With Bidirectional Speech Modeling For Speech Recognition
2024 Β· Yoshiki Masuyama, Koichi Miyazaki, Masato Murata
Abstract
Selective state space models (SSMs) represented by Mamba have demonstrated their computational efficiency and promising outcomes in various tasks, including automatic speech recognition (ASR). Mamba has been applied to ASR task with the attention-based encoder-decoder framework, where the cross-attention mechanism between encoder and decoder remains. This paper explores the capability of Mamba as the decoder-only architecture in ASR task. Our MAmba-based DEcoder-ONly approach (MADEON) consists of a single decoder that takes speech tokens as a condition and predicts text tokens in an autoregressive manner. To enhance MADEON, we further propose speech prefixing that performs bidirectional processing on speech tokens, which enriches the contextual information in the hidden states. Our experiments show that MADEON significantly outperforms a non-selective SSM. The combination of speech prefixing and the recently proposed Mamba-2 yields comparable performance to Transformer-based models on
Authors
(none)
Tags
Stats
Related papers
- Dual-path Mamba: Short And Long-term Bidirectional Selective Structured State Space Models For Speech Separation (2024)4.12
- Samba-asr: State-of-the-art Speech Recognition Leveraging Structured State-space Models (2025)0.00
- Schr\"odinger Bridge Mamba For One-step Speech Enhancement (2025)0.00
- SSAMBA: Self-supervised Audio Representation Learning With Mamba State Space Model (2024)0.00
- An Exploration Of Mamba For Speech Self-supervised Models (2025)1.20
- An Investigation Of Incorporating Mamba For Speech Enhancement (2024)13.70
- Mamba-seunet: Mamba Unet For Monaural Speech Enhancement (2024)7.16
- Audio Mamba: Bidirectional State Space Model For Audio Representation Learning (2024)11.58