SSAMBA: Self-supervised Audio Representation Learning With Mamba State Space Model
2024 Β· Siavash Shams, Sukru Samet Dindar, Xilin Jiang, et al.
Abstract
Transformers have revolutionized deep learning across various tasks, including audio representation learning, due to their powerful modeling capabilities. However, they often suffer from quadratic complexity in both GPU memory usage and computational inference time, affecting their efficiency. Recently, state space models (SSMs) like Mamba have emerged as a promising alternative, offering a more efficient approach by avoiding these complexities. Given these advantages, we explore the potential of SSM-based models in audio tasks. In this paper, we introduce Self-Supervised Audio Mamba (SSAMBA), the first self-supervised, attention-free, and SSM-based model for audio representation learning. SSAMBA leverages the bidirectional Mamba to capture complex audio patterns effectively. We incorporate a self-supervised pretraining framework that optimizes both discriminative and generative objectives, enabling the model to learn robust audio representations from large-scale, unlabeled datasets. W
Authors
(none)
Tags
Stats
Related papers
- Audio Mamba: Bidirectional State Space Model For Audio Representation Learning (2024)11.58
- Audio Mamba: Selective State Spaces For Self-supervised Audio Representations (2024)9.23
- SAM: A Mamba-2 State-space Audio-language Model (2025)0.00
- Samba-asr: State-of-the-art Speech Recognition Leveraging Structured State-space Models (2025)0.00
- Mixture-of-mamba: Enhancing Multi-modal State-space Models With Modality-aware Sparsity (2025)3.42
- An Exploration Of Mamba For Speech Self-supervised Models (2025)1.20
- Rawbmamba: End-to-end Bidirectional State Space Model For Audio Deepfake Detection (2024)10.21
- Mamba-based Decoder-only Approach With Bidirectional Speech Modeling For Speech Recognition (2024)0.00