SAM: A Mamba-2 State-space Audio-language Model
2025 Β· Taehan Lee, Jaehan Jung, Hyukjun Lee
Abstract
We present SAM, a State-space Audio-language Model that integrates an audio encoder with a Mamba-2 backbone. SAM-2.7B achieves 21.1 mAP on AudioSet and 17.6 SPICE on AudioCaps, matching or surpassing larger 7B transformer-based models with fewer parameters. We further provide the first systematic, representation-level analysis of how SSMs interact with audio encoder outputs: (1) joint audio encoder finetuning is essential, supported by accuracy gains and observed adaptation of token representation rank and similarity across different SSM sizes; (2) despite linear scaling, SSMs benefit more from compact, information-rich audio token representations than from excessively long token sequences; and (3) incorporating instruction-following supervision substantially improves reasoning ability, boosting MMAU-Sound accuracy from 22.8 to 56.8. Through comprehensive experiments and analysis, we establish practical design principles for SSMs as strong, scalable backbones for audio-language models.
Authors
(none)
Tags
Stats
Related papers
- SSAMBA: Self-supervised Audio Representation Learning With Mamba State Space Model (2024)0.00
- Audio Mamba: Bidirectional State Space Model For Audio Representation Learning (2024)11.58
- Audio Mamba: Selective State Spaces For Self-supervised Audio Representations (2024)9.23
- Samba-asr: State-of-the-art Speech Recognition Leveraging Structured State-space Models (2025)0.00
- Mixture-of-mamba: Enhancing Multi-modal State-space Models With Modality-aware Sparsity (2025)3.42
- SAMOS: A Neural MOS Prediction Model Leveraging Semantic Representations And Acoustic Features (2024)2.26
- AV-SAM: Segment Anything Model Meets Audio-visual Localization And Segmentation (2023)0.00
- An Investigation Of Incorporating Mamba For Speech Enhancement (2024)13.70