Mixture-of-mamba: Enhancing Multi-modal State-space Models With Modality-aware Sparsity
2025 Β· Weixin Liang, Junhong Shen, Genghan Zhang, et al.
Abstract
State Space Models (SSMs) have emerged as efficient alternatives to Transformers for sequential modeling, but their inability to leverage modality-specific features limits their performance in multi-modal pretraining. Here, we propose Mixture-of-Mamba, a novel SSM architecture that introduces modality-aware sparsity through modality-specific parameterization of the Mamba block. Building on Mixture-of-Transformers (W. Liang et al. arXiv:2411.04996; 2024), we extend the benefits of modality-aware sparsity to SSMs while preserving their computational efficiency. We evaluate Mixture-of-Mamba across three multi-modal pretraining settings: Transfusion (interleaved text and continuous image tokens with diffusion loss), Chameleon (interleaved text and discrete image tokens), and an extended three-modality framework incorporating speech. Mixture-of-Mamba consistently reaches the same loss values at earlier training steps with significantly reduced computational costs. In the Transfusion setting
Authors
(none)
Tags
Stats
Related papers
- SSAMBA: Self-supervised Audio Representation Learning With Mamba State Space Model (2024)0.00
- Mixture-of-transformers: A Sparse And Scalable Architecture For Multi-modal Foundation Models (2024)0.00
- Dual-path Mamba: Short And Long-term Bidirectional Selective Structured State Space Models For Speech Separation (2024)4.12
- Improving Speech Enhancement By Cross- And Sub-band Processing With State Space Model (2025)3.58
- SAM: A Mamba-2 State-space Audio-language Model (2025)0.00
- Audio Mamba: Bidirectional State Space Model For Audio Representation Learning (2024)11.58
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00
- Leveraging Joint Spectral And Spatial Learning With MAMBA For Multichannel Speech Enhancement (2024)0.00