Mamba2 Meets Silence: Robust Vocal Source Separation For Sparse Regions
2025 Β· Euiyeon Kim, Yong-Hoon Choi
Abstract
We introduce a new music source separation model tailored for accurate vocal isolation. Unlike Transformer-based approaches, which often fail to capture intermittently occurring vocals, our model leverages Mamba2, a recent state space model, to better capture long-range temporal dependencies. To handle long input sequences efficiently, we combine a band-splitting strategy with a dual-path architecture. Experiments show that our approach outperforms recent state-of-the-art models, achieving a cSDR of 11.03 dB-the best reported to date-and delivering substantial gains in uSDR. Moreover, the model exhibits stable and consistent performance across varying input lengths and vocal occurrence patterns. These results demonstrate the effectiveness of Mamba-based models for high-resolution audio processing and open up new directions for broader applications in audio research.
Authors
(none)
Tags
Stats
Related papers
- Dual-path Mamba: Short And Long-term Bidirectional Selective Structured State Space Models For Speech Separation (2024)4.12
- U-mamba-net: A Highly Efficient Mamba-based U-net Style Network For Noisy And Reverberant Speech Separation (2024)4.52
- Mamba-seunet: Mamba Unet For Monaural Speech Enhancement (2024)7.16
- A Recurrent Encoder-decoder Approach With Skip-filtering Connections For Monaural Singing Voice Separation (2017)9.41
- Mad Twinnet: Masker-denoiser Architecture With Twin Networks For Monaural Sound Source Separation (2018)0.00
- Audio Mamba: Selective State Spaces For Self-supervised Audio Representations (2024)9.23
- Voice And Accompaniment Separation In Music Using Self-attention Convolutional Neural Network (2020)0.00
- Htmd-net: A Hybrid Masking-denoising Approach To Time-domain Monaural Singing Voice Separation (2021)2.26