Single-microphone Speaker Separation And Voice Activity Detection In Noisy And Reverberant Environments
2024 Β· Renana Opochinsky, Mordehay Moradi, Sharon Gannot
Abstract
Speech separation involves extracting an individual speaker's voice from a multi-speaker audio signal. The increasing complexity of real-world environments, where multiple speakers might converse simultaneously, underscores the importance of effective speech separation techniques. This work presents a single-microphone speaker separation network with TF attention aiming at noisy and reverberant environments. We dub this new architecture as Separation TF Attention Network (Sep-TFAnet). In addition, we present a variant of the separation network, dubbed \( \text\{Sep-TFAnet\}^\{\text\{VAD\}\}\), which incorporates a voice activity detector (VAD) into the separation network. The separation module is based on a temporal convolutional network (TCN) backbone inspired by the Conv-Tasnet architecture with multiple modifications. Rather than a learned encoder and decoder, we use short-time Fourier transform (STFT) and inverse short-time Fourier transform (iSTFT) for the analysis and synthesis
Authors
(none)
Tags
Stats
Related papers
- Tasnet: Time-domain Audio Separation Network For Real-time, Single-channel Speech Separation (2017)20.16
- Audio-visual Speech Separation And Dereverberation With A Two-stage Multimodal Network (2019)12.47
- Two-stage Model And Optimal SI-SNR For Monaural Multi-speaker Speech Separation In Noisy Environment (2020)0.00
- Deformable Temporal Convolutional Networks For Monaural Noisy Reverberant Speech Separation (2022)8.09
- Voice And Accompaniment Separation In Music Using Self-attention Convolutional Neural Network (2020)0.00
- Deep Attractor Network For Single-microphone Speaker Separation (2016)17.88
- Dual-path Filter Network: Speaker-aware Modeling For Speech Separation (2021)3.58
- Tf-gridnet: Integrating Full- And Sub-band Modeling For Speech Separation (2022)0.00