Single-microphone Speaker Separation And Voice Activity Detection In Noisy And Reverberant Environments

Abstract

Speech separation involves extracting an individual speaker's voice from a multi-speaker audio signal. The increasing complexity of real-world environments, where multiple speakers might converse simultaneously, underscores the importance of effective speech separation techniques. This work presents a single-microphone speaker separation network with TF attention aiming at noisy and reverberant environments. We dub this new architecture as Separation TF Attention Network (Sep-TFAnet). In addition, we present a variant of the separation network, dubbed \( \text\{Sep-TFAnet\}^\{\text\{VAD\}\}\), which incorporates a voice activity detector (VAD) into the separation network. The separation module is based on a temporal convolutional network (TCN) backbone inspired by the Conv-Tasnet architecture with multiple modifications. Rather than a learned encoder and decoder, we use short-time Fourier transform (STFT) and inverse short-time Fourier transform (iSTFT) for the analysis and synthesis

Single-microphone Speaker Separation And Voice Activity Detection In Noisy And Reverberant Environments

Abstract

Authors

Tags

Stats

Related papers