X-tasnet: Robust And Accurate Time-domain Speaker Extraction Network
2020 Β· Zining Zhang, Bingsheng He, Zhenjie Zhang
Abstract
Extracting the speech of a target speaker from mixed audios, based on a reference speech from the target speaker, is a challenging yet powerful technology in speech processing. Recent studies of speaker-independent speech separation, such as TasNet, have shown promising results by applying deep neural networks over the time-domain waveform. Such separation neural network does not directly generate reliable and accurate output when target speakers are specified, because of the necessary prior on the number of speakers and the lack of robustness when dealing with audios with absent speakers. In this paper, we break these limitations by introducing a new speaker-aware speech masking method, called X-TaSNet. Our proposal adopts new strategies, including a distortion-based loss and corresponding alternating training scheme, to better address the robustness issue. X-TaSNet significantly enhances the extracted speech quality, doubling SDRi and SI-SNRi of the output speech audio over state-of-
Authors
(none)
Tags
Stats
Related papers
- Tasnet: Time-domain Audio Separation Network For Real-time, Single-channel Speech Separation (2017)20.16
- Conv-tasnet: Surpassing Ideal Time-frequency Magnitude Masking For Speech Separation (2018)24.08
- X-crossnet: A Complex Spectral Mapping Approach To Target Speaker Extraction With Cross Attention Speaker Embedding Fusion (2024)0.00
- Demystifying Tasnet: A Dissecting Approach (2019)12.10
- Speakerbeam-ss: Real-time Target Speaker Extraction With Lightweight Conv-tasnet And State Space Modeling (2024)7.16
- Time Domain Audio Visual Speech Separation (2019)14.62
- 3S-TSE: Efficient Three-stage Target Speaker Extraction For Real-time And Low-resource Applications (2023)5.24
- Target Speech Extraction Based On Blind Source Separation And X-vector-based Speaker Selection Trained With Data Augmentation (2020)0.00